Demo: Optimizing Gemma inference on NVIDIA GPUs with TensorRT-LLM

Subscribers:
2,370,000
Published on ● Video Link: https://www.youtube.com/watch?v=y4QljAMsXr0



Duration: 12:21
926 views
31


Even the smallest of Large Language Models are compute intensive significantly affecting the cost of your Generative AI application. Your ability to increase the throughput and reduce latency can make or break many business cases. NVIDIA TensorRT-LLM is an open-source tool that allows you to considerably speed up execution of your models and in this talk we will demonstrate its application to Gemma.

Watch more videos of Gemma Developer Day 2024 β†’ https://goo.gle/440EAIV
Subscribe to Google for Developers β†’ https://goo.gle/developers

#Gemma #GemmaDeveloperDay







Tags:
Google
developers
pr_pr: Core DevRel DEI;
Purpose: Learn;
Type: Upload Only;
gds:N/A;