Optimize, deploy, and benchmark an open-source LLM with vLLM 文字稿

Learn more: https://bit.ly/3RtV5Lk

Introducing Fast & Efficient LLM Inference with vLLM, a short course built in partnership with Red Hat and taught by Cedric Clyburn, Senior Developer Advocate at Red Hat.

Serving open-source LLMs efficiently, for many users at low latency and reasonable cost, comes down mostly to memory management. Two things compete

for that memory: the model weights and the KV cache. A 70-billion-parameter model takes around 140 GB of memory just for the weights, while the KV

cache grows with every request you serve. In this course, you'll learn to shrink the weights through quantization, and serve the model with vLLM, the

widely adopted open-source serving system, taking advantage of the memory management techniques it provides like PagedAttention and prefix caching.

You'll run the full optimize-deploy-benchmark workflow on a real model: compressing an open-source Qwen model with LLM Compressor, serving it with

vLLM, and benchmarking your deployment under realistic traffic using GuideLLM and lm-eval.

By the end, you'll have run the full optimize-deploy-benchmark workflow on a real model and built the intuition to navigate the tradeoffs between accuracy, speed, and cost.

Enroll now: https://bit.ly/3RtV5Lk

Optimize, deploy, and benchmark an open-source LLM with vLLM · 全文文字稿