Speaker
Description
When I first tried running an open-source LLM on my everyday GPU, the result was humbling out-of-memory errors, sluggish responses, and the sinking feeling that deploying LLMs was only for those with expensive infrastructure.
That’s when I discovered VLLM, a framework built around continuous batching and efficient memory management. For me, it was a breakthrough: suddenly, the same hardware that choked on small models could serve 7B-parameter models with near real-time responses. Combined with quantization and compression from HuggingFace’s ecosystem, VLLM turned deployment from a frustrating experiment into a workable production pipeline.
In this talk, I’ll walk through the journey of optimising LLM inference step by step:
-
How I used HuggingFace Optimum and VLLM to run 7B-parameter models on consumer-grade GPUs.
-
Tricks that reduced GPU memory usage by more than half, making real-time responses possible.
-
Benchmarks comparing latency, throughput, and cost across different
optimisation strategies.