15–16 Nov 2025
Indian Institute of Science
Asia/Kolkata timezone

Optimising LLM Inference on Resource-Constrained Hardware: A Study with VLLM, Quantization, and Compression

15 Nov 2025, 13:05
20m
Indian Institute of Science

Indian Institute of Science

Bengaluru, India
Talk (20 mins) Artificial Intelligence & Machine Learning (AI/ML)

Speaker

Abdul Hakkeem P A
AI Researcher, Master's Student at Cochin University of Science and Technology

Description

When I first tried running an open-source LLM on my everyday GPU, the result was humbling out-of-memory errors, sluggish responses, and the sinking feeling that deploying LLMs was only for those with expensive infrastructure.

That’s when I discovered VLLM, a framework built around continuous batching and efficient memory management. For me, it was a breakthrough: suddenly, the same hardware that choked on small models could serve 7B-parameter models with near real-time responses. Combined with quantization and compression from HuggingFace’s ecosystem, VLLM turned deployment from a frustrating experiment into a workable production pipeline.

In this talk, I’ll walk through the journey of optimising LLM inference step by step:

  • How I used HuggingFace Optimum and VLLM to run 7B-parameter models on consumer-grade GPUs.

  • Tricks that reduced GPU memory usage by more than half, making real-time responses possible.

  • Benchmarks comparing latency, throughput, and cost across different
    optimisation strategies.

Presentation materials