UbuCon India 2025

Name: UbuCon India 2025
Start: 2025-11-15T09:00:00+05:30
End: 2025-11-16T22:05:00+05:30
Location: Indian Institute of Science

15–16 Nov 2025

Indian Institute of Science

Asia/Kolkata timezone

Contact

Optimising LLM Inference on Resource-Constrained Hardware: A Study with VLLM, Quantization, and Compression

15 Nov 2025, 13:05

20m

Indian Institute of Science

Bengaluru, India

Talk (20 mins) Artificial Intelligence & Machine Learning (AI/ML)

Abdul Hakkeem P A

AI Researcher, Master's Student at Cochin University of Science and Technology

When I first tried running an open-source LLM on my everyday GPU, the result was humbling out-of-memory errors, sluggish responses, and the sinking feeling that deploying LLMs was only for those with expensive infrastructure.

That’s when I discovered VLLM, a framework built around continuous batching and efficient memory management. For me, it was a breakthrough: suddenly, the same hardware that choked on small models could serve 7B-parameter models with near real-time responses. Combined with quantization and compression from HuggingFace’s ecosystem, VLLM turned deployment from a frustrating experiment into a workable production pipeline.

In this talk, I’ll walk through the journey of optimising LLM inference step by step:

How I used HuggingFace Optimum and VLLM to run 7B-parameter models on consumer-grade GPUs.
Tricks that reduced GPU memory usage by more than half, making real-time responses possible.
Benchmarks comparing latency, throughput, and cost across different
optimisation strategies.

Ubucon India 2025 - LLM Inference Optimisation.pdf

UbuCon India 2025

Contact

Optimising LLM Inference on Resource-Constrained Hardware: A Study with VLLM, Quantization, and Compression

Indian Institute of Science

Speaker

Description

Presentation materials

Choose timezone

UbuCon India 2025

Contact

Speaker

Description

Presentation materials