Speaker
Description
Modern AI systems are no longer bottlenecked by models—they are bottlenecked by infrastructure. Training and deploying state-of-the-art models requires managing terabytes of multimodal data, orchestrating distributed GPU clusters, and ensuring reproducibility, data consistency & fault tolerance. The difference between a successful AI project and an abandoned prototype often comes down to the invisible layer of infrastructure: how data is stored, streamed, preprocessed, and served for training and inference.
In this talk, We will unpack why building robust AI infrastructure has become the most important problem in both academia and industry. We will explore many open-source tools can level the playing field, enabling even small teams—whether working in research or building products to handle data and computation at scale with far less overhead. I will introduce Ray, an emerging distributed computing framework, and demonstrate how it simplifies complex workflows—scaling from a laptop to multi-GPU clusters, streaming petabyte-scale datasets and orchestrating training/inference pipelines without the additional complexity.
Crux of this workshop:
-
A clear understanding of the design trade-offs in large-scale AI
infra (storage formats, ingestion, orchestration, inference). -
A practical guide to using Ray, vLLM, KubeRay, and related tools on Ubuntu from distributed training and dataset versioning in
academic research to building scalable pipelines and robust model
serving in industrial deployments. -
Common pitfalls & how to avoid them to build resilient AI infrastructure.