25–27 Oct 2024
The Hague, Netherlands
Europe/Amsterdam timezone

Simplify AI Infrastructure with Kubernetes Operators

Not scheduled
25m
The Hague, Netherlands

The Hague, Netherlands

Churchillplein 10, 2517 JW Den Haag, Netherlands
Talk (25 Minutes) Data, MLOps and AI/ML

Speakers

Ganeshkumar Ashokavardhanan
Microsoft
Tariq Ibrahim
NVIDIA Corp

Description

ML applications often require specialized hardware and additional configuration to run efficiently and reliably on Kubernetes. However, managing the cluster lifecycle, the diversity and complexity of hardware configuration across nodes can be challenging. How can we simplify and automate this process to ensure a smooth experience for kubernetes users? How can we speed up GPU node provisioning? Kubernetes Operators along with Ubuntu pre-compiled drivers offer a great solution. In this session, we will go over operators and demonstrate how they can help automate the installation, configuration, and lifecycle management of AI-ready infra end to end from cluster provisioning and k8s node configuration to deep learning model deployments.

We will demo a fine-tuning LLM workload, to showcase how existing operators in the ecosystem such as Cluster API Operator, GPU Operator, and the Kubernetes AI Toolchain Operator, can be used to simplify the infra, and show how using Ubuntu pre-compiled drivers speeds up GPU node provisioning. Finally, we will discuss challenges and best practices of using operators in production.

Session author's bio

Ganeshkumar is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine learning workloads and provide various driver configurations. He also has a prior academic and research background focused on deep learning at UC Berkeley and the Berkeley Artificial Intelligence Research lab.

Tariq Ibrahim is a Senior Cloud Platform Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator. He has also contributed to several cloud native OSS projects like kube-state-metrics, Istio, external-dns, cluster-api.

Social Media https://x.com/ganeshkumar_av, LinkedIn: https://www.linkedin.com/in/ganeshkumar-ashok/
Level of Difficulty Intermediate

Presentation materials