Overview

An overview of Kubeflow Trainer

What is Kubeflow Trainer

Kubeflow Trainer is a Kubernetes-native project designed for large language models (LLMs) fine-tuning and enabling scalable, distributed training of machine learning (ML) models across various frameworks, including PyTorch, JAX, TensorFlow, and XGBoost.

You can integrate other ML libraries such as HuggingFace, DeepSpeed, or Megatron-LM with Kubeflow Trainer to orchestrate their ML training on Kubernetes.

Kubeflow Trainer allows you to effortlessly develop your LLMs with the Kubeflow Python SDK and build Kubernetes-native Training Runtimes with Kubernetes Custom Resources APIs.

Kubeflow Trainer Tech Stack

Who is this for

Kubeflow Trainer is designed for two primary user personas, each with specific resources and responsibilities:

Kubeflow Trainer Personas

User Personas

Kubeflow Trainer documentation is separated between these user personas:

Kubeflow Trainer Introduction

Watch the following KubeCon + CloudNativeCon 2024 talk which provides an overview of Kubeflow Trainer:

Why use Kubeflow Trainer

The Kubeflow Trainer supports key phases on the AI/ML lifecycle, including model training and LLMs fine-tuning, as shown in the diagram below:

AI/ML Lifecycle Trainer

Key Benefits

  • Simple and Scalable for Distributed Training and LLMs Fine-Tuning

Effortlessly scale your model training from a single machine to large distributed Kubernetes clusters using Kubeflow Python APIs and supported Training Runtimes.

  • Extensible and Portable

Deploy Kubeflow Trainer on any cloud platform with a Kubernetes cluster and integrate your own ML frameworks in any programming language.

  • Blueprints for LLMs Fine-Tuning

Fine-tune the latest LLMs on Kubernetes with ready-to-use Kubeflow LLM blueprints.

  • Reduce GPU Cost

  • Kubeflow Trainer implements custom dataset and model initializers to reduce GPU cost by offloading I/O tasks to CPU workloads and to streamline assets initialization across distributed training nodes.

  • Seamless Kubernetes Integration

Optimize GPU utilization and gang-scheduling for ML workloads by leveraging Kubernetes projects like Kueue, Coscheduling, Volcano or YuniKorn.

Next steps

Run your first Kubeflow TrainJob by following the Getting Started guide.

Feedback

Was this page helpful?