Overview

An overview of Kubeflow Trainer

Note

Kubeflow Trainer project is currently in alpha status, and APIs may change. If you are using Kubeflow Training Operator V1, refer to this migration document.

For legacy Kubeflow Training Operator V1 documentation, check these guides

What is Kubeflow Trainer

Kubeflow Trainer is a Kubernetes-native project designed for large language models (LLMs) fine-tuning and enabling scalable, distributed training of machine learning (ML) models across various frameworks, including PyTorch, JAX, TensorFlow, and XGBoost.

You can integrate other ML libraries such as HuggingFace, DeepSpeed, or Megatron-LM with Kubeflow Trainer to orchestrate their ML training on Kubernetes.

Kubeflow Trainer allows you to effortlessly develop your LLMs with the Kubeflow Python SDK and build Kubernetes-native Training Runtimes with Kubernetes Custom Resources APIs.

Kubeflow Trainer Tech Stack

Who is this for

Kubeflow Trainer is designed for two primary user personas, each with specific resources and responsibilities:

Kubeflow Trainer Personas

User Personas

Kubeflow Trainer documentation is separated between these user personas:

ML Users: engineers and scientists who develop AI models using the Kubeflow Python SDK and TrainJob.
Cluster Operators: administrators responsible for managing Kubernetes clusters and Kubeflow Training Runtimes.
Contributors: open source contributors working on Kubeflow Trainer project.

Kubeflow Trainer Introduction

Watch the following KubeCon + CloudNativeCon 2024 talk which provides an overview of Kubeflow Trainer:

Why use Kubeflow Trainer

The Kubeflow Trainer supports key phases on the AI/ML lifecycle, including model training and LLMs fine-tuning, as shown in the diagram below:

AI/ML Lifecycle Trainer

Key Benefits

Simple and Scalable for Distributed Training and LLMs Fine-Tuning

Effortlessly scale your model training from a single machine to large distributed Kubernetes clusters using Kubeflow Python APIs and supported Training Runtimes.

Extensible and Portable

Deploy Kubeflow Trainer on any cloud platform with a Kubernetes cluster and integrate your own ML frameworks in any programming language.

Blueprints for LLMs Fine-Tuning

Fine-tune the latest LLMs on Kubernetes with ready-to-use Kubeflow LLM blueprints.

Reduce GPU Cost
Kubeflow Trainer implements custom dataset and model initializers to reduce GPU cost by offloading I/O tasks to CPU workloads and to streamline assets initialization across distributed training nodes.
Seamless Kubernetes Integration

Optimize GPU utilization and gang-scheduling for ML workloads by leveraging Kubernetes projects like Kueue, Coscheduling, Volcano or YuniKorn.

Next steps

Run your first Kubeflow TrainJob by following the Getting Started guide.

Feedback

Was this page helpful?

Thank you for your feedback!

We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.

Last modified March 29, 2025: website: Add dark theme (#3981) (4f092f15)