Overview

An overview of the Training Operator

Old Version

This page is about Kubeflow Training Operator V1, for the latest information check the Kubeflow Trainer V2 documentation.

Follow this guide for migrating to Kubeflow Trainer V2.

What is the Training Operator

The Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with different ML frameworks such as PyTorch, TensorFlow, XGBoost, JAX, and others.

You can integrate other ML libraries such as HuggingFace, DeepSpeed, or Megatron-LM with the Training Operator to orchestrate their ML training on Kubernetes.

The Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using the Training Operator Python SDK.

The Training Operator implements a centralized Kubernetes controller to orchestrate distributed training jobs.

You can run high-performance computing (HPC) tasks with the Training Operator and MPIJob since it supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC. The Training Operator implements the V1 API version of MPI Operator. For the MPI Operator V2 version, please follow this guide to install MPI Operator V2.

Training Operator Overview

The Training Operator is responsible for scheduling the appropriate Kubernetes workloads to implement various distributed training strategies for different ML frameworks.

Why use the Training Operator

The Training Operator addresses the Model Training and Model Fine-Tuning steps in the AI/ML lifecycle as shown in diagram below:

AI/ML Lifecycle Training Operator

The Training Operator simplifies the ability to run distributed training and fine-tuning.

You can easily scale their model training from single machine to large-scale distributed Kubernetes cluster using APIs and interfaces provided by Training Operator.

The Training Operator is extensible and portable.

You can deploy the Training Operator on any cloud where you have Kubernetes cluster and you can integrate their own ML frameworks written in any programming languages with Training Operator.

The Training Operator is integrated with the Kubernetes ecosystem.

You can leverage Kubernetes advanced scheduling techniques such as Kueue, Volcano, and YuniKorn with the Training Operator to optimize cost savings for your ML training resources.

Custom Resources for ML Frameworks

To perform distributed training the Training Operator implements the following Custom Resources for each ML framework:

ML Framework	Custom Resource
PyTorch	PyTorchJob
TensorFlow	TFJob
XGBoost	XGBoostJob
MPI	MPIJob
PaddlePaddle	PaddleJob
JAX	JAXJob

Next steps

Follow the installation guide to deploy the Training Operator.
Run examples from getting started guide.

Feedback

Was this page helpful?

Thank you for your feedback!

We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.

Last modified March 29, 2025: website: Add dark theme (#3981) (4f092f15)