Overview
Note
Kubeflow Trainer project is currently in alpha status, and APIs may change. If you are using Kubeflow Training Operator V1, refer to this migration document.
For legacy Kubeflow Training Operator V1 documentation, check these guides
What is Kubeflow Trainer
Kubeflow Trainer is a Kubernetes-native project designed for large language models (LLMs) fine-tuning and enabling scalable, distributed training of machine learning (ML) models across various frameworks, including PyTorch, JAX, TensorFlow, and XGBoost.
You can integrate other ML libraries such as HuggingFace, DeepSpeed, or Megatron-LM with Kubeflow Trainer to orchestrate their ML training on Kubernetes.
Kubeflow Trainer allows you to effortlessly develop your LLMs with the Kubeflow Python SDK and build Kubernetes-native Training Runtimes with Kubernetes Custom Resources APIs.
Who is this for
Kubeflow Trainer is designed for two primary user personas, each with specific resources and responsibilities:
User Personas
Kubeflow Trainer documentation is separated between these user personas:
- ML Users: engineers and scientists who develop AI models using the Kubeflow Python SDK and TrainJob.
- Cluster Operators: administrators responsible for managing Kubernetes clusters and Kubeflow Training Runtimes.
- Contributors: open source contributors working on Kubeflow Trainer project.
Kubeflow Trainer Introduction
Watch the following KubeCon + CloudNativeCon 2024 talk which provides an overview of Kubeflow Trainer:
Why use Kubeflow Trainer
The Kubeflow Trainer supports key phases on the AI/ML lifecycle, including model training and LLMs fine-tuning, as shown in the diagram below:
Key Benefits
- Simple and Scalable for Distributed Training and LLMs Fine-Tuning
Effortlessly scale your model training from a single machine to large distributed Kubernetes clusters using Kubeflow Python APIs and supported Training Runtimes.
- Extensible and Portable
Deploy Kubeflow Trainer on any cloud platform with a Kubernetes cluster and integrate your own ML frameworks in any programming language.
- Blueprints for LLMs Fine-Tuning
Fine-tune the latest LLMs on Kubernetes with ready-to-use Kubeflow LLM blueprints.
Reduce GPU Cost
Kubeflow Trainer implements custom dataset and model initializers to reduce GPU cost by offloading I/O tasks to CPU workloads and to streamline assets initialization across distributed training nodes.
Seamless Kubernetes Integration
Optimize GPU utilization and gang-scheduling for ML workloads by leveraging Kubernetes projects like Kueue, Coscheduling, Volcano or YuniKorn.
Next steps
Run your first Kubeflow TrainJob by following the Getting Started guide.
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.