Migrating to Kubeflow Trainer V2

How to migrate to the new Kubeflow Trainer V2.

Overview

Kubeflow Trainer is a significant update to the Kubeflow Training Operator project.

The key features introduced by Kubeflow Trainer are:

The new CRDs: TrainJob, TrainingRuntime, and ClusterTrainingRuntime APIs. These APIs enable the creation of templates for distributed model training and LLM fine-tuning. It abstracts the Kubernetes complexities, providing more intuitive experience for data scientists and ML engineers.
The Kubeflow Python SDK: to further enhance ML user experience and to provide seamless integration with Kubeflow Trainer APIs.
Custom dataset and model initializer: to streamline assets initialization across distributed training nodes and to reduce GPU cost by offloading I/O tasks to CPU workloads.
Enhanced MPI support: featuring MPI-Operator V2 features with SSH-based optimization to boost MPI performance.

TODO (andreyvelich): Add docs for migration.

Was this page helpful?

Thank you for your feedback!

We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.