Architecture

The Training Operator Architecture

Old Version

This page is about Kubeflow Training Operator V1, for the latest information check the Kubeflow Trainer V2 documentation.

Follow this guide for migrating to Kubeflow Trainer V2.

What is the Training Operator Architecture?

The original design was drafted in April 2021 and is available here for reference. The goal was to provide a unified Kubernetes operator that supports multiple machine learning/deep learning frameworks. This was done by having a “Frontend” operator that decomposes the job into different configurable Kubernetes components (e.g., Role, PodTemplate, Fault-Tolerance, etc.), watches all Role Customer Resources, and manages pod performance. The dedicated “Backend” operator was not implemented and instead consolidated to the “Frontend” operator.

The benefits of this approach were:

Shared testing and release infrastructure
Unlocked production grade features like manifests and metadata support
Simpler Kubeflow releases
A Single Source of Truth (SSOT) for other Kubeflow components to interact with

The V1 Training Operator architecture diagram can be seen in the diagram below:

Training Operator V1 Architecture

The diagram displays PyTorchJob and its configured communication methods but it is worth mentioning that each framework can have its own appraoch(es) to communicating across pods. Additionally, each framework can have its own set of configurable resources.

As a concrete example, PyTorch has several Communication Backends available, see the source code documentation for the full list. ).

Feedback

Was this page helpful?

Thank you for your feedback!

We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.

Last modified March 29, 2025: website: Add dark theme (#3981) (4f092f15)