Architecture

The Training Operator Architecture

What is the Training Operator Architecture?

The original design was drafted in April 2021 and is available here for reference. The goal was to provide a unified Kubernetes operator that supports multiple machine learning/deep learning frameworks. This was done by having a “Frontend” operator that decomposes the job into different configurable Kubernetes components (e.g., Role, PodTemplate, Fault-Tolerance, etc.), watches all Role Customer Resources, and manages pod performance. The dedicated “Backend” operator was not implemented and instead consolidated to the “Frontend” operator.

The benefits of this approach were:

  1. Shared testing and release infrastructure
  2. Unlocked production grade features like manifests and metadata support
  3. Simpler Kubeflow releases
  4. A Single Source of Truth (SSOT) for other Kubeflow components to interact with

The V1 Training Operator architecture diagram can be seen in the diagram below:

Training Operator V1 Architecture

The diagram displays PyTorchJob and its configured communication methods but it is worth mentioning that each framework can have its own appraoch(es) to communicating across pods. Additionally, each framework can have its own set of configurable resources.

As a concrete example, PyTorch has several Communication Backends available, see the source code documentation for the full list. ).

Feedback

Was this page helpful?