Distributed Training with the Training Operator

How the Training Operator performs distributed training on Kubernetes

This page shows different distributed strategies that can be used by the Training Operator.

Distributed Training for PyTorch

This diagram shows how the Training Operator creates PyTorch workers for the ring all-reduce algorithm.

Distributed PyTorchJob

You are responsible for writing the training code using native PyTorch Distributed APIs and creating a PyTorchJob with the required number of workers and GPUs using the Training Operator Python SDK. Then, the Training Operator creates Kubernetes pods with the appropriate environment variables for the torchrun CLI to start the distributed PyTorch training job.

At the end of the ring all-reduce algorithm gradients are synchronized in every worker (g1, g2, g3, g4) and the model is trained.

You can define various distributed strategies supported by PyTorch in your training code (e.g. PyTorch FSDP), and the Training Operator will set the appropriate environment variables for torchrun.

Distributed Training for TensorFlow

This diagram shows how the Training Operator creates the TensorFlow parameter server (PS) and workers for PS distributed training.

Distributed TFJob

You are responsible for writing the training code using native TensorFlow Distributed APIs and creating a TFJob with the required number of PSs, workers, and GPUs using the Training Operator Python SDK. Then, the Training Operator creates Kubernetes pods with the appropriate environment variables for TF_CONFIG to start the distributed TensorFlow training job.

The Parameter server splits training data for every worker and averages model weights based on gradients produced by every worker.

You can define various distributed strategies supported by TensorFlow in your training code, and the Training Operator will set the appropriate environment variables for TF_CONFIG.

Feedback

Was this page helpful?