Distributed Training with the Training Operator
This page shows different distributed strategies that can be used by the Training Operator.
Distributed Training for PyTorch
This diagram shows how the Training Operator creates PyTorch workers for the ring all-reduce algorithm.
You are responsible for writing the training code using native
PyTorch Distributed APIs
and creating a PyTorchJob with the required number of workers and GPUs using the Training Operator Python SDK.
Then, the Training Operator creates Kubernetes pods with the appropriate environment variables for the
torchrun
CLI to start the distributed
PyTorch training job.
At the end of the ring all-reduce algorithm gradients are synchronized
in every worker (g1, g2, g3, g4
) and the model is trained.
You can define various distributed strategies supported by PyTorch in your training code
(e.g. PyTorch FSDP), and the Training Operator will set
the appropriate environment variables for torchrun
.
Distributed Training for TensorFlow
This diagram shows how the Training Operator creates the TensorFlow parameter server (PS) and workers for PS distributed training.
You are responsible for writing the training code using native
TensorFlow Distributed APIs and creating a
TFJob with the required number of PSs, workers, and GPUs using the Training Operator Python SDK.
Then, the Training Operator creates Kubernetes pods with the appropriate environment variables for
TF_CONFIG
to start the distributed TensorFlow training job.
The Parameter server splits training data for every worker and averages model weights based on gradients produced by every worker.
You can define various distributed strategies supported by TensorFlow
in your training code, and the Training Operator will set the appropriate environment
variables for TF_CONFIG
.
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.