Job Scheduling
This guide describes how to use Kueue, Volcano Scheduler and Scheduler Plugins with coscheduling to support gang-scheduling in Kubeflow, to allow jobs to run multiple pods at the same time.
Running jobs with gang-scheduling
The Training Operator and the MPI Operator support running jobs with gang-scheduling using Kueue, Volcano Scheduler, and Scheduler Plugins with coscheduling.
Using Kueue with Training Operator Jobs
Follow this guide to learn how to use Kueue with Training Operator Jobs and manage queues for your ML training jobs
Scheduler Plugins with coscheduling
You have to install the Scheduler Plugins with coscheduling in your cluster first as the default scheduler or a secondary scheduler of Kubernetes and configure the operator to select the scheduler name for gang-scheduling in the following:
- training-operator
...
spec:
containers:
- command:
- /manager
+ - --gang-scheduler-name=scheduler-plugins
image: kubeflow/training-operator
name: training-operator
...
- mpi-operator (installed scheduler-plugins as a default scheduler)
...
spec:
containers:
- args:
+ - --gang-scheduling=default-scheduler
- -alsologtostderr
- --lock-namespace=mpi-operator
image: mpioperator/mpi-operator:0.4.0
name: mpi-operator
...
- mpi-operator (installed scheduler-plugins as a secondary scheduler)
...
spec:
containers:
- args:
+ - --gang-scheduling=scheduler-plugins-scheduler
- -alsologtostderr
- --lock-namespace=mpi-operator
image: mpioperator/mpi-operator:0.4.0
name: mpi-operator
...
- Follow the instructions in the kubernetes-sigs/scheduler-plugins repository to install the Scheduler Plugins with coscheduling.
Note: The Scheduler Plugins and operator in Kubeflow achieve gang-scheduling by using PodGroup. The Operator will create the PodGroup of the job automatically.
If you install the Scheduler Plugins in your cluster as a secondary scheduler, you need to specify the scheduler name in the CustomJob resources (e.g., TFJob), for example:
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: tfjob-simple
namespace: kubeflow
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
+ schedulerName: scheduler-plugins-scheduler
containers:
- name: tensorflow
image: kubeflow/tf-mnist-with-summaries:latest
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
If you install the Scheduler Plugins as a default scheduler, you don’t need to specify the scheduler name in CustomJob resources (e.g., TFJob).
Volcano Scheduler
You have to install volcano scheduler in your cluster first as a secondary scheduler of Kubernetes and configure the operator to select the scheduler name for gang-scheduling in the following:
- training-operator
...
spec:
containers:
- command:
- /manager
+ - --gang-scheduler-name=volcano
image: kubeflow/training-operator
name: training-operator
...
- mpi-operator
...
spec:
containers:
- args:
+ - --gang-scheduling=volcano
- -alsologtostderr
- --lock-namespace=mpi-operator
image: mpioperator/mpi-operator:0.4.0
name: mpi-operator
...
- Follow the instructions in the volcano repository to install Volcano.
Note: Volcano scheduler and the operator in Kubeflow achieve gang-scheduling by using PodGroup. Operator will create the PodGroup of the job automatically.
The yaml to use volcano scheduler to schedule your job as a gang is the same as non-gang-scheduler, for example:
apiVersion: "kubeflow.org/v1beta1"
kind: "TFJob"
metadata:
name: "tfjob-gang-scheduling"
spec:
tfReplicaSpecs:
Worker:
replicas: 1
template:
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=gpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
name: tensorflow
resources:
limits:
nvidia.com/gpu: 1
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
PS:
replicas: 1
template:
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
resources:
limits:
cpu: "1"
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
About gang-scheduling
When using Volcano Scheduler or the Scheduler Plugins with coscheduling to apply gang-scheduling, a job can run only if there are enough resources for all the pods of the job. Otherwise, all of the pods will be in a pending state waiting for enough resources. For example, if a job requiring N pods is created and there are only enough resources to schedule N-2 pods, then N pods of the job will stay pending.
Note: when under high workloads, if a pod of the job dies when the job is still running, it might give other pods a chance to occupy the resources and cause deadlock.
Troubleshooting
If you keep getting problems related to RBAC in your volcano scheduler.
You can try to add the following rules into your clusterrole of scheduler used by the volcano scheduler.
- apiGroups:
- '*'
resources:
- '*'
verbs:
- '*'
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.