How to use Trial Templates
This guide describes how to configure Trial template parameters and use custom Kubernetes CRD in Katib Trials. You will learn about changing Trial template specification, how to use Kubernetes ConfigMaps to store templates and how to modify Katib controller to support your Kubernetes CRD in Katib Experiments.
Katib dynamically supports any kind of Kubernetes CRD as Trial’s Worker. In Katib examples, you can find the following examples for Trial’s Workers:
To use your own Kubernetes resource follow the steps below.
How to use Trial Template
To run the Katib Experiment you have to specify a Trial template for your Worker job where actual model training is running.
Configure Trial Template Specification
Trial template specification is located under .spec.trialTemplate
of your Experiment.
To define Trial, you should specify these parameters in .spec.trialTemplate
:
trialParameters
- list of the parameters which are used in the Trial template during Experiment execution.Note: Your Trial template must contain each parameter from the
trialParameters
. You can set these parameters in any field of your template, except.metadata.name
and.metadata.namespace
. For example, your training container can receive hyperparameters as command-line or arguments or as environment variables.Your Experiment’s Suggestion produces
trialParameters
before running the Trial. EachtrialParameter
has these structure:name
- the parameter name that is replaced in your template.description
(optional) - the description of the parameter.reference
- the parameter name that Experiment’s Suggestion returns. Usually, for the hyperparameter tuning parameter references are equal to the Experiment search space. For example, in grid example search space has three parameters (lr
,momentum
) andtrialParameters
contains each of these parameters inreference
.
You have to define your Trial template in one of the
trialSpec
orconfigMap
sources.Note: Your template must omit
.metadata.name
and.metadata.namespace
.To set the parameters from the
trialParameters
, you need to use this expression:${trialParameters.<parameter-name>}
in your template. Katib automatically replaces it with the appropriate values from the Suggestion.For example,
--lr=${trialParameters.learningRate}
is thelearningRate
parameter.trialSpec
- the Trial template in unstructured format. The template should be a valid YAML.configMap
- Kubernetes ConfigMap specification where the Trial template is located. This ConfigMap must have the labelkatib.kubeflow.org/component: trial-templates
and contains key-value pairs, wherekey: <template-name>, value: <template-yaml>
. Check the example of the ConfigMap with Trial templates.The
configMap
specification should have:configMapName
- the ConfigMap name with the Trial templates.configMapNamespace
- the ConfigMap namespace with the Trial templates.templatePath
- the ConfigMap’s data path to the template.
.spec.trialTemplate
parameters below are used to control Trial behavior. If parameter has the
default value, it can be omitted in the Experiment YAML.
retain
- indicates that Trials’s resources are not clean-up after the Trial is complete. Check the example withretain: true
parameter.The default value is
false
primaryPodLabels
- the Trial Worker’s Pod or Pods labels. These Pods are injected by Katib metrics collector.Note: If
primaryPodLabels
are omitted, the Katib metrics collector wraps all worker’s Pods. Check the example withprimaryPodLabels
.The default value for Kubeflow
TFJob
,PyTorchJob
,MXJob
, andXGBoostJob
isjob-role: master
The
primaryPodLabels
default value works only if you specify your template in.spec.trialTemplate.trialSpec
. For theconfigMap
template source you have to manually setprimaryPodLabels
.primaryContainerName
- the training container name where actual model training is running. Katib metrics collector wraps this container to collect required metrics for the single Experiment optimization step.successCondition
- The Trial Worker’s object status in which Trial’s job has succeeded. This condition must be in GJSON format. Check the example withsuccessCondition
.The default value for Kubernetes
Job
is:status.conditions.#(type=="Complete")#|#(status=="True")#
The default value for Kubeflow
TFJob
,PyTorchJob
,MXJob
, andXGBoostJob
is:status.conditions.#(type=="Succeeded")#|#(status=="True")#
The
successCondition
default value works only if you specify your template in.spec.trialTemplate.trialSpec
. For theconfigMap
template source you have to manually setsuccessCondition
.failureCondition
- The Trial Worker’s object status in which Trial’s job has failed. This condition must be in GJSON format. Check the example withfailureCondition
.The default value for Kubernetes
Job
and KubeflowTFJob
,PyTorchJob
,MXJob
, andXGBoostJob
is:status.conditions.#(type=="Failed")#|#(status=="True")#
The
failureCondition
default value works only if you specify your template in.spec.trialTemplate.trialSpec
. For theconfigMap
template source you have to manually setfailureCondition
.
Use Metadata in Trial Template
You can’t specify .metadata.name
and .metadata.namespace
in your Trial template, but you can
get this data during the Experiment run. For example, if you want to append the Trial’s name to your
model storage.
To do this, point .trialParameters[x].reference
to the appropriate metadata parameter and
use .trialParameters[x].name
in your Trial template.
The table below shows the connection between
.trialParameters[x].reference
value and Trial metadata.
Reference | Trial metadata |
---|---|
${trialSpec.Name} | Trial name |
${trialSpec.Namespace} | Trial namespace |
${trialSpec.Kind} | Kubernetes resource kind for the Trial's worker |
${trialSpec.APIVersion} | Kubernetes resource APIVersion for the Trial's worker |
${trialSpec.Labels[custom-key]} | Trial's worker label with custom-key key |
${trialSpec.Annotations[custom-key]} | Trial's worker annotation with custom-key key |
Check the example of using Trial metadata.
Use CRDs with Trial Template
It is possible to use your own Kubernetes CRD or other Kubernetes resource
(e.g. Kubernetes CronJob
)
as a Trial Worker without modifying Katib controller source code and building the new image.
As long as your CRD creates Kubernetes Pods, allows to inject
the sidecar container on these Pods and has
succeeded and failed status, you can use it in Katib.
To do that, you need to modify Katib components before installing it on your Kubernetes cluster. Accordingly, you have to know your CRD API group and version, the CRD object’s kind. Also, you need to know which resources your custom object is created. Check the Kubernetes guide to know more about CRDs.
Follow these two simple steps to integrate your custom CRD in Katib:
Modify Katib controller ClusterRole’s rules with the new rule to give Katib access to all resources that are created by the Trial. To know more about ClusterRole, check the Kubernetes guide.
In case of Tekton
Pipelines
, Trials creates TektonPipelineRun
, then TektonPipelineRun
creates TektonTaskRun
. Therefore, Katib controller ClusterRole should have access to thepipelineruns
andtaskruns
:- apiGroups: - tekton.dev resources: - pipelineruns - taskruns verbs: - "get" - "list" - "watch" - "create" - "delete"
Modify Katib Config controller parameters with the new entity:
trialResources: - <object-kind>.<object-API-version>.<object-API-group>
For example, to support Tekton
Pipelines
:trialResources: - PipelineRun.v1beta1.tekton.dev
After these changes, deploy Katib as described in the installation guide
and wait until the katib-controller
Pod is created. You can check logs from the Katib controller
to verify your resource integration:
$ kubectl logs $(kubectl get pods -n kubeflow -o name | grep katib-controller) -n kubeflow | grep '"CRD Kind":"PipelineRun"'
{"level":"info","ts":1628032648.6285546,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"tekton.dev","CRD Version":"v1beta1","CRD Kind":"PipelineRun"}
If you ran the above steps successfully, you should be able to use your custom object YAML in the Experiment’s Trial template source spec.
We appreciate your feedback on using various CRDs in Katib. It would be great, if you could let us know about your Experiments. The developer guide is a good starting point to know how to contribute to the project.
Next steps
Understand the Katib metrics collector capabilities.
Learn about Katib Configuration.
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.