Katib Experiment Lifecycle

What happens after an Experiment is created

Katib Experiment Lifecycle

When user creates an Experiment, Katib Experiment controller, Suggestion controller and Trial controller is working together to achieve hyperparameters tuning for user’s Machine learning model. The Experiment workflow looks as follows:

Katib Workflow
  1. The Experiment is submitted to the Kubernetes API server. Katib Experiment mutating and validating webhook is called to set the default values for the Experiment and validate the CR separately.

  2. The Experiment controller creates the Suggestion.

  3. The Suggestion controller creates the algorithm deployment and service based on the new Suggestion.

  4. When the Suggestion controller verifies that the algorithm service is ready, it calls the service to generate spec.request - len(status.suggestions) sets of hyperparameters and append them into status.suggestions.

  5. The Experiment controller finds that Suggestion had been updated and generates each Trial for the each new hyperparameters set.

  6. The Trial controller generates Worker Job based on the runSpec from the Trial with the new hyperparameters set.

  7. The related job controller (Kubernetes batch Job, Kubeflow TFJob, Tekton Pipeline, etc.) generates Kubernetes Pods.

  8. Katib Pod mutating webhook is called to inject the metrics collector sidecar container to the candidate Pods.

  9. During the ML model container runs, the metrics collector container collects metrics from the injected pod and persists metrics to the Katib DB backend.

  10. When the ML model training ends, the Trial controller updates status of the corresponding Trial.

  11. When the Trial goes to end, the Experiment controller increases request field of the corresponding Suggestion if it is needed, then everything goes to step 4 again. Of course, if the Trial meet one of end condition (exceeds maxTrialCount, maxFailedTrialCount or goal), the Experiment controller takes everything done.

Feedback

Was this page helpful?


Last modified October 15, 2024: docs: add diagram to reference (#3906) (3c7d3de)