This developer guide is for people who want to contribute to the Katib project. If you're interesting in using Katib in your machine learning project, see the following user guides:
- Concepts in Katib, hyperparameter tuning, and neural architecture search.
- Getting started with Katib.
- Detailed guide to configuring and running a Katib experiment.
- Go (1.19 or later)
- Docker (20.10 or later)
- Docker Buildx (0.8.0 or later)
- Java (8 or later)
- Python (3.9 or later)
- kustomize (4.0.5 or later)
Check source code as follows:
make build REGISTRY=<image-registry> TAG=<image-tag>
To use your custom images for the Katib components, modify Kustomization file and Katib Config
You can deploy Katib v1beta1 manifests into a Kubernetes cluster as follows:
make deploy
You can undeploy Katib v1beta1 manifests from a Kubernetes cluster as follows:
make undeploy
If you want to modify Katib controller APIs, you have to generate deepcopy, clientset, listers, informers, open-api and Python SDK with the changed APIs. You can update the necessary files as follows:
make generate
Below is a list of command-line flags accepted by Katib controller:
Name | Type | Default | Description |
---|---|---|---|
enable-grpc-probe-in-suggestion | bool | true | Enable grpc probe in suggestions |
experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
metrics-addr | string | ":8080" | The address that the metrics endpoint binds to |
healthz-addr | string | ":18080" | The address that the healthz endpoint binds to |
trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
webhook-port | int | 8443 | The port number to be used for admission webhook server |
enable-leader-election | bool | false | Enable leader election for katib-controller. Enabling this will ensure there is only one active katib-controller. |
leader-election-id | string | "3fbc96e9.katib.kubeflow.org" | The ID for leader election. |
Below is a list of command-line flags accepted by Katib DB Manager:
Name | Type | Default | Description |
---|---|---|---|
connect-timeout | time.Duration | 60s | Timeout before calling error during database connection |
Please see workflow-design.md.
Katib uses three Kubernetes admission webhooks.
-
validator.experiment.katib.kubeflow.org
- Validating admission webhook to validate the Katib Experiment before the creation. -
defaulter.experiment.katib.kubeflow.org
- Mutating admission webhook to set the default values in the Katib Experiment before the creation. -
mutator.pod.katib.kubeflow.org
- Mutating admission webhook to inject the metrics collector sidecar container to the training pod. Learn more about the Katib's metrics collector in the Kubeflow documentation.
You can find the YAMLs for the Katib webhooks here.
Note: If you are using a private Kubernetes cluster, you have to allow traffic
via TCP:8443
by specifying the firewall rule and you have to update the master
plane CIDR source range to use the Katib webhooks
Katib uses the custom cert-generator
Kubernetes Job
to generate certificates for the webhooks.
Once Katib is deployed in the Kubernetes cluster, the cert-generator
Job follows these steps:
-
Generate the self-signed CA certificate and private key.
-
Generate public certificate and private key signed with the key generated in the previous step.
-
Create a Kubernetes Secret with the signed certificate. Secret has the
katib-webhook-cert
name andcert-generator
Job'sownerReference
to clean-up resources once Katib is uninstalled.Once Secret is created, the Katib controller Deployment spawns the Pod, since the controller has the
katib-webhook-cert
Secret volume. -
Patch the webhooks with the
CABundle
.
You can find the cert-generator
source code here.
Please see new-algorithm-service.md.
Please see Katib UI README.
Please see proposals.