This repository contains a simple machine learning workflow consisting of data ingestion and preparation, model training, serving and monitoring. It represents how computer-aided diagnosis can be used for the prediction of pneumonia from a collection (a volume) of CT images.
It is based on work by Hasib Zunair.
- Openshift
- OpenDataHub
- Jupyter Notebooks with iPyWidgets
- Numpy
- Tensorflow
- Python requests library
- Seldon Core
- Prometheus
- Grafana
├── 01-inference-3d-image-classification-cli.py Python script forinferencing
├── 01-inference-3d-image-classification.ipynb Visualization and inferencing
├── 02-training-3d-image-classification.ipynb Notebook script for training
├── 02-training-3d-image-classification.py Python script for training
├── 3d_image_classification.h5 Trained model artifact
├── Dockerfile For s2i builds
├── MyModel.py Seldon Model Server Code
├── ct-data.zip Validation data for inferencing
├── requirements-notebook.txt
├── requirements.txt
└── resources Kubernetes Objects
├── 06-seldon-mymodel-servicemonitor.yaml
├── 07-mymodel-seldon-deploy-from-quay.yaml
└── grafana-dashboards
├── NVIDIA-DCGM-dashboard.json GPU Metrics
└── seldon-dashboard.json Model Server Metrics
Train a 17-layer, Convolutional Neural Network to predict the presence of COVID-19 related pneumonia from 3D CT imagery.
Build the training Python stack.
pip install pip tensorflow nibabel matplotlib -Uq
- (200) COVID-19 related 3D CT image studies
- Each study contains 36-54 slices of 512x512 pixels (voxels) each.
- Total size is ~2GB (compressed)
- ~20 minutes to preprocess and train on an NVIDIA Tesla T4 GPU
- ML framework: Keras/Tensorflow
Data Source: Chest CT Scans with COVID-19 Related Findings.
- Change to the
resources
directory.
cd 3d-image-classification/resources
- Create a project called
ml-mon
oc new-project ml-mon
- Using the Openshift console UI, install an instance of the following community operators from OperatorHub into the
ml-mon
namespace.
- OpenDataHub
- JupyterHub, S3, ODH Dashboard
- Prometheus
- Grafana
Seldon
- Install the Seldon Core operator into all namespaces in the cluster (default).
- Create an instance of Prometheus and Grafana in the
ml-mon
namespace.
Expected Output
oc get pods -n ml-mon -w
NAME READY STATUS RESTARTS AGE
$ oc get pods -n ml-mon
NAME READY STATUS RESTARTS AGE
grafana-deployment-8fbf7c944-7895m 1/1 Running 0 5h35m
grafana-operator-controller-manager-6ff698d9fc-xvk28 2/2 Running 0 5h35m
prometheus-example-0 2/2 Running 0 5h35m
prometheus-operator-7b9ccd45c6-7v8td 1/1 Running 0 5h35m
Create routes for Prometheus and Grafana.
oc expose svc prometheus-operated
oc expose svc grafana-service
Obtain the Grafana admin credentials to login to the Grafana console.
oc get secrets grafana-admin-credentials -o=jsonpath='{@.data.GF_SECURITY_ADMIN_USER}' | base64 --decode
admin
oc get secrets grafana-admin-credentials -o=jsonpath='{@.data.GF_SECURITY_ADMIN_PASSWORD}' | base64 --decode
ABcdRqpfdsEfpg==
- Create a Prometheus Service Monitor
oc create -f 06-seldon-mymodel-servicemonitor.yaml
servicemonitor.monitoring.coreos.com/mymodel-mygroup created
-
Login to the Grafana console. The username and password can be obtained from the
grafana-admin-credentials
secret. -
Within Grafana, configure a Prometheus data source called
prometheus
with a URL ofprometheus-operated.ml-mon:9090
-
Import the Seldon dashboard from the
resources/seldon-dashboard.json
file. -
Deploy the Seldon model server and wait for the classifier pod to become ready. Two services should be created by the Seldon deployer.
oc create -f 07-mymodel-seldon-deploy-from-quay.yaml
seldondeployment.machinelearning.seldon.io/mymodel created
oc get pods
NAME READY STATUS RESTARTS AGE
mymodel-mygroup-0-classifier-57647887d9-98qqb 2/2 Running 0 118s
oc get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mymodel-mygroup ClusterIP 10.217.5.143 <none> 8000/TCP,5001/TCP 20s
mymodel-mygroup-classifier ClusterIP 10.217.4.127 <none> 9000/TCP 2m4s
- Create a route for the Seldon model server.
oc expose svc mymodel-mygroup
Curl the prometheus endpoint and confirm it is able to scrape metrics from the classifier pod.
curl -X GET $(oc get route mymodel-mygroup -o jsonpath='{.spec.host}')/prometheus
...
promhttp_metric_handler_requests_total{code="200"} 5
Jupyter Notebook dependencies
pip install tensorflow jupyterlab ipywidgets scipy
- Login to OpenDataHub
- Start the JupyterHub server and choose the
Standard Data Science
notebook image. - Clone this github repo
- Run the
01-inference-3d-image-classification
notebook. - Find the notebook cell with
predict
function and modify theurl
variable to point to the route that was created.echo $(oc get route mymodel-mygroup -o jsonpath='{.spec.host}')/api/v1.0/predictions
- Run the notebook and select a study to make a few predictions to trigger Seldon activity.
Within 30 seconds or so there should be activity on the Seldon Grafana Dashboard.
Optionally, configure Grafana to watch Openshift's built-in Prometheus Data Source so a GPU dashboard can be created. This data source will scrape metrics from the NVIDA DCGM exporter.
Grant the Grafana service account name the cluster-reader
role so it can use
Openshift's Prometheus in the openshift-monitoring
namespace.
oc adm policy add-cluster-role-to-user cluster-monitoring-view -z grafana-service-account -n ml-mon
Get the Prometheus token.
oc serviceaccounts get-token prometheus-k8s -n openshift-monitoring
Add this token to the example Grafana data source yaml.
httpHeaderValue1: 'Bearer ${BEARER_TOKEN}'
Create the data source object.
oc apply -f 03-prometheus-grafanadatasource.yaml
Import the Seldon and GPU dashboards from the included json files.
Open The Prometheus and Grafana Dashboards to visualize the API activity.
curl -X GET $(oc get route mymodel-mygroup -o jsonpath='{.spec.host}')/prometheus
seldon_api_executor_server_requests_seconds_sum{code="200",deployment_name="mymodel",method="post",predictor_name="mygroup",predictor_version="",service="predictions"} 4.714845908
seldon_api_executor_server_requests_seconds_count{code="200",deployment_name="mymodel",method="post",predictor_name="mygroup",predictor_version="",service="predictions"} 5
$ oc create -f resources/07-mymodel-seldon-deploy-from-quay.yaml
Error from server (InternalError): error when creating "resources/07-mymodel-seldon-deploy-from-quay.yaml": Internal error occurred: failed calling webhook "v1.vseldondeployment.kb.io": Post "https://seldon-webhook-service.odh.svc:443/validate-machinelearning-seldon-io-v1-seldondeployment?timeout=30s": service "seldon-webhook-service" not found
This can happen after ODH has been re-installed into a different project. To fix it delete the old webhook.
oc get MutatingWebhookConfiguration,ValidatingWebhookConfiguration -A
oc delete validatingwebhookconfiguration.admissionregistration.k8s.io/seldon-validating-webhook-configuration-odh
cd 3d-image-classification
oc new-build --strategy docker --docker-image registry.redhat.io/ubi8/python-36 --name mymodel -l app=mymodel --binary
oc start-build mymodel --from-dir=. --follow
oc get is
NAME IMAGE REPOSITORY TAGS UPDATED
mymodel image-registry.openshift-image-registry.svc:5000/bk-models/mymodel latest 7 seconds ago
Edit mymodel-seldon-deploy.yaml
to confirm that the image location matches what the image stream reports. Then deploy the model server and wait for the pod to become ready.
oc apply -f resources/mymodel-seldon-deploy.yaml
oc get pods
NAME READY STATUS RESTARTS AGE
mymodel-mygroup-0-classifier-7c6b44569c-qmzk6 2/2 Running 0 61s
Expose the service
oc expose svc <svc-name>
To trigger a redeploy after a new build. This does not always work so the pod may have to be deleted.
oc patch deployment <deployment-name> -p "{\"spec\": {\"template\": {\"metadata\": { \"labels\": { \"redeploy\": \"$(date +%s)\"}}}}}"