The Confidential Inferencing Beta is a collaboration between Microsoft Research, Azure Confidential Compute, Azure Machine Learning, and Microsoft’s ONNX Runtime project and is provided here As-Is as beta in order to showcase a hosting possibility which restricts the machine learning hosting party from accessing both the inferencing request and its corresponding response.
As part of this implementation, the secure Trusted Execution Environment (TEE) generates a private (ECDH) key which is secured within the enclave and used to decrypt incoming inference requests. The client (reference code also provided) first obtains the server's public key and the attestation report proving that the key was created by the TEE. It then completes a key exchange to derive an encryption key for its request.
Currently, the provided AKS deployment example only works on a single node, as it is required to provision the same private key to all inference enclaves to scale to multiple nodes. Developers can use the key provider interface to plug their key distribution solution, but we do not support any official implementation at this stage. We welcome open source contributions.
To make this tutorial easier to follow, we first describe how to locally build and run the inference server on an Azure Confidential Computing VM, then separately describe the steps to build and deploy a container on Azure Kubernetes Service.
Setting up a local deployment on an ACC VM
The reason for running this deployment flow on an ACC VM is that during the deployment, you will be able to test the server locally. This requires Intel SGX support on the server, which is enabled on DC-series VMs from Azure Confidential Computing. In case you don't need to test the server locally, an ACC VM is not required - skip to the AKS deployment tutorial after you built the server image and Python client.
Note: The Azure subscriptions have a default of 8 cores, and the development VM would take some of them – it is recommended you use the DC2sv2 VM for the build machine with 2 cores and the rest can be used by the ACC AKS cluster.
Open Neural Network Exchange (ONNX) is an open standard format for representing machine learning models. ONNX is supported by a community of partners who have implemented it in many frameworks and tools. Most frameworks (Pytorch, TensorFlow, etc) support converting models to ONNX.
This repository depends on an Open Enclave port of the generally available ONNX Runtime in [external/onnxruntime]. Make sure the ONNX model you use is supported by the provided runtime (in general though, there should be no issues).
In this step you will build a server image, which you can test locally, and eventually deploy to an AKS cluster. You will need to follow these 3 main steps:
- Clone this git repo.
- Build a generic container.
- Bundle the generic container with your own ONNX model.
To provide guarantees to your clients that your code is secure, you would need to provide them with the container code. Their client would match the enclave hash they generate (enclave_info.txt) with your hosted server. You can also provide the model hash, so they know which model binary was used to provide the inferencing results.
To use your server, your clients would need to use a proprietary protocol which will make sure the target server is secure before sending it the encrypted inferencing request. This git repository provides an open source Python library called confonnx which can be used to call the server with the proprietary protocol.
Note: During the deployment we use the command line, but the command line interface is not ideal for production. Consider using the API directly (more information hereunder) Note: The client library is also used to generate the hash of the ONNX model and create inference test data.
Once you have the server image you can run it on your VM via Docker, and via the client library you will be able to test that everything is working properly. Once you have built and tested the confidential inference server container on your VM, you are ready to deploy on an AKS cluster. Remember, without a key management solution you can only deploy on a single node
This section describes the steps needed to build and run a confidential inference server on an Azure Confidential Compute VM. Note: The following commands were tested on an ACC VM running Ubuntu 18.04.
Follow the steps on how to Deploy an Azure Confidential Computing VM
Notes: You will need an empty resource group.
- Image: Ubuntu Server 18.04 (Gen 2)
- Choose SSH public key option
- VM size: 1x Standard DC2s v2
- Public inbound ports SSH(Linux)/RDP(Windows)
You can now SSH into your machine, go to the ACC VM resource in the portal, choose connect, select SSH
ssh -i <private key path> <username>@<ip address>
# <username>@accvm:~$
sudo apt update
Clone this repository:
git clone /~https://github.com/microsoft/onnx-server-openenclave
cd onnx-server-openenclave
For running the inference client, the Azure DCAP Client has to be installed. (note: This requirement may be removed in a future release.) To install the Azure DCAP Client on Ubuntu 18.04, run:
echo "deb [arch=amd64] https://packages.microsoft.com/ubuntu/18.04/prod bionic main" | sudo tee /etc/apt/sources.list.d/msprod.list
wget -qO - https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -
sudo apt update
sudo apt install az-dcap-client
Check version:
python3 --version
# Python 3.7.5
Install:
sudo apt install python3
sudo apt-get install python3-pip
Install Docker
sudo apt install docker.io
Build the Python package (this will take some time):
PYTHON_VERSION=3.7 docker/client/build.sh
The folder dist/Release/lib/python
now contains the .whl
file for the requested Python version, for example confonnx-0.1.0-cp37-cp37m-linux_x86_64.whl
.
Note: manylinux wheels can be built with TYPE=manylinux
,
however those do not support enclave identity validation yet.
The non-manylinux wheels built above should work on Ubuntu 18.04 and possibly other versions.
Install the built library:
python3.7 -m pip install dist/Release/lib/python/confonnx-0.1.0-cp37-cp37m-linux_x86_64.whl
Open enclave.conf
and adjust enclave parameters as necessary:
Debug
: Set to 0 for deployment. If left as 1, an attacker has access to the enclave memory.NumTCS
: Set to number of available cores in deployment VM.NumHeapPages
: In-enclave heap memory, increase if out-of-memory errors occur, for example with large models.
By default, an enclave signing key pair is created if it doesn't exist yet.
To use your own, copy the private key to enclave.pem
in the repository root.
Run the following to build the server using Docker: (It takes a while)
docker/server/build.sh
The server binaries are stored in dist/Release
. In the subfolder bin/
you will also find an enclave_info.txt
file. This file contains the enclave hash mrenclave
that is needed for the clients to validate the enclave's identity before sending inference requests from a client.
The inference server uses the ONNX Runtime and hence the model has to be converted into ONNX format first. See the ONNX Tutorials page for an overview of available converters. Make sure the target runtime (see external/onnxruntime) supports the ONNX model version.
For testing, you can download pre-trained ONNX models from the ONNX Model Zoo.
This guide will use one of the pre-trained MNIST models from the Zoo.
curl https://media.githubusercontent.com/media/onnx/models/master/vision/classification/mnist/model/mnist-7.onnx --output model.onnx
To ensure that inference requests are only sent to inference servers that are loaded with a specific model, we can compute the model hash and have the client verify it before sending the inferencing request. Note that this is an optional feature.
python3 -m confonnx.hash_model model.onnx --out model.hash
# 0d715376572e89832685c56a65ef1391f5f0b7dd31d61050c91ff3ecab16c032
We are now ready to bundle the model and the server into a Docker image ready for deployment:
# Adjust model path and image name if needed.
MODEL_PATH=model.onnx IMAGE_NAME=model-server docker/server/build_image.sh
Before testing the server we need some inference test data. We can use the following tool to create random data according to the model schema:
python3 -m confonnx.create_test_inputs --model model.onnx --out input.json
Start the server with:
sudo docker run --rm --name model-server-test --device=/dev/sgx -p 8888:8888 model-server
Now we can send our first inference request:
python3 -m confonnx.main --url http://localhost:8888/ --enclave-hash "<mrenclave>" --enclave-model-hash-file model.hash --json-in input.json --json-out output.json
The inference result is stored in output.json
.
Note: Add --enclave-allow-debug
if Debug
is set to 1
in enclave.conf
.
To stop the server, run:
sudo docker stop model-server-test
See the dedicated AKS deployment tutorial.
In the above instructions, the command line inference client was used. This client is not meant for production scenarios and offers restricted functionality.
Using the Python API directly has the following advantages:
- Simple inference input/output format (dictionary of numpy arrays).
- Efficient handling of multiple requests (avoiding repeated key exchanges).
- Custom error handling.
Example:
import numpy as np
from confonnx.client import Client
client = Client('https://...', auth_key='password123', enclave_hash='<mrenclave>', enclave_model_hash='...')
result = client.predict({
'image': np.random.random_sample((5,128,128)) # five 128x128 images
})
print(result['digit'])
# [1,6,2,1,0]
Currently not, but support for it will be added in a future release.
The server includes experimental and undocumented options for model protection which advanced users may use at their own risk. No support is provided for these options.
Full support for model protection will come in a future release.
Currently not, though this is planned for a future release.
Currently not, though community contributions are highly welcomed to support this.
The Python client is a thin wrapper around C++ code (see confonnx/client
and external/confmsg
).
This code can be used as basis for building a custom native client.