This repository contains a blueprint that creates and secures a Google Kubernetes Engine (GKE) cluster that is ready to host custom apps distributed by a third party. The blueprint uses federated learning as an example use case for hosting custom third party apps inside your cluster. Specifically, the blueprint creates and configures a Google Kubernetes Engine (GKE) cluster and related infrastructure such that the cluster is ready to participate in cross-silo federated learning.
Federated learning is a machine learning approach that allows a loose federation of participants (e.g. a group of organisations) to collaboratively improve a shared model, without sharing any sensitive data. In cross-silo federated learning, each participant uses its own data and compute resources, called a silo. Eash silo trains a shared model using only its local data and compute resources. Training results are shared with the federation owner, who updates the shared model and redistributes to the silos for further training rounds, and the process repeats. This way, silos can collaborate to improve the model without sharing data.
This blueprint suggests using a GKE cluster as the compute infrastructure for a silo. The cluster is designed to host containerised apps, distributed by the federation owner, that train the model against local data and manage interation between the silo and federation owner. As these apps are created by the federation owner, they need to be treated as untrusuted or semi-trusted workloads within the silo cluster. Therefore, the silo cluster is configured according to security best practices, and additional controls are put in place to isolate and constrain the trainer workloads. The blueprint uses Anthos features to automate and optimise the configuration and security of the cluster.
The initial version of the blueprint creates infrastructure in Google Cloud. It can be extended to Anthos clusters running on premises or on other public clouds.
This blueprint is focussed on creating and configuring GKE clusters. The following items are out of scope for the blueprint:
- Creation and orchestration of the federated learning workflows.
- Management of the federated learning consortium.
- Preparation of local training data.
- Deployment and management of the federated learning apps.
- Communication requirements between the cluster and the federation owner.
To deploy this blueprint you need:
- A Google Cloud project with billing enabled
- Owner permissions on the project
- It is expected that you deploy the blueprint using Cloud Shell.
- You create the infastructure using Terraform. The blueprint uses a local backend. It is recommended to configure a remote backend for anything other than experimentation
This repository has the following key directories:
-
terraform: contains the Terraform code used to create the project-level infrastructure and resources, for example a GKE cluster, VPC network, firewall rules etc. It also installs Anthos components into the cluster
-
configsync: contains the cluster-level resources and configurations that are applied to your GKE cluster.
-
tenant-config-pkg: a kpt package that you can use as a template to configure new tenants in the GKE cluster.
The blueprint uses a multi-tenant architecture. The federated learning workloads are treated as a tenant within the cluster. These tenant workloads are grouped in a dedicated namespace, and isolated on dedicated cluster nodes. This way, you can apply security controls and policies to the nodes and namespace that host the tenant workloads.
The following diagram describes the infrastructure created by the blueprint
The infrastructure created by the blueprint includes:
- A VPC network and subnet.
- A private GKE cluster. The blueprint helps you create GKE clusters that implement recommended security settings, such as those described in the GKE hardening guide. For example, the blueprint helps you:
- Limit exposure of your cluster nodes and control plane to the internet by creating a private GKE cluster with authorised networks.
- Use shielded nodes that use a hardened node image with the containerd runtime.
- Harden isolation of tenant workloads using GKE Sandbox.
- Enable Dataplane V2 for optimised Kubernetes networking.
- Encrypt cluster secrets at the application layer.
- Two GKE node-pools.
- You create a dedicated node pool to exclusively host tenant apps and resources. The nodes have taints to ensure that only tenant workloads are scheduled onto the tenant nodes
- Other cluster resources are hosted in the default node pool.
- VPC Firewall rules
- Baseline rules that apply to all nodes in the cluster.
- Additional rules that apply only to the nodes in the tenant node-pool (targeted using the node Service Account below). These firewall rules limit egress from the tenant nodes.
- Cloud NAT to allow egress to the internet
- Cloud DNS rules configured to enable Private Google Access such that apps within the cluster can access Google APIs without traversing the internet
- Service Accounts used by the cluster.
- A dedicated Service Account used by the nodes in the tenant node-pool
- A dedicated Service Account for use by tenant apps (via Workload Identity, discussed later)
The following diagram describes the apps and resources within the GKE cluster
The cluster includes:
- Config Sync, which keeps cluster configuration in sync with config defined in a Git repository.
- The config defined by the blueprint includes namespaces, service accounts, network policies, Policy Controller policies and Istio resources that are applied to the cluster.
- See the configsync dir for the full set of resources applied to the cluster
- Policy Controller enforces policies ('constraints') for your clusters. These policies act as 'guardrails' and prevent any changes to your cluster that violate security, operational, or compliance controls.
- Example policies enforced by the blueprint include:
- Selected constraints similar to PodSecurityPolicy
- Selected constraints from the template library, including:
- Prevent creation of external services (Ingress, NodePort/LoadBalancer services)
- Allow pods to pull container images only from a named set of repos
- See the resources in the configsync/policycontroller directory for details of the constraints applied by this blueprint.
- Example policies enforced by the blueprint include:
- Anthos Service Mesh(ASM) is powered by Istio and enables managed, observable, and secure communication across your services. The blueprint includes service mesh configuration that is applied to the cluster using Config Sync. The following points describe how this blueprint configures the service mesh.
- The root istio namespace (istio-system) is configured with
- PeerAuthentication resource to allow only STRICT mTLS communications between services in the mesh
- AuthorizationPolicies that:
- by default deny all communication between services in the mesh,
- allow communication to a set of known external hosts (such as example.com)
- Egress Gateway that acts a forward-proxy at the edge of the mesh
- VirtualService and DestinationRule resources that route traffic from sidecar proxies through the egress gateway to external destinations.
- The tenant namespace is configured for automatic sidecar proxy injection, see next section.
- Note that the mesh does not include an Ingress Gateway
- See the servicemesh dir for the cluster-level mesh config
- The root istio namespace (istio-system) is configured with
The blueprint configures a dedicated namespace for tenant apps and resources:
- The tenant namespace is part of the service mesh. Pods in the namespace receive sidecar proxy containers. The namespace-level mesh resources include:
- Sidecar resource that allows egress only to known hosts (outboundTrafficPolicy: REGISTRY_ONLY)
- AuthorizationPolicy that defines the allowed communication paths within the namespace. The blueprint only allows requests that originate from within the same namespace. This policy is added to the root policy in the istio-system namespace
- The tenant namespace has network policies to limit traffic to and from pods in the namespace. For example, the network policy:
- By default, denies all ingress and egress traffic to/from the pods. This acts as baseline 'deny all' rule,
- Allows traffic between pods in the namespace
- Allows egress to required cluster resources like kube-dns, service mesh control plane and the GKE metadata server
- Allows egress to Google APIs (via Private Google Access)
- The pods in the tenant namespace are hosted exclusively on nodes in the dedicated tenant node-pool.
- Any pod deployed to the tenant workspace automatically receives a toleration and nodeAffinity to ensure that it is scheudled only a tenant node
- The toleration and nodeAffinity are automatically applied using Policy Controller mutations
- The apps in the tenant namespace use a dedicated Kubernetes service account that is linked to a Google Cloud service account using Workload Identity. This way you can grant appropriate IAM roles to interact with any required Google APIs.
- The blueprint includes a sample RBAC ClusterRole that grants users permissions to interact with limited resource types. The tenant namespace includes a sample RoleBinding that grants the role to an example user.
- For example, different teams might be responsible for managing apps within each tenant namespace
- Users and teams managing tenant apps should not have permissions to change cluster configuration or modify service mesh resources
-
Open Cloud Shell
-
Fork or clone this repo
-
Change into the directory that contains the Terraform code
cd terraform
-
Review the
terraform.tfvars
file and replace values appropriately -
Set a Terraform environment variable for your project ID
export TF_VAR_project_id=[YOUR_PROJECT_ID]
-
Initialise Terraform
terraform init
-
Create the plan; review it so you know what's going on
terraform plan -out terraform.out
-
Apply the plan to create the cluster. Note this may take ~15 minutes to complete
terraform apply terraform.out
See testing for some manual tests you can perform to verify setup
Out-of-the-box the blueprint is configured with a single tenant called 'fltenant1'. Adding another tenant is a two-stage process:
- Create the project-level infra and resources for the tenant (node pool, service accounts, firewall rules...). You do this by updating the Terraform config and re-applying.
- Configure cluster-level resources for the tenant (namespace, network policies, service mesh policies...) You do this by instantiating and configuring a new version of the tenant kpt package, and then applying to the cluster.
See the relevant section in testing for instructions.