This repository contains a blueprint that creates and secures a Google Kubernetes Engine (GKE) cluster that is ready to host custom apps distributed by a third party.
You can use this blueprint to implement Federated Learning (FL) use cases on Google Cloud.
This blueprint suggests controls that you can use to help configure and secure GKE clusters that host custom apps distributed by third-party tenants. These custom apps are considered as untrusuted workloads within the cluster. Therefore, the cluster is configured according to security best practices to isolate and constrain the workloads from other workloads and from the cluster control plane.
This blueprint provisions cloud resources on Google Cloud. After the initial provisioning, you can extended the infrastructure to GKE clusters running on premises or on other public clouds.
This blueprint is aimed at cloud platform administrator and data scientists that need to provision and configure a secure environment to run potentially untrusted workloads in their Google Cloud environment.
This blueprint assumes that you are familiar with GKE and Kubernetes.
To deploy this blueprint you need:
- A Google Cloud project with billing enabled.
- An account with either the Project Owner role (full access) or Granular Access roles.
- The
serviceusage.googleapis.com
must be enabled on the project. For more information about enabling APIs, see Enabling and disabling services - A Git repository to store the environment configuration.
You can choose between Project Owner access (full access) or Granular Access for more fine-tuned permissions.
The service account will have full administrative access to the project.
roles/owner
: Full administrative access to the project (Project Owner role)
The service account will be assigned the following roles to limit access to required resources:
roles/artifactregistry.admin
: Grants full administrative access to Artifact Registry, allowing management of repositories and artifacts.roles/browser
: Provides read-only access to browse resources in a project.roles/cloudkms.admin
: Provides full administrative control over Cloud KMS (Key Management Service) resources.roles/compute.networkAdmin
: Grants full control over Compute Engine network resources.roles/container.clusterAdmin
: Provides full control over Kubernetes Engine clusters, including creating and managing clusters.roles/gkehub.editor
: Grants permission to manage Google Kubernetes Engine Hub features.roles/iam.serviceAccountAdmin
: Grants full control over managing service accounts in the project.roles/resourcemanager.projectIamAdmin
: Allows managing IAM policies and roles at the project level.roles/servicenetworking.serviceAgent
: Allows managing service networking configurations.roles/serviceusage.serviceUsageAdmin
: Grants permission to enable and manage services and APIs for a project.
You create the infrastructure using Terraform. The blueprint uses a local Terraform backend, but we recommend to configure a remote backend for anything other than experimentation.
This repository has the following key directories:
examples
: contains examples that build on top of this blueprint.terraform
: contains the Terraform code used to create the project-level infrastructure and resources, for example a GKE cluster, VPC network, firewall rules etc. It also installs Anthos components into the clusterconfigsync
: contains the cluster-level resources and configurations that are applied to your GKE cluster.tenant-config-pkg
: a kpt package that you can use as a template to configure new tenants in the GKE cluster.
The following diagram describes the architecture that you create with this blueprint:
As shown in the preceding diagram, the blueprint helps you to create and configure the following infrastructure components:
- A Virtual Private Cloud (VPC) network and subnet.
- A private GKE cluster that helps you:
- Isolate cluster nodes from the internet.
- Limit exposure of your cluster nodes and control plane to the internet by creating a private GKE cluster with authorised networks.
- Use shielded cluster nodes that use a hardened node image with the containerd runtime.
- Enable Dataplane V2 for optimised Kubernetes networking.
- Encrypt cluster secrets at the application layer.
- Dedicated GKE node pools.
- You create a dedicated node pool to exclusively host tenant apps and resources. The nodes have taints to ensure that only tenant workloads are scheduled onto the tenant nodes.
- Other cluster resources are hosted in the main node pool.
- VPC Firewall rules
- Baseline rules that apply to all nodes in the cluster.
- Additional rules that apply only to the nodes in the tenant node pool. These firewall rules limit ingress to and egress from tenant nodes.
- Cloud NAT to allow egress to the internet
- Cloud DNS records to enable Private Google Access such that apps within the cluster can access Google APIs without traversing the internet.
- Service Accounts:
- Dedicated service account for the nodes in the tenant node pool.
- Dedicated service account for tenant apps to use with Workload Identity.
- Support for using Google Groups for Kubernetes RBAC.
- A Cloud Source Repository to store configuration descriptors.
- An Artifact Registry repository to store container images.
The following diagram shows the cluster-level resources that you create and configure with the blueprint.
As shown in the preceding diagram, in the blueprint, you use the following to create and configure the cluster-level resources:
- Anthos Config Management Config Sync, to sync cluster configuration and policies from a Git repository.
- When you provision the resources using this blueprint, the tooling initializes a Git repository for Config Sync to consume, and automatically renders the relevant templates and commits changes.
- The tooling automatically commits any modification to templates in the Config Sync repository on each run of the provisioning process.
- Anthos Config Management Policy Controller enforces policies ('constraints') to enforce policies on resources in the cluster.
- Anthos Service Mesh to control and help secure network traffic.
- A dedicated namespace and node pools for tenant apps and resources. Custom apps are treated as a tenant within the cluster.
- Policies and controls applied to the tenant namespace:
- Allow egress only to known hosts.
- Allow requests that originate from within the same namespace.
- By default, deny all ingress and egress traffic to and from pods. This acts as baseline 'deny all' rule.
- Allow traffic between pods in the namespace.
- Allow egress to required cluster resources such as: Kubernetes DNS, the service mesh control plane, and the GKE metadata server.
- Allow egress to Google APIs only using Private Google Access.
- Allow running host tenant pods on nodes in the dedicated tenant node pool exclusively.
- Use a dedicated Kubernetes service account that is linked to a Cloud Identity and Access Management service account using Workload Identity.
Users and teams managing tenant apps should not have permissions to change cluster configuration or modify service mesh resources
-
Open Cloud Shell
-
Initialize the local repository where the environment configuration will be stored:
ACM_REPOSITORY_PATH= # Path on the host running Terraform to store environment configuration ACM_REPOSITORY_URL= # URL of the repository to store environment configuration ACM_BRANCH= # Name of the Git branch in the repository that Config Sync will sync with git clone "${ACM_REPOSITORY_URL}" --branch "${ACM_BRANCH}" "${ACM_REPOSITORY_PATH}"
-
Clone this Git repository.
-
Change into the directory that contains the Terraform code:
cd [REPOSITORY]/terraform
Where
[REPOSITORY]
is the path to the directory where you cloned this repository. -
Initialize Terraform:
terraform init
-
Initialize the following Terraform variables:
project_id = # Google Cloud project ID where to provision resources with the blueprint. acm_branch = # Use the same value that you used for ${ACM_BRANCH} acm_repository_path = # Use the same value that you used for ${ACM_REPOSITORY_PATH} acm_repository_url = # Use the same value that you used for ${ACM_REPOSITORY_URL} acm_secret_type = # Secret type to authenticate with the Config Sync Git repository acm_source_repository_fqdns = # FQDNs of source repository for Config Sync to allow in the Network Firewall Policy
For more information about setting
acm_secret_type
, see Grant access to Git.If you don't provide all the necessary inputs, Terraform will exit with an error, and will provide information about the missing inputs. For example, you can create a Terraform variables initialization file and set inputs there. For more information about providing these inputs, see Terraform input variables.
-
Review the proposed changes, and apply them:
terraform apply
The provisioning process may take about 15 minutes to complete.
-
Wait for the Cloud Service Mesh custom resource definitions to be available:
/bin/sh -c 'while ! kubectl wait crd/controlplanerevisions.mesh.cloud.google.com --for condition=established --timeout=60m --all-namespaces; do echo \"crd/controlplanerevisions.mesh.cloud.google.com not yet available, waiting...\"; sleep 5; done'
-
Wait for the Cloud Service Mesh custom resources to be available:
/bin/sh -c 'while ! kubectl -n istio-system wait ControlPlaneRevision --all --timeout=60m --for condition=Reconciled; do echo \"ControlPlaneRevision not yet available, waiting...\"; sleep 5; done'
-
Commit and push generated configuration files to the environment configuration repository:
git -C "${ACM_REPOSITORY_PATH}" add . git -C "${ACM_REPOSITORY_PATH}" commit -m "Config update: $(date -u +'%Y-%m-%dT%H:%M:%SZ')" git -C "${ACM_REPOSITORY_PATH}" push -u origin "${ACM_BRANCH}"
Every time you modify the environment configuration, you need to commit and push changes to the environment configuration repository.
-
Grant the Config Sync agent access to the Git repository where the environment configuration will be stored.
-
Wait for the GKE cluster to be reported as ready in the GKE Kuberentes clusters dashboard.
After deploying the blueprint completes, the GKE cluster is ready to host untrusted workloads. To familiarize with the environment that you provisioned, you can also deploy the following examples in the GKE cluster:
Federated learning is typically split into Cross-silo and Cross-device federated learning. Cross-silo federated computation is where the participating members are organizations or companies, and the number of members is usually small (e.g., within a hundred).
Cross-device computation is a type of federated computation where the participating members are end user devices such as mobile phones and vehicles. The number of members can reach up to a scale of millions or even tens of millions.
You can deploy a cross-device infrastructure by following this README.md
This blueprint dynamically provisions a runtime environment for each tenant you configure.
To add another tenant:
- Add its name to the list of tenants to configure using the
tenant_names
variable. - Follow the steps to Deploy the blueprint again.
To open an SSH session against a node of the cluster, you use an IAP tunnel because cluster nodes don't have external IP addresses:
gcloud compute ssh --tunnel-through-iap node_name
Where node_name
is the Compute Engine instance name to connect to.
This section describes common issues and troubleshooting steps.
If Terraform reports errors when you run plan
or apply
because it can't get
the status of a resource inside a GKE cluster, and it also reports that it needs
to update the cidr_block
of the master_authorized_networks
block of that
cluster, it might be that the instance that runs Terraform is not part of any
CIDR that is authorized to connect to that GKE cluster control plane.
To solve this issue, you can try updating the cidr_block
by targeting the GKE
cluster specifically when applying changes:
terraform apply -target module.gke
Then, you can try running terraform apply
again, without any resource
targeting.
If Terraform reports connect: cannot assign requested address
errors when
you run Terraform, try running the command again.
If Terraform reports errors about the format of the fleet membership configuration, it may mean that the Fleet API initialization didn't complete when Terraform tried to add the GKE cluster to the fleet. Example:
Error creating FeatureMembership: googleapi: Error 400: InvalidValueError for
field membership_specs["projects/<project number>/locations/global/memberships/<cluster name>"].feature_spec:
does not match a current membership in this project. Keys should be in the form: projects/<project number>/locations/{l}/memberships/{m}
If this error occurs, try running terraform apply
again.
If istio-ingress
or istio-egress
Pods fail to run because GKE cannot
download their container images and GKE reports ImagePullBackOff
errors, see
Troubleshoot gateways
for details about the potential root cause. You can inspect the status of these
Pods in the
GKE Workloads Dashboard.
If this happens, wait for the cluster to complete the initialiazation, and delete the Deployment that has this issue. Config Sync will deploy it again with the correct container image identifiers.
When running terraform destroy
to remove resources that this reference
architecture provisioned and configured, it might happen that you get the
following errors:
-
Dangling network endpoint groups (NEGs):
Error waiting for Deleting Network: The network resource 'projects/PROJECT_NAME/global/networks/NETWORK_NAME' is already being used by 'projects/PROJECT_NAME/zones/ZONE_NAME/networkEndpointGroups/NETWORK_ENDPOINT_GROUP_NAME'.
If this happens:
- Open the NEGs dashboard for your project.
- Delete all the NEGs that were associated with the GKE cluster that Terraform deleted.
- Run
terraform destroy
again.
This section discusses the controls that you apply with the blueprint to help you secure your GKE cluster.
Creating clusters according to security best practices.
The blueprint helps you create a GKE cluster which implements the following security settings:
- Limit exposure of your cluster nodes and control plane to the internet by creating a private GKE cluster with authorized networks.
- Use
shielded nodes
that use a hardened node image with the
containerd
runtime. - Increased isolation of tenant workloads using GKE Sandbox.
- Encrypt cluster secrets at the application layer.
For more information about GKE security settings, refer to Hardening your cluster's security.
VPC firewall rules govern which traffic is allowed to or from Compute Engine VMs. The rules let you filter traffic at VM granularity, depending on Layer 4 attributes.
You create a GKE cluster with the default GKE cluster firewall rules. These firewall rules enable communication between the cluster nodes and GKE control plane, and between nodes and Pods in the cluster.
You apply additional firewall rules to the nodes in the tenant node pool. These firewall rules restrict egress traffic from the tenant nodes. This approach lets you increase the isolation of the tenant nodes. By default, all egress traffic from the tenant nodes is denied. Any required egress must be explicitly configured. For example, you use the blueprint to create firewall rules to allow egress from the tenant nodes to the GKE control plane, and to Google APIs using Private Google Access. The firewall rules are targeted to the tenant nodes using the tenant node pool service account.
<<_shared/_anthos_snippets/_anthos-blueprints-snippets-namespaces.md>>
The blueprint helps you create a dedicated namespace to host the third-party apps. The namespace and its resources are treated as a tenant within your cluster. You apply policies and controls to the namespace to limit the scope of resources in the namespace.
Network policies enforce Layer 4 network traffic flows by using Pod-level firewall rules. Network policies are scoped to a namespace.
In the blueprint, you apply network policies to the tenant namespace that hosts the third-party apps. By default, the network policy denies all traffic to and from pods in the namespace. Any required traffic must be explicitly allowlisted. For example, the network policies in the blueprint explicitly allow traffic to required cluster services, such as the cluster internal DNS and the Anthos Service Mesh control plane.
Config Sync keeps your GKE clusters in sync with configs stored in a Git repository. The Git repository acts as the single source of truth for your cluster configuration and policies. Config Sync is declarative. It continuously checks cluster state and applies the state declared in the configuration file in order to enforce policies, which helps to prevent configuration drift.
You install Config Sync into your GKE cluster. You configure Config Sync to sync cluster configurations and policies from the GitHub repository associated with the blueprint. The synced resources include the following:
- Cluster-level Anthos Service Mesh configuration
- Cluster-level security policies
- Tenant namespace-level configuration and policy including network policies, service accounts, RBAC rules, and Anthos Service Mesh configuration
Anthos Policy Controller is a dynamic admission controller for Kubernetes that enforces CustomResourceDefinition-based (CRD-based) policies that are executed by the Open Policy Agent (OPA).
Admission controllers are Kubernetes plugins that intercept requests to the Kubernetes API server before an object is persisted, but after the request is authenticated and authorized. You can use admission controllers to limit how a cluster is used.
You install Policy Controller into your GKE cluster. The blueprint includes example policies to help secure your cluster. You automatically apply the policies to your cluster using Config Sync. You apply the following policies:
- Selected policies to help enforce Pod security. For example, you apply policies that prevent pods running privileged containers and that require a read-only root file system.
- Policies from the Policy Controller template library. For example, you apply a policy that disallows services with type NodePort.
Anthos Service Mesh helps you monitor and manage an Istio-based service mesh. A service mesh is an infrastructure layer that helps create managed, observable, and secure communication across your services.
Anthos Service Mesh helps simplify the management of secure communications across services in the following ways:
- Managing authentication and encryption of traffic (supported protocols within the cluster using mutual Transport Layer Communication (mTLS)). Anthos Service Mesh manages the provisioning and rotation of mTLS keys and certificates for Anthos workloads without disrupting communications. Regularly rotating mTLS keys is a security best practice that helps reduce exposure in the event of an attack.
- Letting you configure network security policies based on service identity rather than on the IP address of a peers on the network. Anthos Service Mesh is used to configure identity-aware access control (firewall) policies that let you create security policies that are independent of the network location of the workload. This approach simplifies the process of setting up service-to-service communications policies.
- Letting you configure policies that permit access from certain clients.
The blueprint guides you to install Anthos Service Mesh in your cluster. You configure the tenant namespace for automatic sidecar proxy injection. This approach ensures that apps in the tenant namespace are part of the mesh. You automatically configure Anthos Service Mesh using Config Sync. You configure the mesh to do the following:
- Enforce mTLS communication between services in the mesh.
- Limit outbound traffic from the mesh to only known hosts.
- Limit authorized communication between services in the mesh. For example, apps in the tenant namespace are only allowed to communicate with apps in the same namespace, or with a set of known external hosts.
- Route all outbound traffic through a mesh gateway where you can apply further traffic controls.
Node taints and node affinity are Kubernetes mechanisms that let you influence how pods are scheduled onto cluster nodes.
Tainted nodes repel pods. Kubernetes will not schedule a Pod onto a tainted node unless the Pod has a toleration for the taint. You can use node taints to reserve nodes for use only by certain workloads or tenants. Taints and tolerations are often used in multi-tenant clusters. See the dedicated nodes with taints and tolerations documentation for more information.
Node affinity lets you constrain pods to nodes with particular labels. If a pod has a node affinity requirement, Kubernetes will not schedule the Pod onto a node unless the node has a label that matches the affinity requirement. You can use node affinity to ensure that pods are scheduled onto appropriate nodes.
You can use node taints and node affinity together to ensure tenant workload pods are scheduled exclusively onto nodes reserved for the tenant.
The blueprint helps you control the scheduling of the tenant apps in the following ways:
- Creating a GKE node pool dedicated to the tenant. Each node in the pool has a taint related to the tenant name.
- Automatically applying the appropriate toleration and node affinity to any Pod targeting the tenant namespace. You apply the toleration and affinity using PolicyController mutations.
It is a security best practice to adopt a principle of least privilege for your Google Cloud projects and resources like GKE clusters. This way, the apps that run inside your cluster, and the developers and operators that use the cluster, have only the minimum set of permissions required.
The blueprint helps you use least privilege service accounts in the following ways:
- Each GKE node pool receives its own service account. For example, the nodes in the tenant node pool use a service account dedicated to those nodes. The node service accounts are configured with the minimum required permissions.
- The cluster uses Workload Identity to associate Kubernetes service accounts with Google service accounts. This way, the tenant apps can be granted limited access to any required Google APIs without downloading and storing a service account key. For example, you can grant the service account permissions to read data from a Cloud Storage bucket.
The blueprint helps you restrict access to cluster resources in the following ways:
- You create a sample Kubernetes RBAC role with limited permissions to manage apps. You can grant this role to the users and groups who operate the apps in the tenant namespace. This way, those users only have permissions to modify app resources in the tenant namespace. They do not have permissions to modify cluster-level resources or sensitive security settings like Anthos Service Mesh policies.