Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provision cross-device FL cloud infrastructure #336

Merged
merged 48 commits into from
Dec 5, 2023
Merged
Changes from 1 commit
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
68d88e8
feat: Adding exclusion of Terraform generated files
laurentgrangeau Nov 7, 2023
accd698
feat: Add Tensorboard Dockerfile
laurentgrangeau Nov 7, 2023
adbad26
feat: Add server Dockerfile
laurentgrangeau Nov 7, 2023
eeeb68e
feat: Add client Dockerfile
laurentgrangeau Nov 7, 2023
b0f8926
fix: Superlinter errors
laurentgrangeau Nov 7, 2023
3277b92
feat: Add manifests for deploying cross-device example
laurentgrangeau Nov 14, 2023
568551b
feat: Add TF script for cross-device example
laurentgrangeau Nov 14, 2023
e836e90
fix: Remove unwanted folder
laurentgrangeau Nov 14, 2023
37af830
fix: Moving manifests in a separate PR
laurentgrangeau Nov 14, 2023
1f64d3d
fix: Remove test file
laurentgrangeau Nov 14, 2023
2f491ae
Merge branch 'main' into cross-device-fl
laurentgrangeau Nov 14, 2023
8a23d18
feat: Make cross-device an optional module
laurentgrangeau Nov 14, 2023
be17718
Merge branch 'cross-device-fl' of /~https://github.com/GoogleCloudPlatf…
laurentgrangeau Nov 14, 2023
aa0e5bb
fix: Move to a module under the folder
laurentgrangeau Nov 14, 2023
6e76395
feat: Finalize cross-device module
laurentgrangeau Nov 14, 2023
0276a49
fix: Superlinter errors
laurentgrangeau Nov 14, 2023
71008ca
fix: Superlinter errors
laurentgrangeau Nov 14, 2023
4b6ca93
Merge branch 'main' into cross-device-fl
laurentgrangeau Nov 15, 2023
d671043
fix: PR comments
laurentgrangeau Nov 15, 2023
f85fe91
fix: PR comments
laurentgrangeau Nov 15, 2023
553a386
fix: Review of the PR
laurentgrangeau Nov 15, 2023
e4c3f1e
Merge branch 'main' into cross-device-fl
laurentgrangeau Nov 16, 2023
eb6fd5d
fix: PR comments
laurentgrangeau Nov 16, 2023
0c50238
fix: Remove gitignore tfvars
laurentgrangeau Nov 16, 2023
dd4624d
fix: Superlinter errors
laurentgrangeau Nov 16, 2023
3c91ee3
fix: PR comments
laurentgrangeau Nov 17, 2023
6db1900
fix: PR comments
laurentgrangeau Nov 17, 2023
0efd503
fix: PR comments
laurentgrangeau Nov 20, 2023
da80dd3
fix: Unused variable
laurentgrangeau Nov 20, 2023
8882783
fix: PR comments
laurentgrangeau Nov 27, 2023
d542df0
fix: Roles for SA in WI
laurentgrangeau Nov 27, 2023
617afa2
fix: Don't analyze SQL
laurentgrangeau Nov 27, 2023
27fa931
fix: Typo
laurentgrangeau Nov 27, 2023
3e7adbb
feat: Add roles to SA in namespace
laurentgrangeau Nov 30, 2023
1c39775
fix: Bugs
laurentgrangeau Nov 30, 2023
1fcc9f1
fix: Superlinter
laurentgrangeau Nov 30, 2023
c544812
fix: PR reviews
laurentgrangeau Nov 30, 2023
c7da50a
fix: SA namespace
laurentgrangeau Nov 30, 2023
ef89730
feat: Add instructions in the README
laurentgrangeau Nov 30, 2023
54dc3f2
Merge branch 'main' into cross-device-fl
laurentgrangeau Nov 30, 2023
311f983
fix: PR comments
laurentgrangeau Dec 4, 2023
f5b2e9d
fix: Rewrite README.md
laurentgrangeau Dec 4, 2023
2e4d78f
fix: Refactor README.md
laurentgrangeau Dec 4, 2023
935c699
fix: PR comments
laurentgrangeau Dec 5, 2023
9d2a443
Merge branch 'main' into cross-device-fl
laurentgrangeau Dec 5, 2023
c108fb0
fix: Move prerequisites
laurentgrangeau Dec 5, 2023
dadcb1f
Merge branch 'cross-device-fl' of /~https://github.com/GoogleCloudPlatf…
laurentgrangeau Dec 5, 2023
bb3b387
fix: Remove minimum nodes
laurentgrangeau Dec 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feat: Add instructions in the README
  • Loading branch information
laurentgrangeau committed Nov 30, 2023
commit ef89730c0fb2bbbb44e156f42202aab322ead9ed
46 changes: 43 additions & 3 deletions terraform/cross-device/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,53 @@
# Cross-device Federated Learning

This module is an example of an end to end demo for cross-device Federated Learning
This module is an example of an end to end demo for cross-device Federated Learning. This example deploys 6 different workloads:
ferrarimarco marked this conversation as resolved.
Show resolved Hide resolved
- `aggregator`: this is an offline job that reads device gradients and calculates aggregated result with DP
ferrarimarco marked this conversation as resolved.
Show resolved Hide resolved
- `collector`: this is an offline job that runs periodically to query active task and encrypted gradients, resulting in deciding when to kick off aggregating
- `modelupdater`: this is an offline job that listens to events and publishes results so that device can download
- `task-assignment`: this is a front end service that distributes training tasks to devices
- `task-management`: this is an offline job that manages tasks
- `task-scheduler`: this is an offline job that either runs periodically or is triggered by some events

This example builds on top of the infrastructure that the
[blueprint provides](../../../../README.md), and follows the best practices the
blueprint establishes.

## Prerequisites
ferrarimarco marked this conversation as resolved.
Show resolved Hide resolved

- A POSIX-compliant shell
- Git (tested with version 2.41)
- Docker (tested with version 20.10.21)

## Infrastructure

It creates:
- A spanner instance for storing the status of training
- Pubsub topics that act as buses for messages between microservices
- Buckets for storing the trained models

ferrarimarco marked this conversation as resolved.
Show resolved Hide resolved
To deploy this solution and ensure end-to-end confidentiality, you need to enable confidential nodes.
To deploy this solution, just set the `cross-device` flag to `true`.
ferrarimarco marked this conversation as resolved.
Show resolved Hide resolved
ferrarimarco marked this conversation as resolved.
Show resolved Hide resolved

To ensure end-to-end confidentiality, you need to enable confidential nodes.
ferrarimarco marked this conversation as resolved.
Show resolved Hide resolved

However, it is also necessary to use VM families that support this feature, such as **N2D** or **C2D**.
When using confidential nodes, set `enable_confidential_nodes` to `true` and `cluster_tenant_pool_machine_type` to `n2d-standard-8`. In addition, in order to have the minimum number of replicas required during deployment, you need at least 4 nodes and set `cluster_tenant_pool_min_nodes` to `4`.
When using confidential nodes, set `enable_confidential_nodes` to `true` and `cluster_tenant_pool_machine_type` to `n2d-standard-8`. In addition, in order to have the minimum number of replicas required during deployment, you need at least 4 nodes.

You will then deploy the cross-device workloads in a namespace. You will need to set the `tenant_namespace` variable with the name of the namespace in which you want to deploy the workloads.

### Containers running in different namespaces, in the same GKE cluster

1. Provision infrastructure by following the instructions in the [main README](../../../../README.md).
1. From Cloud Shell, change the working directory to the `terraform` directory.
1. Initialize the following Terraform variables:

```hcl
enable_confidential_nodes = true
cluster_tenant_pool_machine_type = "n2d-standard-4"
cluster_default_pool_machine_type = "n2d-standard-4"
cross-device = true
tenant_namespace = "main"
ferrarimarco marked this conversation as resolved.
Show resolved Hide resolved
```

1. Run `terraform apply`, and wait for Terraform to complete the provisioning process.
1. Open the [GKE Workloads Dashboard](https://cloud.google.com/kubernetes-engine/docs/concepts/dashboards#workloads)
and wait for the workers Deployments and Services to be ready.
ferrarimarco marked this conversation as resolved.
Show resolved Hide resolved