This repo is used to deploy an EKS cluster to AWS. CI/CD is managed through Spacelift.
.: Contains references to all the "Things" that are going to be deployed
├── common-resources: Resources that are environment independent
│ ├── contexts: Contexts that we'll attach across environments
│ └── policies: Rego policies that can be attached to 0..* spacelift stacks
├── dev: Development/sandbox environment
│ ├── spacelift: Terraform scripts to manage spacelift resources
│ │ └── dpe-k8s/dpe-sandbox: Spacelift specific resources to manage the CI/CD pipeline
│ └── stacks: The deployable cloud resources
│ ├── dpe-auth0: Stack used to provision and setup auth0 IDP (Identity Provider) settings
│ ├── dpe-sandbox-k8s: K8s + supporting AWS resources
│ └── dpe-sandbox-k8s-deployments: Resources deployed inside of a K8s cluster
└── modules: Templatized collections of terraform resources that are used in a stack
├── apache-airflow: K8s deployment for apache airflow
│ └── templates: Resources used during deployment of airflow
├── argo-cd: K8s deployment for Argo CD, a declarative, GitOps continuous delivery tool for Kubernetes.
│ └── templates: Resources used during deployment of this helm chart
├── cert-manager: Handles provisioning TLS certificates for the cluster
├── envoy-gateway: API Gateway for the cluster securing and providing secure traffic into the cluster
├── postgres-cloud-native: Used to provision a postgres instance
├── postgres-cloud-native-operator: Operator that manages the lifecycle of postgres instances on the cluster
├── demo-network-policies: K8s deployment for a demo showcasing how to use network policies
├── demo-pod-level-security-groups-strict: K8s deployment for a demo showcasing how to use pod level security groups in strict mode
├── sage-aws-eks: Sage specific EKS cluster for AWS
├── sage-aws-eks-addons: Sets up additional resources that need to be installed post creation of the EKS cluster
├── sage-aws-k8s-node-autoscaler: K8s node autoscaler using spotinst ocean
├── sage-aws-ses: AWS SES (Simple email service) setup
├── sage-aws-vpc: Sage specific VPC for AWS
├── signoz: SigNoz provides APM, logs, traces, metrics, exceptions, & alerts in a single tool
├── trivy-operator: K8s deployment for trivy, along with a few supporting charts for security scanning
│ └── templates: Resources used during deployment of these helm charts
├── victoria-metrics: K8s deployment for victoria metrics, a promethus like tool for cluster metric collection
│ └── templates: Resources used during deployment of these helm charts
This root main.tf
contains all the "Things" that are going to be deployed.
In this top level directory you'll find that the terraform files are bringing together
everything that should be deployed in spacelift declerativly. The items declared in
this top level directory are as follows:
- A single root administrative stack that is responsible for taking each and every resource to deploy it to spacelift.
- A spacelift space that everything is deployed under called
environment
. - Reference to the
terraform-registry
modules directory. - Reference to
common-resources
or reusable resources that are not environment specific. - The environment specific resources such as
dev
,staging
, orprod
This structure is looking to /~https://github.com/antonbabenko/terraform-best-practices/tree/master/examples for inspiration.
The VPC used in this project is created with the AWS VPC Terraform module. It contains a number of defaults for our use-case at sage. Head on over to the module definition to learn more.
AWS EKS is a managed kubernetes cluster that handles many of the tasks around running a k8s cluster. On-top of it we are providing the configurable parameters in order to run a number of workloads.
API access to the kubernetes cluster endpoint is set to Public and private
.
Reading:
This allows one outside of the VPC to connect via kubectl
and related tools to
interact with kubernetes resources. By default, this API server endpoint is public to
the internet, and access to the API server is secured using a combination of AWS
Identity and Access Management (IAM) and native Kubernetes Role Based Access Control
(RBAC).
You can enable private access to the Kubernetes API server so that all communication between your worker nodes and the API server stays within your VPC. You can limit the IP addresses that can access your API server from the internet, or completely disable internet access to the API server.
This section describes the VPC CNI (Container Network Interface) that is being used within the EKS cluster. The plugin is responsible for allocating VPC IP addresses to Kubernetes nodes and configuring the necessary networking for Pods on each node.
Allows us to assign EC2 security groups directly to pods running in AWS EKS clusters.
This can be used as an alternative or in conjunction with Kubernetes network policies
.
See modules/demo-pod-level-security-groups-strict
for more context on how this works.
Controls network traffic within the cluster, for example pod to pod traffic.
See modules/demo-network-policies
for more context on how this works.
Further reading:
- https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html
- https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html
- https://aws.amazon.com/blogs/containers/introducing-security-groups-for-pods/
- https://kubernetes.io/docs/concepts/services-networking/network-policies/
We use spot.io to manage the nodes attached to each of the EKS cluster. This tool has scale-to-zerio capabilities and will dynamically add or removes nodes from the cluster depending on the required demand. The autoscaler is templatized and provided as a terraform module to be used within an EKS stack.
Setup of spotio (Manual per AWS Account):
- Subscribe through the AWS Marketplace: https://aws.amazon.com/marketplace/saas/ordering?productId=bc241ac2-7b41-4fdd-89d1-6928ec6dae15
- "Set up your account" on the spotio website and link it to an existing organization
- Link the account through the AWS UI:
- Create a policy (See the JSON in the spotio UI)
- Create a role (See instructions in the spotio UI)
After this has been setup the last item is to get an API token from the spotio UI and add it to the AWS secret manager.
- Log into the spot UI and go to https://console.spotinst.com/settings/v2/tokens/permanent
- Create a new Permanent token, name it
{AWS-Account-Name}-token
or similar - Copy the token and create an
AWS Secrets Manager
Plaintext secret namedspotinst_token
with a descriptionSpot.io token
To connect to the EKS stack running in AWS you'll need to make sure that you have SSO setup for the account you'll be using. Once setup run the commands below:
# Login with the profile you're using to authenticate. For example mine is called
# `dpe-prod-admin`
aws sso login --profile dpe-prod-admin
# Update your kubeconfig with the proper values. This is saying "Authenticate with
# AWS using my SSO session for the profile `dpe-prod-admin`. After authenticated
# assuming that we want to use the `role/eks_admin_role` to connect to the k8s
# cluster". This will update your kubeconfig with permissions to access the cluster.
aws eks update-kubeconfig --region us-east-1 --name dpe-k8 --profile dpe-prod-admin
AWS Guard duty is being used to perform audit trails for the EKS cluster, it involves 2 components for the cluster:
The initial configuration of these is handled through the securitycentral
IT account.
Runtime Monitoring is installed manually via terraform modules, allowing it to be torn
down when we destroy the VPC and EKS cluster.
In addition to this scanning that is in place we are also taking advantage of the trivy-operator, a "Kubernetes-native security toolkit". The use of this tool will give us regular scans of the resources that we are deploying to the kubernetes cluster. As resources are added trivy will spin up and add more to the existing reports. In addition policy-report has been installed to the cluster to give a UI to review the results without needing to dig into kubernetes resources. The use of these reports will be regularly reviewed as new applications are added to the cluster, in addition the use of the SBOM (Software bill of materials) will allow us to review for any security advisories.
Deployment of applications to the kubernetes cluster is handled through the combination of terraform (.tf) scripts, spacelift (CICD tool), and ArgoCd or Flux CD (Declarative definitions for applications).
To start of the deployment journey the first step is to create a new terraform module that encapsulating everything that is required to deploy any cloud resources, in addition to defining any kubernetes specific resources that will be deployed to the cluster.
This is supplemental information to what is defined within the modules readme
- Create a new directory for your module within
./modules
, it should be named after what you are deploying. - At a minimum you must define a
main.tf
andversions.tf
script that define:- What cloud resources you are deploying
- The providers required for deploying those cloud resources
- You may also define any number of additional files and resources that are specific to the module you are creating.
Deployment of applications and kubernetes specific resources to the cluster is handled via Argo CD, which is a declarative, GitOps continuous delivery tool for Kubernetes. The usage of this tool instead of terraform allows for continous monitoring of Kubernetes resources to align expected state with the actual state of the cluster. It has extensive support to deploy helm charts, and Kubernetes yaml files.
The declarative setup for ArgoCD is done through defining specific Kubernetes resources as defined in the ArgoCD Custom Resource Definition (CRD).
For our use cases we will typically be creating an Application Specification
which defines a number of pointers and configuration elements for ArgoCD to deploy to
the Kubernetes cluster. In addition we are taking advantage of Multiple Sources for an Application
to install public helm charts with our custom values.yaml
files without the need to
create our own helm chart and host it in our repository. See the following readme for a real example
we are using for deploying out Apache Airflow
instance.
As of August 2024 the only access to resources on the kubernetes cluster is occuring through kubectl port forward sessions. No internet facing load balancers are avaiable to connect to.
Using a tool like K9s
navigate to the pod in question and start a port forward session.
Then open up a browser and go to the localhost
and port you have specified in your
port forward session.
(Area of future work to have better secrets management): https://sagebionetworks.jira.com/browse/IBCDPE-1038
Most, but not all, resources will have a login page where you'll need to enter in a username and password. These resources will have a Kubernetes secret in base64 that you'll be able to look at and get the appropriate username/password to log into the tool. Once you obtain the Base64 data you'll need to decode it and then log into the tool. Examples:
- ArgoCD: Secret is named
argocd-initial-admin-secret
with a default username ofadmin
- Grafana: Secret is named
victoria-metrics-k8s-stack-grafana
with a default username ofadmin
As docker limits the number of unauthenticated pulls for images for anonymous accounts
we are using a DPE service account named dpesagebionetworks
to authenticate all pulls
from docker. To accomplish this task the following steps were taken:
- Log into docker hub with credentials stored in lastpass
- Create a new Personal access token for the account
- Copy the personal access token into the Spacelift stack for "Kubernetes Deployments" as an environment variable named "TF_VAR_docker_access_token"
- Update the
variables.tf
in the relevant module to include the variable as shown below - Update the
main.tf
to add a new kubernetes secret (Below) for all namespaces where an authenticated docker pull needs to occur - Update any helm charts to reference the authentication as described in this document
- Deploy the changes via terraform to the kubernetes cluster and apply the changes via ArgoCD or FluxCD
To add to variables.tf
variable "docker_server" {
description = "The docker registry URL"
default = "https://index.docker.io/v1/"
type = string
}
variable "docker_username" {
description = "Username to log into docker for authenticated pulls"
default = "dpesagebionetworks"
type = string
}
variable "docker_access_token" {
description = "The access token to use for docker authenticated pulls. Created via by setting 'TF_VAR_docker_access_token' within spacelift as an environment variable"
type = string
}
variable "docker_email" {
description = "The email for the docker account"
default = "dpe@sagebase.org"
type = string
}
To add to main.tf
resource "kubernetes_secret" "docker-cfg" {
metadata {
name = "docker-cfg"
namespace = var.namespace
}
type = "kubernetes.io/dockerconfigjson"
data = {
".dockerconfigjson" = jsonencode({
auths = {
"${var.docker_server}" = {
"username" = var.docker_username,
"password" = var.docker_access_token,
"email" = var.docker_email
"auth" = base64encode("${var.docker_username}:${var.docker_access_token}")
}
}
})
}
}
If you need to fully tear down all of the infra start at the smallest point and work outwards. Destroy items in this order:
- Go into the argoCD UI and delete all applications
- Run
tofu destroy --auto-approve
as a task in spacelift for the Kubernetes Deployments stack - Run
tofu destroy --auto-approve
as a task in spacelift for the infrastructure deployment stack
Here are some instructions on setting up spacelift.
This document describes the abbreviated process below: https://docs.spacelift.io/integrations/cloud-providers/aws#setup-guide
- Create a new role and set it's name to something unique within the account, such as
spacelift-admin-role
- Description: "Role for spacelift CICD to assume when deploying resources managed by terraform"
- Use the custom trust policy below:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::324880187172:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringLike": {
"sts:ExternalId": "sagebionetworks@*"
}
}
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::{{AWS ACCOUNT ID}}:root"
},
"Action": "sts:AssumeRole"
}
]
}
- Attach a few policies to the role:
PowerUserAccess
- Create an inline policy to allow interaction with IAM (Needed if TF is going to be creating, editing, and deleting IAM roles/policies):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"iam:*Role",
"iam:*RolePolicy",
"iam:*RolePolicies",
"iam:*Policy",
"iam:*PolicyVersion",
"iam:*OpenIDConnectProvider",
"iam:*InstanceProfile",
"iam:ListPolicyVersions",
"iam:ListGroupsForUser",
"iam:ListAttachedUserPolicies"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"iam:CreateUser",
"iam:AttachUserPolicy",
"iam:ListPolicies",
"iam:TagUser",
"iam:GetUser",
"iam:DeleteUser",
"iam:CreateAccessKey",
"iam:ListAccessKeys",
"iam:DeleteAccessKey"
],
"Resource": "arn:aws:iam::{{AWS ACCOUNT ID}}:user/smtp_user"
}
]
}
- Add a new
spacelift_aws_integration
resources to thecommon-resources/aws-integrations
directory.