Production-Grade Kubernetes Cluster set-up on AWS with Terraform and Helm, including Istio for service mesh, Kafka for messaging, and KMS keys for encryption. Configured for high availability, scalability, and secure management of resources.
This project sets up a robust infrastructure on AWS using Terraform, Helm, and Kubernetes, primarily focused on deploying and managing services in an EKS (Elastic Kubernetes Service) cluster. It includes various components and configurations to support a production-grade environment.
-
EKS Cluster: The project provisions an EKS cluster using Terraform, which serves as the foundation for deploying Kubernetes workloads.
-
Namespaces: Several Kubernetes namespaces are created for organizational purposes, including:
istio-system
: For Istio service mesh components.cert-manager
: For managing TLS certificates.cve-processor
,kafka
,cve-consumer
,eks-autoscaler
,cve-operator
,llm-cve
: For specific applications and services.amazon-cloudwatch
,monitoring
,metrics-server
: For monitoring and logging.
-
KMS Keys:
- EKS Secrets Key: Manages encryption for EKS secrets with key rotation enabled.
- EBS Key: Manages encryption for EBS volumes with specific IAM permissions and key rotation.
-
Security Groups:
eks_node_group_allow_istio_sg
: Allows incoming TCP traffic on port 15017 from any IP address to facilitate communication for Istio components.
This infrastructure setup provides a comprehensive solution for deploying, managing, and scaling applications in a Kubernetes environment on AWS, integrating with monitoring, logging, and security services.
-
Istio Base and Istiod:
- Istio is deployed to manage service mesh capabilities for the cluster, enabling secure and observable communication between microservices.
- The Istio base chart installs Custom Resource Definitions (CRDs) required for Istio's operation.
- Istiod serves as the control plane, responsible for traffic routing, service discovery, and security, with features like mutual TLS and logging.
-
Istio Ingress Gateway:
- The Istio Ingress Gateway provides entry point management for external traffic into the cluster, enabling secure and controlled access to services.
- Custom configurations include proxy settings for enhanced control and reliability in managing incoming traffic.
-
Cert-Manager:
- Cert-Manager, when enabled, automates the issuance and renewal of TLS certificates within the Kubernetes cluster.
- It supports certificate management for workloads to ensure secure communication.
-
Kafka:
- Kafka is deployed to handle event streaming, providing a highly scalable platform for processing real-time data within the cluster.
- Includes SASL authentication and SSL/TLS encryption for secure communication between Kafka brokers, producers, and consumers.
- Kafka is integrated with Prometheus to monitor its health and performance.
-
Prometheus and Grafana Stack:
- Prometheus is used for monitoring cluster resources, collecting metrics from Kubernetes components, and providing alerts.
- Grafana is integrated for visualization, offering pre-configured dashboards for real-time insights into cluster health and workloads.
- Kafka metrics are exported via the Prometheus Kafka Exporter, enabling detailed monitoring of Kafka pods.
-
Fluent-Bit for CloudWatch:
- Fluent-Bit is deployed to gather logs from the cluster and send them to Amazon CloudWatch for centralized log management.
- Helps monitor application logs and system events for enhanced observability and troubleshooting.
-
Cluster Autoscaler:
- Automatically adjusts the number of nodes in the EKS cluster based on resource demand.
- Ensures optimal resource allocation by adding or removing nodes as workloads fluctuate.
- Integrated with IAM roles and service accounts for secure scaling operations.
-
Metrics Server:
- Metrics Server collects resource usage data from Kubernetes nodes and pods, providing the necessary metrics for autoscaling.
- This enables Kubernetes Horizontal Pod Autoscaler (HPA) to scale pods based on real-time CPU and memory usage.
- Cluster-autoscaler
awsRegion: us-east-1
image:
repository: "vkneu7/eks-autoscaler"
tag: v1.30.0-amd64
pullPolicy: IfNotPresent
pullSecrets:
- name: docker-hub-pat
rbac:
create: true
pspEnabled: false
clusterScoped: true
serviceAccount:
annotations: {}
create: true
name: "cluster-autoscaler-service-account"
automountServiceAccountToken: true
secrets:
dbPassword: "placeholder_for_db_password"
dockerConfigJson: "placeholder_for_docker_config_json"
kafkaPassword: "placeholder_for_kafka_password"
- Fluetbit
clusterName: csye7125
regionName: us-east-1
fluentBitHttpPort: "2020"
fluentBitHttpServer: "On"
fluentBitReadFromHead: "Off"
fluentBitReadFromTail: "On"
- Kafka
sasl:
enabledMechanisms: PLAIN,SCRAM-SHA-256,SCRAM-SHA-512
client:
users:
- user1
passwords: "*****"
controller:
resourcesPreset: "medium"
provisioning:
enabled: true
numPartitions: 1
replicationFactor: 1
topics:
- name: cve
partitions: 3
replicationFactor: 3
config:
max.message.bytes: 64000
flush.messages: 1
postScript: |
trap "curl --max-time 2 -s -f -XPOST http://127.0.0.1:15020/quitquitquit" EXIT;
while ! curl -s -f http://127.0.0.1:15020/healthz/ready; do
sleep 1;
done;
echo "Ready!"
- Postgresql
global:
postgresql:
auth:
postgresPassword: "git"
username: "web_app"
password: "*******"
database: "cve"
service:
ports:
postgresql: "5432"
primary:
resourcesPreset: "small"
labels:
app: cve-db
podLabels:
app: cve-db
metrics:
## @param metrics.enabled Start a prometheus exporter
##
enabled: true
- Prometheus-grafana
grafana:
adminPassword: ****** for grafana dashboard
prometheus:
prometheusSpec:
additionalScrapeConfigs:
- job_name: "postgres-metrics"
static_configs:
- targets:
- "postgresql-metrics.cve-consumer.svc.cluster.local:9187"
- job_name: "kafka-jmx-metrics"
static_configs:
- targets:
- "kafka-jmx-metrics.kafka.svc.cluster.local:5556"
- kafka-exporter
kafkaServer:
- kafka.kafka.svc.cluster.local:9092
prometheus:
serviceMonitor:
enabled: true
namespace: monitoring
apiVersion: "monitoring.coreos.com/v1"
interval: "30s"
additionalLabels:
release: prometheus
targetLabels: []
sasl:
enabled: true
handshake: true
scram:
enabled: true
mechanism: scram-sha256
# add username and password
username: user1
password: ****
Ensure the following tools are installed and configured before proceeding:
- Clone the Repository
git clone git@github.com:your-username/infra-aws.git
cd infra-aws
-
Add terraform.tfvars file: Example below
provider_region = "us-east-1" #vpc eks_vpc_cidr_block = "10.0.0.0/16" eks_vpc_tag_name = "eks-vpc" #public subnet1 eks_public_subnet1_availability_zone = "us-east-1a" eks_public_subnet1_cidr_block = "10.0.1.0/24" eks_public_subnet_tag = "eks-public-subnet" #public subnet2 eks_public_subnet2_availability_zone = "us-east-1b" eks_public_subnet2_cidr_block = "10.0.2.0/24" #public subnet3 eks_public_subnet3_availability_zone = "us-east-1c" eks_public_subnet3_cidr_block = "10.0.3.0/24" #private subnet1 eks_private_subnet1_availability_zone = "us-east-1a" eks_private_subnet1_cidr_block = "10.0.4.0/24" eks_private_subnet_tag = "eks-private-subnet" #private subnet2 eks_private_subnet2_availability_zone = "us-east-1b" eks_private_subnet2_cidr_block = "10.0.5.0/24" #private subnet3 eks_private_subnet3_availability_zone = "us-east-1c" eks_private_subnet3_cidr_block = "10.0.6.0/24" #Route Table route_table_cidr_block = "0.0.0.0/0" #EKS Module eks_cluster_name = "csye7125" eks_cluster_version = "1.29" eks_cluster_authentication_mode = "API_AND_CONFIG_MAP" #EKS managed node group ami_type = "AL2_x86_64" min_size = 3 max_size = 7 desired_size = 4 instance_types = [ "c3.large" ] capacity_type = "ON_DEMAND" max_unavailable = 1 #Block device mappings device_name = "/dev/xvda" ebs_volume_size = 20 ebs_volume_type = "gp2" #tags environment_tag = "dev" #Cluster encryption config resources cluster_encryption_config_resources = ["secrets"] cluster_enabled_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"] #KMS KEYS eks_secrets_key_description = "KMS key for EKS secrets encryption" deletion_window_in_days = 7 ebs_key_description = "KMS key for EBS encryption" ebs_key_usage = "ENCRYPT_DECRYPT" customer_master_key_spec = "SYMMETRIC_DEFAULT" github_token = "your_github_api_key" autoscaler_version = "1.0.0" #Ingress values istio_ingress_values = <<-EOT service: annotations: service.beta.kubernetes.io/aws-load-balancer-type: "nlb" service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing" external-dns.alpha.kubernetes.io/hostname: "grafana.prod.skynetx.me,cve.prod.skynetx.me" ports: - port: 80 targetPort: 8080 name: http2 - port: 443 targetPort: 8443 name: https - port: 15021 targetPort: 15021 name: status-port type: LoadBalancer EOT #EKS Blueprint addon cert_manager_route53_hosted_zone_arns = ["arn:aws:route53:::hostedzone/your_hosted_zone_id"] route53_hosted_zone = "prod.skynetx.me"
-
Setup aws profile on cli
export AWS_PROFILE=dev
-
Initialize Terraform:
terraform init
-
Terraform Configuration: Review the
terraform.tfvars
. Modify variables interraform.tfvars
as needed for your environment. -
Plan Infrastructure Changes:
terraform plan
-
Apply Infrastructure Changes:
terraform apply
-
Verify the Infrastructure: After Terraform applies the changes successfully, verify the infrastructure on AWS.
Instructions for tearing down the infrastructure
-
Destroy Infrastructure:
terraform destroy
-
Confirm Destruction: Terraform will prompt you to confirm destruction. Enter
yes
to proceed with tearing down the infrastructure.