Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set min nodes to 0 for worker and user. #2168

Merged
merged 108 commits into from
Feb 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
058028c
Set min nodes to 0 for worker and user.
Dec 22, 2023
68f34b0
Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0
pt247 Dec 22, 2023
7031d46
Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0
pt247 Dec 22, 2023
125f390
Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0
pt247 Dec 22, 2023
1b8a6c1
Move user scheduler to general and add tags and lables to asg and nod…
Jan 3, 2024
26150be
Change the tag key for autoscheudler to work.
Jan 3, 2024
492835f
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 3, 2024
8123adc
Retain previous node selectors.
Jan 3, 2024
d602e7c
resolve git merge.
Jan 3, 2024
fd141de
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 3, 2024
6533542
Formatting changes.
Jan 3, 2024
7d26843
Formatting changes.
Jan 3, 2024
3e5044f
ASG needs to exist before its tagged, moving aws_autoscaling_group_ta…
Jan 5, 2024
1197fe3
Remove origianl node selector. as it prevents the pod to match nodese…
Jan 5, 2024
1d406a3
Terraform format changes.
Jan 5, 2024
8ca133b
Minor formatting changes.
Jan 5, 2024
2079175
Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0
pt247 Jan 5, 2024
a2c891a
Upgrade provider version.
Jan 5, 2024
214b2f3
Pin hashicorp/aws version.
Jan 5, 2024
9d825b6
Pin hashicorp/aws version.
Jan 5, 2024
0d8b692
Revert Pin hashicorp/aws version.
Jan 5, 2024
4f82abf
Pin hashicorp/aws version.
Jan 6, 2024
b04cd1b
Pin hashicorp/aws version.
Jan 6, 2024
4855087
Set aws region.
Jan 6, 2024
1850624
Move aws_autoscaling_group_tag to initilization stage.
Jan 6, 2024
ee3b171
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 6, 2024
30c0025
Revert aws version change.
Jan 6, 2024
2e18b86
Revert aws version change.
Jan 6, 2024
73fbc6c
Fix terrafrom template.
Jan 7, 2024
bfe9ce2
Move aws_autoscaling_group_tag back to kubernetes stage.
Jan 7, 2024
7996faa
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 7, 2024
7c9f6ec
Add a module for ASG tagging at 03-kubernetes-initialize/aws-asg-tagg…
Jan 7, 2024
2f33394
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 7, 2024
8228996
Terafromr formatting changes.
Jan 7, 2024
c375059
Pin provider version.
Jan 7, 2024
9fe79a2
Tagging aws region.
Jan 7, 2024
55200f3
FTerraform frmatting changes.
Jan 7, 2024
34f318f
Tagging changes.
Jan 7, 2024
9257555
Update versions.tf provide region in AWS configuration
pt247 Jan 8, 2024
cd9b5f8
Update versions.tf removed AWS from stage level
pt247 Jan 8, 2024
d4c4251
Update versions.tf move AWS provider to level of tagging
pt247 Jan 8, 2024
8a4ec12
Update versions.tf
pt247 Jan 8, 2024
2bac3be
Add region to input vars of kubernetes_initialize.
Jan 8, 2024
30d0e5b
Revert version changes.
Jan 8, 2024
433e9cb
Fix provider error.
Jan 8, 2024
c123bad
Remove if condition from for each.
Jan 8, 2024
5d8c344
Fix aws-asg-taggin.tf.
Jan 8, 2024
342774d
Formatting changes.
Jan 8, 2024
3a12e60
Fix aws-asg-taggin.tf.
Jan 8, 2024
033d76b
Add aws validation for tagging.
Jan 8, 2024
8d53afd
Count not supported using lenght.
Jan 8, 2024
0acd071
Revert module level aws filtering.
Jan 8, 2024
28eef1f
Revert module level aws filtering.
Jan 8, 2024
f25decc
Terraform format changes.
Jan 8, 2024
9157bdd
Fix aws-asg-tagging.tf.
Jan 8, 2024
cff6ca3
Add aws provider region.
Jan 8, 2024
9307175
Remove aws filter from stage level and add it to module level.
Jan 8, 2024
ef3b2f7
Terraform format changes.
Jan 8, 2024
1a00157
Remove aws filter from stage level and add it to module level.
Jan 8, 2024
2db4db9
Terraform format changes.
Jan 8, 2024
e3a3ea3
Get aws_get_asg_node_group_mapping from stage 2 output.
Jan 9, 2024
192020b
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 9, 2024
560fc1a
Get aws_get_asg_node_group_mapping from stage 2 output.
Jan 9, 2024
970b2a5
Get aws_get_asg_node_group_mapping from stage 2 output.
Jan 9, 2024
a0315c3
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 9, 2024
29e7f65
Terraform format changes.
Jan 9, 2024
852a929
Fix stage level filter for tagging.
Jan 9, 2024
301a2db
Try and remove tagging from pahse-03
Jan 9, 2024
b6c3ca4
Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0
pt247 Jan 9, 2024
f9992a0
Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0
costrouc Jan 10, 2024
9b06091
Add back tagging config to phase-03.
Jan 9, 2024
a2206cd
Add precondition check if provider is aws for testing.
Jan 11, 2024
5970eef
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 11, 2024
4a253d3
Terrafrom fmt changes.
Jan 11, 2024
5764dc9
Add debugging.
Jan 11, 2024
ecafb70
Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0
Jan 11, 2024
b5e1c9f
Add debugging.
Jan 11, 2024
9fd765b
Add debugging.
Jan 11, 2024
3faba0d
Add debugging.
Jan 11, 2024
a471245
Send empty asg_node_group_map to tagging module.
Jan 11, 2024
ed1b591
Remove trace logs.
Jan 11, 2024
177a8bf
Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0
Jan 15, 2024
13b69e9
Minor change to trigger CI.
Jan 15, 2024
a6b2d98
Add tagging in post deploy of phase 3, and remove tagging in phase 3.
Jan 17, 2024
4ea6b45
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 17, 2024
703bce5
Fix post deploy for stage 2.
Jan 17, 2024
023513e
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 17, 2024
5b41feb
Fix post deploy for stage 2.
Jan 17, 2024
35549e2
Resolve merge conflict.
Jan 17, 2024
019704f
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 17, 2024
4177986
Merge branch 'nebari-dev:develop' into 2154-aws-set-minimum-notes-to-0
pt247 Jan 17, 2024
32465f0
Remove changes that are no longer needed.
Jan 17, 2024
240f36f
Add back original tags.
Jan 17, 2024
7037637
Remove unused tagging from stage 3.
Jan 17, 2024
fa31975
Remove extra config.
Jan 17, 2024
46bb93c
Fix dask config.
Jan 17, 2024
94b2ba2
Revert changes for autoscaling.
Jan 17, 2024
11da557
Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0
pt247 Jan 17, 2024
7e8b003
Revert changes to jupyerhub and dask servers, temprarly placing sched…
Jan 18, 2024
2bd5686
Fix revert.
Jan 18, 2024
e395098
Conditionally add lables for AWS.
Jan 18, 2024
4235594
[pre-commit.ci] Apply automatic pre-commit fixes
pre-commit-ci[bot] Jan 18, 2024
e882fa6
Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0
Jan 18, 2024
fd14645
Fix dask config.
Jan 18, 2024
6d28bb2
Merge develop.
Jan 18, 2024
5c45b2d
Fix jupyterhub nodeselectors.
Jan 18, 2024
ab7f8f6
Merge branch 'nebari-dev:develop' into 2154-aws-set-minimum-notes-to-0
pt247 Feb 8, 2024
f1f9534
Merge branch 'develop' into 2154-aws-set-minimum-notes-to-0
dcmcand Feb 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions src/_nebari/provider/cloud/amazon_web_services.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,46 @@ def aws_get_vpc_id(name: str, namespace: str, region: str) -> Optional[str]:
return None


def set_asg_tags(asg_node_group_map: Dict[str, str], region: str) -> None:
"""Set tags for AWS node scaling from zero to work."""
session = aws_session(region=region)
autoscaling_client = session.client("autoscaling")
tags = []
for asg_name, node_group in asg_node_group_map.items():
tags.append(
{
"Key": "k8s.io/cluster-autoscaler/node-template/label/dedicated",
"Value": node_group,
"ResourceId": asg_name,
"ResourceType": "auto-scaling-group",
"PropagateAtLaunch": True,
}
)
autoscaling_client.create_or_update_tags(Tags=tags)


def aws_get_asg_node_group_mapping(
name: str, namespace: str, region: str
) -> Dict[str, str]:
"""Return a dictionary of autoscaling groups and their associated node groups."""
asg_node_group_mapping = {}
session = aws_session(region=region)
eks = session.client("eks")
node_groups_response = eks.list_nodegroups(
clusterName=f"{name}-{namespace}",
)
node_groups = node_groups_response.get("nodegroups", [])
for nodegroup in node_groups:
response = eks.describe_nodegroup(
clusterName=f"{name}-{namespace}", nodegroupName=nodegroup
)
node_group_name = response["nodegroup"]["nodegroupName"]
auto_scaling_groups = response["nodegroup"]["resources"]["autoScalingGroups"]
for auto_scaling_group in auto_scaling_groups:
asg_node_group_mapping[auto_scaling_group["name"]] = node_group_name
return asg_node_group_mapping


def aws_get_subnet_ids(name: str, namespace: str, region: str) -> List[str]:
"""Return list of subnet IDs for the EKS cluster named `{name}-{namespace}`."""
session = aws_session(region=region)
Expand Down
25 changes: 23 additions & 2 deletions src/_nebari/stages/infrastructure/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,17 @@ class AWSInputVars(schema.Base):
tags: Dict[str, str] = {}


def _calculate_asg_node_group_map(config: schema.Main):
if config.provider == schema.ProviderEnum.aws:
return amazon_web_services.aws_get_asg_node_group_mapping(
config.project_name,
config.namespace,
config.amazon_web_services.region,
)
else:
return {}


def _calculate_node_groups(config: schema.Main):
if config.provider == schema.ProviderEnum.aws:
return {
Expand Down Expand Up @@ -438,10 +449,10 @@ class AmazonWebServicesProvider(schema.Base):
node_groups: Dict[str, AWSNodeGroup] = {
"general": AWSNodeGroup(instance="m5.2xlarge", min_nodes=1, max_nodes=1),
"user": AWSNodeGroup(
instance="m5.xlarge", min_nodes=1, max_nodes=5, single_subnet=False
instance="m5.xlarge", min_nodes=0, max_nodes=5, single_subnet=False
),
"worker": AWSNodeGroup(
instance="m5.xlarge", min_nodes=1, max_nodes=5, single_subnet=False
instance="m5.xlarge", min_nodes=0, max_nodes=5, single_subnet=False
),
}
existing_subnet_ids: List[str] = None
Expand Down Expand Up @@ -814,6 +825,16 @@ def set_outputs(
outputs["node_selectors"] = _calculate_node_groups(self.config)
super().set_outputs(stage_outputs, outputs)

@contextlib.contextmanager
def post_deploy(
self, stage_outputs: Dict[str, Dict[str, Any]], disable_prompt: bool = False
):
asg_node_group_map = _calculate_asg_node_group_map(self.config)
if asg_node_group_map:
amazon_web_services.set_asg_tags(
asg_node_group_map, self.config.amazon_web_services.region
)

@contextlib.contextmanager
def deploy(
self, stage_outputs: Dict[str, Dict[str, Any]], disable_prompt: bool = False
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@ resource "aws_eks_node_group" "main" {
max_size = var.node_groups[count.index].max_size
}

labels = {
"dedicated" = var.node_groups[count.index].name
}

lifecycle {
ignore_changes = [
scaling_config[0].desired_size,
Expand All @@ -53,7 +57,9 @@ resource "aws_eks_node_group" "main" {
]

tags = merge({
"kubernetes.io/cluster/${var.name}" = "shared"
# "kubernetes.io/cluster/${var.name}" = "shared"
"k8s.io/cluster-autoscaler/node-template/label/dedicated" = var.node_groups[count.index].name
propagate_at_launch = true
}, var.tags)
}

Expand Down
3 changes: 3 additions & 0 deletions src/_nebari/stages/kubernetes_services/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,6 +362,7 @@ class JupyterhubInputVars(schema.Base):
idle_culler_settings: Dict[str, Any] = Field(alias="idle-culler-settings")
argo_workflows_enabled: bool = Field(alias="argo-workflows-enabled")
jhub_apps_enabled: bool = Field(alias="jhub-apps-enabled")
cloud_provider: str = Field(alias="cloud-provider")


class DaskGatewayInputVars(schema.Base):
Expand Down Expand Up @@ -411,6 +412,7 @@ def input_vars(self, stage_outputs: Dict[str, Dict[str, Any]]):
realm_id = stage_outputs["stages/06-kubernetes-keycloak-configuration"][
"realm_id"
]["value"]
cloud_provider = self.config.provider.value
jupyterhub_shared_endpoint = (
stage_outputs["stages/02-infrastructure"]
.get("nfs_endpoint", {})
Expand Down Expand Up @@ -486,6 +488,7 @@ def input_vars(self, stage_outputs: Dict[str, Dict[str, Any]]):
),
jupyterhub_stared_storage=self.config.storage.shared_filesystem,
jupyterhub_shared_endpoint=jupyterhub_shared_endpoint,
cloud_provider=cloud_provider,
jupyterhub_profiles=self.config.profiles.dict()["jupyterlab"],
jupyterhub_image=_split_docker_image_name(
self.config.default_images.jupyterhub
Expand Down
6 changes: 6 additions & 0 deletions src/_nebari/stages/kubernetes_services/template/jupyterhub.tf
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@ variable "idle-culler-settings" {
type = any
}

variable "cloud-provider" {
description = "Name of cloud provider."
type = string
}

module "kubernetes-nfs-server" {
count = var.jupyterhub-shared-endpoint == null ? 1 : 0
Expand Down Expand Up @@ -88,6 +92,8 @@ module "jupyterhub" {
name = var.name
namespace = var.environment

cloud-provider = var.cloud-provider

external-url = var.endpoint
realm_id = var.realm_id

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -114,8 +114,12 @@ def list_dask_environments():


def base_node_group(options):
key = config["worker-node-group"]["key"]
if config.provider.value == "aws":
key = "dedicated"
default_node_group = {
config["worker-node-group"]["key"]: config["worker-node-group"]["value"]
key: config["worker-node-group"]["value"],
# config["worker-node-group"]["key"]: config["worker-node-group"]["value"],
}

# check `worker_extra_pod_config` first
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,11 @@ resource "random_password" "jhub_apps_jwt_secret" {
}

locals {
jhub_apps_secrets_name = "jhub-apps-secrets"
jhub_apps_env_var_name = "JHUB_APP_JWT_SECRET_KEY"
jhub_apps_secrets_name = "jhub-apps-secrets"
jhub_apps_env_var_name = "JHUB_APP_JWT_SECRET_KEY"
singleuser_nodeselector_key = var.cloud-provider == "aws" ? "dedicated" : var.user-node-group.key
userscheduler_nodeselector_key = var.cloud-provider == "aws" ? "dedicated" : var.user-node-group.key
userscheduler_nodeselector_value = var.cloud-provider == "aws" ? var.general-node-group.value : var.user-node-group.value
}

resource "kubernetes_secret" "jhub_apps_secrets" {
Expand Down Expand Up @@ -174,14 +177,14 @@ resource "helm_release" "jupyterhub" {
singleuser = {
image = var.jupyterlab-image
nodeSelector = {
"${var.user-node-group.key}" = var.user-node-group.value
"${local.singleuser_nodeselector_key}" = var.user-node-group.value
}
}

scheduling = {
userScheduler = {
nodeSelector = {
"${var.user-node-group.key}" = var.user-node-group.value
"${local.userscheduler_nodeselector_key}" = local.userscheduler_nodeselector_value
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,11 @@ variable "jupyterlab-pioneer-log-format" {
type = string
}

variable "cloud-provider" {
description = "Name of cloud provider."
type = string
}

variable "initial-repositories" {
description = "Map of folder location and git repo url to clone"
type = string
Expand Down
Loading