Load balancer initialisation fails #1642

hvraven · 2024-12-04T12:05:52Z

hvraven
Dec 4, 2024

Description

I had some issues with the load balancer after an upgrade, switched to metal/klipper lb and I'm now trying to switch back.
This fails and gets stuck at Waiting for load-balancer to get an IP...

I dug a bit around and found some things, but am stuck now. tofu creates the expected load-balancer with the set name (k3s-nginx in my case). However the hcloud cloud controller does not configure it correctly. Instead it complains about the name being already used, log output below. When I check the cloud interface it shows two load balancer. One with the correct name, but unconfigured, a second with a random name and correctly configured (it points to all 4 currently configured workers). It appears the cloud controller is creating and configuring its own load balancer. This one is not known to tofu, most likely explaining the errors during setup.

There are some issues at the cloud controller describing similar issues, not sure who's to blame: hetznercloud/hcloud-cloud-controller-manager#811 & hetznercloud/hcloud-cloud-controller-manager#812

Parts of output of kubectl logs -f -n kube-system deployments/hcloud-cloud-controller-manager

I1204 11:57:49.384227       1 load_balancers.go:127] "ensure Load Balancer" op="hcloud/loadBalancers.EnsureLoadBalancer" service="nginx-ingress-nginx-controller" nodes=["k3s-agent-arm-small-dgi","k3s-agent-arm-small-qry","k3s-agent-small-1-ywn","k3s-agent-small-igs"]
I1204 11:57:49.386310       1 event.go:389] "Event occurred" object="nginx/nginx-ingress-nginx-controller" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E1204 11:57:50.183708       1 controller.go:303] "Unhandled Error" err="error processing service nginx/nginx-ingress-nginx-controller (retrying with exponential backoff): failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLB: hcops/LoadBalancerOps.changeHCLBInfo: name is already used (uniqueness_error, c42eb92584201411)" logger="UnhandledError"
I1204 11:57:50.184612       1 event.go:389] "Event occurred" object="nginx/nginx-ingress-nginx-controller" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLB: hcops/LoadBalancerOps.changeHCLBInfo: name is already used (uniqueness_error, c42eb92584201411)"

Kube.tf file

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token


  source = "kube-hetzner/kube-hetzner/hcloud"

  version = "2.15.4"

  ssh_public_key = data.hcloud_ssh_key.admin_key.public_key
  ssh_private_key = null

  ssh_hcloud_key_label = "role=admin"

  ssh_max_auth_tries = 10

  hcloud_ssh_key_id = data.hcloud_ssh_key.admin_key.id


  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cx22",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
      zram_size   = "2G"
      kubelet_args = ["kube-reserved=cpu=250m,memory=1500Mi,ephemeral-storage=1Gi", "system-reserved=cpu=250m,memory=300Mi"]


    },
    {
      name        = "control-plane-nbg1",
      server_type = "cx22",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
      zram_size   = "2G"
      kubelet_args = ["kube-reserved=cpu=250m,memory=1500Mi,ephemeral-storage=1Gi", "system-reserved=cpu=250m,memory=300Mi"]


    },
    {
      name        = "control-plane-hel1",
      server_type = "cx22",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
      zram_size   = "2G"
      kubelet_args = ["kube-reserved=cpu=250m,memory=1500Mi,ephemeral-storage=1Gi", "system-reserved=cpu=250m,memory=300Mi"]


    }
  ]

  agent_nodepools = [
    {
      name        = "agent-small",
      server_type = "cx22",
      location    = "fsn1",
      labels = [
        "node.longhorn.io/create-default-disk=config",
      ],
      taints = [],
      zram_size = "2G"
      nodes = {
        "0": {
        },
        "1": {
          server_type: "cx32",
          location = "nbg1",
        },
      }

    },

    {
      name        = "agent-arm-small",
      server_type = "cax21",
      location    = "fsn1",
      labels = [
        "node.longhorn.io/create-default-disk=config",
      ],
      zram_size = "2G"
      taints      = [],
      count       = 2,
    },
  ]

  enable_wireguard = true

  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"


  base_domain = "${var.subdomain}.${var.domain}"


  enable_csi_driver_smb = true


  enable_longhorn = true


  longhorn_namespace = "longhorn-system"

  longhorn_fstype = "ext4"

  longhorn_replica_count = 3

  ingress_controller = "nginx"


  system_upgrade_use_drain = true





  initial_k3s_channel = "stable"




  /* k3s_registries = <<-EOT
    mirrors:
      hub.my_registry.com:
        endpoint:
          - "hub.my_registry.com"
    configs:
      hub.my_registry.com:
        auth:
          username: username
          password: password
  EOT */

  additional_k3s_environment = {
    "CONTAINERD_HTTP_PROXY" : "http://localhost:1055",
    "CONTAINERD_HTTPS_PROXY" : "http://localhost:1055",
    "NO_PROXY" : "127.0.0.0/8,10.128.0.0/9,10.0.0.0/10,",
  }

  preinstall_exec = [
    "curl -vL https://registry.gitlab.com",
  ]

  k3s_exec_agent_args = "--kubelet-arg image-gc-high-threshold=50 --kubelet-arg=image-gc-low-threshold=45"


  extra_firewall_rules = [
  ]



  enable_cert_manager = true

  dns_servers = []

  lb_hostname = "${var.subdomain}.${var.domain}"

  extra_kustomize_parameters = {
    vpn_domain = var.vpn_domain,
  }

  create_kubeconfig = false

  create_kustomization = false

  longhorn_values = <<EOT
defaultSettings:
  createDefaultDiskLabeledNodes: true
  defaultDataPath: /var/longhorn
  node-down-pod-deletion-policy: delete-both-statefulset-and-deployment
persistence:
  defaultFsType: ext4
  defaultClassReplicaCount: 3
  defaultClass: true
  EOT

}

Screenshots

No response

Platform

Linux

Answered by mysticaltech

Feb 16, 2025

This behavior stems from how the Hetzner Cloud Controller Manager (hcloud-ccm) handles load balancer names. The CCM typically attempts to create or rename a load-balancer resource to match the Kubernetes Service object. In your setup, there is already a load balancer named k3s-nginx (created by Terraform), and the CCM is also trying to manage (or rename) another load balancer to the same name, which leads to the uniqueness_error.

In effect, you ended up with two load balancers:

k3s-nginx: Unconfigured, owned by Terraform (or tofu), but recognized by the CCM as already existing.
Another LB with a random name**: That the CCM manages (and attaches to the nginx-ingress-nginx-controller Servi…

View full answer

mysticaltech · 2025-02-16T21:52:45Z

mysticaltech
Feb 16, 2025
Maintainer

This behavior stems from how the Hetzner Cloud Controller Manager (hcloud-ccm) handles load balancer names. The CCM typically attempts to create or rename a load-balancer resource to match the Kubernetes Service object. In your setup, there is already a load balancer named k3s-nginx (created by Terraform), and the CCM is also trying to manage (or rename) another load balancer to the same name, which leads to the uniqueness_error.

In effect, you ended up with two load balancers:

k3s-nginx: Unconfigured, owned by Terraform (or tofu), but recognized by the CCM as already existing.
Another LB with a random name**: That the CCM manages (and attaches to the nginx-ingress-nginx-controller Service) because it hits the name-collision and falls back to a fallback name.

Because of this naming collision, CCM's rename logic tries to rename the randomly-created LB to k3s-nginx (or tries to set metadata that conflicts), sees the name is taken, and fails with a uniqueness_error. Meanwhile, Terraform sees that its LB is never becoming fully “configured,” so the deployment ends up stuck at Waiting for load-balancer to get an IP....

This scenario is also mentioned in:

They discuss how the CCM tries to unify the load-balancer name with the Service name or an annotation, and if there is already a LB with that name, it will fail with a uniqueness error.

Outcome

No fix is needed in this module itself. It’s a known quirk/bug in the Hetzner CCM name-handling logic when there is an existing LB resource with the same name that CCM tries to manage. The recommended approach is to let either Terraform or the CCM own the LB fully. Mixing both can cause these collisions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load balancer initialisation fails #1642

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Load balancer initialisation fails #1642

hvraven Dec 4, 2024

Description

Kube.tf file

Screenshots

Platform

Replies: 1 comment

mysticaltech Feb 16, 2025 Maintainer

Outcome

hvraven
Dec 4, 2024

mysticaltech
Feb 16, 2025
Maintainer