Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Make KCP pre-terminate hook more robust #11161

Conversation

sbueringer
Copy link
Member

Signed-off-by: Stefan Büringer buringerst@vmware.com

What this PR does / why we need it:
Follow-up to #11137
Additional context: https://kubernetes.slack.com/archives/C8TSNPY4T/p1725952675583209

This PR ensures that KCP always removes the pre-terminate hook from Machines if KCP is deleted.

Before this PR this was only done if a Machine didn't have a deletionTimestamp.

There are edge cases where a Machine could have a deletionTimestamp before (or at the same time) as KCP:

  • All objects of the cluster are deleted at the same time (this is not technically supported, but some folks are running into this case)
  • The Machine already had the deletionTimestamp (e.g. through a scale down or remediation) when the Cluster/KCP object were deleted.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 10, 2024
@k8s-ci-robot k8s-ci-robot added do-not-merge/needs-area PR is missing an area label size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 10, 2024
@sbueringer sbueringer added the area/provider/control-plane-kubeadm Issues or PRs related to KCP label Sep 10, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-area PR is missing an area label label Sep 10, 2024
@sbueringer
Copy link
Member Author

/cherry-pick release-1.8

@k8s-infra-cherrypick-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.8 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@sbueringer: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-build-main
  • /test pull-cluster-api-e2e-blocking-main
  • /test pull-cluster-api-e2e-conformance-ci-latest-main
  • /test pull-cluster-api-e2e-conformance-main
  • /test pull-cluster-api-e2e-latestk8s-main
  • /test pull-cluster-api-e2e-main
  • /test pull-cluster-api-e2e-mink8s-main
  • /test pull-cluster-api-e2e-upgrade-1-31-1-32-main
  • /test pull-cluster-api-test-main
  • /test pull-cluster-api-test-mink8s-main
  • /test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-apidiff-main
  • pull-cluster-api-build-main
  • pull-cluster-api-e2e-blocking-main
  • pull-cluster-api-test-main
  • pull-cluster-api-verify-main

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main

@sbueringer
Copy link
Member Author

/assign @chrischdi @fabriziopandini @neolit123

(fyi, we have a patch release coming up this evening, we should get this fix in)

@@ -595,11 +600,19 @@ func (r *KubeadmControlPlaneReconciler) reconcileDelete(ctx context.Context, con
"Failed to delete control plane Machines for cluster %s control plane: %v", klog.KObj(controlPlane.Cluster), err)
return ctrl.Result{}, err
}

log.Info("Waiting for control plane Machines to go away")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd replace 'to go away' with something more formal e.g. 'to be deleted'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what does "to be deleted" mean? Setting the deeltionTimestamp? It's already on there :)

Copy link
Member

@neolit123 neolit123 Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, 'to be deleted' != 'to have deletionTimestamp applied'.
delete is the HTTP method and kubectl call, etc.

perhaps 'removed' will make it clearer.

Copy link
Member Author

@sbueringer sbueringer Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. The most precise way to say this is something like "Waiting for control plane Machines to not exist anymore" (+ "in the apiserver"? but maybe the apiserver part would sound a bit strange :))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Waiting for control plane Machines to not exist anymore"

+1, or anything formal.

@neolit123
Copy link
Member

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2024
Signed-off-by: Stefan Büringer buringerst@vmware.com
@sbueringer sbueringer force-pushed the pr-make-kcp-pre-terminate-more-robust branch from db292e7 to 4c83455 Compare September 10, 2024 10:23
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main

@fabriziopandini
Copy link
Member

Thanks for the quick fix!
/lgtm
/approve

pending tests

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8cfd0ad0f5bd121695da96d4069ce0e54c5d10f5

Copy link
Member

@chrischdi chrischdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrischdi, fabriziopandini, neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [chrischdi,fabriziopandini,neolit123]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main

(all 3 were about to fail)

@sbueringer
Copy link
Member Author

Please dont' retest. I'll assess flakes individually and consider overrides

@sbueringer
Copy link
Member Author

@fabriziopandini
Copy link
Member

double checked

  • pull-cluster-api-e2e-latestk8s-main
    "msg":"Cluster still has descendants - need to requeue",, descendants":"Machine pools: self-hosted-4z83gz-mp-0-h46pb" ...

(so it seems a well know flake)

@sbueringer
Copy link
Member Author

Unbelievable - all green

@k8s-ci-robot k8s-ci-robot merged commit 3d22daf into kubernetes-sigs:main Sep 10, 2024
25 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.9 milestone Sep 10, 2024
@sbueringer sbueringer deleted the pr-make-kcp-pre-terminate-more-robust branch September 10, 2024 13:42
@k8s-infra-cherrypick-robot

@sbueringer: new pull request created: #11165

In response to this:

/cherry-pick release-1.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Danil-Grigorev
Copy link
Member

/cherry-pick release-1.7

For evaluation, seeing similar issue in 1.7.2 version

@k8s-infra-cherrypick-robot

@Danil-Grigorev: #11161 failed to apply on top of branch "release-1.7":

Applying: Make KCP pre-terminate hook more robust
Using index info to reconstruct a base tree...
M	controlplane/kubeadm/internal/controllers/controller.go
M	controlplane/kubeadm/internal/controllers/controller_test.go
M	internal/controllers/machine/machine_controller.go
Falling back to patching base and 3-way merge...
Auto-merging internal/controllers/machine/machine_controller.go
Auto-merging controlplane/kubeadm/internal/controllers/controller_test.go
Auto-merging controlplane/kubeadm/internal/controllers/controller.go
CONFLICT (content): Merge conflict in controlplane/kubeadm/internal/controllers/controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Make KCP pre-terminate hook more robust
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.7

For evaluation, seeing similar issue in 1.7.2 version

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member Author

sbueringer commented Sep 10, 2024

@Danil-Grigorev If you are seeing a similar effect, it should not be the same root cause. The change that triggered this issue and that we are fixing here was never backported to release-1.7

@sbueringer
Copy link
Member Author

This is the chain of PRs and fixes that got us here. We needed it for 1.31 support, so it was never backported to 1.7:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/control-plane-kubeadm Issues or PRs related to KCP cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants