-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve machine/controller.Reconcile to intelligently block deleting … #350
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: spew The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @roberthbailey @k4leung4 |
should /~https://github.com/kubernetes-sigs/cluster-api/blob/master/cloud/vsphere/config/configtemplate.go#L131 be updated with the env as well? |
pkg/controller/machine/controller.go
Outdated
"sigs.k8s.io/cluster-api/pkg/controller/sharedinformers" | ||
"sigs.k8s.io/cluster-api/util" | ||
) | ||
|
||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a single const you should just put it on one line instead of in a block:
const NodeNameEnvVar = "NODE_NAME"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
pkg/controller/machine/controller.go
Outdated
if c.nodeName == "" { | ||
c.nodeName = os.Getenv(NodeNameEnvVar) | ||
if c.nodeName == "" { | ||
glog.Infof("environment variable %v is not set, this controller will not protect against deleting its own machine", NodeNameEnvVar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a warning instead of info (since we generally do expect the env var to be set)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good call.
pkg/controller/machine/controller.go
Outdated
if cfg.ControllerConfig.InCluster && util.IsMaster(machine) { | ||
glog.Infof("skipping reconciling master machine object %v", name) | ||
if !c.isDeleteAllowed(machine) { | ||
glog.Infof("skipping reconciling of machine object %v", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your code but you should fix it while here: capitalize the first line of log messages (unlike error strings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this instance, will follow up in another PR for the other instances.
node, err := c.kubernetesClientSet.CoreV1().Nodes().Get(c.nodeName, metav1.GetOptions{}) | ||
if err != nil { | ||
glog.Infof("unable to determine if controller's node is associated with machine '%v', error getting node named '%v': %v", machine.Name, c.nodeName, err) | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return false? I'm imagining a case where temporarily lose connectivity (what if the apiserver pod is being restarted?) or a single http request failing causes the controller to delete itself. If we return false will the reconcile loop try again? Or do we need retry logic around this call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated this one back and forth and finally decided that we could err on the side of deleting in the "api server not available" scenarios. I biased for optimizing the "delete cluster" experience. For example, if for whatever reason you are trying to delete a cluster that has the API server that doesn't work or is crash looping or the pod is deleted then you would be unable to delete the machine controller node if this check returned false.
You are correct -- that the API server can be temporarily unavailable, timing out, etc, can & will happen. However, I chose to not deal with it because the correct answer is retries.
We aren't currently doing retries in this file and before introducing it I think I'd prefer a more holistic approach (i.e. the client should support a retry policy that wraps all calls and makes it so no one needs to write retry code explicitly). The reason why I think we should wait is that this is not yet production code and therefore it is probably better to optimize for code readability and ease of making future changes rather than stability of the system (the system which doesn't even exist yet).
Happy to hear your thoughts. And maybe kubernetes clients already support retries and we just need to turn some stuff on or import it into the cluster API clients.
pkg/controller/machine/controller.go
Outdated
glog.Infof("unable to determine if controller's node is associated with machine '%v', error getting node named '%v': %v", machine.Name, c.nodeName, err) | ||
return true | ||
} | ||
// don't allow this controller to delete its underlying node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Comments should be full sentences, starting with a capital letter and ending with a period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
nodeUID: "another-uid", | ||
}, | ||
{ | ||
name: "Delete machine, controller on the same node", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there is at least one test missing. Maybe a specific set of tests for isDeleteAllowed would make sense as that function has a bunch of branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree, i'll add it.
Updated post review comments. |
pkg/controller/machine/controller.go
Outdated
@@ -25,14 +25,16 @@ import ( | |||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | |||
"k8s.io/client-go/kubernetes" | |||
|
|||
"os" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standard imports should always be in the first group
/~https://github.com/golang/go/wiki/CodeReviewComments#imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed -- i'll try to figure out how to automate this.
lgtm |
…a node associated with the controller machine. This will enable external machine controllers to delete any machine/node, including masters.
/lgtm |
…a node associated with the controller machine. This will enable external machine controllers to delete any machine/node, including masters. (kubernetes-sigs#350)
…x-yaml-out-dir Fix the -o OUT_DIR opt for generating yaml
…a node associated with the controller machine. This will enable external machine controllers to delete any machine/node, including masters.
What this PR does / why we need it:
This PR improves the machine/controller package's deletion protection to only apply if the node that the machine controller is located on is the node associated with the machine to be deleted. This makes it possible for master nodes to be deleted by machine controller. This also removes the need for the "InCluster" flag so it has been removed.
The way that the machine controller determines if it is on the same node is by using the downward API to inject the name of the actual node that the pod is running on into a NODE_NAME environment variable. You can see that this environment variable is loaded in the controller.Init function. An edge case is handled where the node's name matches but the UID does not match (imagine that our external cluster has a node name of "minikube" and for whatever reason our node in the cluster has a name of "minikube", we'll still be able to delete because the UIDs won't match.
Special notes for your reviewer:
Release note:
@kubernetes/kube-deploy-reviewers