-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Conversation
f048ee2
to
5ad6c34
Compare
This PR is missing the DCO check. Please rebase and follow the details of the DCO check linked here to make progress on this. |
@kingdonb done |
fc37093
to
d82e9cc
Compare
Apologies for the length of time that has elapsed, I am planning out the 1.23.2 release at the moment and I think we can include your change this time. I'm just reviewing the details to make sure I understand. Thank you for documenting the reasoning behind your change, it should be simple for me to verify. Do you think there is any test which could be added that makes it even clearer what has gone wrong and how this fixes it? I don't want to overcomplicate this, but we love tests and if you might be inclined to add one it will make this PR even better. (Also, 🌮 yay first-time contributor! Thanks very much for submitting this.) |
I wanted to reiterate the issue described here for my own understanding: Flux keeps track of sync errors and when it encounters one, it puts that resource aside a minute, holding the error in an array Flux keeps trying each resource that failed, even before it changes, in case any other changes to neighboring parts of the cluster have occurred that may have caused new conditions which can now permit a successful sync. (One pretty likely example: we've installed a HelmRelease that installs a CRD, and we have some custom resources to apply based on that CRD. Once the HelmRelease is reconciled, our Custom Resources can be synced now, since the CRD was applied. The error is resolved now...) If the error does not resolve itself from simple retries as described, the next potential resolution comes when another commit fixes the error so a sync can complete successfully. Maybe a syntax error has been corrected, or something else, and so the individual resource that failed, itself, can now also be synced successfully. In our example from before, the CRD resource that failed to apply once, is not actually removed from the syncErrors list on success and thus keeps being re-applied, potentially forever (or at least until another error afflicts that resource.) My theory connecting #3450 to this issue: if you accumulate enough of these errors, or perhaps simply the wrong kind of errors, they could somehow cause the sync to fail outright, or potentially spin the Flux sync loop until it finally times out. I think that sounds plausible, so I've pinged the subscribers to that issue to see if they can report on whether this PR fixes any of their issues. It would be great to get confirmation if this solves someone's problem in the wild. Thanks again @osela for your PR. |
d82e9cc
to
9bf9647
Compare
9bf9647
to
f37f609
Compare
f37f609
to
ee43de0
Compare
f1e15b0
to
f6405e1
Compare
Signed-off-by: Or Sela <orsela1@gmail.com> Signed-off-by: Kingdon Barrett <kingdon@weave.works>
f6405e1
to
f30fbd7
Compare
Not quite -- Flux always tries to apply everything. Usually it will try to apply everything all at once, but this can partially fail if there's a problem with any one item. In that case, each item is applied individually (#872). But it is slow to apply things individually. So: on the assumption that the things that were problematic last time are the things that will be problematic this time, the list of resources that didn't apply is kept over syncs, and those items are not included in the "everything all at once" (#1410).
I think you and the fix are correct. If there were no errors, Sync will exit before setting the errors, so they will be carried over to the next sync. The solution is to call setSyncErrors unconditionally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct to me ✅
This PR is in |
1.24.0 is released 🎉 |
This PR resets the
syncErrors
field on successful syncs.I started playing with flux and almost immediately noticed an unexpected behavior:
At this point, I would expect flux to stop applying the previously failed objects individually, but it doesn't because the objects are still in
Cluster
'ssyncErrors
map. They would only be cleared once the sync fails again.Am I missing something basic here?