reset sync errors on successful sync #3156

osela · 2020-06-24T14:54:33Z

This PR resets the syncErrors field on successful syncs.

I started playing with flux and almost immediately noticed an unexpected behavior:

Flux syncs a few objects and some of them fail for expected reasons.
In every future sync, flux tries to apply the failed objects individually.
A commit is made which makes the failed objects sync successfully.

At this point, I would expect flux to stop applying the previously failed objects individually, but it doesn't because the objects are still in Cluster's syncErrors map. They would only be cleared once the sync fails again.

Am I missing something basic here?

kingdonb · 2021-05-01T00:30:59Z

This PR is missing the DCO check. Please rebase and follow the details of the DCO check linked here to make progress on this.

osela · 2021-05-01T16:31:18Z

@kingdonb done

kingdonb · 2021-07-27T16:15:14Z

Apologies for the length of time that has elapsed, I am planning out the 1.23.2 release at the moment and I think we can include your change this time. I'm just reviewing the details to make sure I understand.

Thank you for documenting the reasoning behind your change, it should be simple for me to verify. Do you think there is any test which could be added that makes it even clearer what has gone wrong and how this fixes it?

I don't want to overcomplicate this, but we love tests and if you might be inclined to add one it will make this PR even better.

(Also, 🌮 yay first-time contributor! Thanks very much for submitting this.)

kingdonb · 2021-07-27T16:45:54Z

I wanted to reiterate the issue described here for my own understanding: Flux keeps track of sync errors and when it encounters one, it puts that resource aside a minute, holding the error in an array syncErrors so it can be tried again later.

Flux keeps trying each resource that failed, even before it changes, in case any other changes to neighboring parts of the cluster have occurred that may have caused new conditions which can now permit a successful sync.

(One pretty likely example: we've installed a HelmRelease that installs a CRD, and we have some custom resources to apply based on that CRD. Once the HelmRelease is reconciled, our Custom Resources can be synced now, since the CRD was applied. The error is resolved now...)

If the error does not resolve itself from simple retries as described, the next potential resolution comes when another commit fixes the error so a sync can complete successfully. Maybe a syntax error has been corrected, or something else, and so the individual resource that failed, itself, can now also be synced successfully.

In our example from before, the CRD resource that failed to apply once, is not actually removed from the syncErrors list on success and thus keeps being re-applied, potentially forever (or at least until another error afflicts that resource.)

My theory connecting #3450 to this issue: if you accumulate enough of these errors, or perhaps simply the wrong kind of errors, they could somehow cause the sync to fail outright, or potentially spin the Flux sync loop until it finally times out.

I think that sounds plausible, so I've pinged the subscribers to that issue to see if they can report on whether this PR fixes any of their issues. It would be great to get confirmation if this solves someone's problem in the wild.

Thanks again @osela for your PR.

Signed-off-by: Or Sela <orsela1@gmail.com> Signed-off-by: Kingdon Barrett <kingdon@weave.works>

squaremo · 2021-08-19T10:15:09Z

Flux keeps track of sync errors and when it encounters one, it puts that resource aside a minute, holding the error in an array syncErrors so it can be tried again later.

Flux keeps trying each resource that failed, even before it changes,

Not quite -- Flux always tries to apply everything. Usually it will try to apply everything all at once, but this can partially fail if there's a problem with any one item. In that case, each item is applied individually (#872). But it is slow to apply things individually. So: on the assumption that the things that were problematic last time are the things that will be problematic this time, the list of resources that didn't apply is kept over syncs, and those items are not included in the "everything all at once" (#1410).

if you accumulate enough of these errors, or perhaps simply the wrong kind of errors, they could somehow cause the sync to fail outright

setSyncErrors recreates its set of errors from scratch when called, and Sync starts with an empty list each time; I don't think there's an opportunity for errors to accumulate.

At this point, I would expect flux to stop applying the previously failed objects individually, but it doesn't because the objects are still in Cluster's syncErrors map. They would only be cleared once the sync fails again.

Am I missing something basic here?

I think you and the fix are correct. If there were no errors, Sync will exit before setting the errors, so they will be carried over to the next sync. The solution is to call setSyncErrors unconditionally.

squaremo

Looks correct to me ✅

kingdonb · 2021-08-19T13:40:55Z

This PR is in fluxcd/flux-prerelease:master-371e94c4 and it will be included in the upcoming 1.24.0 release 🚀

kingdonb · 2021-08-20T14:26:56Z

1.24.0 is released 🎉

kingdonb self-assigned this Apr 27, 2021

kingdonb force-pushed the reset-sync-errors branch from f048ee2 to 5ad6c34 Compare April 27, 2021 01:19

kingdonb added the blocked-dco Needs DCO to be fixed up label May 1, 2021

osela force-pushed the reset-sync-errors branch from 5ad6c34 to fc37093 Compare May 1, 2021 16:29

kingdonb added bug and removed blocked-dco Needs DCO to be fixed up labels May 23, 2021

kingdonb added this to the 1.22.3 milestone May 23, 2021

kingdonb force-pushed the reset-sync-errors branch from fc37093 to d82e9cc Compare May 23, 2021 14:55

kingdonb modified the milestones: 1.23.1, 1.23.2 Jul 22, 2021

kingdonb mentioned this pull request Jul 22, 2021

Flux sync-loop is failing with internal Error #3450

Closed

kingdonb requested a review from hiddeco July 27, 2021 22:14

kingdonb mentioned this pull request Jul 28, 2021

Policy update Status and StatusString in conflict. #931

Closed

kingdonb requested a review from stefanprodan July 30, 2021 13:39

kingdonb force-pushed the reset-sync-errors branch from d82e9cc to 9bf9647 Compare July 30, 2021 13:41

kingdonb mentioned this pull request Jul 30, 2021

Release 1.23.2 #3521

Merged

kingdonb modified the milestones: 1.23.2, 1.23.3 Aug 4, 2021

kingdonb force-pushed the reset-sync-errors branch from 9bf9647 to f37f609 Compare August 4, 2021 18:42

kingdonb modified the milestones: 1.23.3, 1.24.0 Aug 17, 2021

kingdonb force-pushed the reset-sync-errors branch from f37f609 to ee43de0 Compare August 17, 2021 15:35

kingdonb mentioned this pull request Aug 17, 2021

Update kubernetes go client (@ 0.22.0), other go mods #3537

Closed

kingdonb force-pushed the reset-sync-errors branch 2 times, most recently from f1e15b0 to f6405e1 Compare August 18, 2021 15:22

reset sync errors on successful sync

f30fbd7

Signed-off-by: Or Sela <orsela1@gmail.com> Signed-off-by: Kingdon Barrett <kingdon@weave.works>

kingdonb force-pushed the reset-sync-errors branch from f6405e1 to f30fbd7 Compare August 18, 2021 17:21

squaremo approved these changes Aug 19, 2021

View reviewed changes

kingdonb merged commit 371e94c into fluxcd:master Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reset sync errors on successful sync #3156

reset sync errors on successful sync #3156

osela commented Jun 24, 2020

kingdonb commented May 1, 2021

osela commented May 1, 2021

kingdonb commented Jul 27, 2021

kingdonb commented Jul 27, 2021

squaremo commented Aug 19, 2021

squaremo left a comment

kingdonb commented Aug 19, 2021

kingdonb commented Aug 20, 2021

reset sync errors on successful sync #3156

reset sync errors on successful sync #3156

Conversation

osela commented Jun 24, 2020

kingdonb commented May 1, 2021

osela commented May 1, 2021

kingdonb commented Jul 27, 2021

kingdonb commented Jul 27, 2021

squaremo commented Aug 19, 2021

squaremo left a comment

Choose a reason for hiding this comment

kingdonb commented Aug 19, 2021

kingdonb commented Aug 20, 2021