Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

reset sync errors on successful sync #3156

Merged
merged 1 commit into from
Aug 19, 2021
Merged

Conversation

osela
Copy link
Contributor

@osela osela commented Jun 24, 2020

This PR resets the syncErrors field on successful syncs.

I started playing with flux and almost immediately noticed an unexpected behavior:

  • Flux syncs a few objects and some of them fail for expected reasons.
  • In every future sync, flux tries to apply the failed objects individually.
  • A commit is made which makes the failed objects sync successfully.

At this point, I would expect flux to stop applying the previously failed objects individually, but it doesn't because the objects are still in Cluster's syncErrors map. They would only be cleared once the sync fails again.

Am I missing something basic here?

@kingdonb kingdonb self-assigned this Apr 27, 2021
@kingdonb kingdonb force-pushed the reset-sync-errors branch from f048ee2 to 5ad6c34 Compare April 27, 2021 01:19
@kingdonb kingdonb added the blocked-dco Needs DCO to be fixed up label May 1, 2021
@kingdonb
Copy link
Member

kingdonb commented May 1, 2021

This PR is missing the DCO check. Please rebase and follow the details of the DCO check linked here to make progress on this.

@osela osela force-pushed the reset-sync-errors branch from 5ad6c34 to fc37093 Compare May 1, 2021 16:29
@osela
Copy link
Contributor Author

osela commented May 1, 2021

@kingdonb done

@kingdonb kingdonb added bug and removed blocked-dco Needs DCO to be fixed up labels May 23, 2021
@kingdonb kingdonb added this to the 1.22.3 milestone May 23, 2021
@kingdonb kingdonb force-pushed the reset-sync-errors branch from fc37093 to d82e9cc Compare May 23, 2021 14:55
@kingdonb kingdonb modified the milestones: 1.23.1, 1.23.2 Jul 22, 2021
@kingdonb
Copy link
Member

Apologies for the length of time that has elapsed, I am planning out the 1.23.2 release at the moment and I think we can include your change this time. I'm just reviewing the details to make sure I understand.

Thank you for documenting the reasoning behind your change, it should be simple for me to verify. Do you think there is any test which could be added that makes it even clearer what has gone wrong and how this fixes it?

I don't want to overcomplicate this, but we love tests and if you might be inclined to add one it will make this PR even better.

(Also, 🌮 yay first-time contributor! Thanks very much for submitting this.)

@kingdonb
Copy link
Member

I wanted to reiterate the issue described here for my own understanding: Flux keeps track of sync errors and when it encounters one, it puts that resource aside a minute, holding the error in an array syncErrors so it can be tried again later.

Flux keeps trying each resource that failed, even before it changes, in case any other changes to neighboring parts of the cluster have occurred that may have caused new conditions which can now permit a successful sync.

(One pretty likely example: we've installed a HelmRelease that installs a CRD, and we have some custom resources to apply based on that CRD. Once the HelmRelease is reconciled, our Custom Resources can be synced now, since the CRD was applied. The error is resolved now...)

If the error does not resolve itself from simple retries as described, the next potential resolution comes when another commit fixes the error so a sync can complete successfully. Maybe a syntax error has been corrected, or something else, and so the individual resource that failed, itself, can now also be synced successfully.

In our example from before, the CRD resource that failed to apply once, is not actually removed from the syncErrors list on success and thus keeps being re-applied, potentially forever (or at least until another error afflicts that resource.)

My theory connecting #3450 to this issue: if you accumulate enough of these errors, or perhaps simply the wrong kind of errors, they could somehow cause the sync to fail outright, or potentially spin the Flux sync loop until it finally times out.

I think that sounds plausible, so I've pinged the subscribers to that issue to see if they can report on whether this PR fixes any of their issues. It would be great to get confirmation if this solves someone's problem in the wild.

Thanks again @osela for your PR.

@kingdonb kingdonb requested a review from hiddeco July 27, 2021 22:14
@kingdonb kingdonb requested a review from stefanprodan July 30, 2021 13:39
@kingdonb kingdonb force-pushed the reset-sync-errors branch from d82e9cc to 9bf9647 Compare July 30, 2021 13:41
@kingdonb kingdonb mentioned this pull request Jul 30, 2021
@kingdonb kingdonb modified the milestones: 1.23.2, 1.23.3 Aug 4, 2021
@kingdonb kingdonb force-pushed the reset-sync-errors branch from 9bf9647 to f37f609 Compare August 4, 2021 18:42
@kingdonb kingdonb modified the milestones: 1.23.3, 1.24.0 Aug 17, 2021
@kingdonb kingdonb force-pushed the reset-sync-errors branch 2 times, most recently from f1e15b0 to f6405e1 Compare August 18, 2021 15:22
Signed-off-by: Or Sela <orsela1@gmail.com>
Signed-off-by: Kingdon Barrett <kingdon@weave.works>
@squaremo
Copy link
Member

Flux keeps track of sync errors and when it encounters one, it puts that resource aside a minute, holding the error in an array syncErrors so it can be tried again later.

Flux keeps trying each resource that failed, even before it changes,

Not quite -- Flux always tries to apply everything. Usually it will try to apply everything all at once, but this can partially fail if there's a problem with any one item. In that case, each item is applied individually (#872). But it is slow to apply things individually. So: on the assumption that the things that were problematic last time are the things that will be problematic this time, the list of resources that didn't apply is kept over syncs, and those items are not included in the "everything all at once" (#1410).

if you accumulate enough of these errors, or perhaps simply the wrong kind of errors, they could somehow cause the sync to fail outright

setSyncErrors recreates its set of errors from scratch when called, and Sync starts with an empty list each time; I don't think there's an opportunity for errors to accumulate.

At this point, I would expect flux to stop applying the previously failed objects individually, but it doesn't because the objects are still in Cluster's syncErrors map. They would only be cleared once the sync fails again.

Am I missing something basic here?

I think you and the fix are correct. If there were no errors, Sync will exit before setting the errors, so they will be carried over to the next sync. The solution is to call setSyncErrors unconditionally.

Copy link
Member

@squaremo squaremo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me ✅

@kingdonb kingdonb merged commit 371e94c into fluxcd:master Aug 19, 2021
@kingdonb
Copy link
Member

This PR is in fluxcd/flux-prerelease:master-371e94c4 and it will be included in the upcoming 1.24.0 release 🚀

@kingdonb
Copy link
Member

1.24.0 is released 🎉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants