-
Notifications
You must be signed in to change notification settings - Fork 1.1k
sync-loop context deadline exceeded while pushing into git a lot of changes #1857
Comments
So it's a |
@squaremo i tried setting it up to 2-4 minutes with 1.10 and it wasn't working (i mean i still got |
I think this may not work as to my knowledge all releases are made through jobs and the job provides its own context with 60 second timeout. |
It's a bit strange isn't it. I can speculate on some explanations:
Can we have the full log lines, please? (I mean with the files and line numbers -- you can still redact sensitive details :-) |
Yeah, sure. But i don't believe it is about disk. I mean it is running on NVMe at Hetzner.
|
@squaremo it is standard log output. If i can somehow increase verbosity let me know |
OK, I have a better guess: as Hidde suggests, it's running into the job timeout. The reason it looks like I suspect it's invoking |
@squaremo ok i'll into it in an hour. Prometheus is not running in our cluster yet. Initial setup of Prometheus and Grafana was quite expensive and we decided to roll it back for a time being. |
|
Nice, thank you @dananichev.
This is what I was looking for. This metric measures how long it takes to write the changes back into the files. The histogram buckets (the For now it'll be tricky to do anything about the writes individually taking about half a second -- the problem, briefly, is that fluxd execs a Python program to do it, because Python has the only round-tripping (i.e., comment-preserving) YAML parser I could find. We could give you the opportunity to configure the job timeout (current hard-wired value: 60s). This would be a bit of a sticking plaster solution, and may have side-effects, like delaying syncs. |
Some other potential solutions:
|
I guess, for now the only solution is to use the same workaround i described in first post? |
Yes, in currently released Flux, I think that's the only workaround -- sorry for the lump of extra work :-( If you can find some way to ratchet forward the images in manifests when they are created (when "we are forced to recreate all manifests inside git repository" happens), so they are less likely to have an upgrade when deployed, that would help too of course. |
We don't always know the size of a job before it's started, since calculating what to do is usually part of the job. Automated updates are an exception -- it figures some things out before it queues the job; so it could give the job longer to run, if we wanted.
I guess we could parallelise the update of distinct files, by adding a bit of complexity. |
@squaremo i guess the proper way would be to use some kind of yaml parser and change only whose values we need but we have not enough resources for this for now. As a workaround i've set half-manually all images to latest versions. But i hope there will some kind of native solution (from Flux) in future. And for now i will try to mitigate this on my side also. |
@squaremo now i've got new error: ...
ts=2019-03-31T17:31:08.168210362Z caller=images.go:79 component=sync-loop service=pgs-production:deployment/work container=work repo=gitlab.**.ru:4567/**/clients pattern=glob:* current=gitlab.**.ru:4567/**/clients:latest-34eb94841504daa85785910e60c1e2e7cdae439b info="added update to automation run" new=gitlab.**.ru:4567/**/clients:latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 reason="latest latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 (2019-03-31 15:20:54.332303657 +0000 UTC) > current latest-34eb94841504daa85785910e60c1e2e7cdae439b (2019-03-22 07:58:16.913909237 +0000 UTC)"
ts=2019-03-31T17:31:08.168262474Z caller=images.go:79 component=sync-loop service=trzh-production:deployment/work container=work repo=gitlab.**.ru:4567/**/clients pattern=glob:* current=gitlab.**.ru:4567/**/clients:latest-34eb94841504daa85785910e60c1e2e7cdae439b info="added update to automation run" new=gitlab.**.ru:4567/**/clients:latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 reason="latest latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 (2019-03-31 15:20:54.332303657 +0000 UTC) > current latest-34eb94841504daa85785910e60c1e2e7cdae439b (2019-03-22 07:58:16.913909237 +0000 UTC)"
ts=2019-03-31T17:31:08.168312371Z caller=images.go:79 component=sync-loop service=yaprav1-production:deployment/work container=work repo=gitlab.**.ru:4567/**/clients pattern=glob:* current=gitlab.**.ru:4567/**/clients:latest-34eb94841504daa85785910e60c1e2e7cdae439b info="added update to automation run" new=gitlab.**.ru:4567/**/clients:latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 reason="latest latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 (2019-03-31 15:20:54.332303657 +0000 UTC) > current latest-34eb94841504daa85785910e60c1e2e7cdae439b (2019-03-22 07:58:16.913909237 +0000 UTC)"
ts=2019-03-31T17:31:08.168505733Z caller=images.go:79 component=sync-loop service=simp-production:deployment/work container=work repo=gitlab.**.ru:4567/**/clients pattern=glob:* current=gitlab.**.ru:4567/**/clients:latest-34eb94841504daa85785910e60c1e2e7cdae439b info="added update to automation run" new=gitlab.**.ru:4567/**/clients:latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 reason="latest latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 (2019-03-31 15:20:54.332303657 +0000 UTC) > current latest-34eb94841504daa85785910e60c1e2e7cdae439b (2019-03-22 07:58:16.913909237 +0000 UTC)"
ts=2019-03-31T17:31:08.168610363Z caller=images.go:79 component=sync-loop service=yamalec-production:deployment/work container=work repo=gitlab.**.ru:4567/**/clients pattern=glob:* current=gitlab.**.ru:4567/**/clients:latest-34eb94841504daa85785910e60c1e2e7cdae439b info="added update to automation run" new=gitlab.**.ru:4567/**/clients:latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 reason="latest latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 (2019-03-31 15:20:54.332303657 +0000 UTC) > current latest-34eb94841504daa85785910e60c1e2e7cdae439b (2019-03-22 07:58:16.913909237 +0000 UTC)"
ts=2019-03-31T17:31:08.170085374Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.**.ru:*/*.git branch=master HEAD=bf833d219f0025ba21bc3efc977de99decb6aab8
ts=2019-03-31T17:31:08.170167927Z caller=loop.go:111 component=sync-loop jobID=ade62567-e227-47e9-2636-6b41c235775d state=in-progress
ts=2019-03-31T17:31:22.631248812Z caller=releaser.go:58 component=sync-loop jobID=ade62567-e227-47e9-2636-6b41c235775d type=release updates=136
ts=2019-03-31T17:32:03.452508149Z caller=loop.go:121 component=sync-loop jobID=ade62567-e227-47e9-2636-6b41c235775d state=done success=false err="fork/exec /usr/bin/git: argument list too long"
And an error is I didn't reset git repo's state or anything like that. Just usual Flux workflow (build new image -> push into registry -> wait for an update). Metrics:
|
I have hit this issue too unfortunately when building out a new ~40 deployment environment. A My workaround was to release workloads individually until they one by one populated the patchfile. Note I'm using patchfile which I thought was initially the problem but in the target environment a build only takes a second or two. It's not clear though how many components would need to be updated though to make this a problem, but it at least seems to be tracking individual upstream images OK. |
Hi all, I am also hitting this problem
What is the general suggestion for a fix in the meantime? |
Have you tried increasing |
Sync timeout is at |
Maybe .... It would be good to know where and why it's getting stuck though. |
Everytime I try add the |
I am not sure what you mean |
|
@tomjohnburton the flag you are looking for is |
Okay I've put a 10minute timeout, |
AMAZING, that worked. Thanks all |
Hi all, We're seeing a similar issue now, but it seems to be specific only to automated releases. i.e. as a sync without any automated releases works fine with the default
I'm wondering whether #2805 might help, since we're using this. I'll try it anyway and report back. |
I was looking into this a bit more, and wanted to understand whether this is related to As far as I can tell, during an release I was wondering what was passed into In our case, that is basically a We have a monorepo, so there's a lot.
I'm not sure I've got the inputs to
In this example it's 9 seconds for a single deployment - it's possible this runs multiple time right? When I was hitting the context-deadline, it was updating 3 workloads. Also, this was on my laptop, but our flux instance probably has less cpu than my laptop to be honest, so it could be slower in-cluster :) |
Looked at our input to One issue I'm seeing is that we're including all our CustomResourceDefinitions (as part of the These files are large - especially the upstream prometheus ones. Here's one such example: /~https://github.com/coreos/prometheus-operator/blob/master/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml With CRD's
Without it's much faster :)
In our case, CRD's make up only 16 out of 283~ manifests to parse. Should be easy enough for us to ignore them, but not sure whether Noted that whilst |
Seems like we also now hit this issue,
Seems like setting policy via CLI still happens, issue is with auto-updates. |
Perhaps commit message gets too long 🤔 because it is generated |
It is definitely a performance issue somewhere. I added 9 HelmReleases of a chart with 8 sub-charts and it had to do 72 updates in total, 8 tag updates per HelmRelease. With no CPU limit and 6 minutes it could not finish whatever I did.
I'll try to investigate more and come back. For now disabling auto update on those 9 HelmReleases was the only thing that helped. |
Another workaround to this is to manually update the This is only helpful if you subsequently will be updating only a few images at a time. |
Possibly related to #3450 If we have an active report of this issue with someone who can reproduce it, we can follow it up, but according to the Migration Timetable Flux v1 is formally superseded since about this time last month. Bugs can still be fixed, but soon only CVE fixes will be accepted. Closing for now. Please open a new report if you wish to pursue this. Thanks for using Flux! |
Once in a while we are forced to recreate all manifests inside git repository (limits tweaks, new pods and containers and so on). This leads to performing automated release by flux based on tag annotation for every manifest inside git repository (i believe we have ~1000 manifests or so on; maybe less). Which in turn leads to GIT sync. Which fails.
git commit: running git command: git [commit --no-verify -a -m Auto-release multiple images\n\n - gitlab.***.ru:4567/***:latest-4985eb31bb68b72c377f44dd3905c4213eba2f01\n - gitlab.***.ru:4567/***:latest-e92483f1319148a21624249f472d8b885fbedbe4\n - gitlab.***.ru:4567/***:latest-10a671033011403ce0ebad6701ccb96d9a5a0fb6\n - gitlab.***.ru:4567/***:latest-dee8e41f3f2a1413954050356cfc3c68112715aa\n - gitlab.***.ru:4567/***:latest-ed6b6773ae051e9808ddd837b80a3e4177973cec\n - gitlab.***.ru:4567/***:latest-eb816707b76f8589dc4b667331e1c655f4966696\n - gitlab.***.ru:4567/***:latest-34eb94841504daa85785910e60c1e2e7cdae439b\n]: context deadline exceeded
So far the only workaround i found:
This workaround takes a lot of time to "fix" this issue.
Is there anything i can do to permamently fix it?
Thanks,
Dmitry
The text was updated successfully, but these errors were encountered: