-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusterctl upgrade tests are flaky #9688
Comments
@kubernetes-sigs/cluster-api-release-team These flakes are very disruptive to the test signal right now. It would be great if someone could prioritize investigating and fixing them out ahead of the releases. /triage accepted |
/help |
@killianmuldoon: GuidelinesPlease ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Note that each branch has a different number of variants, enumerated below, of this test which may be responsible for some unevenness in the signal:
|
I am looking into this one. |
I will be pairing up with @adilGhaffarDev on this one since it is happening more frequently. /assign @adilGhaffarDev |
Adding a bit more explanation regarding failures. We have three failures in clusterctl upgrade:
It might be something related to DockerMachinePool, we might need to backport the recent fixes related to DockerMachinePool. Another interesting thing is I don't see this failure on
|
This is not an error. These are just info messages that surface that we are calling |
Update on this issue.
|
@adilGhaffarDev So the clusterctl upgrade test is 100% stable apart from "failed to discovery ownerGraph types flake is still happening but only when upgrading from (v0.4=>current)"? Is not showing anything for me |
sorry for the bad link, here is more persitent link: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&test=clusterctl%20upgrades%20&xjob=.*-provider-.* Maybe not 100% stable there are very minor flakes that happen sometimes. But |
@adilGhaffarDev cluster-api/test/framework/cluster_proxy.go Line 258 in adce020
(/~https://github.com/kubernetes-sigs/cluster-api/pull/9737/files) That doesn't mean the underlying errors are fixed unfortunately. |
|
Sounds good! Nope I didn't see any. Just wanted to clarify that the errors would look different now. But if the same step works now, it should be fine. Just not sure what changed as I don't remember fixing/changing anything there. |
This is the new error that was happening after your PR, it seems like it stopped happening after 07-12-2023. Only PR on 07-12-2023 that might have fixed this seemed to be this one: #9819 , but I am not sure. |
#9819 Should not be related. This func is called later in clusterctl_upgrade.go (l.516). While the issue happens in l.389. So this is the error we get there
This is the corresponding output (under "open stdout")
So looks like the mgmt cluster was not reachable. Thx for digging into this. I would say let's ignore this error for now as it's not occurring anymore. Good enough for me to know the issue stopped happening (I assumed it might be still there and just looks different). |
Little more explanation to clusterctl upgrade failure. Now we are seeing only one flake when upgrading from
This failure happens in post upgrade step, where we are are calling |
🤔 : may be helpful to collect cert-manager resources + logs to analyse this. Or is this locally reproducible? |
I haven't been able to reproduce locally. I have ran it multiple times. |
@chrischdi thank you for working on it, now we are not seeing this flake too much, nice work. On k8s triage I can see that now ownergraph flake is only happening in |
Note: this is a different flake, not directly ownergraph but similar. It happens at a different place though. cluster-api/test/e2e/clusterctl_upgrade.go Lines 553 to 568 in 487ed95
We could propably also ignore the x509 errors here and ensure that the last try in |
We could also add an Eventually before to wait until the List call works and then keep the Consistently the same Btw, thx folks, really nice work on this issue! |
I will open a PR with your suggestion |
#10301 did not fix the issue, failure is still there for |
/priority important-soon |
I implemented a fix at #10469 which should fix the situation. |
This issue is labeled with You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/triage accepted |
agreed, i think a new issue would be helpful. the incoming release CI team can prioritize this. @chandankumar4 @adilGhaffarDev @Sunnatillo is there a summary of where we stand? if not, ill take a shot at refreshing the investigation and can open the new issue seems like we do have flakes on |
From my observation I would say there are two main flakes that are occuring in clusterctl upgrade tests:
First flake happening more often and when upgrading from latest versions, second flake happening mostly when uplifting from older releases. I agree that we should close this issue and open new one separately for each flake. |
@chrischdi Was looking into some of these issues and is about to write an update here. Let's wait for that before closing this issue |
Sorry folks, took longer than expected. According to aggregated failures of the last two weeks, we still have some flakyness on our clusterctl upgrade tests. But it looks like none of them are the ones in the initial post:
Link to check if messages changed or we have new flakes on clusterctl upgrade tests: here |
thank you for putting this together @chrischdi -- you mind if i copy paste this refreshed summary into a new issue and close the current one? |
Feel free to go ahead with that |
Doesn't hurt to start with a clean slate to reduce confusion :) |
/close in favor of #11133 |
@cahillsf: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The
clusterctl
upgrade tests have been significantly flaky in the last couple of weeks, with flakes occurring onmain
release-1.4
andrelease-1.5
.The flakes are occurring across many forms of the
clusterctl
upgrade tests includingv0.4=>current
,v1.3=>current
andv1.0=>current
.The failures take a number of forms, including but not limited to:
exec.ExitError
: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&xjob=.*-provider-.*#f5ccd02ae151196a4bf1failed to find releases
: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&test=.*clusterctl%20upgrades.*&xjob=.*-provider-.*#983e849a73bad197d73bfailed to discovery ownerGraph types
: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&test=.*clusterctl%20upgrades.*&xjob=.*-provider-.*#176363ebfcd19172c1acThere's an overall triage for tests with
clusterctl upgrades
in the name here: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&test=.*clusterctl%20upgrades.*&xjob=.*-provider-.*/kind flake
The text was updated successfully, but these errors were encountered: