Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Increase staggered build timeout to 180 min #18568

Merged
merged 2 commits into from
Jun 16, 2020

Conversation

josephevans
Copy link
Contributor

Description

The sanity build already has a timeout of 180 minutes. The staggered full build pipeline currently has a timeout of 20 minutes, but we are seeing timeouts because sometimes the sanity build takes longer than this (because of having to rebuild docker caches for branches other than master.) This results in the rest of the builds failing to get triggered. Increasing this timeout to 180 minutes allows the sanity build to complete and won't prevent the rest of the builds from being triggered.

@mxnet-bot
Copy link

Hey @josephevans , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-cpu, unix-cpu, edge, centos-cpu, website, miscellaneous, centos-gpu, windows-gpu, unix-gpu, clang, sanity]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@marcoabreu
Copy link
Contributor

I think 180 min is too high for a simple docker rebuild from scratch. 60 min tops, everything else will just delay the display of a stuck job.

@josephevans
Copy link
Contributor Author

I think 180 min is too high for a simple docker rebuild from scratch. 60 min tops, everything else will just delay the display of a stuck job.

I agree 180min is a little excessive, but the sanity build already has this value as a timeout. It doesn't make sense to have a timeout less than the sanity build. If we want to revisit the timeouts in the individual builds, that should fall under a separate project/PR.

@ChaiBapchya
Copy link
Contributor

This was observed for this PR #18560
Specifically build number 3: http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/full-build/view/change-requests/job/PR-18560/3/console

Where sanity build exceeded the 20min timeout.

@marcoabreu
Copy link
Contributor

marcoabreu commented Jun 15, 2020

Well, I'm not going to give way to the broken window theory. I'm rather a fan of the boy scout rule.

Feel free to reduce/align the timeout of the underlying sanity job while you're at it.

@josephevans
Copy link
Contributor Author

Well, I'm not going to give way to the broken window theory. I'm rather a fan of the boy scout rule.

Feel free to reduce/align the timeout of the underlying sanity job while you're at it.

The sanity build timeout was increased from 120 to 180 minutes over a year ago. We clearly had some issues with it timing out in the past, so I'd like to avoid having this happen again in the future.

9c3253db1f

@marcoabreu
Copy link
Contributor

marcoabreu commented Jun 15, 2020

Exactly, it was done one and a half year ago and I personally merged it. But since then, the system has been reworked by @leezu .

There is a system in place in cloud watch which records all the metrics. Since I no longer work at Amazon, I lost access to them. If you can present me with a graph of the maximum durations, I'll be happy to reconsider. But until proven otherwise, I think it's safe to go back to a more reasonable level.

We don't want a PR to be hanging for three hours just because the sanity check got into a hiccup.

Before this timeout was totally fine because it didn't matter. But since the decision was made zo safe a few bucks and in turn sacrifice the development speed, I'm gonna be harsh on the critical path. And this pipeline was intentionally put on the critical path, so I'm treating it as a such a critical piece in my reviews.

@sandeep-krishnamurthy I understand the push towards frugality, but then I'd expect insisting on highest standards if a critical path is introduced. This is no means meant against Joseph, but if the newly introduced critical path is actually causing issues, I'd like to see it properly addressed and not just duct taped. This breaking point was known when the switch to a sequential system was made and now the impact is being noticed. So I'd appreciate if it could be addressed (or reverted) properly and not expect Joseph to just quickly fix it without allocating some time and priority for him.

@josephevans
Copy link
Contributor Author

Thanks @marcoabreu for your very thoughtful comments. I've decreased the timeouts for sanity build to 60 min as well as the staggered build trigger.

@marcoabreu marcoabreu merged commit f91b989 into apache:v1.x Jun 16, 2020
josephevans added a commit to josephevans/mxnet that referenced this pull request Jun 18, 2020
* Increase staggered build timeout to 180 min, since sanity build has 180 min timeout.

* Decrease timeout so everyone is happy.

Co-authored-by: Joe Evans <joeev@amazon.com>
josephevans added a commit to josephevans/mxnet that referenced this pull request Jun 18, 2020
* Increase staggered build timeout to 180 min, since sanity build has 180 min timeout.

* Decrease timeout so everyone is happy.

Co-authored-by: Joe Evans <joeev@amazon.com>
leezu pushed a commit that referenced this pull request Jun 26, 2020
* Increase staggered build timeout to 180 min, since sanity build has 180 min timeout.

* Decrease timeout so everyone is happy.

Co-authored-by: Joe Evans <joeev@amazon.com>

Co-authored-by: Joe Evans <joeev@amazon.com>
TaoLv added a commit that referenced this pull request Jul 2, 2020
* Increase staggered build timeout to 180 min, since sanity build has 180 min timeout.

* Decrease timeout so everyone is happy.

Co-authored-by: Joe Evans <joeev@amazon.com>

Co-authored-by: Joe Evans <joeev@amazon.com>
Co-authored-by: Tao Lv <tao.a.lv@intel.com>
ChaiBapchya pushed a commit to ChaiBapchya/mxnet that referenced this pull request Aug 15, 2020
* Increase staggered build timeout to 180 min, since sanity build has 180 min timeout.

* Decrease timeout so everyone is happy.

Co-authored-by: Joe Evans <joeev@amazon.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants