-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconfigure CI pipelines for PR and merge queue #11526
Conversation
One or more of the the following people are requested to review this:
|
This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular: - the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime. - a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version. To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to `tox.ini`, and that's used to do the linting. This makes it more consistent for developers, as well.
0ddb999
to
e849c51
Compare
Pull Request Test Coverage Report for Build 7548743872Warning: This coverage report may be inaccurate.We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
💛 - Coveralls |
The lint job will fail unless the Rust components are installed into the source directory, since that's where `pylint` will look for them to resolve the imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this LGTM, just one question inline
# Clean up Sphinx detritus. | ||
rm -rf docs/_build/html/{.doctrees,.buildinfo} | ||
displayName: 'Run Docs build' | ||
- bash: tox run -e docs,lint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good idea to leverage tox for lint too. Save a lot of config logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
laziness is often a good motivator for single-source-of-truth configuration lol
# `pylint` will examine the source code, not the version that would otherwise be | ||
# installed in `site-packages`, so we use an editable install to make sure the | ||
# compiled modules are built into a valid place for it to find them. | ||
package = editable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should do this globally in the tox config. I'm constantly hitting this locally when using tox which is why I pushed #11380
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge #11380 in that separate PR after this, maybe? To me, this change is required to fix the lint job (at the moment, tox -e lint
only works if you happen to already have built the rust extensions in place yourself, hence the CI failure earlier), but #11380 isn't strictly needed, it's a change to the config to deliberately override some of tox's isolation as a trade-off to get better re-test performance.
Ha, the templating of the merge queue CI is bust - it crashed when added to the queue. |
Dunno how I borked that - I'd assumed I'd copy-pasted the condition, but apparently not. Hopefully better now. |
* Reconfigure CI pipelines for PR and merge queue This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular: - the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime. - a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version. To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to `tox.ini`, and that's used to do the linting. This makes it more consistent for developers, as well. * Allow cargo as an external in 'tox' * Add display name to lint job * Use editable install for tox lint job The lint job will fail unless the Rust components are installed into the source directory, since that's where `pylint` will look for them to resolve the imports. * Fix typo in merge-queue stage condition (cherry picked from commit ed79d42) # Conflicts: # .azure/lint-linux.yml # tox.ini
@Mergifyio backport stable/0.46 |
✅ Backports have been created
|
* Reconfigure CI pipelines for PR and merge queue This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular: - the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime. - a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version. To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to `tox.ini`, and that's used to do the linting. This makes it more consistent for developers, as well. * Allow cargo as an external in 'tox' * Add display name to lint job * Use editable install for tox lint job The lint job will fail unless the Rust components are installed into the source directory, since that's where `pylint` will look for them to resolve the imports. * Fix typo in merge-queue stage condition (cherry picked from commit ed79d42) # Conflicts: # .azure/lint-linux.yml # tox.ini
…1587) * Reconfigure CI pipelines for PR and merge queue (#11526) * Reconfigure CI pipelines for PR and merge queue This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular: - the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime. - a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version. To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to `tox.ini`, and that's used to do the linting. This makes it more consistent for developers, as well. * Allow cargo as an external in 'tox' * Add display name to lint job * Use editable install for tox lint job The lint job will fail unless the Rust components are installed into the source directory, since that's where `pylint` will look for them to resolve the imports. * Fix typo in merge-queue stage condition (cherry picked from commit ed79d42) # Conflicts: # .azure/lint-linux.yml # tox.ini * Fix conflict --------- Co-authored-by: Jake Lishman <jake.lishman@ibm.com>
…1586) * Reconfigure CI pipelines for PR and merge queue (#11526) * Reconfigure CI pipelines for PR and merge queue This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular: - the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime. - a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version. To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to `tox.ini`, and that's used to do the linting. This makes it more consistent for developers, as well. * Allow cargo as an external in 'tox' * Add display name to lint job * Use editable install for tox lint job The lint job will fail unless the Rust components are installed into the source directory, since that's where `pylint` will look for them to resolve the imports. * Fix typo in merge-queue stage condition (cherry picked from commit ed79d42) # Conflicts: # .azure/lint-linux.yml # tox.ini * Fix conflict --------- Co-authored-by: Jake Lishman <jake.lishman@ibm.com>
This commit has two major goals: - fix the caching of the QPY files for both the `main` and `stable/*` branches - increase the number of compatibility tests between the different symengine versions that might be involved in the generation and loading of the QPY files. Achieving both of these goals also means that it is sensible to move the job to GitHub Actions at the same time, since it will put more pressure on the Azure machine concurrency we use. Caching ------- The previous QPY tests attempted to cache the generated files for each historical version of Qiskit, but this was unreliable. The cache never seemed to hit on backport branches, which was a huge slowdown in the critical path to getting releases out. The cache restore keys were also a bit lax, meaning that we might accidentally have invalidated files in the cache by changing what we wanted to test, but the restore keys wouldn't have changed. The cache files would fail to restore as a side-effect of ed79d42 (Qiskitgh-11526); QPY was moved to be on the tail end of the lint run, rather than in a test run. This meant that it was no longer run as part of the push event when updating `main` or one of the `stable/*` branches. In Azure (and GitHub Actions), the "cache" action accesses a _scoped_ cache, not a universal one for the repository [^1][^2]. Approximately, base branches each have their own scope, and PR events open a new scope that is a child of the target branch, the default branch, and the source branch, if appropriate. A cache task can read from any of its parent scopes, but write events go to the most local scope. This means that we haven't been writing to long-standing caches for some time now. PRs would typically miss the cache on the first attempt, hit their cache for updates, then miss again once entering the merge queue. The fix for this is to run the QPY job on branch-update events as well. The post-job cache action will then write out to a reachable cache for all following events. Cross-symengine tests --------------------- We previously were just running a single test with differing versions of symengine between the loading and generation of the QPY files. This refactors the QPY `run_tests.sh` script to run a full pairwise matrix of compatibility tests, to increase the coverage. [^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache [^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security
This commit has two major goals: - fix the caching of the QPY files for both the `main` and `stable/*` branches - increase the number of compatibility tests between the different symengine versions that might be involved in the generation and loading of the QPY files. Achieving both of these goals also means that it is sensible to move the job to GitHub Actions at the same time, since it will put more pressure on the Azure machine concurrency we use. Caching ------- The previous QPY tests attempted to cache the generated files for each historical version of Qiskit, but this was unreliable. The cache never seemed to hit on backport branches, which was a huge slowdown in the critical path to getting releases out. The cache restore keys were also a bit lax, meaning that we might accidentally have invalidated files in the cache by changing what we wanted to test, but the restore keys wouldn't have changed. The cache files would fail to restore as a side-effect of ed79d42 (Qiskitgh-11526); QPY was moved to be on the tail end of the lint run, rather than in a test run. This meant that it was no longer run as part of the push event when updating `main` or one of the `stable/*` branches. In Azure (and GitHub Actions), the "cache" action accesses a _scoped_ cache, not a universal one for the repository [^1][^2]. Approximately, base branches each have their own scope, and PR events open a new scope that is a child of the target branch, the default branch, and the source branch, if appropriate. A cache task can read from any of its parent scopes, but write events go to the most local scope. This means that we haven't been writing to long-standing caches for some time now. PRs would typically miss the cache on the first attempt, hit their cache for updates, then miss again once entering the merge queue. The fix for this is to run the QPY job on branch-update events as well. The post-job cache action will then write out to a reachable cache for all following events. Cross-symengine tests --------------------- We previously were just running a single test with differing versions of symengine between the loading and generation of the QPY files. This refactors the QPY `run_tests.sh` script to run a full pairwise matrix of compatibility tests, to increase the coverage. [^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache [^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security
…13273) This commit has two major goals: - fix the caching of the QPY files for both the `main` and `stable/*` branches - increase the number of compatibility tests between the different symengine versions that might be involved in the generation and loading of the QPY files. Achieving both of these goals also means that it is sensible to move the job to GitHub Actions at the same time, since it will put more pressure on the Azure machine concurrency we use. Caching ------- The previous QPY tests attempted to cache the generated files for each historical version of Qiskit, but this was unreliable. The cache never seemed to hit on backport branches, which was a huge slowdown in the critical path to getting releases out. The cache restore keys were also a bit lax, meaning that we might accidentally have invalidated files in the cache by changing what we wanted to test, but the restore keys wouldn't have changed. The cache files would fail to restore as a side-effect of ed79d42 (gh-11526); QPY was moved to be on the tail end of the lint run, rather than in a test run. This meant that it was no longer run as part of the push event when updating `main` or one of the `stable/*` branches. In Azure (and GitHub Actions), the "cache" action accesses a _scoped_ cache, not a universal one for the repository [^1][^2]. Approximately, base branches each have their own scope, and PR events open a new scope that is a child of the target branch, the default branch, and the source branch, if appropriate. A cache task can read from any of its parent scopes, but write events go to the most local scope. This means that we haven't been writing to long-standing caches for some time now. PRs would typically miss the cache on the first attempt, hit their cache for updates, then miss again once entering the merge queue. The fix for this is to run the QPY job on branch-update events as well. The post-job cache action will then write out to a reachable cache for all following events. Cross-symengine tests --------------------- We previously were just running a single test with differing versions of symengine between the loading and generation of the QPY files. This refactors the QPY `run_tests.sh` script to run a full pairwise matrix of compatibility tests, to increase the coverage. [^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache [^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security
…13273) This commit has two major goals: - fix the caching of the QPY files for both the `main` and `stable/*` branches - increase the number of compatibility tests between the different symengine versions that might be involved in the generation and loading of the QPY files. Achieving both of these goals also means that it is sensible to move the job to GitHub Actions at the same time, since it will put more pressure on the Azure machine concurrency we use. Caching ------- The previous QPY tests attempted to cache the generated files for each historical version of Qiskit, but this was unreliable. The cache never seemed to hit on backport branches, which was a huge slowdown in the critical path to getting releases out. The cache restore keys were also a bit lax, meaning that we might accidentally have invalidated files in the cache by changing what we wanted to test, but the restore keys wouldn't have changed. The cache files would fail to restore as a side-effect of ed79d42 (gh-11526); QPY was moved to be on the tail end of the lint run, rather than in a test run. This meant that it was no longer run as part of the push event when updating `main` or one of the `stable/*` branches. In Azure (and GitHub Actions), the "cache" action accesses a _scoped_ cache, not a universal one for the repository [^1][^2]. Approximately, base branches each have their own scope, and PR events open a new scope that is a child of the target branch, the default branch, and the source branch, if appropriate. A cache task can read from any of its parent scopes, but write events go to the most local scope. This means that we haven't been writing to long-standing caches for some time now. PRs would typically miss the cache on the first attempt, hit their cache for updates, then miss again once entering the merge queue. The fix for this is to run the QPY job on branch-update events as well. The post-job cache action will then write out to a reachable cache for all following events. Cross-symengine tests --------------------- We previously were just running a single test with differing versions of symengine between the loading and generation of the QPY files. This refactors the QPY `run_tests.sh` script to run a full pairwise matrix of compatibility tests, to increase the coverage. [^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache [^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security (cherry picked from commit af8be25)
…13273) (#13380) This commit has two major goals: - fix the caching of the QPY files for both the `main` and `stable/*` branches - increase the number of compatibility tests between the different symengine versions that might be involved in the generation and loading of the QPY files. Achieving both of these goals also means that it is sensible to move the job to GitHub Actions at the same time, since it will put more pressure on the Azure machine concurrency we use. Caching ------- The previous QPY tests attempted to cache the generated files for each historical version of Qiskit, but this was unreliable. The cache never seemed to hit on backport branches, which was a huge slowdown in the critical path to getting releases out. The cache restore keys were also a bit lax, meaning that we might accidentally have invalidated files in the cache by changing what we wanted to test, but the restore keys wouldn't have changed. The cache files would fail to restore as a side-effect of ed79d42 (gh-11526); QPY was moved to be on the tail end of the lint run, rather than in a test run. This meant that it was no longer run as part of the push event when updating `main` or one of the `stable/*` branches. In Azure (and GitHub Actions), the "cache" action accesses a _scoped_ cache, not a universal one for the repository [^1][^2]. Approximately, base branches each have their own scope, and PR events open a new scope that is a child of the target branch, the default branch, and the source branch, if appropriate. A cache task can read from any of its parent scopes, but write events go to the most local scope. This means that we haven't been writing to long-standing caches for some time now. PRs would typically miss the cache on the first attempt, hit their cache for updates, then miss again once entering the merge queue. The fix for this is to run the QPY job on branch-update events as well. The post-job cache action will then write out to a reachable cache for all following events. Cross-symengine tests --------------------- We previously were just running a single test with differing versions of symengine between the loading and generation of the QPY files. This refactors the QPY `run_tests.sh` script to run a full pairwise matrix of compatibility tests, to increase the coverage. [^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache [^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security (cherry picked from commit af8be25) Co-authored-by: Jake Lishman <jake.lishman@ibm.com>
Summary
This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular:
the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime.
a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version.
To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to
tox.ini
, and that's used to do the linting. This makes it more consistent for developers, as well.Details and comments
This should automatically already work with the branch-protection rules, but there's like a 99% chance I've messed something up in the Azure configuration on the first attempt, so let's see if CI passes.