Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 'release-debug' build for ATDM Trilinos builds and use to avoid timeouts #3633

Closed
bartlettroscoe opened this issue Oct 16, 2018 · 6 comments
Labels
ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project type: enhancement Issue is an enhancement, not a bug

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Oct 16, 2018

CC: @fryeguy52, @mhoemmen, @rppawlo, @bathmatt, @micahahoward, @trilinos/kokkos, @trilinos/kokkos-kernels

Next Action Status

ATDM Trilinos scripts now support a release-debug build type and this has been used in new release-debug builds on 'waterman'. Converting debug builds to release-debug builds on other platforms will be done in follow-on issues ...

Description

Currently, the ATDM Trilinos builds support debug and an opt build. The debug build uses CMAKE_BUILD_TYPE=DEBUG (with -O0) and enables runtime debug-mode checking while the opt build uses CMAKE_BUILD_TYPE=RELEASE (with -O3) and no runtime debug-mode checking. The problem with this approach is that some of the Trilinos tests (especially many of the Kokkos and KokkosKernels tests) run many times slower wtih -O0 than with -O3. This has caused many tests to timeout at 10 minutes in debug builds that finish is well under 10 minutes in opt builds (e.g. #2964, #2921, #2461).

A solution that we discussed was to change most debug builds into release-debug builds that will set CMAKE_BUILD_TYPE=RELEASE (with -O3) but enable runtime debug-mode checking.

Proposed solution

The idea would be to add a new release-debug keyword that matches before opt or debug which will set ATDM_CONFIG_BUILD_TYPE=RELEASE_DEBUG and then update the file ATDMDevEnvSettings.cmake accordingly. That will be easy. The harder part will be updating the tweaks *.cmake files and all of the Jenkins jobs to accommodate the name change. NOTE: calling this release-debug as apposed to opt-debug hopefully might be more clear.

We can still leave some full debug builds to help support full GDB debugging by the ATDM APPs teams but they should be sparing (because we are constantly dealing with timeouts with full debug builds).

Tasks:

  • ???
@bartlettroscoe bartlettroscoe added type: enhancement Issue is an enhancement, not a bug client: ATDM Any issue primarily impacting the ATDM project ATDM Config Issues that are specific to the ATDM configuration settings labels Oct 16, 2018
@bartlettroscoe bartlettroscoe changed the title Add support for 'opt-debug' build for ATDM Trilinos builds and use to avoid timeouts Add support for 'release-debug' build for ATDM Trilinos builds and use to avoid timeouts Oct 16, 2018
@bartlettroscoe
Copy link
Member Author

FYI: I think we should call this new build type release-debug instead of opt-debug since that might be more clear.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 17, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 17, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 17, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 17, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 17, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 17, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 17, 2018
This will allow many of the ATDM Trilinos 'debug' builds to be switched to
'release-debug' builds and help to avoid a bunch of timeouts that we are
dealing with.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 17, 2018
…rilinos#3633)

I renamed the 'cuda' builds to 'cuda-9.2' builds since that is what they are
and that matches the Jenkins drive names.

I kept the existing cuda-9.2-debug-Power9-Volta70 build since there are
currently not any timing out tests in that build and I figured that the CUDA
builds was most likey the one a developer would want to run a debug with.  But
I created a cuda-9.2-release-debug-Power9-Volta70 build so that we can avoid
having to disable slow Kokkos, KokkosKernels, and other tests that run super
slow with -O0.

I just changed the build gnu-debug-openmp-Power9-Volta70 to a
gnu-release-debug-openmp-Power9-Volta70 build since I don't think it is as
important to run this build with a debugger and the full 'debug' build
currently has some timing-out tests for Kokkos and KokkosKernals as described
in trilinos#3336.  If the APP teams tell us they want a full
gnu-debug-openmp-Power9-Volta70 build, we will add one back.

NOTE: By having both 'debug' and 'release-debug' builds, we can be free to
disable some slow tests in the 'debug' build and not loose any runtime debug
checking since these tests will be running in the 'release-debug' build.  So
going forward, if a test times-out in the 'debug' build but not the
'release-debug' build, then we will just disable it in the 'debug' build and
move on.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 17, 2018
…rilinos#3633)

I kept the existing cuda-9.2-debug-Power9-Volta70 build since there are
currently not any timing out tests in that build and I figured that the CUDA
build was most likey the one a developer would want to run with a debugger.
But I created a new cuda-9.2-release-debug-Power9-Volta70 build so that we can
avoid having to disable slow Kokkos, KokkosKernels, and other tests that run
super slow with -O0.

I changed the build gnu-debug-openmp-Power9-Volta70 to a
gnu-release-debug-openmp-Power9-Volta70 build since I don't think it is as
important to run this build with a debugger and the full 'debug' build and
this build currently has some timing-out tests for Kokkos and KokkosKernals as
described in trilinos#3336.  (The new gnu-release-debug-openmp-Power9-Volta70 build
has not have any timeouts.)  If the APP teams tell us they want a full
gnu-debug-openmp-Power9-Volta70 build, then we will add one back and deal with
the timeouts.

NOTE: By having both 'debug' and 'release-debug' builds, we can be free to
disable some slow tests in the full 'debug' build and not loose much runtime
debug checking since these tests will be running in the 'release-debug' build
(with runtime debug checking enabled).  So going forward, if a test times-out
in the 'debug' build but not the 'release-debug' build, then we will just
disable it in the 'debug' build and move on.

I also renamed the 'cuda' builds to 'cuda-9.2' builds since that is what they
are and that matches the Jenkins drive names.
trilinos-autotester added a commit that referenced this issue Oct 18, 2018
…se-debug

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Add 'release-debug' build type (#3633) and change some 'waterman' builds (#3336)
PR Author: bartlettroscoe
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 18, 2018
…terman' cuda-9.2-debug build (trilinos#3336)

Now that this test is running and passing in the new build
Trilinos-atdm-waterman-cuda-9.2-release-debug (see trilinos#3659 and trilinos#3633), it is
fine to disable this in this full -O3 build.

 # Please
enter the commit message for your changes. Lines starting # with '#' will be
ignored, and an empty message aborts the commit.  # On branch
3336-waterman-disable-kokkoscontainers-test # Changes to be committed: #
modified:
cmake/std/atdm/waterman/tweaks/CUDA-9.2-DEBUG-CUDA-POWER9-VOLTA70.cmake #
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 18, 2018
…terman' cuda-9.2-debug build (trilinos#3336)

Now that this test is running and passing in the new build
Trilinos-atdm-waterman-cuda-9.2-release-debug (see trilinos#3659 and trilinos#3633), it is
fine to disable this in this full -O3 build.
@bartlettroscoe
Copy link
Member Author

CC: @fryeguy52, @trilinos/intrepid2, @trilinos/piro, @trilinos/framework

After the merge of #3659 yesterday, the new release-debug builds on 'waterman' today shown here are actually showing some new failures in the Intrepid2 and Piro packages that did not occur in the opt or debug builds on that machine. We will create separate Trilinos GitHub issues for those failures but I thought that it would be useful to point this out here.

What this shows is that there is some code in Trilinos that behaves differently on some platforms with optimized compiler flags and runtime debug checking turned on compared to builds with non-optimized compiler flags and debug checking disabled. What is interested is that the GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build actually uses optimized compiler flags with runtime debug checking enabled so it is not like were don't have testing for this code path in PR testing. We just don't test all platforms in PR testing (obviously).

If I was a Trilinos customer, my default development build would be against a release-debug build instead of a debug build because the code runs pretty fast but still checks for runtime errors and reports them nicely. Therefore, that release-debug builds should be higher priority to test than debug or even opt/release builds.

@mperego
Copy link
Contributor

mperego commented Oct 18, 2018

@bartlettroscoe I had a look at the failures and they seem to be all due to tolerances being too tight. I can try to fix this later today or tomorrow

@bartlettroscoe
Copy link
Member Author

I was going to switch over more debug builds to release-debug builds on platforms where we had disabled tests that were timing out but given the new failures that are appearing in the release-debug builds on 'waterman' I describe above, I think that I am going to punt on this for now. If we get new timing out tests in only the debug builds but not the opt builds on these other platforms, then we can switch some of these debug builds over to release-debug builds.

Therefore, I am going to close this issue and we will deal with the new failures in other issues.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Oct 18, 2018

@bartlettroscoe I had a look at the failures and they seem to be all due to tolerances being too tight. I can try to fix this later today or tomorrow

@mperego, these same failures have been happening in lots of other builds as well as described in #2474. Let's continue the discuss on the Piro test failure there.

@bartlettroscoe
Copy link
Member Author

@mperego, sorry, the issue for the Piro test failures is #2474.

mperego added a commit to mperego/Trilinos that referenced this issue Oct 21, 2018
          Compute error using l2 norms of arrays instead of element-wise magnitude to avoid issues w/ relative errors for zero entries.
          This should partially address issues trilinos#3633 and trilinos#2474
mperego added a commit to mperego/Trilinos that referenced this issue Oct 21, 2018
          Compute error using l2 norms of arrays instead of element-wise magnitude to avoid issues w/ relative errors for zero entries.
          This should partially address issues trilinos#3633 and trilinos#2474
@bartlettroscoe bartlettroscoe removed the ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams label Oct 27, 2018
@bartlettroscoe bartlettroscoe added ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams and removed ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams labels Oct 27, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Dec 5, 2018
…ebug-pt build (trilinos#2464, trilinos#3633)

We really need to switch most of these 'debug' builds to 'release-debug'
builds (see trilinos#3633).

Also, the Trilinos CUDA PR build really needs to be a cuda-9.2-release-debug
build since that runs more tests and catches more issues than either a
cuda-9.2-opt or cuda-9.2-debug build (see trilinos#3939).
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
This will allow many of the ATDM Trilinos 'debug' builds to be switched to
'release-debug' builds and help to avoid a bunch of timeouts that we are
dealing with.
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
…rilinos#3633)

I kept the existing cuda-9.2-debug-Power9-Volta70 build since there are
currently not any timing out tests in that build and I figured that the CUDA
build was most likey the one a developer would want to run with a debugger.
But I created a new cuda-9.2-release-debug-Power9-Volta70 build so that we can
avoid having to disable slow Kokkos, KokkosKernels, and other tests that run
super slow with -O0.

I changed the build gnu-debug-openmp-Power9-Volta70 to a
gnu-release-debug-openmp-Power9-Volta70 build since I don't think it is as
important to run this build with a debugger and the full 'debug' build and
this build currently has some timing-out tests for Kokkos and KokkosKernals as
described in trilinos#3336.  (The new gnu-release-debug-openmp-Power9-Volta70 build
has not have any timeouts.)  If the APP teams tell us they want a full
gnu-debug-openmp-Power9-Volta70 build, then we will add one back and deal with
the timeouts.

NOTE: By having both 'debug' and 'release-debug' builds, we can be free to
disable some slow tests in the full 'debug' build and not loose much runtime
debug checking since these tests will be running in the 'release-debug' build
(with runtime debug checking enabled).  So going forward, if a test times-out
in the 'debug' build but not the 'release-debug' build, then we will just
disable it in the 'debug' build and move on.

I also renamed the 'cuda' builds to 'cuda-9.2' builds since that is what they
are and that matches the Jenkins drive names.
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
…terman' cuda-9.2-debug build (trilinos#3336)

Now that this test is running and passing in the new build
Trilinos-atdm-waterman-cuda-9.2-release-debug (see trilinos#3659 and trilinos#3633), it is
fine to disable this in this full -O3 build.
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
          Compute error using l2 norms of arrays instead of element-wise magnitude to avoid issues w/ relative errors for zero entries.
          This should partially address issues trilinos#3633 and trilinos#2474
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
…ebug-pt build (trilinos#2464, trilinos#3633)

We really need to switch most of these 'debug' builds to 'release-debug'
builds (see trilinos#3633).

Also, the Trilinos CUDA PR build really needs to be a cuda-9.2-release-debug
build since that runs more tests and catches more issues than either a
cuda-9.2-opt or cuda-9.2-debug build (see trilinos#3939).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

2 participants