Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Travis tests on arm64, ppc64le, s390x #3744

Merged
merged 1 commit into from
Apr 18, 2020

Conversation

fingolfin
Copy link
Member

@fingolfin fingolfin commented Nov 14, 2019

Note that s390x is big endian which would be really nice in light of issue #2362.

And testing on arm64 might help with debugging issue #3919.

Resolves #2362.

@fingolfin fingolfin added do not merge PRs which are not yet ready to be merged (e.g. submitted for discussion, or test results) topic: ci Anything related to GitHub Actions, Codecov, AppVeyor, Coveralls, Travis, ... labels Nov 14, 2019
@fingolfin fingolfin force-pushed the mh/s390x branch 2 times, most recently from b912bec to 174744b Compare November 15, 2019 09:30
@ChrisJefferson

This comment has been minimized.

@fingolfin
Copy link
Member Author

fingolfin commented Nov 22, 2019

While the aarch64 crash is puzzling, in the end I only really care about s390x, as it is big endian, and there it runs fine and gets to the test suite (I guess @dimpase will be happy ;-). But then we see an actual issues, due to the hash function (as mentioned on issue #2362). I see no way to get identical hashes here, though (because the hash function does not know whether our data consists of 16bit or 32bit words). So that leaves the following options:

  1. Just drop those tests and pretend nothing happened ;-)
  2. we already have different hashes for 32 vs 64 bit archs; so I simply add two more values, for BE.
  3. we decide that we rather have non-reproducible hashes, and always scramble them on each GAP startup with some kind of global random seed (no idea how we'd obtain that, but that's an irrelevant detail for now). I think globally if I was creating GAP from scratch, without concerns for existing code and users, I'd like this most, but it's a rather major break from what we do right now everywhere else (where we strongly try to keep RNG states reproducible across GAP launches). Anyway, in this case, the test of course also needs to be removed/replaced (e.g. by testing that pairs of equal but non-identical objects have the same hash)

Other suggestions?

@hulpke
Copy link
Contributor

hulpke commented Nov 22, 2019

If you want an opinion, I would chose 2. It is not a big extra cost (adding two more definitions), and unlikely to force further cases in the forseeable future.

@fingolfin
Copy link
Member Author

BTW, there is a second test failure seen on both ppc64le and s390x which I right now can't explain, caused by different results given by OPERS_CACHE_INFO() right after calling CLEAR_CACHE_INFO();

########> Diff in /home/travis/build/gap-system/gap/tst/testinstall/kernel/opers.tst:266
# Input is:
opcheck{[1..11]};
# Expected output:
[ 0, 0, 0, 7, 0, 0, 2, 0, 0, 0, 0 ]
# But found:
[ 0, 0, 0, 6, 0, 0, 2, 0, 0, 0, 0 ]

I have absolutely no idea right now what might be causing this. At first I thought the hash function for the method selection cache is not BE safe, but then I noticed the issue also pops up on ppc64le. Huh.

@fingolfin

This comment has been minimized.

@fingolfin

This comment has been minimized.

@fingolfin

This comment has been minimized.

@fingolfin
Copy link
Member Author

So I dug down on the weird difference in the output of OPERS_CACHE_INFO, where different OperationHit values are shown (7 on x86, but 6 on these three architectures), with the most genius debugging tool invented by mankind, printf. I set up things so that whenever OperationHit is incremented, its new value is printed as well as the name of the operation for which the cache was searched. I also logged the GAP call tree, so I know that all hits are inside the GAP library function RunTests

On aarch64 and s390x, I got this (for powerpc64le the Travis had another issue, but I am confident it'll be the same:

OperationHit = 1 for ReadLine
OperationHit = 2 for CloseStream
OperationHit = 3 for CloseStream
OperationHit = 4 for InputTextString
OperationHit = 5 for OutputTextString
OperationHit = 6 for PrintFormattingStatus
OperationHit = 7 for ReadLine
OperationHit = 8 for CloseStream

In contrast, on my Mac on x86, I got this:

OperationHit = 1 for ReadLine
OperationHit = 2 for CloseStream
OperationHit = 3 for CloseStream
OperationHit = 4 for Int          <--------- this is new
OperationHit = 5 for InputTextString
OperationHit = 6 for OutputTextString
OperationHit = 7 for PrintFormattingStatus
OperationHit = 8 for ReadLine
OperationHit = 9 for CloseStream

I inserted some more code to get a backtrace of the GAP code leading to this, and I found out that the Int call in question is the one in lib/test.gi:213; i.e., in this code:

      Print("\r# line ", pos[i],
            " of ", nrlines,
            " (", Int(pos[i] / nrlines * 100), "%)",
            "\c");

I still don't have a clue why it is in the method cache the one time but not the other. But OK, some minor difference in the environment might lead to a different path being taken somewhere and then an Int call is made differently, resulting in a slightly different cache content... So now I have at least hope that this might not be due to random memory corruption.

It also suggests a simple workaround: insert a call Int(1/2) before the problematic test. Let's see if that works.

@fingolfin fingolfin force-pushed the mh/s390x branch 3 times, most recently from ee8dab9 to 06b91f0 Compare April 15, 2020 11:38
@coveralls
Copy link

coveralls commented Apr 16, 2020

Coverage Status

Coverage increased (+0.0003%) to 84.331% when pulling c1f6ee6 on fingolfin:mh/s390x into 1ea7bc2 on gap-system:master.

@fingolfin fingolfin removed the do not merge PRs which are not yet ready to be merged (e.g. submitted for discussion, or test results) label Apr 16, 2020
@ChrisJefferson
Copy link
Contributor

Wow, thanks for battling this. Let's see how stable it is long term, but I'm happy to merge it (assuming all the tests pass of course)

@fingolfin
Copy link
Member Author

For the backport, see #3966

Besides just adding three jobs to the Travis build matrix, we have to
deal with some issues with the VMs for the three new platforms:

- Always call the right pip matching the `python` binary; and
  prepare for `python` being Python 3.
  For details, see <https://travis-ci.community/t/8165>

- Work around the fact that the pip cache dir is owned by root.
  For details, see <https://travis-ci.community/t/7822/6>

- Use travis_terminate liberally in .travis.yml to workaround another
  issue with the three new platforms; as a side benefit, we also get
  Travis to reliably and early on abort CI builds with failures.
  For details, see <https://travis-ci.community/t/7659/>.

- Adjust a test for OPERS_CACHE_INFO to pass on the three new platforms.
@fingolfin fingolfin merged commit f676ec2 into gap-system:master Apr 18, 2020
@fingolfin fingolfin deleted the mh/s390x branch April 18, 2020 23:37
@fingolfin
Copy link
Member Author

fingolfin commented Apr 18, 2020

Backported to stable-4.11 via 4d3e491 from PR #3966.

@dimpase
Copy link
Member

dimpase commented Sep 25, 2022

as Travis is gone for GAP (?), what's the status of this? No more s390 tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-to-4.11-DONE release notes: not needed PRs introducing changes that are wholly irrelevant to the release notes topic: ci Anything related to GitHub Actions, Codecov, AppVeyor, Coveralls, Travis, ...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test, or block, big-endian CPUs
6 participants