Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime build may not be reproducible on Mac #609

Open
mvdan opened this issue Nov 14, 2022 · 13 comments
Open

runtime build may not be reproducible on Mac #609

mvdan opened this issue Nov 14, 2022 · 13 comments
Labels
bug Something isn't working

Comments

@mvdan
Copy link
Member

mvdan commented Nov 14, 2022

We've been seeing gogarble.txt failures in CI on Mac for a couple of weeks now. They happen with Go 1.19.x

I was finally able to grab the two binaries which should be equal but are not, attached below. Here is the start of their diffoscope with LLVM installed:

--- file1-3602764619
+++ file2-1889520802
├── strings -a -n 8 {}
│ @@ -13,15 +13,14 @@
│  __noptrdata
│  __noptrbss
│  __LINKEDIT
│  /usr/lib/dyld
│  /usr/lib/libSystem.B.dylib
│  /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
│  /System/Library/Frameworks/Security.framework/Versions/A/Security
│ -/usr/lib/libobjc.A.dylib
│  UUUUUUUUH!
│  33333333H!
│  D$pH9P@w
│  t*H9HPt$
│  debugCal
│  debugCal
│  debugCalH9
│ @@ -33143,15 +33142,14 @@
│  _getrlimit
│  _getsockname
│  _getsockopt
│  _mach_absolute_time
│  _mach_timebase_info
│  _madvise
│  _nanosleep
│ -_objc_msgSend
│  _pthread_attr_destroy
│  _pthread_attr_getstacksize
│  _pthread_attr_init
│  _pthread_attr_setdetachstate
│  _pthread_cond_broadcast
│  _pthread_cond_init
│  _pthread_cond_signal
├── llvm-readobj --file-headers {}
│ @@ -3,16 +3,16 @@
│  Arch: x86_64
│  AddressSize: 64bit
│  MachHeader {
│    Magic: Magic64 (0xFEEDFACF)
│    CpuType: X86-64 (0x1000007)
│    CpuSubType: CPU_SUBTYPE_X86_64_ALL (0x3)
│    FileType: Executable (0x2)
│ -  NumOfLoadCommands: 18
│ -  SizeOfLoadCommands: 2432
│ +  NumOfLoadCommands: 17
│ +  SizeOfLoadCommands: 2376
│    Flags [ (0x85)
│      MH_DYLDLINK (0x4)
│      MH_NOUNDEFS (0x1)
│      MH_TWOLEVEL (0x80)
│    ]
│    Reserved: 0x0
│  }
├── llvm-readobj --needed-libs {}
│ @@ -2,9 +2,8 @@
│  Format: Mach-O 64-bit x86-64
│  Arch: x86_64
│  AddressSize: 64bit
│  NeededLibraries [
│    /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
│    /System/Library/Frameworks/Security.framework/Versions/A/Security
│    /usr/lib/libSystem.B.dylib
│ -  /usr/lib/libobjc.A.dylib
│  ]
├── llvm-readobj --symbols {}
│ @@ -1,66975 +1,66975 @@
│  
│  Format: Mach-O 64-bit x86-64
│  Arch: x86_64
│  AddressSize: 64bit
│  Symbols [
│    Symbol {
│ -    Name: _runtime.text (2024)
│ +    Name: _runtime.text (2010)
│      Type: Section (0xE)
│      Section: __text (0x1)
│      RefType: UndefinedNonLazy (0x0)
│      Flags [ (0x0)
│      ]
│      Value: 0x1000018C0
│    }
│    Symbol {
│ -    Name: _internal/cpu.Initialize (2038)
│ +    Name: _internal/cpu.Initialize (2024)
│      Type: Section (0xE)
│      Section: __text (0x1)
│      RefType: UndefinedNonLazy (0x0)
│      Flags [ (0x0)
│      ]
│      Value: 0x1000018C0
│    }
│    Symbol {
│ -    Name: _internal/cpu.processOptions (2063)
│ +    Name: _internal/cpu.processOptions (2049)
│      Type: Section (0xE)
│      Section: __text (0x1)
│      RefType: UndefinedNonLazy (0x0)
│      Flags [ (0x0)

It looks like, for some reason, the original build dynamically links against libc, but the rebuild does not. I'm not sure if msgSend (presumably from //go:cgo_import_dynamic libc_sendmsg sendmsg "libc.so" in Go's standard library) is relevant here, but it appears to be the only string difference as well.

Here is the last build log plus the diffed files:

Some observations:

  • This is a problem in Go 1.19.2 (per the log above), so if Go is to blame, it is not a recent problem.
  • I don't see anything about darwin, the toolchain, or libc for the released 1.19.3 nor the upcoming 1.19.4.
  • So far, we haven't reproduced outside of GitHub Actions. It could be an issue with how they installed Go or the C/libc toolchain. It would be useful to see if we can reproduce the error on other Mac machines.
  • This issue only pops up on Mac. We've been able to reproduce four times on CI, but not on any other OS. So this at least tells me that the bug is rather unlikely to be in garble itself - we only treat Mac as special in one single GOOS != "darwin" line, and it seems fairly innocent.
  • This started happening on about a quarter of all Mac builds on CI recently. I struggle to imagine what recent change in garble could have caused this. The first failure was 13 days ago in fix garble with newer Go tip versions #601 (comment), so any one of the handful of changes in garble over the last month could have caused it.

My current best guess is that we broke reproducibility on Darwin with 7c28663. The syscall package is quite special as it does a log of linker directive magic, like the //go:cgo_import_dynamic I showed above. We might be obfuscating those names, but we're definitely not touching those directives - we only update the ones we explicitly support, like //go:linkname.

@mvdan
Copy link
Member Author

mvdan commented Nov 14, 2022

If anyone has a Mac machine, please try to reproduce via go test -run=Script/gogarble. You might need to run the test ten times or so, via go test -run=Script/gogarble -count=10 -failfast, as the failure doesn't always happen on CI.

@mvdan
Copy link
Member Author

mvdan commented Nov 14, 2022

I think my guess is right. With GOGARBLE=* GOOS=darwin, I can see that we obfuscated the syscall package incorrectly:

//go:cgo_import_dynamic libc_sendmsg sendmsg "/usr/lib/libSystem.B.dylib"

func cIuq_0pP(vSRMCYjCQ int, vpOlW3CefzM unsafe.Pointer, zRkUUkY29iiY int, oxBSrYC_IB unsafe.Pointer, xQTI5Uez int, iRW8rFu0X9nX *YnjcAiI5vCyi) (zPzblqaRGRXzBE int, cGQJIxmS error) {

Note how we obfuscated the declaration of libc_sendmsg, but not its appearance in the linker directive.

@mvdan
Copy link
Member Author

mvdan commented Nov 14, 2022

If the syscall revert doesn't do the trick, then I think the macos-latest switch from 11 to 12 could be the culprit: https://github.blog/changelog/2022-10-03-github-actions-jobs-running-on-macos-latest-are-now-running-on-macos-12/

That changed the versions of a lot of versions in the native toolchain, so it could have introduced problems. It looks like it was rolled out for us in mid October, which lines up with the first failure - assuming we got lucky for a few times until we got the first failure.

@kortschak
Copy link
Contributor

Running the test above on a Catalina (10.15.7) install with go1.19.3 does not fail, but on Monterey (12.6.1) does. I don't have an 11 installation to test on.

@mvdan
Copy link
Member Author

mvdan commented Nov 17, 2022

@kortschak thanks, that's interesting. Can you reproduce repeatedly? I wonder if it only happens once on a clean install of Go, given that the standard library comes pre-built. Perhaps it becomes different once re-built for the first time.

mvdan added a commit that referenced this issue Nov 17, 2022
In the hopes that it fixes the sporadic gogarble.txtar failures on Mac,
where rebuilding runtime sometimes leads to a different binary.

For #609.
mvdan added a commit that referenced this issue Nov 17, 2022
In the hopes that it fixes the sporadic gogarble.txtar failures on Mac,
where rebuilding runtime sometimes leads to a different binary.

For #609.
mvdan added a commit that referenced this issue Nov 17, 2022
In the hopes that it fixes the sporadic gogarble.txtar failures on Mac,
where rebuilding runtime sometimes leads to a different binary.

For #609.
mvdan added a commit that referenced this issue Nov 17, 2022
In the hopes that it fixes the sporadic gogarble.txtar failures on Mac,
where rebuilding runtime sometimes leads to a different binary.

For #609.
mvdan added a commit that referenced this issue Nov 17, 2022
In the hopes that it fixes the sporadic gogarble.txtar failures on Mac,
where rebuilding runtime sometimes leads to a different binary.

For #609.
@mvdan
Copy link
Member Author

mvdan commented Nov 17, 2022

Nope, obfuscating the syscall package is not the problem. I reverted that piece in 7879624, and fired off five CI runs by force-pushing to ci-test (which you can see above), and one already failed in the same way.

@kortschak
Copy link
Contributor

I was able to get the failure consistently on the Monterey (just rechecked this twice) and never on the Catalina.

Will the bincmp outputs be useful for you?

@mvdan
Copy link
Member Author

mvdan commented Nov 17, 2022

Will the bincmp outputs be useful for you?

Yes, if anything to check that they are the same as the original post.

consistently

Huh, on CI it just fails about a third of the time.

@mvdan
Copy link
Member Author

mvdan commented Nov 17, 2022

If you could spare the time - could you check whether v0.7.2 has the same test failure? If not, could you bisect to find which commit introduces the problem? Hopefully the bisection shouldn't be too costly if you can reproduce via go test -run=Script/gogarble.

@mvdan
Copy link
Member Author

mvdan commented Nov 17, 2022

Aha! I just pushed v0.7.2 (with reduced CI to save time) to ci-test, and it failed: /~https://github.com/burrowers/garble/actions/runs/3492436397/jobs/5846215678

So it seems like the breakage was indeed the MacOS bump to version 12, and not anything that we changed recently in garble.

So, either MacOS 12 somehow broke Go (e.g. its build system), or garble does something that only happens to break on MacOS 12 and later - and so we never noticed before.

mvdan added a commit to mvdan/garble-fork that referenced this issue Nov 29, 2022
Since it's not yet clear whether it's a bug on garble's end,
and the regression was the MacOS version bump and not any recent change
we did.

For burrowers#609.
mvdan added a commit that referenced this issue Nov 29, 2022
Since it's not yet clear whether it's a bug on garble's end,
and the regression was the MacOS version bump and not any recent change
we did.

For #609.
@mvdan
Copy link
Member Author

mvdan commented Jun 3, 2023

I've re-enabled macos-latest and I haven't seen any errors on the 20 or so builds CI has done on Mac since. Fingers crossed that whatever it was got fixed. There's a small chance that it was related to the os/exec issues caused by testscript per rogpeppe/go-internal#200.

@mvdan
Copy link
Member Author

mvdan commented Jun 14, 2023

Whatever it was, it appears to be fixed :)

@mvdan mvdan closed this as completed Jun 14, 2023
@mvdan
Copy link
Member Author

mvdan commented Sep 23, 2023

Womp womp, not fixed at all. Happened again with Go 1.21.1 on MacOS 12.6.9: /~https://github.com/burrowers/garble/actions/runs/6284752892/job/17066435869

             > exec garble build -a runtime
            > exec garble build -o=out_rebuild ./stdimporter
            > bincmp out_rebuild out
            diffoscope not found; skipping
            wrote files to bincmp_output/file1-2443663803 and bincmp_output/file2-890863538
            FAIL: testdata/script/gogarble.txtar:52: out_rebuild and out differ; diffoscope above, size diff: -32
$ diffoscope file1-2443663803 file2-890863538
--- file1-2443663803
+++ file2-890863538
├── strings -a -n 8 {}
│ @@ -15,15 +15,14 @@
│  __common
│  __LINKEDIT
│  /usr/lib/dyld
│  /usr/lib/libresolv.9.dylib
│  /usr/lib/libSystem.B.dylib
│  /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
│  /System/Library/Frameworks/Security.framework/Versions/A/Security
│ -/usr/lib/libobjc.A.dylib
│  l$ M9,$u
│  UUUUUUUUH!
│  33333333H!
│  D$pH9P@w
│  L$hH9J0uBH
│  debugCal
│  debugCal
│ @@ -33710,15 +33709,14 @@
│  _getsockopt
│  _issetugid
│  _mach_absolute_time
│  _mach_timebase_info
│  _madvise
│  _nanosleep
│  _notify_is_valid_token
│ -_objc_msgSend
│  _pthread_attr_getstacksize
│  _pthread_attr_init
│  _pthread_attr_setdetachstate
│  _pthread_attr_setstacksize
│  _pthread_cond_broadcast
│  _pthread_cond_init
│  _pthread_cond_signal
├── llvm-readobj --file-headers {}
│ @@ -3,16 +3,16 @@
│  Arch: x86_64
│  AddressSize: 64bit
│  MachHeader {
│    Magic: Magic64 (0xFEEDFACF)
│    CpuType: X86-64 (0x1000007)
│    CpuSubType: CPU_SUBTYPE_X86_64_ALL (0x3)
│    FileType: Executable (0x2)
│ -  NumOfLoadCommands: 19
│ -  SizeOfLoadCommands: 2568
│ +  NumOfLoadCommands: 18
│ +  SizeOfLoadCommands: 2512
│    Flags [ (0x85)
│      MH_DYLDLINK (0x4)
│      MH_NOUNDEFS (0x1)
[...]

@mvdan mvdan reopened this Sep 23, 2023
mvdan added a commit to mvdan/garble-fork that referenced this issue Sep 23, 2023
CI ran into the failure again; reopened burrowers#609 for now.
mvdan added a commit to mvdan/garble-fork that referenced this issue Sep 24, 2023
CI ran into the failure again; reopened burrowers#609 for now.
lu4p pushed a commit that referenced this issue Sep 24, 2023
CI ran into the failure again; reopened #609 for now.
@lu4p lu4p added the bug Something isn't working label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

3 participants