Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build system: link object files #14754

Merged
merged 5 commits into from
Sep 10, 2020

Conversation

leandrolanzieri
Copy link
Contributor

Contribution description

This PR is another attempt to use the generated object files for the linking step instead of the archives. This makes the linking process more transparent and helps pointing out possible redefinition of symbols that may not have been caught before. It also saves a small amount of time by no dealing with the archive creation. The UNDEF list of objects is not needed anymore.

The PR also fixes the bindist example which is broken in current master. The archive seems to be malformed and the second linking is not possible. The example now copies the object files for the second linking.

Current problem

Currently we are using mostly archives in the linking step (only object files specified in the UNDEF makefile variable will be linked directly). That means that:

  • As we link archives and not objects the order really matters, because of the algorithm the linker uses to pick objects from those archives. We may be masking errors of multiple symbol definitions, which could be dangerous (see the example bellow).

  • We are generating one archive per module (even when some modules don't have any object files, e.g. sys or gnrc). This produced some issue in OSx apparently and we found a workaround to be able to link empty archives.

  • As archives may have circular dependencies between them, we are forced to use the --start-group and --end-group options for all our archives. According to man ld, "Using this option has a significant performance cost".

An example of the problem

As order matters, and we use alphabetical order to link the archives, we are currently not overriding the weakkly defined panic_arch function, and using always the version provided by the core. To see this, just compile some example on an architecture that overrides this (e.g. any cortexm), and inspect the binary (nm bin/samr21-xpro/gcoap_example.elf), the symbol
type will be W. This could happen with other symbols in other modules, and depends on which other modules are being used and their order in the linking process.

Testing procedure

  • Green CI
  • We should test this on different platforms. Keep in mind that binaries will change, because of the problem explained above.

Issues/PRs references

Alternative to #8711
Initial suggestion in #5757

@leandrolanzieri leandrolanzieri added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: build system Area: Build system labels Aug 13, 2020
@cgundogan cgundogan added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Aug 13, 2020
@leandrolanzieri leandrolanzieri force-pushed the pr/no_archives branch 2 times, most recently from 2d6358f to 9000eed Compare August 13, 2020 15:02
@cgundogan cgundogan added CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR and removed CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Aug 15, 2020
@leandrolanzieri
Copy link
Contributor Author

Many changes that are currently in the PR can and should be split. I'll do so if we agree with the change.

@maribu
Copy link
Member

maribu commented Aug 18, 2020

Could this fix the issue that randomly a weak symbol is linked in instead of a strong symbol? I had this issue when working with a local PKG and in the end resorted to using guardimg the weak symbol with #ifdef HAS_FOO, as otherwise I ended up with the weak symbol being preferred over the strong one from time to time.

@leandrolanzieri
Copy link
Contributor Author

Could this fix the issue that randomly a weak symbol is linked in instead of a strong symbol? I had this issue when working with a local PKG and in the end resorted to using guardimg the weak symbol with #ifdef HAS_FOO, as otherwise I ended up with the weak symbol being preferred over the strong one from time to time.

Yes. That's because of the way the linker is picking the objects that provide needed symbols from the archives, which depends of the order of the archives on the list, and the name of the objects. Using objects it should link the strong symbol, and throw an error when more than one strong symbol is found. For examples of this, see the panic_arch example above, and the multiple fixes that were added to the PR because of multiple definitions of symbols.

@leandrolanzieri
Copy link
Contributor Author

@leandrolanzieri Would you mind confirming that this commit indeed caused a regression with libhydrogen and the combination of board and compiler?

I'm on it. I could not reproduce at first but when building in Docker the error pops up.

@leandrolanzieri
Copy link
Contributor Author

@leandrolanzieri Would you mind confirming that this commit indeed caused a regression with libhydrogen and the combination of board and compiler?

I'm on it. I could not reproduce at first but when building in Docker the error pops up.

The regression seems to be caused by the commit you pointed out. I'm debugging to figure out why the call to hydro_mem_ct_cmp_u32 from hydro_secretbox_decrypt is returning an error. The weird thing as I mentioned is that I can only reproduce when using the Docker build.

My clang version outside docker is:

clang version 8.0.0-3 (tags/RELEASE_800/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

@bergzand
Copy link
Member

bergzand commented Sep 23, 2020

I've opened an issue to track the regression, see #15066

@benpicco
Copy link
Contributor

benpicco commented Oct 4, 2020

This broke the output on esp8266 #15137

(That might also be the reason why esp32 tests are failing on CI)

@maribu
Copy link
Member

maribu commented Oct 4, 2020

Interestingly, the output of elf_diff of e80bb4b and e24a64f for examples/default on the ESP8266 is huge. The number of symbols linked in and their names remained unchanged, but many symbols changed in size.

Seemingly, symbol exception_handler from cpu/esp_common/exceptions.c was never linked in for the ESP8266 previously (and some other same-named symbol likely from the SDK got linked in instead) - but it now is.

@leandrolanzieri
Copy link
Contributor Author

Seemingly, symbol exception_handler from cpu/esp_common/exceptions.c was never linked in for the ESP8266 previously (and some other same-named symbol likely from the SDK got linked in instead) - but it now is.

This could be the case for other functions defined as weak, which did not get overridden before and now are. I'll look into this.

@gschorcht
Copy link
Contributor

Seemingly, symbol exception_handler from cpu/esp_common/exceptions.c was never linked in for the ESP8266 previously (and some other same-named symbol likely from the SDK got linked in instead)

That's not true. The exception_handler function from cpu/esp_common/exceptions.c was definitely used before to print the register set as well as the running processes and to restart the system in case of crashes.

@leandrolanzieri
Copy link
Contributor Author

Interestingly, the output of elf_diff of e80bb4b and e24a64f for examples/default on the ESP8266 is huge. The number of symbols linked in and their names remained unchanged, but many symbols changed in size.

I'm also getting a big output with this tool. Interestingly, it points out that one symbol disappeared and one symbol is new (main):

Symbols Disappeared
===================
_exit-0x8: 0 bytes
   388c      	beqz.n	a8, 40210017 <_irom0_text_start+0x7>
   b44023        	excw
   101b      	addi.n	a1, a0, 1
   322040        	excw
   __ED_SOURCE_START____ED_SOURCE_END__

New Symbols
===========
main-0x14: 0 bytes
   5388      	l32i.n	a8, a3, 20
   984023        	excw
   402353        	excw
   53a8      	l32i.n	a10, a3, 20
   d04023        	excw
   402353        	excw
   53d8      	l32i.n	a13, a3, 20
   214023        	excw
   __ED_SOURCE_START____ED_SOURCE_END__

@maribu
Copy link
Member

maribu commented Oct 5, 2020

That's not true. The exception_handler function from cpu/esp_common/exceptions.c was definitely used before to print the register set as well as the running processes and to restart the system in case of crashes.

Strange. It changed by 100 Byte and the machine code looks pretty different. I wonder what else could explain this :-/

Here is by the way the output of elf_diff for examples/default build with the docker toolchain: https://mari-bu.de/pr14754.html (Beware: The file is huge (as in multiple MiBs), that will take some seconds to load from my Raspberry Pi home server.)

I sadly forgot to overwrite RIOT_VERSION, but that results only in small changes in the generated binary anyway.

@leandrolanzieri
Copy link
Contributor Author

Please see #15186

@JannesVolkens
Copy link
Contributor

Because of this PR, it is somehow no longer possible to build anything on the nucleo-f207zg :/
I tried running the test under /tests/buttons and it gave me this output:

jannes@Latitude:~/RIOT/tests/buttons$ BOARD=nucleo-f207zg make all flash term
Building application "tests_buttons" for "nucleo-f207zg" with MCU "stm32".

"make" -C /home/jannes/RIOT/boards/nucleo-f207zg
"make" -C /home/jannes/RIOT/boards/common/nucleo
"make" -C /home/jannes/RIOT/core
"make" -C /home/jannes/RIOT/cpu/stm32
"make" -C /home/jannes/RIOT/cpu/cortexm_common
"make" -C /home/jannes/RIOT/cpu/cortexm_common/periph
"make" -C /home/jannes/RIOT/cpu/stm32/bootloader
"make" -C /home/jannes/RIOT/cpu/stm32/periph
"make" -C /home/jannes/RIOT/cpu/stm32/stmclk
"make" -C /home/jannes/RIOT/cpu/stm32/vectors
"make" -C /home/jannes/RIOT/drivers
"make" -C /home/jannes/RIOT/drivers/periph_common
"make" -C /home/jannes/RIOT/sys
"make" -C /home/jannes/RIOT/sys/auto_init
"make" -C /home/jannes/RIOT/sys/isrpipe
"make" -C /home/jannes/RIOT/sys/newlib_syscalls_default
"make" -C /home/jannes/RIOT/sys/pm_layered
"make" -C /home/jannes/RIOT/sys/stdio_uart
"make" -C /home/jannes/RIOT/sys/test_utils/interactive_sync
"make" -C /home/jannes/RIOT/sys/tsrb
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/core/kernel_init.o: In function `kernel_init':
/home/jannes/RIOT/core/kernel_init.c:72: multiple definition of `kernel_init'
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/core/init.o:/home/jannes/RIOT/core/init.c:78: first defined here
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/stm32_clk/stmclk_fx.o: In function `stmclk_init_sysclk':
/home/jannes/RIOT/cpu/stm32/stmclk/stmclk_fx.c:167: multiple definition of `stmclk_init_sysclk'
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/stm32_clk/stmclk_f2f4f7.o:/home/jannes/RIOT/cpu/stm32/stmclk/stmclk_f2f4f7.c:465: first defined here
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/stm32_vectors/STM32F207xx.o: In function `dummy_handler':
/home/jannes/RIOT/cpu/stm32/vectors/STM32F207xx.c:14: multiple definition of `dummy_handler'
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/cpu/vectors.o:/home/jannes/RIOT/cpu/stm32f2/vectors.c:28: first defined here
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/stm32_vectors/STM32F207xx.o:/home/jannes/RIOT/cpu/stm32/vectors/STM32F207xx.c:94: multiple definition of `vector_cpu'
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/cpu/vectors.o:/home/jannes/RIOT/cpu/stm32f2/vectors.c:109: first defined here
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/stm32_vectors/vectors_f2.o: In function `dummy_handler':
/home/jannes/RIOT/cpu/stm32/vectors/vectors_f2.c:28: multiple definition of `dummy_handler'
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/cpu/vectors.o:/home/jannes/RIOT/cpu/stm32f2/vectors.c:28: first defined here
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/stm32_vectors/vectors_f2.o:/home/jannes/RIOT/cpu/stm32/vectors/vectors_f2.c:109: multiple definition of `vector_cpu'
/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/cpu/vectors.o:/home/jannes/RIOT/cpu/stm32f2/vectors.c:109: first defined here
collect2: error: ld returned 1 exit status
/home/jannes/RIOT/tests/buttons/../../Makefile.include:584: recipe for target '/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/tests_buttons.elf' failed
make: *** [/home/jannes/RIOT/tests/buttons/bin/nucleo-f207zg/tests_buttons.elf] Error 1

@aabadie
Copy link
Contributor

aabadie commented Oct 9, 2020

it is somehow no longer possible to build anything on the nucleo-f207zg

Can you try clean flash term targets ? You have object files from older builds.

@JannesVolkens
Copy link
Contributor

Oh damn, I didn't see this :D.
Thank you @aabadie, everything is working now.

gschorcht added a commit to gschorcht/RIOT-Xtensa-ESP that referenced this pull request Dec 1, 2021
With the change from using archive files to using object files in PR RIOT-OS#14754, the module `esp_wifi_enterprise` was no longer compilable. The reason for this was that the file `bignum.c` was present in two different vendor sub-directories and created two identical object files, which then led to a symbol conflict when linking. This commit removes one of these identical files. The one that is used in all `esp_wifi` variants is kept, the one that is only used in `esp_wifi_enterprise` is dropped.
dylad pushed a commit to dylad/RIOT that referenced this pull request Dec 2, 2021
With the change from using archive files to using object files in PR RIOT-OS#14754, the module `esp_wifi_enterprise` was no longer compilable. The reason for this was that the file `bignum.c` was present in two different vendor sub-directories and created two identical object files, which then led to a symbol conflict when linking. This commit removes one of these identical files. The one that is used in all `esp_wifi` variants is kept, the one that is only used in `esp_wifi_enterprise` is dropped.
fjmolinas pushed a commit to fjmolinas/RIOT that referenced this pull request Dec 6, 2021
With the change from using archive files to using object files in PR RIOT-OS#14754, the module `esp_wifi_enterprise` was no longer compilable. The reason for this was that the file `bignum.c` was present in two different vendor sub-directories and created two identical object files, which then led to a symbol conflict when linking. This commit removes one of these identical files. The one that is used in all `esp_wifi` variants is kept, the one that is only used in `esp_wifi_enterprise` is dropped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: build system Area: Build system CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Impact: major The PR changes a significant part of the code base. It should be reviewed carefully Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 3-testing The PR was tested according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines Reviewed: 5-documentation The documentation details of the PR were reviewed according to the maintainer guidelines Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.