Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New bitshuffle functions #567

Merged
merged 11 commits into from
Nov 1, 2023
Merged

New bitshuffle functions #567

merged 11 commits into from
Nov 1, 2023

Conversation

FrancescAlted
Copy link
Member

Here it is a new implementation of bitshuffle for AVX512 coming from upstream. I also took the opportunity to sync the SSE2, AVX2 and ARM/NEON implementions (however, the latter is still pretty much slower than the generic solution, so it is not enabled by default).

Preliminary results using AVX512 on Zen4 (AMD 7950X) points to an speed-up of 10% to 15%, which is quite good (most specially as Zen4 does not have a full 512-bit register implementation).

Finally, AVX512 codepath will be compiled by default on UNIX platforms (still need to figure out which MSVC version supports AVX512), and will be used when AVX512 (more specifically AVX512F and AVX512BW) presence should be detected in runtime.

Although it is slower that our existing code, and of course,
slower than the generic code (at least on an ARM M1), at least
this is completely correct, even for typesizes 1, 2 and 4 (the
previous code was not).

Also, syncing sources with the bitshuffle project makes things
easier when porting docs and fixes from there.
ivilata added a commit to ivilata/c-blosc2 that referenced this pull request Oct 27, 2023
ivilata added a commit to ivilata/python-blosc2 that referenced this pull request Oct 29, 2023
FrancescAlted pushed a commit that referenced this pull request Oct 30, 2023
@FrancescAlted
Copy link
Member Author

This should be mostly ready, but for some reason, fuzzing is failing in CI. I have tried to reproduce that in my Ubuntu box, but I cannot. Also, valgrind seems fine for the dataset causing the issue to the compress_chunk_fuzzer. I am curious on whether this is a genuine problem, or we should ignore it. @nmoinvaz can you shed some light on this one? If you cannot reproduce it either, can you suggest a workaround to avoid the failure in CI?

@nmoinvaz
Copy link
Member

nmoinvaz commented Oct 30, 2023

Log shows the input that causes the crash is base64 HR2dnZ3///////////////89////////////EyvuU/////8=. You so how to decode it and feed that decoded file into the compress_chunk_fuzzer. If the compress_chunk_fuzzer finds it, it is a real thing. It is the most basic check.

@FrancescAlted
Copy link
Member Author

Yes, I did exacly that, but I am unable to reproduce the issue:

$ base64 -d > input.fuz
HR2dnZ3///////////////89////////////EyvuU/////8=
$ build/tests/fuzz/compress_chunk_fuzzer input.fuz
Running 1 inputs
Running: build/tests/fuzz/compress_chunk_fuzzer input.fuz
Done:    input.fuz: (35 bytes)
$ echo $?
0

Am I missing something?

@nmoinvaz
Copy link
Member

I haven't tried it but I can later. Interesting it fails in blosc_get_cpu_features. Perhaps those Google fuzzing CI machines don't have AVX512 support?

@nmoinvaz
Copy link
Member

It doesn't look like you are doing anything controversial in that function.

@FrancescAlted
Copy link
Member Author

This is what I think. Do you know if there is a way to silence the error and treat it as a false positive?

@nmoinvaz
Copy link
Member

nmoinvaz commented Nov 1, 2023

Here are some of my findings:

  • I can't reproduce it on a machine that supports AVX-512. I tried x86 and x64 build on MSVC.
  • I don't know of a way to disable a particular input in the OSS-Fuzz CI.
  • set(COMPILER_SUPPORT_AVX512 TRUE) is missing for MSVC branch in CMake.
  • zlib-ng 2.1.x supports AVX-512 but the version 2.0.x you are using doesn't.

@FrancescAlted
Copy link
Member Author

Here are some of my findings:

  • I can't reproduce it on a machine that supports AVX-512. I tried x86 and x64 build on MSVC.
  • I don't know of a way to disable a particular input in the OSS-Fuzz CI.
  • set(COMPILER_SUPPORT_AVX512 TRUE) is missing for MSVC branch in CMake.

Right. Done in 5cd384c. And hey! this seems to fix the OSS-Fuzz issue, although frankly, I don't understand why.

Incidentally, I also activated a solution based on __builtin_cpu_supports that works well on modern GCC and Clang compilers. This looks cleaner than the other based in __cpuid/_xgetbv , that still works for MSVC and Intel compilers.

  • zlib-ng 2.1.x supports AVX-512 but the version 2.0.x you are using doesn't.

Yup, I tried to update to zlib-ng 2.1.x a while ago, but still doesn't work. For more context, see also zlib-ng/zlib-ng#1560. I suppose the solution is close, but I have not figure it out yet.

@FrancescAlted FrancescAlted requested a review from ivilata November 1, 2023 10:33
Copy link
Contributor

@ivilata ivilata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Congrats on making this work!

@FrancescAlted
Copy link
Member Author

I am going to merge this. BTW, the ALTIVEC implementation for bitshuffle needed a small API change. @kif or @t20100 please check this out in a POWER machine when you have the opportunity. Thanks!

@FrancescAlted FrancescAlted merged commit accf9ff into main Nov 1, 2023
@FrancescAlted FrancescAlted deleted the new-bitshuffle branch November 1, 2023 12:32
FrancescAlted pushed a commit to Blosc/python-blosc2 that referenced this pull request Nov 1, 2023
@FrancescAlted
Copy link
Member Author

FrancescAlted commented Nov 1, 2023

I have done some intensive benchmarks using a AMD 7950X3D CPU with support for AVX512, and this acceleration seems to work best with fast compressors like LZ4 in compression mode, where up to a 20% faster operation can be seen:

image

It seems that for decompression there is some benefit too, but not as large:

image

All in all, this looks like a nice addition.

@FrancescAlted
Copy link
Member Author

For completeness, here are the results for Zstd. As this is quite more CPU intensive than LZ4, the speedup due to the bitshuffle acceleration is not as evident, but one can still see up to 15% speed-ups:

image

As for decompression, there is almost no difference (and sometimes AVX2 seems to do a better job):

image

@oscarbg
Copy link

oscarbg commented Nov 6, 2023

Hi @FrancescAlted ,
I'm on a 7950x, and willing to test.. sorry if noob question, but any simple way (command line launch arguments) to reproduce your findings?..

EDIT: running b2bench uses AVX512 in anyway?

./b2bench  
Blosc version: 2.11.1 ($Date:: 2023-11-05 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,zlib,zstd
Supported compression libraries:
  BloscLZ: 2.5.3
  LZ4: 1.9.4
  Zlib: 1.2.11.zlib-ng
  Zstd: 1.5.5
Using compressor: blosclz
Using shuffle type: shuffle
Running suite: single
--> 8, 8388608, 4, 19, blosclz, shuffle
********************** Run info ******************************
Blosc version: 2.11.1 ($Date:: 2023-11-05 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 8388608 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):		 2330.8 us, 3432.3 MB/s
memcpy(read):		  273.0 us, 29299.4 MB/s
Compression level: 0
comp(write):	  316.4 us, 25281.3 MB/s	  Final bytes: 8388640  Ratio: 1.00
decomp(read):	  182.4 us, 43866.3 MB/s	  OK
Compression level: 1
comp(write):	  158.3 us, 50550.9 MB/s	  Final bytes: 2143456  Ratio: 3.91
decomp(read):	   77.5 us, 103181.5 MB/s	  OK
Compression level: 2
comp(write):	  282.7 us, 28301.3 MB/s	  Final bytes: 681248  Ratio: 12.31
decomp(read):	  112.7 us, 70966.5 MB/s	  OK
Compression level: 3
comp(write):	  291.5 us, 27441.3 MB/s	  Final bytes: 681408  Ratio: 12.31
decomp(read):	  128.0 us, 62516.1 MB/s	  OK
Compression level: 4
comp(write):	  278.0 us, 28780.0 MB/s	  Final bytes: 594880  Ratio: 14.10
decomp(read):	  111.1 us, 72017.0 MB/s	  OK
Compression level: 5
comp(write):	  264.2 us, 30275.4 MB/s	  Final bytes: 594880  Ratio: 14.10
decomp(read):	  110.4 us, 72441.1 MB/s	  OK
Compression level: 6
comp(write):	  280.8 us, 28485.8 MB/s	  Final bytes: 594880  Ratio: 14.10
decomp(read):	  110.8 us, 72176.4 MB/s	  OK
Compression level: 7
comp(write):	  197.4 us, 40519.7 MB/s	  Final bytes: 326800  Ratio: 25.67
decomp(read):	   88.7 us, 90185.7 MB/s	  OK
Compression level: 8
comp(write):	  140.7 us, 56852.8 MB/s	  Final bytes: 179712  Ratio: 46.68
decomp(read):	   74.0 us, 108113.4 MB/s	  OK
Compression level: 9
comp(write):	  256.6 us, 31172.7 MB/s	  Final bytes: 105404  Ratio: 79.59
decomp(read):	  121.1 us, 66054.6 MB/s	  OK

Round-trip compr/decompr on 2.5 GB
Elapsed time:	    0.2 s, 25091.6 MB/s

thanks..

@FrancescAlted
Copy link
Member Author

Yes, b2bench is the tool that I used to produce the plots (with /~https://github.com/Blosc/c-blosc2/blob/main/bench/plot-speeds.py). However, you need to use bitshuffle to see actual accelerations on AVX512 (shuffle does not support AVX512 yet). On my 7950X3D machine:

$ bench/b2bench lz4 bitshuffle suite 32
<snip>
********************** Run info ******************************
Blosc version: 2.10.6.dev ($Date:: 2023-10-05 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 8388608 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 32
********************** Running benchmarks *********************
memcpy(write):		 2356.1 us, 3395.5 MB/s
memcpy(read):		  239.2 us, 33445.5 MB/s
Compression level: 0
comp(write):	  273.0 us, 29301.7 MB/s	  Final bytes: 8388640  Ratio: 1.00
decomp(read):	  202.1 us, 39589.4 MB/s	  OK
Compression level: 1
comp(write):	  223.5 us, 35797.6 MB/s	  Final bytes: 334368  Ratio: 25.09
decomp(read):	  164.2 us, 48725.5 MB/s	  OK
Compression level: 2
comp(write):	  142.8 us, 56020.6 MB/s	  Final bytes: 297760  Ratio: 28.17
decomp(read):	  160.9 us, 49724.9 MB/s	  OK
Compression level: 3
comp(write):	  134.2 us, 59609.9 MB/s	  Final bytes: 275936  Ratio: 30.40
decomp(read):	  170.7 us, 46861.0 MB/s	  OK
Compression level: 4
comp(write):	  156.2 us, 51218.7 MB/s	  Final bytes: 108592  Ratio: 77.25
decomp(read):	  121.7 us, 65730.7 MB/s	  OK
Compression level: 5
comp(write):	  156.3 us, 51196.3 MB/s	  Final bytes: 108592  Ratio: 77.25
decomp(read):	  134.8 us, 59337.6 MB/s	  OK
Compression level: 6
comp(write):	  191.1 us, 41860.8 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  150.8 us, 53056.4 MB/s	  OK
Compression level: 7
comp(write):	  188.0 us, 42552.3 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  156.5 us, 51131.4 MB/s	  OK
Compression level: 8
comp(write):	  188.8 us, 42378.3 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  137.3 us, 58267.1 MB/s	  OK
Compression level: 9
comp(write):	  189.9 us, 42125.4 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  167.7 us, 47711.1 MB/s	  OK

Round-trip compr/decompr on 80.0 GB
Elapsed time:	    8.7 s, 20726.4 MB/s

@t20100
Copy link
Contributor

t20100 commented Nov 6, 2023

Hi,

I updated c-blosc2 to 2.11.1 in hdf5plugin: It builds and tests pass on ppc64le.

@oscarbg
Copy link

oscarbg commented Nov 7, 2023

Yes, b2bench is the tool that I used to produce the plots (with /~https://github.com/Blosc/c-blosc2/blob/main/bench/plot-speeds.py). However, you need to use bitshuffle to see actual accelerations on AVX512 (shuffle does not support AVX512 yet). On my 7950X3D machine:

$ bench/b2bench lz4 bitshuffle suite 32
<snip>
********************** Run info ******************************
Blosc version: 2.10.6.dev ($Date:: 2023-10-05 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 8388608 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 32
********************** Running benchmarks *********************
memcpy(write):		 2356.1 us, 3395.5 MB/s
memcpy(read):		  239.2 us, 33445.5 MB/s
Compression level: 0
comp(write):	  273.0 us, 29301.7 MB/s	  Final bytes: 8388640  Ratio: 1.00
decomp(read):	  202.1 us, 39589.4 MB/s	  OK
Compression level: 1
comp(write):	  223.5 us, 35797.6 MB/s	  Final bytes: 334368  Ratio: 25.09
decomp(read):	  164.2 us, 48725.5 MB/s	  OK
Compression level: 2
comp(write):	  142.8 us, 56020.6 MB/s	  Final bytes: 297760  Ratio: 28.17
decomp(read):	  160.9 us, 49724.9 MB/s	  OK
Compression level: 3
comp(write):	  134.2 us, 59609.9 MB/s	  Final bytes: 275936  Ratio: 30.40
decomp(read):	  170.7 us, 46861.0 MB/s	  OK
Compression level: 4
comp(write):	  156.2 us, 51218.7 MB/s	  Final bytes: 108592  Ratio: 77.25
decomp(read):	  121.7 us, 65730.7 MB/s	  OK
Compression level: 5
comp(write):	  156.3 us, 51196.3 MB/s	  Final bytes: 108592  Ratio: 77.25
decomp(read):	  134.8 us, 59337.6 MB/s	  OK
Compression level: 6
comp(write):	  191.1 us, 41860.8 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  150.8 us, 53056.4 MB/s	  OK
Compression level: 7
comp(write):	  188.0 us, 42552.3 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  156.5 us, 51131.4 MB/s	  OK
Compression level: 8
comp(write):	  188.8 us, 42378.3 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  137.3 us, 58267.1 MB/s	  OK
Compression level: 9
comp(write):	  189.9 us, 42125.4 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  167.7 us, 47711.1 MB/s	  OK

Round-trip compr/decompr on 80.0 GB
Elapsed time:	    8.7 s, 20726.4 MB/s

thanks for details.. my setup is nearly identical in perf..

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Nov 15, 2023
Changes from 2.11.1 to 2.11.2
=============================

* Added support for ARMv7l platforms (Raspberry Pi).  The NEON version
  of the bitshuffle filter was not compiling there, and besides it offered
  no performance advantage over the generic bitshuffle version (it is 2x to
  3x slower actually). So bitshuffle-neon.c has been disabled by default in
  all ARM platforms.

* Also, unaligned access has been disabled in all ARM non-64bits platforms.
  It turned out that, at least the armv7l CPU in Raspberry Pi 4, had issues
  because `__ARM_FEATURE_UNALIGNED` C macro was asserted in the compiler
  (both gcc and clang), but it actually made binaries to raise a "Bus error".

* Thanks to Ben Nuttall for providing a Raspberry Pi for tracking down these
  issues.


Changes from 2.11.0 to 2.11.1
=============================

* Fix ALTIVEC header.  Only affects to IBM POWER builds. Thanks to
  Michael Kuhn for providing a patch.


Changes from 2.10.5 to 2.11.0
=============================

* New AVX512 support for the bitshuffle filter.  This is a backport of the upstream
  bitshuffle project (/~https://github.com/kiyo-masui/bitshuffle).  Expect up to [20%
  better compression speed](Blosc/c-blosc2#567 (comment))
  on AMD Zen4 architecture (7950X3D CPU).

* Add c-blosc2 package definition for Guix.  Thanks to Ivan Vilata.

* Properly check calls to `strtol`.

* Export the `b2nd_copy_buffer` function. This may be useful for other projects
  dealing with multidimensional arrays in memory. Thanks to Ivan Vilata.

* Better check that nthreads must be >= 1 and <= INT16_MAX.

* Fix compile arguments for armv7l. Thanks to Ben Greiner.
angshine pushed a commit to angshine/python-blosc2 that referenced this pull request Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants