New bitshuffle functions #567

FrancescAlted · 2023-10-26T19:25:19Z

Here it is a new implementation of bitshuffle for AVX512 coming from upstream. I also took the opportunity to sync the SSE2, AVX2 and ARM/NEON implementions (however, the latter is still pretty much slower than the generic solution, so it is not enabled by default).

Preliminary results using AVX512 on Zen4 (AMD 7950X) points to an speed-up of 10% to 15%, which is quite good (most specially as Zen4 does not have a full 512-bit register implementation).

Finally, AVX512 codepath will be compiled by default on UNIX platforms (still need to figure out which MSVC version supports AVX512), and will be used when AVX512 (more specifically AVX512F and AVX512BW) presence should be detected in runtime.

Although it is slower that our existing code, and of course, slower than the generic code (at least on an ARM M1), at least this is completely correct, even for typesizes 1, 2 and 4 (the previous code was not). Also, syncing sources with the bitshuffle project makes things easier when porting docs and fixes from there.

In preparation for the merging of Blosc#567.

In preparation for the merging of Blosc/c-blosc2#567.

In preparation for the merging of #567.

FrancescAlted · 2023-10-30T12:00:53Z

This should be mostly ready, but for some reason, fuzzing is failing in CI. I have tried to reproduce that in my Ubuntu box, but I cannot. Also, valgrind seems fine for the dataset causing the issue to the compress_chunk_fuzzer. I am curious on whether this is a genuine problem, or we should ignore it. @nmoinvaz can you shed some light on this one? If you cannot reproduce it either, can you suggest a workaround to avoid the failure in CI?

nmoinvaz · 2023-10-30T17:40:09Z

Log shows the input that causes the crash is base64 HR2dnZ3///////////////89////////////EyvuU/////8=. You so how to decode it and feed that decoded file into the compress_chunk_fuzzer. If the compress_chunk_fuzzer finds it, it is a real thing. It is the most basic check.

FrancescAlted · 2023-10-30T18:05:14Z

Yes, I did exacly that, but I am unable to reproduce the issue:

$ base64 -d > input.fuz
HR2dnZ3///////////////89////////////EyvuU/////8=
$ build/tests/fuzz/compress_chunk_fuzzer input.fuz
Running 1 inputs
Running: build/tests/fuzz/compress_chunk_fuzzer input.fuz
Done:    input.fuz: (35 bytes)
$ echo $?
0

Am I missing something?

nmoinvaz · 2023-10-30T18:14:11Z

I haven't tried it but I can later. Interesting it fails in blosc_get_cpu_features. Perhaps those Google fuzzing CI machines don't have AVX512 support?

nmoinvaz · 2023-10-30T18:16:58Z

It doesn't look like you are doing anything controversial in that function.

FrancescAlted · 2023-10-30T19:44:50Z

This is what I think. Do you know if there is a way to silence the error and treat it as a false positive?

nmoinvaz · 2023-11-01T06:36:25Z

Here are some of my findings:

I can't reproduce it on a machine that supports AVX-512. I tried x86 and x64 build on MSVC.
I don't know of a way to disable a particular input in the OSS-Fuzz CI.
set(COMPILER_SUPPORT_AVX512 TRUE) is missing for MSVC branch in CMake.
zlib-ng 2.1.x supports AVX-512 but the version 2.0.x you are using doesn't.

FrancescAlted · 2023-11-01T10:32:30Z

Here are some of my findings:

I can't reproduce it on a machine that supports AVX-512. I tried x86 and x64 build on MSVC.

I don't know of a way to disable a particular input in the OSS-Fuzz CI.

set(COMPILER_SUPPORT_AVX512 TRUE) is missing for MSVC branch in CMake.

Right. Done in 5cd384c. And hey! this seems to fix the OSS-Fuzz issue, although frankly, I don't understand why.

Incidentally, I also activated a solution based on __builtin_cpu_supports that works well on modern GCC and Clang compilers. This looks cleaner than the other based in __cpuid/_xgetbv , that still works for MSVC and Intel compilers.

zlib-ng 2.1.x supports AVX-512 but the version 2.0.x you are using doesn't.

Yup, I tried to update to zlib-ng 2.1.x a while ago, but still doesn't work. For more context, see also zlib-ng/zlib-ng#1560. I suppose the solution is close, but I have not figure it out yet.

ivilata

LGTM! Congrats on making this work!

FrancescAlted · 2023-11-01T10:52:16Z

I am going to merge this. BTW, the ALTIVEC implementation for bitshuffle needed a small API change. @kif or @t20100 please check this out in a POWER machine when you have the opportunity. Thanks!

In preparation for the merging of Blosc/c-blosc2#567.

FrancescAlted · 2023-11-01T16:12:07Z

I have done some intensive benchmarks using a AMD 7950X3D CPU with support for AVX512, and this acceleration seems to work best with fast compressors like LZ4 in compression mode, where up to a 20% faster operation can be seen:

It seems that for decompression there is some benefit too, but not as large:

All in all, this looks like a nice addition.

FrancescAlted · 2023-11-02T12:32:10Z

For completeness, here are the results for Zstd. As this is quite more CPU intensive than LZ4, the speedup due to the bitshuffle acceleration is not as evident, but one can still see up to 15% speed-ups:

As for decompression, there is almost no difference (and sometimes AVX2 seems to do a better job):

oscarbg · 2023-11-06T05:54:36Z

Hi @FrancescAlted ,
I'm on a 7950x, and willing to test.. sorry if noob question, but any simple way (command line launch arguments) to reproduce your findings?..

EDIT: running b2bench uses AVX512 in anyway?

./b2bench  
Blosc version: 2.11.1 ($Date:: 2023-11-05 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,zlib,zstd
Supported compression libraries:
  BloscLZ: 2.5.3
  LZ4: 1.9.4
  Zlib: 1.2.11.zlib-ng
  Zstd: 1.5.5
Using compressor: blosclz
Using shuffle type: shuffle
Running suite: single
--> 8, 8388608, 4, 19, blosclz, shuffle
********************** Run info ******************************
Blosc version: 2.11.1 ($Date:: 2023-11-05 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 8388608 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):		 2330.8 us, 3432.3 MB/s
memcpy(read):		  273.0 us, 29299.4 MB/s
Compression level: 0
comp(write):	  316.4 us, 25281.3 MB/s	  Final bytes: 8388640  Ratio: 1.00
decomp(read):	  182.4 us, 43866.3 MB/s	  OK
Compression level: 1
comp(write):	  158.3 us, 50550.9 MB/s	  Final bytes: 2143456  Ratio: 3.91
decomp(read):	   77.5 us, 103181.5 MB/s	  OK
Compression level: 2
comp(write):	  282.7 us, 28301.3 MB/s	  Final bytes: 681248  Ratio: 12.31
decomp(read):	  112.7 us, 70966.5 MB/s	  OK
Compression level: 3
comp(write):	  291.5 us, 27441.3 MB/s	  Final bytes: 681408  Ratio: 12.31
decomp(read):	  128.0 us, 62516.1 MB/s	  OK
Compression level: 4
comp(write):	  278.0 us, 28780.0 MB/s	  Final bytes: 594880  Ratio: 14.10
decomp(read):	  111.1 us, 72017.0 MB/s	  OK
Compression level: 5
comp(write):	  264.2 us, 30275.4 MB/s	  Final bytes: 594880  Ratio: 14.10
decomp(read):	  110.4 us, 72441.1 MB/s	  OK
Compression level: 6
comp(write):	  280.8 us, 28485.8 MB/s	  Final bytes: 594880  Ratio: 14.10
decomp(read):	  110.8 us, 72176.4 MB/s	  OK
Compression level: 7
comp(write):	  197.4 us, 40519.7 MB/s	  Final bytes: 326800  Ratio: 25.67
decomp(read):	   88.7 us, 90185.7 MB/s	  OK
Compression level: 8
comp(write):	  140.7 us, 56852.8 MB/s	  Final bytes: 179712  Ratio: 46.68
decomp(read):	   74.0 us, 108113.4 MB/s	  OK
Compression level: 9
comp(write):	  256.6 us, 31172.7 MB/s	  Final bytes: 105404  Ratio: 79.59
decomp(read):	  121.1 us, 66054.6 MB/s	  OK

Round-trip compr/decompr on 2.5 GB
Elapsed time:	    0.2 s, 25091.6 MB/s

thanks..

FrancescAlted · 2023-11-06T09:08:02Z

Yes, b2bench is the tool that I used to produce the plots (with /~https://github.com/Blosc/c-blosc2/blob/main/bench/plot-speeds.py). However, you need to use bitshuffle to see actual accelerations on AVX512 (shuffle does not support AVX512 yet). On my 7950X3D machine:

$ bench/b2bench lz4 bitshuffle suite 32
<snip>
********************** Run info ******************************
Blosc version: 2.10.6.dev ($Date:: 2023-10-05 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 8388608 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 32
********************** Running benchmarks *********************
memcpy(write):		 2356.1 us, 3395.5 MB/s
memcpy(read):		  239.2 us, 33445.5 MB/s
Compression level: 0
comp(write):	  273.0 us, 29301.7 MB/s	  Final bytes: 8388640  Ratio: 1.00
decomp(read):	  202.1 us, 39589.4 MB/s	  OK
Compression level: 1
comp(write):	  223.5 us, 35797.6 MB/s	  Final bytes: 334368  Ratio: 25.09
decomp(read):	  164.2 us, 48725.5 MB/s	  OK
Compression level: 2
comp(write):	  142.8 us, 56020.6 MB/s	  Final bytes: 297760  Ratio: 28.17
decomp(read):	  160.9 us, 49724.9 MB/s	  OK
Compression level: 3
comp(write):	  134.2 us, 59609.9 MB/s	  Final bytes: 275936  Ratio: 30.40
decomp(read):	  170.7 us, 46861.0 MB/s	  OK
Compression level: 4
comp(write):	  156.2 us, 51218.7 MB/s	  Final bytes: 108592  Ratio: 77.25
decomp(read):	  121.7 us, 65730.7 MB/s	  OK
Compression level: 5
comp(write):	  156.3 us, 51196.3 MB/s	  Final bytes: 108592  Ratio: 77.25
decomp(read):	  134.8 us, 59337.6 MB/s	  OK
Compression level: 6
comp(write):	  191.1 us, 41860.8 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  150.8 us, 53056.4 MB/s	  OK
Compression level: 7
comp(write):	  188.0 us, 42552.3 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  156.5 us, 51131.4 MB/s	  OK
Compression level: 8
comp(write):	  188.8 us, 42378.3 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  137.3 us, 58267.1 MB/s	  OK
Compression level: 9
comp(write):	  189.9 us, 42125.4 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  167.7 us, 47711.1 MB/s	  OK

Round-trip compr/decompr on 80.0 GB
Elapsed time:	    8.7 s, 20726.4 MB/s

t20100 · 2023-11-06T15:54:44Z

Hi,

I updated c-blosc2 to 2.11.1 in hdf5plugin: It builds and tests pass on ppc64le.

oscarbg · 2023-11-07T06:12:09Z

Yes, b2bench is the tool that I used to produce the plots (with /~https://github.com/Blosc/c-blosc2/blob/main/bench/plot-speeds.py). However, you need to use bitshuffle to see actual accelerations on AVX512 (shuffle does not support AVX512 yet). On my 7950X3D machine:

$ bench/b2bench lz4 bitshuffle suite 32
<snip>
********************** Run info ******************************
Blosc version: 2.10.6.dev ($Date:: 2023-10-05 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 8388608 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 32
********************** Running benchmarks *********************
memcpy(write):		 2356.1 us, 3395.5 MB/s
memcpy(read):		  239.2 us, 33445.5 MB/s
Compression level: 0
comp(write):	  273.0 us, 29301.7 MB/s	  Final bytes: 8388640  Ratio: 1.00
decomp(read):	  202.1 us, 39589.4 MB/s	  OK
Compression level: 1
comp(write):	  223.5 us, 35797.6 MB/s	  Final bytes: 334368  Ratio: 25.09
decomp(read):	  164.2 us, 48725.5 MB/s	  OK
Compression level: 2
comp(write):	  142.8 us, 56020.6 MB/s	  Final bytes: 297760  Ratio: 28.17
decomp(read):	  160.9 us, 49724.9 MB/s	  OK
Compression level: 3
comp(write):	  134.2 us, 59609.9 MB/s	  Final bytes: 275936  Ratio: 30.40
decomp(read):	  170.7 us, 46861.0 MB/s	  OK
Compression level: 4
comp(write):	  156.2 us, 51218.7 MB/s	  Final bytes: 108592  Ratio: 77.25
decomp(read):	  121.7 us, 65730.7 MB/s	  OK
Compression level: 5
comp(write):	  156.3 us, 51196.3 MB/s	  Final bytes: 108592  Ratio: 77.25
decomp(read):	  134.8 us, 59337.6 MB/s	  OK
Compression level: 6
comp(write):	  191.1 us, 41860.8 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  150.8 us, 53056.4 MB/s	  OK
Compression level: 7
comp(write):	  188.0 us, 42552.3 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  156.5 us, 51131.4 MB/s	  OK
Compression level: 8
comp(write):	  188.8 us, 42378.3 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  137.3 us, 58267.1 MB/s	  OK
Compression level: 9
comp(write):	  189.9 us, 42125.4 MB/s	  Final bytes: 102088  Ratio: 82.17
decomp(read):	  167.7 us, 47711.1 MB/s	  OK

Round-trip compr/decompr on 80.0 GB
Elapsed time:	    8.7 s, 20726.4 MB/s

thanks for details.. my setup is nearly identical in perf..

Changes from 2.11.1 to 2.11.2 ============================= * Added support for ARMv7l platforms (Raspberry Pi). The NEON version of the bitshuffle filter was not compiling there, and besides it offered no performance advantage over the generic bitshuffle version (it is 2x to 3x slower actually). So bitshuffle-neon.c has been disabled by default in all ARM platforms. * Also, unaligned access has been disabled in all ARM non-64bits platforms. It turned out that, at least the armv7l CPU in Raspberry Pi 4, had issues because `__ARM_FEATURE_UNALIGNED` C macro was asserted in the compiler (both gcc and clang), but it actually made binaries to raise a "Bus error". * Thanks to Ben Nuttall for providing a Raspberry Pi for tracking down these issues. Changes from 2.11.0 to 2.11.1 ============================= * Fix ALTIVEC header. Only affects to IBM POWER builds. Thanks to Michael Kuhn for providing a patch. Changes from 2.10.5 to 2.11.0 ============================= * New AVX512 support for the bitshuffle filter. This is a backport of the upstream bitshuffle project (/~https://github.com/kiyo-masui/bitshuffle). Expect up to [20% better compression speed](Blosc/c-blosc2#567 (comment)) on AMD Zen4 architecture (7950X3D CPU). * Add c-blosc2 package definition for Guix. Thanks to Ivan Vilata. * Properly check calls to `strtol`. * Export the `b2nd_copy_buffer` function. This may be useful for other projects dealing with multidimensional arrays in memory. Thanks to Ivan Vilata. * Better check that nthreads must be >= 1 and <= INT16_MAX. * Fix compile arguments for armv7l. Thanks to Ben Greiner.

In preparation for the merging of Blosc/c-blosc2#567.

FrancescAlted added 7 commits October 26, 2023 11:53

Initial version where SSE2 and AVX2 paths have been ported

ab2931f

Preliminary support for AVX512 for bitshuffle

9bd45e7

Fix loosing ends for AVX512 support

e924a78

Proper AVX512 flags for MSVC

7da1379

Support for disabling AVX512

cbf7050

bitshuffle generic synced with bitshuffle upstream

374f12e

ivilata added a commit to ivilata/c-blosc2 that referenced this pull request Oct 27, 2023

Add c-blosc2 package variant for Guix with AVX-512 enabled

0b697f9

In preparation for the merging of Blosc#567.

ivilata mentioned this pull request Oct 27, 2023

Add files for GNU Guix support #568

Merged

ivilata added a commit to ivilata/python-blosc2 that referenced this pull request Oct 29, 2023

Add python-blosc2 package variant for Guix with AVX-512 enabled

a50e335

In preparation for the merging of Blosc/c-blosc2#567.

Fixes in signature of functions

e01c72a

FrancescAlted pushed a commit that referenced this pull request Oct 30, 2023

Add c-blosc2 package variant for Guix with AVX-512 enabled

af2c534

In preparation for the merging of #567.

FrancescAlted mentioned this pull request Oct 31, 2023

False positive? google/oss-fuzz#11166

Closed

FrancescAlted added 2 commits November 1, 2023 09:45

Activate AVX512 in MSVC and Intel compilers

5cd384c

Re-enabling __builtin_cpu_supports() in GCC and Clang

536a032

FrancescAlted requested a review from ivilata November 1, 2023 10:33

Adapt the ALTIVEC module to the new bitshuffle API

414b449

ivilata approved these changes Nov 1, 2023

View reviewed changes

FrancescAlted merged commit accf9ff into main Nov 1, 2023

FrancescAlted deleted the new-bitshuffle branch November 1, 2023 12:32

FrancescAlted pushed a commit to Blosc/python-blosc2 that referenced this pull request Nov 1, 2023

Add python-blosc2 package variant for Guix with AVX-512 enabled

4ac7e7b

In preparation for the merging of Blosc/c-blosc2#567.

angshine pushed a commit to angshine/python-blosc2 that referenced this pull request Nov 21, 2023

Add python-blosc2 package variant for Guix with AVX-512 enabled

207fd5d

In preparation for the merging of Blosc/c-blosc2#567.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New bitshuffle functions #567

New bitshuffle functions #567

FrancescAlted commented Oct 26, 2023

FrancescAlted commented Oct 30, 2023

nmoinvaz commented Oct 30, 2023 •

edited

Loading

FrancescAlted commented Oct 30, 2023

nmoinvaz commented Oct 30, 2023

nmoinvaz commented Oct 30, 2023

FrancescAlted commented Oct 30, 2023

nmoinvaz commented Nov 1, 2023 •

edited

Loading

FrancescAlted commented Nov 1, 2023

ivilata left a comment

FrancescAlted commented Nov 1, 2023

FrancescAlted commented Nov 1, 2023 •

edited

Loading

FrancescAlted commented Nov 2, 2023

oscarbg commented Nov 6, 2023 •

edited

Loading

FrancescAlted commented Nov 6, 2023

t20100 commented Nov 6, 2023

oscarbg commented Nov 7, 2023

New bitshuffle functions #567

New bitshuffle functions #567

Conversation

FrancescAlted commented Oct 26, 2023

FrancescAlted commented Oct 30, 2023

nmoinvaz commented Oct 30, 2023 • edited Loading

FrancescAlted commented Oct 30, 2023

nmoinvaz commented Oct 30, 2023

nmoinvaz commented Oct 30, 2023

FrancescAlted commented Oct 30, 2023

nmoinvaz commented Nov 1, 2023 • edited Loading

FrancescAlted commented Nov 1, 2023

ivilata left a comment

Choose a reason for hiding this comment

FrancescAlted commented Nov 1, 2023

FrancescAlted commented Nov 1, 2023 • edited Loading

FrancescAlted commented Nov 2, 2023

oscarbg commented Nov 6, 2023 • edited Loading

FrancescAlted commented Nov 6, 2023

t20100 commented Nov 6, 2023

oscarbg commented Nov 7, 2023

nmoinvaz commented Oct 30, 2023 •

edited

Loading

nmoinvaz commented Nov 1, 2023 •

edited

Loading

FrancescAlted commented Nov 1, 2023 •

edited

Loading

oscarbg commented Nov 6, 2023 •

edited

Loading