Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partially reading NDArray slow via NFS share #188

Closed
FabianIsensee opened this issue Apr 21, 2024 · 6 comments · Fixed by #198
Closed

Partially reading NDArray slow via NFS share #188

FabianIsensee opened this issue Apr 21, 2024 · 6 comments · Fixed by #198

Comments

@FabianIsensee
Copy link

Hey there,

first and foremost I would like to express my gratitude for this amazing work! It's super fast + addresses exactly the pain points that have been plaguing us for a long time.

We are currently exploring using blosc2 as a new interface for storing and loading preprocessed data for training AI-based segmentation models. Our nnU-Net repository is quite popular and optimizing its pipeline would benefit an entire community. Especially now that datasets are getting larger, the ability to efficiently store and partially read files is becoming increasingly valuable.

Let me briefly explain what it is we are doing. During training, we need to read the training data over and over again. Due to the large images that are typical for our domain (~ around 500x500x500 pixels, sometimes 20000x20000 or 1000x1000x1000.), we need to take smaller crops that the GPU can actually handle. These crops are usually around 192x192x192 or smaller. To maximize the variability of data we are presenting, we are not computing these crops offline, but rather partially read them from the larger images on the fly while the training runs.

Currently, this is implemented with numpy memmaps. Preprocessed images are stored on some fast file system (either local SSDs or fast network shares). We use uncompressed numpy arrays (np.save) that we open with np.load([...], mmap_mode='r'). This is fast (provided I/O keeps up) and uses basically no CPU at all. But it requires a lot of storage and ... well ... I/O does not longer keep up.

The blosc2 NDarray offers exactly what we need to alleviate our problems: fast, partial reading of compressed arrays. An initial prototype that I implemented does exactly that. CPU usage barely increases over the numpy memmap baseline, storage requirement is drastically reduced (compression ratio ~3-4) and I/O as reported by iotop is down as well. At least when running the code on my local workstation.

On our compute cluster, when using the local SSDs of the our compute nodes, blosc2 performance is as expected. The problem lies in the NFS share. If the data is located there, performance tanks. I have a hard time pinpointing what the problem is, especially because reading memmaps with higher I/O requirements runs smoothly.
When using blosc2, CPUs are barely used and almost all processes report that they are waiting for I/O. But when checking the network activity I see that we are only reading ~15MB/s and (curiously) also transmitting ~15MB/s? These numbers are well below what our NFS file shares are capable of (>1.5GB/s). Do you have any idea why this could be?
Could it be that blosc2 sends a large amount of small read requests? But: Even if each chunk was read individually, this should still be fewer reads than we currently have to do with numpy memmaps as there are fewer chunks in a crop than there are contiguous rows to extract from an uncompressed array. Is there something I am not seeing?

Here is some additional information in the hope that it helps narrowing down the problem:

This is how we save arrays:

        cparams = {
            'codec': blosc2.Codec.ZSTD,
            'clevel': 8,
            'nthreads': 8
        }
        blosc2.asarray(np.ascontiguousarray(data), urlpath=output_filename_truncated + '.b2nd', chunks=chunks, blocks=blocks, cparams=cparams)

This is how we open the files to partially read/uncompress the arrays:

data = blosc2.open(urlpath=data_b2nd_file, mode='r', dparams={'nthreads': 1})
crop = data[crop_slice]

Info for a representative training image

type : NDArray
shape : (1, 690, 563, 563)
chunks : (1, 80, 64, 64)
blocks : (1, 16, 16, 16)
dtype : float32
cratio : 4.83
cparams : {'blocksize': 16384,
'clevel': 8,
'codec': <Codec.ZSTD: 5>,
'codec_meta': 0,
'filters': [<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.SHUFFLE: 1>],
'filters_meta': [0, 0, 0, 0, 0, 0],
'nthreads': 8,
'splitmode': <SplitMode.ALWAYS_SPLIT: 1>,
'typesize': 4,
'use_dict': 0}
dparams : {'nthreads': 1}

When using numpy arrays, I observed that the OS is caching read operations. For smaller datasets, this substantially reduces the amount of reads over time, reducing the I/O burden. I have not observed read caching with blosc2. Is there a specific reason for that?

Thank you for you support and for providing this wonderful package! If there is anything else you need, please let me know!
Best,
Fabian

@JanSellner
Copy link
Contributor

JanSellner commented Apr 22, 2024

We did some further testing and have some additional insights for this problem. We compared file reading of NumPy's memmap with blosc2 and observed the file accesses with inotify.

Preparation:

import blosc2
import numpy as np
a = np.random.randn(100, 150)

# Write Numpy file
fp = np.memmap("test.mmap", dtype='float32', mode='w+', shape=a.shape)
fp[:] = a[:]
fp.flush()

# Write Blosc file
blosc2.asarray(a, chunks=(1, 150), urlpath="test.b2nd", mode="w")

Reading the file with numpy:

newfp = np.memmap("test.mmap", dtype='float32', mode='r', shape=(100, 150))
newfp[0, :]

inotifywait -m --timefmt '%H:%M' --format '%T %w %e %f' test.mmap yields:

Setting up watches.
Watches established.
test.mmap OPEN
test.mmap CLOSE_NOWRITE,CLOSE

Reading the file with blosc2:

array2 = blosc2.open("test.b2nd")

array2.info
type    : NDArray
shape   : (100, 150)
chunks  : (1, 150)
blocks  : (1, 150)
dtype   : float32
cratio  : 1.02
cparams : {'blocksize': 600,
 'clevel': 1,
 'codec': <Codec.ZSTD: 5>,
 'codec_meta': 0,
 'filters': [<Filter.NOFILTER: 0>,
             <Filter.NOFILTER: 0>,
             <Filter.NOFILTER: 0>,
             <Filter.NOFILTER: 0>,
             <Filter.NOFILTER: 0>,
             <Filter.SHUFFLE: 1>],
 'filters_meta': [0, 0, 0, 0, 0, 0],
 'nthreads': 16,
 'splitmode': <SplitMode.ALWAYS_SPLIT: 1>,
 'typesize': 4,
 'use_dict': 0}
dparams : {'nthreads': 16}

array2[0, :]

inotifywait -m --timefmt '%H:%M' --format '%T %w %e %f' test.b2nd yields:

Setting up watches.
Watches established.
test.b2nd OPEN
test.b2nd ACCESS
test.b2nd CLOSE_NOWRITE,CLOSE
test.b2nd OPEN
test.b2nd ACCESS
test.b2nd CLOSE_NOWRITE,CLOSE
test.b2nd OPEN
test.b2nd ACCESS
test.b2nd CLOSE_NOWRITE,CLOSE
test.b2nd OPEN
test.b2nd ACCESS
test.b2nd CLOSE_NOWRITE,CLOSE
test.b2nd OPEN
test.b2nd ACCESS
test.b2nd CLOSE_NOWRITE,CLOSE
test.b2nd OPEN
test.b2nd ACCESS
test.b2nd CLOSE_NOWRITE,CLOSE
test.b2nd OPEN
test.b2nd ACCESS
test.b2nd CLOSE_NOWRITE,CLOSE
test.b2nd OPEN
test.b2nd ACCESS
test.b2nd CLOSE_NOWRITE,CLOSE
test.b2nd OPEN
test.b2nd ACCESS
test.b2nd CLOSE_NOWRITE,CLOSE
test.b2nd OPEN
test.b2nd ACCESS
test.b2nd CLOSE_NOWRITE,CLOSE

Even though inotify obviously doesn't catch accesses to memory mapped files, there are much more open/close file events in the case of blosc. This could explain the bottleneck since small file accesses can be very slow with NFS (or any remote storage file system) due to the overhead (including latency) involved with every request.

Since blosc2 NDArray is targeted at fast partial file accesses, I think it would be very valuable if blsoc2 would use memory mapped files internally instead of multiple file openings. Memory mapped files have the additional advantage that users can indicate in advance which pages will be needed in the future (MADV_WILLNEED). In the case of blosc, this could be translated to an intention to load certain chunks. What do you think?

A workaround on the user side would be to open the file memory-mapped in the first place via Python's mmap module. However, this requires passing a file handle to blosc2.open instead of a string which currently does not seem to be supported. Is there maybe any other loading function which can operate on a file handle?

@FrancescAlted
Copy link
Member

FrancescAlted commented Apr 22, 2024

Nice investigation work. Actually the C library does have a plugin mechanism for registering user-defined I/O functions: /~https://github.com/Blosc/c-blosc2/blob/main/include/blosc2.h#L1096 . You can see an example of how this works at: /~https://github.com/Blosc/c-blosc2/blob/main/tests/test_udio.c#L102 . You could implement your own udio on top of mmap and try.

Unfortunately, there are no wrappers for registering the udio mechanism in python-blosc2 yet, but if the C experiment would work well for you, adding the register at the Python level should not be hard.

@JanSellner
Copy link
Contributor

@FrancescAlted Thank you very much for your support, the API hints were really helpful :-)

We could reproduce the issue in our NFS infrastructure and can also report that mmap does indeed solve the problem. Here are the benchmarking results where we compared default blosc (Benchmark 1) with blosc mmap (Benchmark 2) and numpy memmap (Benchmark 3):

Benchmark 1: build/blosc_mmap default test_files_nfs/test_real0.b2nd
  Time (mean ± σ):      1.522 s ±  0.082 s    [User: 0.027 s, System: 0.091 s]
  Range (min … max):    1.453 s …  1.705 s    10 runs

Benchmark 2: build/blosc_mmap mmap test_files_nfs/test_real0.b2nd
  Time (mean ± σ):      31.9 ms ±   1.1 ms    [User: 22.7 ms, System: 7.9 ms]
  Range (min … max):    30.0 ms …  33.8 ms    97 runs

Benchmark 3: python main.py test_files_nfs/test_real0.npy
  Time (mean ± σ):     171.9 ms ±   2.6 ms    [User: 128.5 ms, System: 39.6 ms]
  Range (min … max):   169.0 ms … 178.6 ms    17 runs

Summary
  'build/blosc_mmap mmap test_files_nfs/test_real0.b2nd' ran
    5.38 ± 0.20 times faster than 'python main.py test_files_nfs/test_real0.npy'
   47.67 ± 3.04 times faster than 'build/blosc_mmap default test_files_nfs/test_real0.b2nd'

The times are for reading one crop slice (128, 128, 128) out of one image (480, 512, 512). For some reason, the many file accesses/openings/closings of default blosc2 are extremely harmful in the NFS environment.

You can find our full benchmarking code and our udio prototype here: /~https://github.com/JanSellner/blosc2_nfs_issue. Some remarks on our implementation:

udio interface

Unfortunately, the udio mechanism, even though extremely helpful, does not perfectly fit to the mmap approach. This can be seen in the read function (a similar problem exists for the write method):

int64_t test_read(void *ptr, int64_t size, int64_t nitems, void *stream) {
    memcpy(ptr, mmap_file.addr + mmap_file.offset, size * nitems);
    return nitems;
}

The memcpy here is actually not necessary because with mmap it would be possible to use the address mmap_file.addr + mmap_file.offset directly instead of copying the data into the address specified by the pointer ptr. Ideally, there would be an interface which allows to just return a pointer. Especially given that blosc is advertised to be faster than memcpy, it would be nice to have an implementation which does not require unnecessary copies of the compressed data ;-)

MADV_WILLNEED

I also experimented a bit with madvise but could not observe that MADV_WILLNEED actually has any effect. So, I am not sure whether preloading the data as soon as possible would be beneficial (in theory it should be helpful to interleave the data loading and compression steps). But this would also be more complicated to implemented. Hence, it is currently not included.

Local file loading

The mmap approach is not only helpful in NFS environments but can also speed up blosc when reading files locally (here with results where the data is in the cache):

Benchmark 1: build/blosc_mmap default test_files/test_real0.b2nd
  Time (mean ± σ):      19.3 ms ±   0.8 ms    [User: 6.5 ms, System: 12.7 ms]
  Range (min … max):    17.9 ms …  22.1 ms    149 runs

Benchmark 2: build/blosc_mmap mmap test_files/test_real0.b2nd
  Time (mean ± σ):       8.4 ms ±   0.7 ms    [User: 6.2 ms, System: 2.2 ms]
  Range (min … max):     7.5 ms …  10.5 ms    319 runs

Benchmark 3: python main.py test_files/test_real0.npy
  Time (mean ± σ):      58.9 ms ±   5.6 ms    [User: 41.7 ms, System: 17.1 ms]
  Range (min … max):    53.9 ms …  84.6 ms    50 runs

Summary
  'build/blosc_mmap mmap test_files/test_real0.b2nd' ran
    2.28 ± 0.20 times faster than 'build/blosc_mmap default test_files/test_real0.b2nd'
    6.97 ± 0.85 times faster than 'python main.py test_files/test_real0.npy'

Next steps: Given the results, I think it would be super cool and beneficial for many users to have built-in support of mmap in blosc. Ideally with an API adjustment to the read/write methods so that we can omit unnecessary calls to memcpy. If you are interested, I could assist with an PR. What do you think?

@FrancescAlted
Copy link
Member

Great work! This clearly demonstrates the benefits of using the mmap interface for doing I/O; it was completely unexpected to me that this would even accelerate the I/O with local filesystems, so I'm definitely +1 on a PR for supporting mmap. Also, I don't think there would be anybody using the udio interface other than C-Blosc2 itself, so +1 if you want to change the signature for the read and write functions.

Do you think that you would be able to provide support for mmap for Win and Mac too? Or these would require different code paths? At any rate, happy to discuss the details on a PR.

@FabianIsensee
Copy link
Author

With the merge of Blosc/c-blosc2#604 we now have mmap support in the blosc2-c interface. That's awesome! Thank you so much @FrancescAlted and @JanSellner !!!
It would be awesome to be able to use that from python as well! Are you planning on exposing that functionality in the python bindings in the near future?

@FabianIsensee
Copy link
Author

Thanks a lot @FrancescAlted & @JanSellner !
Do you have a roadmap for the next release so that we can use this cool feature? If we want to integrate it, for example into nnU-Net, we need it to be installable via pypi
Best,
Fabian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants