-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partially reading NDArray slow via NFS share #188
Comments
We did some further testing and have some additional insights for this problem. We compared file reading of NumPy's memmap with blosc2 and observed the file accesses with Preparation: import blosc2
import numpy as np
a = np.random.randn(100, 150)
# Write Numpy file
fp = np.memmap("test.mmap", dtype='float32', mode='w+', shape=a.shape)
fp[:] = a[:]
fp.flush()
# Write Blosc file
blosc2.asarray(a, chunks=(1, 150), urlpath="test.b2nd", mode="w") Reading the file with numpy: newfp = np.memmap("test.mmap", dtype='float32', mode='r', shape=(100, 150))
newfp[0, :]
Reading the file with blosc2: array2 = blosc2.open("test.b2nd")
array2.info
type : NDArray
shape : (100, 150)
chunks : (1, 150)
blocks : (1, 150)
dtype : float32
cratio : 1.02
cparams : {'blocksize': 600,
'clevel': 1,
'codec': <Codec.ZSTD: 5>,
'codec_meta': 0,
'filters': [<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.SHUFFLE: 1>],
'filters_meta': [0, 0, 0, 0, 0, 0],
'nthreads': 16,
'splitmode': <SplitMode.ALWAYS_SPLIT: 1>,
'typesize': 4,
'use_dict': 0}
dparams : {'nthreads': 16}
array2[0, :]
Even though Since blosc2 NDArray is targeted at fast partial file accesses, I think it would be very valuable if blsoc2 would use memory mapped files internally instead of multiple file openings. Memory mapped files have the additional advantage that users can indicate in advance which pages will be needed in the future ( A workaround on the user side would be to open the file memory-mapped in the first place via Python's |
Nice investigation work. Actually the C library does have a plugin mechanism for registering user-defined I/O functions: /~https://github.com/Blosc/c-blosc2/blob/main/include/blosc2.h#L1096 . You can see an example of how this works at: /~https://github.com/Blosc/c-blosc2/blob/main/tests/test_udio.c#L102 . You could implement your own udio on top of mmap and try. Unfortunately, there are no wrappers for registering the udio mechanism in python-blosc2 yet, but if the C experiment would work well for you, adding the register at the Python level should not be hard. |
@FrancescAlted Thank you very much for your support, the API hints were really helpful :-) We could reproduce the issue in our NFS infrastructure and can also report that
The times are for reading one crop slice (128, 128, 128) out of one image (480, 512, 512). For some reason, the many file accesses/openings/closings of default blosc2 are extremely harmful in the NFS environment. You can find our full benchmarking code and our udio prototype here: /~https://github.com/JanSellner/blosc2_nfs_issue. Some remarks on our implementation: udio interfaceUnfortunately, the udio mechanism, even though extremely helpful, does not perfectly fit to the int64_t test_read(void *ptr, int64_t size, int64_t nitems, void *stream) {
memcpy(ptr, mmap_file.addr + mmap_file.offset, size * nitems);
return nitems;
} The
|
Great work! This clearly demonstrates the benefits of using the mmap interface for doing I/O; it was completely unexpected to me that this would even accelerate the I/O with local filesystems, so I'm definitely +1 on a PR for supporting mmap. Also, I don't think there would be anybody using the Do you think that you would be able to provide support for mmap for Win and Mac too? Or these would require different code paths? At any rate, happy to discuss the details on a PR. |
With the merge of Blosc/c-blosc2#604 we now have mmap support in the blosc2-c interface. That's awesome! Thank you so much @FrancescAlted and @JanSellner !!! |
Thanks a lot @FrancescAlted & @JanSellner ! |
Hey there,
first and foremost I would like to express my gratitude for this amazing work! It's super fast + addresses exactly the pain points that have been plaguing us for a long time.
We are currently exploring using blosc2 as a new interface for storing and loading preprocessed data for training AI-based segmentation models. Our nnU-Net repository is quite popular and optimizing its pipeline would benefit an entire community. Especially now that datasets are getting larger, the ability to efficiently store and partially read files is becoming increasingly valuable.
Let me briefly explain what it is we are doing. During training, we need to read the training data over and over again. Due to the large images that are typical for our domain (~ around 500x500x500 pixels, sometimes 20000x20000 or 1000x1000x1000.), we need to take smaller crops that the GPU can actually handle. These crops are usually around 192x192x192 or smaller. To maximize the variability of data we are presenting, we are not computing these crops offline, but rather partially read them from the larger images on the fly while the training runs.
Currently, this is implemented with numpy memmaps. Preprocessed images are stored on some fast file system (either local SSDs or fast network shares). We use uncompressed numpy arrays (np.save) that we open with
np.load([...], mmap_mode='r')
. This is fast (provided I/O keeps up) and uses basically no CPU at all. But it requires a lot of storage and ... well ... I/O does not longer keep up.The blosc2 NDarray offers exactly what we need to alleviate our problems: fast, partial reading of compressed arrays. An initial prototype that I implemented does exactly that. CPU usage barely increases over the numpy memmap baseline, storage requirement is drastically reduced (compression ratio ~3-4) and I/O as reported by iotop is down as well. At least when running the code on my local workstation.
On our compute cluster, when using the local SSDs of the our compute nodes, blosc2 performance is as expected. The problem lies in the NFS share. If the data is located there, performance tanks. I have a hard time pinpointing what the problem is, especially because reading memmaps with higher I/O requirements runs smoothly.
When using blosc2, CPUs are barely used and almost all processes report that they are waiting for I/O. But when checking the network activity I see that we are only reading ~15MB/s and (curiously) also transmitting ~15MB/s? These numbers are well below what our NFS file shares are capable of (>1.5GB/s). Do you have any idea why this could be?
Could it be that blosc2 sends a large amount of small read requests? But: Even if each chunk was read individually, this should still be fewer reads than we currently have to do with numpy memmaps as there are fewer chunks in a crop than there are contiguous rows to extract from an uncompressed array. Is there something I am not seeing?
Here is some additional information in the hope that it helps narrowing down the problem:
This is how we save arrays:
This is how we open the files to partially read/uncompress the arrays:
Info for a representative training image
When using numpy arrays, I observed that the OS is caching read operations. For smaller datasets, this substantially reduces the amount of reads over time, reducing the I/O burden. I have not observed read caching with blosc2. Is there a specific reason for that?
Thank you for you support and for providing this wonderful package! If there is anything else you need, please let me know!
Best,
Fabian
The text was updated successfully, but these errors were encountered: