Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IBufferProtocol to mmap #1866

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

slozier
Copy link
Contributor

@slozier slozier commented Jan 6, 2025

Add IBufferProtocol to mmap. Throws a NotImplementedException if the length does not fit in a int.

For the case of bigger files I think we'll have to rework IPythonBuffer. My initial thoughts on changes would be:

  • Use nint in place of int where applicable (probably where CPython uses Py_ssize_t).
  • Change AsSpan/AsReadOnlySpan methods to have an nint start argument.
  • Fix up everything to use nint arithmetic instead of int.

Marking as draft since I need to do more testing (and maybe write some tests). @BCSharp I'm not sure if you looked at/thought about this before so if you have any feedback would appreciate it.

Related to #1408

@BCSharp
Copy link
Member

BCSharp commented Jan 8, 2025

Yes, I have been thinking about the buffer protocol for large blobs, though not in the context of mmap. I didn't know that mmap implements the buffer protocol, but of course, why not.

I was thinking of supporting Numpy's ND arrays, which can be really big too. Also on some .NET versions, CLI arrays can be up to 4GB. The latter is exotic and the former still a little bit away, but mmap will be a good testing ground for large buffers.

The idea is to make it still easy to support by various types that implement the buffer protocol, and relatively easy to use by consumers. I am wary of using too much of nint, since the interesting .NET API is predominantly int/Span based, and I would like to avoid too much of narrowing casting by the clients to be able to do useful things with the buffer. The casts would have to be checked, or guarded with ifs, which is ugly and error prone. In the end, however, it all depends on usage. Now that we have the usage of the protocol in a number of places, it will be easier to accommodate the most common usage patterns. For instance, I've noticed that almost every client requests BufferFlags.Simple.

I see the following three major consuming patterns that should be easy for consumers to implement:

  1. Some consumers have innate limitations that make them unable to consume buffers larger than 2GB no matter what. For instance, constructors or initializers of builtin types, like bytes, bytearray, etc. For them, the current interface form (int/span-based) is the most convenient. If given a buffer larger than it they can handle, OverflowException should be automatically thrown, just as currently an exception is thrown if the requested buffer type is not supported.

  2. Some consumers can handle buffers larger than 2GB but prefer to do it in span-size chunks because they are making various .NET API calls with the data. For instance scanning for bytes, regex, reading/writing/copying data, encoding/decoding, encrypting/decrypting, etc. The interface should make it easy and convenient for them to consume a given buffer.

  3. Some consumers should be fully capable of handling memory data of any size. For instance memoryview. The interface should allow those clients to easily access the whole blob randomly.

I was thinking along these lines:

  • Add an optional parameter start, and perhaps count too, to AsSpan/AsReadOnlySpan methods, like one of your ideas. This should be easy for exporters to implement, though not always convenient to consume.
  • OR: Add Apply/ReadOnlyApply that takes a lambda or a delegate and optional start/count, which will repeatedly invoke the lambda with the appropriate and successive span, until the end of the designated data range. This should be convenient for consumers in Group 2, but cumbersome for each exporter to implement.
  • OR: Go with the optional parameter to AsSpan/AsReadOnlySpan, and provide Apply as an extension method. This will have the best of both worlds.
  • Extend IBufferProtocol with GetLongBuffer (or GetNativeBuffer?) that returns IPythonLongBuffer, which looks just like IPythonBuffer but everything is nint-based (like another of your ideas). Also AsSpan/AsReadOnlySpan return a new ref struct type that is nint-based. This will be convenient for consumers in Group 3. Provide a helper method to easily implement GetLongBuffer for exporters that never export anything bigger of 2GB. I think we still cannot use default implementations in interfaces, can we?
  • OR: A modification to the point above: drop AsSpan/AsReadOnlySpan, which are just convenience methods around Pin. Since the consumers in Group 3 are few and far between, it may be just simpler to let them fiddle with unsafe pointers.

The file descriptor work seems to be finally coming to an end, so once you merge this PR I could play with these ideas in code to get some better insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants