Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: better ways to use deletion vectors #215

Merged
merged 22 commits into from
Jul 26, 2024
Merged

Conversation

hntd187
Copy link
Collaborator

@hntd187 hntd187 commented May 22, 2024

This is meant to be some utils to help make DV usage easier. You can

  • Get a deletion/selection vector
  • Get a DV or SV in bits or bools depending
  • Get either DV or SV in batches or a full vector specified by the caller
  • Pass an array, slice or vec to use for the container
  • Implement the traits to provide your own container to operate on

TODO:

  1. Docs
  2. More tests
  3. FFI integration for this. I'd prefer to get the rust side right first and then I can extend for FFI usage.
  4. Correctness, is this what we want?

@hntd187 hntd187 self-assigned this May 22, 2024
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass. Overall I think this is the right idea, but I'm not sold on the batch thing

kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing first pass. Seems like an interesting start, but not sure I have a "big picture" view of the idea quite yet?
How selection vs. deletion vectors would work, how engine could provide its own bit vector implementation, etc.

kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments but not much substance in the end... rust pointers are just messy no matter how you "slice" it :(

ffi/src/lib.rs Outdated
@@ -122,6 +126,12 @@ mod private {
len: usize,
}

#[repr(C)]
pub struct KernelRowIndexArray {
ptr: *mut u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this accurate?

Suggested change
ptr: *mut u64,
ptr: NonNull<u64>,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: this mirrors the existing KernelBoolSlice, which uses a null pointer to represent the empty slice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, according to std::slice::from_raw_parts:

You can obtain a pointer that is usable as data for zero-length slices using NonNull::dangling().

... that way, we wouldn't need any null check in the as_ref() method below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for Vec don't specifically mention NonNull::dangling, but Vec::as_ptr does return

a dangling raw pointer valid for zero sized reads if the vector didn’t allocate.

The docs for Vec::from_raw_slice are not clear tho:

ptr must have been allocated using the global allocator, such as via the alloc::alloc function. ... These requirements are always upheld by any ptr that has been allocated via Vec<T> [which presumably includes the pointer returned by as_ptr which we already trust for the non-empty case?]. Other allocation sources are allowed if the invariants are upheld.

Further, both Unique::dangling and NonNull::dangling state that they are

useful for initializing types which lazily allocate, like Vec::new does.

That said, it's probably safest of all to just use Vec::new().as_mut_ptr() as pointer for an empty index slice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the null pointer approach is simpler if the engine might ever need to create one of these to pass into kernel? Actually, given how messy all the non-null/unique/etc constraints are, nullable pointers are probably the simplest even if engine doesn't need to create one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's not clear to me what changes should be made? Should be switch both slice types over to a new way in a subsequent PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry that was a lot of rambling/exploring. Basically, we have two options for representing an empty slice:

  • Pointer is always non-null, length 0 and dangling pointer means empty
  • Null pointer means empty

Rust and its library classes don't have a really nice way to handle either case, so IMO we should choose the least-bad approach (= least code and least error prone) and use that consistently. Maybe the current code is already the correct choice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the current code does both -- the slice -> vec code assumes nullable pointer, while the vec -> slice (impl From) uses a dangling pointer. We should pick one and stick with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any opinion on which you prefer @scovich ? I'd probably go with the always non-null length zero.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated both bool and row id slice to use non-zero length as empty.

ffi/src/lib.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated
Comment on lines 214 to 215
let mut vec = value.row_indexes();
vec.shrink_to_fit();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut vec = value.row_indexes();
vec.shrink_to_fit();
let vec = value.row_indexes();

Vec::into_boxed_slice states:

Before doing the conversion, this method discards excess capacity like shrink_to_fit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: This feels like a very messy part of rust. There are so many ways to convert to and from pointers, and it's not clear that any of them is the "best" -- especially when bits don't fit together nicely. Like here -- into_raw returns a pointer that satisfies NonNull, but there's no "safe" way to actually create a NonNull from it.

ffi/src/lib.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally fine, but we need to augment the actual ffi apis. So today we have selection_vector_from_dv (in ffi/src/scan.rs), which always returns a bool slice.

we should probably add a row_indexes_from_dv to mirror that.

Alternately, we could have a single function that can return whatever the user asks for, but while that could be clean in rust (return an enum), I think over ffi it's probably easier/cleaner to just have multiple functions each with its own return type.

ffi/src/lib.rs Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated Show resolved Hide resolved
@hntd187 hntd187 requested a review from nicklan June 21, 2024 21:19
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly good! I think there's still some stuff ryan pointed out that needs to be fixed, and I had one more thing.

kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
@hntd187 hntd187 requested a review from nicklan June 27, 2024 12:57
kernel/src/scan/state.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for iterating on this. just a couple of doc comments and we're good to go (i think)

ffi/src/lib.rs Show resolved Hide resolved
kernel/src/scan/state.rs Show resolved Hide resolved
@hntd187 hntd187 requested a review from scovich July 23, 2024 00:04
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few nits - and can we update the PR description to accurately represent the changes?

ffi/src/lib.rs Show resolved Hide resolved
ffi/src/lib.rs Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated
if let Some(ptr) = NonNull::new(ptr) {
KernelBoolSlice { ptr, len }
} else {
KernelBoolSlice::empty()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the leaked pointer can ever be null, so None case here is impossible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't believe so, I am just pleasing the compiler here because internally a vec that hasn't allocated also won't produce a null here and I can't think of any situation in which it would realistically be null.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's a suitable From:

let ptr = Box::leak(boxed).into();
KernelBoolSlice { ptr, len }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping this would work, but we call .as_mut_ptr() to get the pointer to the inner buffer of the vec, NonNull::from() wants a &mut T which I am not aware of anyway to get a mutable reference to the inner buffer (only a pointer) that doesn't involve dereferencing the pointer above, which would add unsafe to the call. I would prefer to keep the check above in lieu of adding unsafe to the method, thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also keep in mind a lot of ways to refer to the vec in rust give pointers to the rust object of the vec, not the inner buffer, making this harder to get

Copy link
Collaborator

@scovich scovich Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! I had indeed forgotten the as_mut_ptr was a vec method (wrongly assumed it was just turning the leaked reference into a pointer). Agree there doesn't seem to be any good way to directly extract a NonNull from a slice, even tho it's guaranteed to exist (grr). All the examples just just expect to unwrap the option when passing a reference to NonNull::new, should we do that?

(the current code is confusing because it uses an empty slice as a fallback for an error case that can't actually arise in practice -- which could trick some future reader into thinking empty slices need some kind of special handling there)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility would be to leverage first_mut, but that's still annoying and ugly because it returns an Option<&mut T> (because it's UB to create a reference from a zero-length slice). So expect or the current code are probably the best we can do, barring a rust API change.

Copy link
Collaborator Author

@hntd187 hntd187 Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expect should be okay because if it ever hits that error, we have a legitimate bug we should investigate.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks correct, remaining ugly seems to be a wart in rust, not easily fixed.

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@hntd187 hntd187 merged commit ec934e8 into delta-io:main Jul 26, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants