Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: better ways to use deletion vectors #215

Merged
merged 22 commits into from
Jul 26, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
9cd41f2
feat: better ways to use deletion vectors
hntd187 May 22, 2024
9edcce9
Merge branch 'main' into dv-utils
hntd187 May 22, 2024
6acb5f8
Merge branch 'main' into dv-utils
hntd187 May 22, 2024
6c726f5
feat: option to pass DVs over as just array of row indexes
hntd187 Jun 8, 2024
0c20f6b
Merge branch 'main' into dv-utils
hntd187 Jun 21, 2024
213c519
Updated and added some plumbing for the row index vector creation
hntd187 Jun 21, 2024
0da2dfb
chore: fmt
hntd187 Jun 21, 2024
ea36d70
Merge branch 'main' into dv-utils
hntd187 Jun 21, 2024
1c897c6
Merge branch 'main' into dv-utils
hntd187 Jun 27, 2024
b33d621
More deleting, a few doc comments, some cleanup
hntd187 Jun 27, 2024
c621634
Merge branch 'main' into dv-utils
hntd187 Jul 5, 2024
4b8e18f
Address PR feedback
hntd187 Jul 5, 2024
63bdbc6
Address PR feedback
hntd187 Jul 7, 2024
b92fcc0
Merge branch 'main' into dv-utils
hntd187 Jul 20, 2024
bfc7486
Address PR feedback, use dangling pointer for init/empty array instea…
hntd187 Jul 20, 2024
59c150b
Merge branch 'main' into dv-utils
hntd187 Jul 23, 2024
8e19022
Address PR feedback, use dangling pointer for init/empty array instea…
hntd187 Jul 25, 2024
265d1f1
Merge remote-tracking branch 'mine/dv-utils' into dv-utils
hntd187 Jul 25, 2024
90dc89f
Update kernel/src/actions/deletion_vector.rs
hntd187 Jul 26, 2024
c97c456
Address PR feedback, use dangling pointer for init/empty array instea…
hntd187 Jul 26, 2024
e68dc95
Merge remote-tracking branch 'mine/dv-utils' into dv-utils
hntd187 Jul 26, 2024
c9e08f2
Address PR feedback, as well as resolve some lints and nightly build …
hntd187 Jul 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions ffi/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,14 @@
#[cfg(any(feature = "default-engine", feature = "sync-engine"))]
use std::collections::HashMap;
use std::default::Default;
use std::ffi::{c_long, c_longlong, c_ulonglong};
scovich marked this conversation as resolved.
Show resolved Hide resolved
use std::os::raw::{c_char, c_void};
use std::ptr::NonNull;
use std::sync::Arc;
use tracing::debug;
use url::Url;

use delta_kernel::actions::deletion_vector::DeletionVector;
scovich marked this conversation as resolved.
Show resolved Hide resolved
use delta_kernel::expressions::{BinaryOperator, Expression, Scalar};
use delta_kernel::schema::{ArrayType, DataType, MapType, PrimitiveType, StructType};
use delta_kernel::snapshot::Snapshot;
Expand Down Expand Up @@ -113,6 +115,8 @@ pub type AllocateStringFn = extern "C" fn(kernel_str: KernelStringSlice) -> Null
// Put KernelBoolSlice in a sub-module, with non-public members, so rust code cannot instantiate it
// directly. It can only be created by converting `From<Vec<bool>>`.
mod private {
use delta_kernel::actions::deletion_vector::DeletionVector;

/// Represents an owned slice of boolean values allocated by the kernel. Any time the engine
/// receives a `KernelBoolSlice` as a return value from a kernel method, engine is responsible
/// to free that slice, by calling [super::drop_bool_slice] exactly once.
Expand All @@ -122,6 +126,12 @@ mod private {
len: usize,
}

#[repr(C)]
nicklan marked this conversation as resolved.
Show resolved Hide resolved
pub struct KernelRowIndexArray {
ptr: *mut u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this accurate?

Suggested change
ptr: *mut u64,
ptr: NonNull<u64>,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: this mirrors the existing KernelBoolSlice, which uses a null pointer to represent the empty slice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, according to std::slice::from_raw_parts:

You can obtain a pointer that is usable as data for zero-length slices using NonNull::dangling().

... that way, we wouldn't need any null check in the as_ref() method below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for Vec don't specifically mention NonNull::dangling, but Vec::as_ptr does return

a dangling raw pointer valid for zero sized reads if the vector didn’t allocate.

The docs for Vec::from_raw_slice are not clear tho:

ptr must have been allocated using the global allocator, such as via the alloc::alloc function. ... These requirements are always upheld by any ptr that has been allocated via Vec<T> [which presumably includes the pointer returned by as_ptr which we already trust for the non-empty case?]. Other allocation sources are allowed if the invariants are upheld.

Further, both Unique::dangling and NonNull::dangling state that they are

useful for initializing types which lazily allocate, like Vec::new does.

That said, it's probably safest of all to just use Vec::new().as_mut_ptr() as pointer for an empty index slice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the null pointer approach is simpler if the engine might ever need to create one of these to pass into kernel? Actually, given how messy all the non-null/unique/etc constraints are, nullable pointers are probably the simplest even if engine doesn't need to create one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's not clear to me what changes should be made? Should be switch both slice types over to a new way in a subsequent PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry that was a lot of rambling/exploring. Basically, we have two options for representing an empty slice:

  • Pointer is always non-null, length 0 and dangling pointer means empty
  • Null pointer means empty

Rust and its library classes don't have a really nice way to handle either case, so IMO we should choose the least-bad approach (= least code and least error prone) and use that consistently. Maybe the current code is already the correct choice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the current code does both -- the slice -> vec code assumes nullable pointer, while the vec -> slice (impl From) uses a dangling pointer. We should pick one and stick with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any opinion on which you prefer @scovich ? I'd probably go with the always non-null length zero.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated both bool and row id slice to use non-zero length as empty.

len: usize,
}

impl KernelBoolSlice {
/// Creates an empty slice.
pub fn empty() -> KernelBoolSlice {
Expand Down Expand Up @@ -175,13 +185,43 @@ mod private {
/// memory, but must only free it by calling [super::drop_bool_slice]. Since the global
/// allocator is threadsafe, it doesn't matter which engine thread invokes that method.
unsafe impl Send for KernelBoolSlice {}
unsafe impl Send for KernelRowIndexArray {}
hntd187 marked this conversation as resolved.
Show resolved Hide resolved

/// # Safety
///
/// If engine chooses to leverage concurrency, engine is responsible to prevent data races.
unsafe impl Sync for KernelBoolSlice {}
unsafe impl Sync for KernelRowIndexArray {}
hntd187 marked this conversation as resolved.
Show resolved Hide resolved

impl KernelRowIndexArray {
/// Converts this slice back into a `Vec<u64>`.
///
/// # Safety
///
/// The slice must have been originally created `From<Vec<u64>>`, and must not have been
/// already been consumed by a previous call to this method.
scovich marked this conversation as resolved.
Show resolved Hide resolved
pub unsafe fn into_vec(self) -> Vec<u64> {
if self.ptr.is_null() {
Default::default()
} else {
Vec::from_raw_parts(self.ptr, self.len, self.len)
}
}
}

impl From<DeletionVector> for KernelRowIndexArray {
fn from(value: DeletionVector) -> Self {
let mut vec = value.row_indexes();
vec.shrink_to_fit();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut vec = value.row_indexes();
vec.shrink_to_fit();
let vec = value.row_indexes();

Vec::into_boxed_slice states:

Before doing the conversion, this method discards excess capacity like shrink_to_fit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: This feels like a very messy part of rust. There are so many ways to convert to and from pointers, and it's not clear that any of them is the "best" -- especially when bits don't fit together nicely. Like here -- into_raw returns a pointer that satisfies NonNull, but there's no "safe" way to actually create a NonNull from it.

hntd187 marked this conversation as resolved.
Show resolved Hide resolved
let len = vec.len();
let boxed = vec.into_boxed_slice();
let ptr = Box::into_raw(boxed).cast();
hntd187 marked this conversation as resolved.
Show resolved Hide resolved
KernelRowIndexArray { ptr, len }
}
}
}
pub use private::KernelBoolSlice;
pub use private::KernelRowIndexArray;

/// # Safety
///
Expand All @@ -192,6 +232,11 @@ pub unsafe extern "C" fn drop_bool_slice(slice: KernelBoolSlice) {
debug!("Dropping bool slice. It is {vec:#?}");
}

#[no_mangle]
pub unsafe extern "C" fn drop_row_indexes(slice: KernelRowIndexArray) {
let _ = slice.into_vec();
}

#[repr(C)]
#[derive(Debug)]
pub enum KernelError {
Expand Down
49 changes: 46 additions & 3 deletions kernel/src/actions/deletion_vector.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@ use std::io::{Cursor, Read};
use std::sync::Arc;

use bytes::Bytes;
use delta_kernel_derive::Schema;
use roaring::RoaringTreemap;
use url::Url;

use delta_kernel_derive::Schema;

use crate::utils::require;
use crate::{DeltaResult, Error, FileSystemClient};

Expand Down Expand Up @@ -165,6 +166,14 @@ impl DeletionVectorDescriptor {
}
}
}

pub fn into_deletion_vector(
self,
fs_client: Arc<dyn FileSystemClient>,
parent: &Url,
) -> DeltaResult<DeletionVector> {
DeletionVector::new(fs_client, parent, self)
}
}

enum Endian {
Expand Down Expand Up @@ -223,15 +232,37 @@ pub(crate) fn treemap_to_bools(treemap: RoaringTreemap) -> Vec<bool> {
}
}

#[derive(Debug, Clone)]
#[repr(C)]
pub struct DeletionVector {
hntd187 marked this conversation as resolved.
Show resolved Hide resolved
inner: RoaringTreemap,
}

impl DeletionVector {
pub fn new(
fs_client: Arc<dyn FileSystemClient>,
parent: &Url,
descriptor: DeletionVectorDescriptor,
) -> DeltaResult<Self> {
let inner = descriptor.read(fs_client, parent)?;
Ok(Self { inner })
}

pub fn row_indexes(self) -> Vec<u64> {
self.inner.into_iter().collect::<Vec<u64>>()
scovich marked this conversation as resolved.
Show resolved Hide resolved
}
}

#[cfg(test)]
mod tests {
use roaring::RoaringTreemap;
use std::path::PathBuf;

use super::*;
use roaring::RoaringTreemap;

use crate::{engine::sync::SyncEngine, Engine};

use super::DeletionVectorDescriptor;
use super::*;

fn dv_relative() -> DeletionVectorDescriptor {
DeletionVectorDescriptor {
Expand Down Expand Up @@ -358,4 +389,16 @@ mod tests {
expected[4294967300] = false;
assert_eq!(bools, expected);
}

#[tokio::test]
async fn test_dv_wrapper() -> Result<(), Box<dyn std::error::Error>> {
let example = dv_inline();
let sync_engine = SyncEngine::new();
let fs_client = sync_engine.get_file_system_client();
let parent = Url::parse("http://not.used")?;
let dv = example.into_deletion_vector(fs_client, &parent)?;
let row_idx = dv.row_indexes();
nicklan marked this conversation as resolved.
Show resolved Hide resolved
dbg!(row_idx);
Ok(())
}
}