Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Builder for Scans on TableChanges #521

Merged
merged 14 commits into from
Nov 23, 2024

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Nov 21, 2024

What changes are proposed in this pull request?

This PR introduces the TableChangesScanBuilder which constructs a TableChangesScan given a TableChanges, and optionally, a predicate and schema.

This introduces the following structs:

  • TableChangesScan
  • TableChangesScanBuilder

I also introduce methods to TableChanges to get a builder:

  • into_scan_builder
  • scan_builder

How was this change tested?

I ensure that schema projection works for CDF's generated columns, predicates are correctly processed, and that the ColumnTypes are correctly created for the TableChangesScan

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

Attention: Patch coverage is 93.22034% with 8 lines in your changes missing coverage. Please review.

Project coverage is 80.58%. Comparing base (4a0fad2) to head (53732c4).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/table_changes/scan.rs 95.53% 2 Missing and 3 partials ⚠️
kernel/src/table_changes/mod.rs 50.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #521      +/-   ##
==========================================
+ Coverage   80.43%   80.58%   +0.15%     
==========================================
  Files          62       63       +1     
  Lines       13645    13762     +117     
  Branches    13645    13762     +117     
==========================================
+ Hits        10975    11090     +115     
+ Misses       2114     2113       -1     
- Partials      556      559       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Nov 21, 2024
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick pass with some early comments

kernel/src/table_changes/mod.rs Show resolved Hide resolved
kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
kernel/src/table_changes/scan.rs Show resolved Hide resolved
kernel/src/table_changes/scan.rs Show resolved Hide resolved
kernel/src/table_changes/scan.rs Show resolved Hide resolved
kernel/src/table_changes/scan.rs Outdated Show resolved Hide resolved
kernel/src/table_changes/scan.rs Outdated Show resolved Hide resolved
kernel/src/table_changes/scan.rs Outdated Show resolved Hide resolved
@@ -47,8 +48,8 @@ static LOG_SCHEMA: LazyLock<SchemaRef> = LazyLock::new(|| {
Option::<Protocol>::get_struct_field(PROTOCOL_NAME),
Option::<SetTransaction>::get_struct_field(SET_TRANSACTION_NAME),
Option::<CommitInfo>::get_struct_field(COMMIT_INFO_NAME),
Option::<Cdc>::get_struct_field(CDC_NAME),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR need to rebase? I could have sworn Cdc infra already merged?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, needs a rebase

@OussamaSaoudi-db OussamaSaoudi-db changed the title Implement Builder for Scans on TableChangs Implement Builder for Scans on TableChanges Nov 21, 2024
@OussamaSaoudi-db OussamaSaoudi-db force-pushed the scan_builder branch 2 times, most recently from 9ac7550 to 2e1fd66 Compare November 21, 2024 23:19
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, few nits

Comment on lines 54 to 58
match schema_opt {
Some(schema) => self.with_schema(schema),
None => self,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also work?

Suggested change
match schema_opt {
Some(schema) => self.with_schema(schema),
None => self,
}
schema_opt.map_or(self, |schema| self.with_schema(schema))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it doesn't because it sees self is moving ownership to map_or's default and also to the closure. Not allowed 😔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_or_else might delay the move enough for it to work?

schema_opt.map_or_else(|| self, |schema| self.with_schema(schema))

kernel/src/table_changes/scan.rs Outdated Show resolved Hide resolved
kernel/src/table_changes/scan.rs Outdated Show resolved Hide resolved
// Add to read schema, store field so we can build a `Column` expression later
// if needed (i.e. if we have partition columns)
let physical_field =
logical_field.make_physical(*self.table_changes.column_mapping_mode())?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this changes a lot to support column mapping and/or nested columns: #512

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm how do you think we should handle this?

Copy link
Collaborator

@scovich scovich Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an area where CDF is not different from a normal scan. We just need to leverage the same approach the other PR introduces. IMO we should NOT solve it here -- just let the other PR pick it up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet, ty!

Comment on lines +156 to +188
ColumnType::Selected("_change_type".to_string()),
ColumnType::Selected("_commit_version".to_string()),
ColumnType::Selected("_commit_timestamp".to_string()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these generated columns are not file-constant values, so we have to treat them like normal columns even tho they don't come from the parquet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I considered introducing a ColumnType::Generated. I feel like this is the "right" way, but may take some more discussion and changes to kernel/scan/mod.rs. If you think ColumnType::Generated is not controversial, we can go with that.

The current plan is to treat them as selected, and later check when transforming physical to logical:

match (column_type) {
    ColumnType::Selected(col) => {
       if CDF_FIELDS.contains(col) {
           // Treat as CDF generated column
       } else {
           // Usual path for ColumnType::Selected
       }
   }
   ...
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicklan @zachschuermann If you feel strongly that we should do Generated or keep Selected, let me know!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it as-is for now. Once we add support for row tracking there will be more generated columns to worry about and we can revisit with more context available.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and I think the expression stuff will change this as well. Rather than a "selected" field, we'll just generate an expression to say, add a column with this value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure SGTM :)

kernel/src/table_changes/table_changes_scan.rs Outdated Show resolved Hide resolved
Comment on lines +156 to +188
ColumnType::Selected("_change_type".to_string()),
ColumnType::Selected("_commit_version".to_string()),
ColumnType::Selected("_commit_timestamp".to_string()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and I think the expression stuff will change this as well. Rather than a "selected" field, we'll just generate an expression to say, add a column with this value.

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a couple nits!

Comment on lines +156 to +188
ColumnType::Selected("_change_type".to_string()),
ColumnType::Selected("_commit_version".to_string()),
ColumnType::Selected("_commit_timestamp".to_string()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure SGTM :)


#[test]
fn simple_table_changes_scan_builder() {
let path = "./tests/data/table-with-cdf";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: little bit confusing there's a column named 'part' but there are no partition columns. maybe just add a comment that this is a non-partitioned table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added 👍

kernel/src/table_changes/scan.rs Show resolved Hide resolved
@zachschuermann zachschuermann removed the breaking-change Change that will require a version bump label Nov 22, 2024
@OussamaSaoudi-db OussamaSaoudi-db merged commit db29db0 into delta-io:main Nov 23, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants