Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Add readers and writers for the internal object model #11904
base: main
Are you sure you want to change the base?
Parquet: Add readers and writers for the internal object model #11904
Changes from all commits
fe2c208
a2b449b
cd8edea
3eaf3bc
3bb323b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct. The INT96 timestamp reader is used when the underlying type is INT96, not when the Iceberg or Parquet type has nanosecond precision. Nanosecond precision timestamps were recently introduced and are not used to signal that Parquet stores timestamp as an INT96.
This should match the previous behavior, where this is used when underlying physical type (
desc.getPrimitiveType().getPrimitiveTypeName()
) is INT96.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the following 2 methods are the only changes between the implementations of this class, so a lot of code is duplicated. In addition, this already introduces abstract factory methods for some readers -- including timestamps. I think it would be much cleaner to reuse this and call factory methods instead:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this change, but please point these kinds of changes out for reviewers.
The old version worked because all of the supported logical type annotations had an equivalent
ConvertedType
(which is whatOriginalType
is called in Parquet format and the logical type docs).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods should return
ParquetValueWriter<?>
and not a specific class.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing class
LogicalTypeWriterVisitor
have declared return values asOptional<ParquetValueWriters.PrimitiveWriter<?>>
. So, I need to modify the base class also to returnParquetValueWriter<?>
.I thought I will get a comment that why we modified base class and avoid unnecessary refactoring!
Hence, I didn't do.
Since, you also want to handle this. I will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no primitive writer for UUIDs. Maybe this was a copy/paste error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses the fixed length primitive writer with byte[] as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this use
ofNullable
? The intent is not to exposeBaseParquetWriter
beyond its package, so there is no need to handle cases where the factory method returnnull
. Those are error cases.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any need to make this more flexible. It makes sense to support the new
UUIDWriter
to ensure the tests pass with all primitive types, but this doesn't need to introduce a way to override the writer and representation (same for the read side).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic writers handle UUID logical type by retuning default empty(), so that it will be handled as primitive writer of fixed length byte[] here
/~https://github.com/apache/iceberg/blob/5b13760c01bdbb2ab15be072c582adb2a8792f23/parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetWriter.java#L131C1-L132C1
But internal writers needs new UUID based writer. Hence, I added an override to return empty writer for generic types (default) and UUID based writer for internal type.