-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: make Map arrow type consistent with Map parquet type #299
Conversation
0155f25
to
7c29a91
Compare
@roeap can you take a look at this one? IIRC you had uncovered a bunch of intricacies in the naming of these fields, and even the number of fields involved, depending on which parquet writer produced the file? Perhaps we need a way for kernel to hide those details from engine, so engine can always count on getting the same style of map schema? |
In a discussion with Hannes from DuckDB he told about his experience implementing their Parquet reader.
I guess where I am at is the only robust way of processing this is via ordinals and not rely on names. That said, since we are creating an arrow schema here I would personally lean more towards consistency with arrow-rs, where it is called "entities" as we do here. In practice I guess a parquet writer should make sure its spec compliant. Then again I don't think the various arrow implementations are consistent, as the arrow spec just mentions ordinals - IIRC. So may leaning towards "entities" is very slight and would also fine changing it. On my todo list is to upstream the parquet logic into delta-rs, at which point I will try to make as much processing ordinal bases as can be. |
But it's an issue at the moment for delta-rs, easiest would be to set these names properly in kernel-rs. Otherwise we have to manually fix these wrongs schemas afterwards, Another way would be to allow users to provide custom names they prefer in the delta->arrow schema conversion so there are no assumptions on the spec |
While that is correct, since we are not controlling how all tables we are reading are written In fact we are not even controlling the schema conventions of the data we are writing in delta-rs. I guess this is just a fix tailored to that one specific case, rather then solving the issue more permanently. I.e. at some point a table written by another writer comes along and we see the same issue again for another case. As long as we are relying on a specific convention there, I fear we will always just cover a subset of cases 😦. Do we have a repro somewhere for the spark issue? I'd like to investigate that a bit and see if we cannot find a more permeant solution. As to if we should adopt this change here I'd rely on @scovich @nicklan @zachschuermann opinions to to tip the scales. |
If people follow the parquet spec, then we should never receive data that's structured different then that though @roeap You can run this snippet: from deltalake import DeltaTable
dt = DeltaTable("crates/test/tests/data/delta-1.2.1-only-struct-stats")
dt.optimize.compact() The issue will occur during |
Wouldn't that be nice :D. But again, we are not talking about parquet here, this is an arrow schema, where the parquet spec technically does not apply. So the issue is more around how one schema maps to the other ... In fact none of the arrow implementations I am aware of use this in their schemas. arrow-rs does it as we do here, and C++ as well (i think) albeit C++ and rust differ in how they name the internal list field. |
I know :), but we read parquet data into arrow RecordBatches, therefore the Arrow data gets structured following the parquet schema. As we are just round tripping Parquet -> Arrow -> Parquet, it would just be an easier option to follow the parquet structure. Also folks from Arrow-rs pointed out that data get's written to Parquet in whatever way it's structured in Arrow, so if we don't set it right there, we potentially violate the parquet schema. We can also fix it by improving our RecordBatch casting logic, and schema resolving (it's just more work :P ), but we probably need to do this soon anyways as we currently do a lot of costly casts at the moment.., such as UTF8-view to utf8 |
@roeap @scovich someone else is reporting the issue now as well. delta-io/delta-rs#2731 This only started after adopting kernel in 0.18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to merge this, but also, delta-rs guys you should move to ordinal based. So, why merge? Mainly:
- unblock delta-rs
- The parquet spec says: "It is required that the repeated group of key-value pairs is named key_value and that its fields are named key and value. However, these names may not be used in existing data and should not be enforced as errors when reading." So really it shouldn't be an error, but let's try and be compliant
- It's odd that arrow doesn't follow that, they also say: "In a field with Map type, the field has a child Struct field, which then has two children: key type and the second the value type. The names of the child fields may be respectively “entries”, “key”, and “value”, but this is not enforced.", which is a lot less strong that the parquet statement
- it's what the java kernel does
- it's what pyiceberg does
@roeap if you want to give the other approval we can merge this. if you still feel it's best to fix differently, we can hold off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicklan - go ahead! We still need to actually fix this, but that may take a bit longer :).
When we read parquet Map types in delta-rs, the data gets read as rootname "key_value" and then the struct type has two fields "key", "value". We have casting logic that expects this schema, however when we convert our snapshot.schema to an arrow schema, a mismatch happens because kernel is using other field names. Error when we optimize tables with Map types. ```python SchemaMismatchError: Could not find column keys ``` Theoretically arrow fields can use whatever names they want, but since we are reading parquet data and therefore getting these RecordBatches with that structure, it would be just much easier to keep the structure of the datatypes consistent between the two. See parquet docs: /~https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps
When we read parquet Map types in delta-rs, the data gets read as rootname "key_value" and then the struct type has two fields "key", "value". We have casting logic that expects this schema, however when we convert our snapshot.schema to an arrow schema, a mismatch happens because kernel is using other field names.
Error when we optimize tables with Map types.
Theoretically arrow fields can use whatever names they want, but since we are reading parquet data and therefore getting these RecordBatches with that structure, it would be just much easier to keep the structure of the datatypes consistent between the two.
See parquet docs:
/~https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps