Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make Map arrow type consistent with Map parquet type #299

Merged
merged 2 commits into from
Aug 5, 2024

Conversation

ion-elgreco
Copy link
Contributor

@ion-elgreco ion-elgreco commented Aug 3, 2024

When we read parquet Map types in delta-rs, the data gets read as rootname "key_value" and then the struct type has two fields "key", "value". We have casting logic that expects this schema, however when we convert our snapshot.schema to an arrow schema, a mismatch happens because kernel is using other field names.

Error when we optimize tables with Map types.

SchemaMismatchError: Could not find column keys

Theoretically arrow fields can use whatever names they want, but since we are reading parquet data and therefore getting these RecordBatches with that structure, it would be just much easier to keep the structure of the datatypes consistent between the two.

See parquet docs:
/~https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps

@scovich
Copy link
Collaborator

scovich commented Aug 4, 2024

@roeap can you take a look at this one? IIRC you had uncovered a bunch of intricacies in the naming of these fields, and even the number of fields involved, depending on which parquet writer produced the file? Perhaps we need a way for kernel to hide those details from engine, so engine can always count on getting the same style of map schema?

@roeap
Copy link
Collaborator

roeap commented Aug 4, 2024

In a discussion with Hannes from DuckDB he told about his experience implementing their Parquet reader.

... you are not implementing a spec, but you are implementing all these bugs and quirks for the various implementations out there ...

I guess where I am at is the only robust way of processing this is via ordinals and not rely on names. That said, since we are creating an arrow schema here I would personally lean more towards consistency with arrow-rs, where it is called "entities" as we do here. In practice I guess a parquet writer should make sure its spec compliant.

Then again I don't think the various arrow implementations are consistent, as the arrow spec just mentions ordinals - IIRC. So may leaning towards "entities" is very slight and would also fine changing it.

On my todo list is to upstream the parquet logic into delta-rs, at which point I will try to make as much processing ordinal bases as can be.

@ion-elgreco
Copy link
Contributor Author

In a discussion with Hannes from DuckDB he told about his experience implementing their Parquet reader.

... you are not implementing a spec, but you are implementing all these bugs and quirks for the various implementations out there ...

I guess where I am at is the only robust way of processing this is via ordinals and not rely on names. That said, since we are creating an arrow schema here I would personally lean more towards consistency with arrow-rs, where it is called "entities" as we do here. In practice I guess a parquet writer should make sure its spec compliant.

Then again I don't think the various arrow implementations are consistent, as the arrow spec just mentions ordinals - IIRC. So may leaning towards "entities" is very slight and would also fine changing it.

On my todo list is to upstream the parquet logic into delta-rs, at which point I will try to make as much processing ordinal bases as can be.

But it's an issue at the moment for delta-rs, easiest would be to set these names properly in kernel-rs. Otherwise we have to manually fix these wrongs schemas afterwards,

Another way would be to allow users to provide custom names they prefer in the delta->arrow schema conversion so there are no assumptions on the spec

@roeap
Copy link
Collaborator

roeap commented Aug 4, 2024

While that is correct, since we are not controlling how all tables we are reading are written In fact we are not even controlling the schema conventions of the data we are writing in delta-rs. I guess this is just a fix tailored to that one specific case, rather then solving the issue more permanently. I.e. at some point a table written by another writer comes along and we see the same issue again for another case.

As long as we are relying on a specific convention there, I fear we will always just cover a subset of cases 😦.

Do we have a repro somewhere for the spark issue? I'd like to investigate that a bit and see if we cannot find a more permeant solution.

As to if we should adopt this change here I'd rely on @scovich @nicklan @zachschuermann opinions to to tip the scales.

@ion-elgreco
Copy link
Contributor Author

ion-elgreco commented Aug 4, 2024

If people follow the parquet spec, then we should never receive data that's structured different then that though

@roeap You can run this snippet:

from deltalake import DeltaTable
dt = DeltaTable("crates/test/tests/data/delta-1.2.1-only-struct-stats")
dt.optimize.compact()

The issue will occur during fn cast_struct, since col is not found there

@roeap
Copy link
Collaborator

roeap commented Aug 4, 2024

If people follow the parquet spec, then we should never receive data that's structured different then that though

Wouldn't that be nice :D.

But again, we are not talking about parquet here, this is an arrow schema, where the parquet spec technically does not apply. So the issue is more around how one schema maps to the other ... In fact none of the arrow implementations I am aware of use this in their schemas. arrow-rs does it as we do here, and C++ as well (i think) albeit C++ and rust differ in how they name the internal list field.

@ion-elgreco
Copy link
Contributor Author

I know :), but we read parquet data into arrow RecordBatches, therefore the Arrow data gets structured following the parquet schema. As we are just round tripping Parquet -> Arrow -> Parquet, it would just be an easier option to follow the parquet structure.

Also folks from Arrow-rs pointed out that data get's written to Parquet in whatever way it's structured in Arrow, so if we don't set it right there, we potentially violate the parquet schema.

We can also fix it by improving our RecordBatch casting logic, and schema resolving (it's just more work :P ), but we probably need to do this soon anyways as we currently do a lot of costly casts at the moment.., such as UTF8-view to utf8

@ion-elgreco
Copy link
Contributor Author

@roeap @scovich someone else is reporting the issue now as well. delta-io/delta-rs#2731

This only started after adopting kernel in 0.18

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to merge this, but also, delta-rs guys you should move to ordinal based. So, why merge? Mainly:

  1. unblock delta-rs
  2. The parquet spec says: "It is required that the repeated group of key-value pairs is named key_value and that its fields are named key and value. However, these names may not be used in existing data and should not be enforced as errors when reading." So really it shouldn't be an error, but let's try and be compliant
  3. It's odd that arrow doesn't follow that, they also say: "In a field with Map type, the field has a child Struct field, which then has two children: key type and the second the value type. The names of the child fields may be respectively “entries”, “key”, and “value”, but this is not enforced.", which is a lot less strong that the parquet statement
  4. it's what the java kernel does
  5. it's what pyiceberg does

@nicklan
Copy link
Collaborator

nicklan commented Aug 5, 2024

@roeap if you want to give the other approval we can merge this. if you still feel it's best to fix differently, we can hold off

@nicklan nicklan requested a review from roeap August 5, 2024 18:55
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicklan - go ahead! We still need to actually fix this, but that may take a bit longer :).

@roeap roeap merged commit 9a81b34 into delta-io:main Aug 5, 2024
9 checks passed
@ion-elgreco ion-elgreco deleted the fix/map_type_structure branch August 5, 2024 20:28
ion-elgreco added a commit to ion-elgreco/delta-kernel-rs that referenced this pull request Aug 5, 2024
When we read parquet Map types in delta-rs, the data gets read as
rootname "key_value" and then the struct type has two fields "key",
"value". We have casting logic that expects this schema, however when we
convert our snapshot.schema to an arrow schema, a mismatch happens
because kernel is using other field names.

Error when we optimize tables with Map types.
```python
SchemaMismatchError: Could not find column keys
```

Theoretically arrow fields can use whatever names they want, but since
we are reading parquet data and therefore getting these RecordBatches
with that structure, it would be just much easier to keep the structure
of the datatypes consistent between the two.

See parquet docs:

/~https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants