fix: make Map arrow type consistent with Map parquet type #299

ion-elgreco · 2024-08-03T12:00:58Z

When we read parquet Map types in delta-rs, the data gets read as rootname "key_value" and then the struct type has two fields "key", "value". We have casting logic that expects this schema, however when we convert our snapshot.schema to an arrow schema, a mismatch happens because kernel is using other field names.

Error when we optimize tables with Map types.

SchemaMismatchError: Could not find column keys

Theoretically arrow fields can use whatever names they want, but since we are reading parquet data and therefore getting these RecordBatches with that structure, it would be just much easier to keep the structure of the datatypes consistent between the two.

See parquet docs:
/~https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps

scovich · 2024-08-04T02:19:53Z

@roeap can you take a look at this one? IIRC you had uncovered a bunch of intricacies in the naming of these fields, and even the number of fields involved, depending on which parquet writer produced the file? Perhaps we need a way for kernel to hide those details from engine, so engine can always count on getting the same style of map schema?

roeap · 2024-08-04T16:30:16Z

In a discussion with Hannes from DuckDB he told about his experience implementing their Parquet reader.

... you are not implementing a spec, but you are implementing all these bugs and quirks for the various implementations out there ...

I guess where I am at is the only robust way of processing this is via ordinals and not rely on names. That said, since we are creating an arrow schema here I would personally lean more towards consistency with arrow-rs, where it is called "entities" as we do here. In practice I guess a parquet writer should make sure its spec compliant.

Then again I don't think the various arrow implementations are consistent, as the arrow spec just mentions ordinals - IIRC. So may leaning towards "entities" is very slight and would also fine changing it.

On my todo list is to upstream the parquet logic into delta-rs, at which point I will try to make as much processing ordinal bases as can be.

ion-elgreco · 2024-08-04T17:18:47Z

In a discussion with Hannes from DuckDB he told about his experience implementing their Parquet reader.

... you are not implementing a spec, but you are implementing all these bugs and quirks for the various implementations out there ...

I guess where I am at is the only robust way of processing this is via ordinals and not rely on names. That said, since we are creating an arrow schema here I would personally lean more towards consistency with arrow-rs, where it is called "entities" as we do here. In practice I guess a parquet writer should make sure its spec compliant.

Then again I don't think the various arrow implementations are consistent, as the arrow spec just mentions ordinals - IIRC. So may leaning towards "entities" is very slight and would also fine changing it.

On my todo list is to upstream the parquet logic into delta-rs, at which point I will try to make as much processing ordinal bases as can be.

But it's an issue at the moment for delta-rs, easiest would be to set these names properly in kernel-rs. Otherwise we have to manually fix these wrongs schemas afterwards,

Another way would be to allow users to provide custom names they prefer in the delta->arrow schema conversion so there are no assumptions on the spec

roeap · 2024-08-04T20:27:54Z

While that is correct, since we are not controlling how all tables we are reading are written In fact we are not even controlling the schema conventions of the data we are writing in delta-rs. I guess this is just a fix tailored to that one specific case, rather then solving the issue more permanently. I.e. at some point a table written by another writer comes along and we see the same issue again for another case.

As long as we are relying on a specific convention there, I fear we will always just cover a subset of cases 😦.

Do we have a repro somewhere for the spark issue? I'd like to investigate that a bit and see if we cannot find a more permeant solution.

As to if we should adopt this change here I'd rely on @scovich @nicklan @zachschuermann opinions to to tip the scales.

ion-elgreco · 2024-08-04T20:34:11Z

If people follow the parquet spec, then we should never receive data that's structured different then that though

@roeap You can run this snippet:

from deltalake import DeltaTable
dt = DeltaTable("crates/test/tests/data/delta-1.2.1-only-struct-stats")
dt.optimize.compact()

The issue will occur during fn cast_struct, since col is not found there

roeap · 2024-08-04T20:42:25Z

If people follow the parquet spec, then we should never receive data that's structured different then that though

Wouldn't that be nice :D.

But again, we are not talking about parquet here, this is an arrow schema, where the parquet spec technically does not apply. So the issue is more around how one schema maps to the other ... In fact none of the arrow implementations I am aware of use this in their schemas. arrow-rs does it as we do here, and C++ as well (i think) albeit C++ and rust differ in how they name the internal list field.

ion-elgreco · 2024-08-04T21:32:27Z

I know :), but we read parquet data into arrow RecordBatches, therefore the Arrow data gets structured following the parquet schema. As we are just round tripping Parquet -> Arrow -> Parquet, it would just be an easier option to follow the parquet structure.

Also folks from Arrow-rs pointed out that data get's written to Parquet in whatever way it's structured in Arrow, so if we don't set it right there, we potentially violate the parquet schema.

We can also fix it by improving our RecordBatch casting logic, and schema resolving (it's just more work :P ), but we probably need to do this soon anyways as we currently do a lot of costly casts at the moment.., such as UTF8-view to utf8

ion-elgreco · 2024-08-05T11:33:00Z

@roeap @scovich someone else is reporting the issue now as well. delta-io/delta-rs#2731

This only started after adopting kernel in 0.18

nicklan

I'm inclined to merge this, but also, delta-rs guys you should move to ordinal based. So, why merge? Mainly:

unblock delta-rs
The parquet spec says: "It is required that the repeated group of key-value pairs is named key_value and that its fields are named key and value. However, these names may not be used in existing data and should not be enforced as errors when reading." So really it shouldn't be an error, but let's try and be compliant
It's odd that arrow doesn't follow that, they also say: "In a field with Map type, the field has a child Struct field, which then has two children: key type and the second the value type. The names of the child fields may be respectively “entries”, “key”, and “value”, but this is not enforced.", which is a lot less strong that the parquet statement
it's what the java kernel does
it's what pyiceberg does

nicklan · 2024-08-05T18:47:22Z

@roeap if you want to give the other approval we can merge this. if you still feel it's best to fix differently, we can hold off

roeap

@nicklan - go ahead! We still need to actually fix this, but that may take a bit longer :).

When we read parquet Map types in delta-rs, the data gets read as rootname "key_value" and then the struct type has two fields "key", "value". We have casting logic that expects this schema, however when we convert our snapshot.schema to an arrow schema, a mismatch happens because kernel is using other field names. Error when we optimize tables with Map types. ```python SchemaMismatchError: Could not find column keys ``` Theoretically arrow fields can use whatever names they want, but since we are reading parquet data and therefore getting these RecordBatches with that structure, it would be just much easier to keep the structure of the datatypes consistent between the two. See parquet docs: /~https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps

ion-elgreco added 2 commits August 3, 2024 14:01

make arrow map compatible with parquet map

6a1e36f

fmt

7c29a91

ion-elgreco force-pushed the fix/map_type_structure branch from 0155f25 to 7c29a91 Compare August 3, 2024 12:01

This was referenced Aug 3, 2024

fix: change arrow map root name to follow with parquet root name delta-io/delta-rs#2538

Merged

fix: optimize Spark written tables delta-io/delta-rs#1650

Merged

nicklan approved these changes Aug 5, 2024

View reviewed changes

nicklan requested a review from roeap August 5, 2024 18:55

roeap approved these changes Aug 5, 2024

View reviewed changes

roeap merged commit 9a81b34 into delta-io:main Aug 5, 2024
9 checks passed

ion-elgreco deleted the fix/map_type_structure branch August 5, 2024 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make Map arrow type consistent with Map parquet type #299

fix: make Map arrow type consistent with Map parquet type #299

ion-elgreco commented Aug 3, 2024 •

edited

Loading

scovich commented Aug 4, 2024

roeap commented Aug 4, 2024

ion-elgreco commented Aug 4, 2024

roeap commented Aug 4, 2024

ion-elgreco commented Aug 4, 2024 •

edited

Loading

roeap commented Aug 4, 2024 •

edited

Loading

ion-elgreco commented Aug 4, 2024

ion-elgreco commented Aug 5, 2024

nicklan left a comment

nicklan commented Aug 5, 2024

roeap left a comment

fix: make Map arrow type consistent with Map parquet type #299

fix: make Map arrow type consistent with Map parquet type #299

Conversation

ion-elgreco commented Aug 3, 2024 • edited Loading

scovich commented Aug 4, 2024

roeap commented Aug 4, 2024

ion-elgreco commented Aug 4, 2024

roeap commented Aug 4, 2024

ion-elgreco commented Aug 4, 2024 • edited Loading

roeap commented Aug 4, 2024 • edited Loading

ion-elgreco commented Aug 4, 2024

ion-elgreco commented Aug 5, 2024

nicklan left a comment

Choose a reason for hiding this comment

nicklan commented Aug 5, 2024

roeap left a comment

Choose a reason for hiding this comment

ion-elgreco commented Aug 3, 2024 •

edited

Loading

ion-elgreco commented Aug 4, 2024 •

edited

Loading

roeap commented Aug 4, 2024 •

edited

Loading