-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: change arrow map root name to follow with parquet root name #2538
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
@sclmn - thanks for opening this PR. Would you mind elaborating a bit on the use case / problem you are encountering? Main reason I am asking is that we have been struggling a bit to come up with a system that would be less opinionated about "internal" field names. Here we are not really configuring names in parquet, but rather arrow, which may be written to parquet. In the arrow world, there is a difference between implementations which name is used, and it actually sometimes does not even matter. In principle I would be fine merging this, but I would not want to promise, that this name stays like this indefinitely, as it conflicts with arrow, where we have a larger surface area. |
The delta lake protocol defines the type of various fields as map (eg partitionValues) and since the format of the checkpoint is parquet we expect key_value. If you look at the link above it states: The outer-most level must be a group annotated with MAP that contains a single field named key_value. The repetition of this level must be either optional or required and determines whether the list is nullable. Various code validates this to some degree (eg arrow-rs ). |
certainly, but I believe we may be mixing things up here a bit, so just making sure we are talking about the same things here.
SO the thing we want to change in this PR, only controls how data is represented in memory. The name entries is in accordance with how arrow-rs models map arrays - albeit I think arrow-rs does not require that specific name. IIRC, different arrow implementations made different choices. SO ultimately I believe it is the responsibility of the parquet writer, to make sure that the "Inner" map field gets the right name according to the spec. That said, we have seen several instances now, where the mismatch in conventions how these inner fields are named (also applied to the varies lists variants in parquet etc.) became quite bothersome. As such I'd be keen to understand a little bit better where you are hitting an error based on this name to finally find a sustainable solution how we can handle the various schema conversions. |
I'm trying to fix the on disk result (working on supporting this at BigQuery). The current code generates delta lake checkpoints that have an improper parquet schema. For instance, the tags field looks like: |
In regards to this, the Arrow specification is a lot more lenient on the the layout of maps than parquet. @alamb @tustvold are there any flags in Parquet-rs to make it transparently write map types with compliant naming? |
…rmat/blob/master/LogicalTypes.md#maps. The indication is to have a key_value name instead of entries.
I think we should keep the Parquet field names here, cause we are going back and forth between arrow and parquet it's good to keep this consistent. As @roeap has pointed out there is no real right here, but it seems that using parquet names gives less problems then using the arrow names. This already happens here as well: delta-io/delta-kernel-rs#299, because most of our code expects the arrow datatypes to be structured the way parquet datatypes are, we are unfortunately getting schema mismatches when we try to cast recordbatches |
Description
Change root field name for map to key_value
Related Issue(s)
#2182
Documentation
/~https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps