Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Improved reading nullable Avro arrays #727

Merged
merged 1 commit into from
Jan 5, 2022

Conversation

Igosuki
Copy link
Contributor

@Igosuki Igosuki commented Jan 3, 2022

This helps with deserializing the schema Union([Array, Null]) or Union([Null, Array]) which is a valid avro schema (and the default for sparks list arrays for instance).

@Igosuki Igosuki changed the title Arrays can be null Avro arrays can be null Jan 3, 2022
@codecov
Copy link

codecov bot commented Jan 3, 2022

Codecov Report

Merging #727 (8422a02) into main (ef7937d) will increase coverage by 0.25%.
The diff coverage is 40.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #727      +/-   ##
==========================================
+ Coverage   70.25%   70.51%   +0.25%     
==========================================
  Files         312      311       -1     
  Lines       17016    16925      -91     
==========================================
- Hits        11955    11934      -21     
+ Misses       5061     4991      -70     
Impacted Files Coverage Δ
src/io/avro/read/deserialize.rs 65.85% <40.00%> (-1.38%) ⬇️
src/datatypes/schema.rs 24.00% <0.00%> (-8.00%) ⬇️
src/io/csv/write/mod.rs 65.38% <0.00%> (-3.85%) ⬇️
src/io/parquet/write/record_batch.rs 82.75% <0.00%> (-3.45%) ⬇️
src/array/list/mutable.rs 74.28% <0.00%> (-2.19%) ⬇️
src/io/ipc/read/common.rs 79.64% <0.00%> (-1.71%) ⬇️
src/io/csv/read_utils.rs 62.00% <0.00%> (-1.16%) ⬇️
src/io/ipc/write/stream.rs 73.07% <0.00%> (-1.00%) ⬇️
src/io/ipc/read/reader.rs 72.46% <0.00%> (-0.78%) ⬇️
src/compute/filter.rs 52.85% <0.00%> (-0.77%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef7937d...8422a02. Read the comment docs.

@jorgecarleitao
Copy link
Owner

Looks great. Thanks! Do you have a recommendation on how we could test this? (we do not need for this PR, but I am curious). E.g. we have a procedure to test that our parquet files are readable by spark, via pyspark. Could we use pyspark to write a bunch of avro files and test against that?

@Igosuki
Copy link
Contributor Author

Igosuki commented Jan 3, 2022

@jorgecarleitao I test it using the avro files at /~https://github.com/apache/arrow-testing/tree/a8f7be380531758eb7962542a5eb020d8795aa20/data/avro (they are copy of the parquet files).
You're right that actually doing a round trip with arrow and pyspark would be a great integration test... But in the meantime, the test mod here is enough to test this PR : /~https://github.com/Igosuki/arrow-datafusion/blob/arrow22r/datafusion/src/avro_to_arrow/arrow_array_reader.rs happy to port the code here

@jorgecarleitao
Copy link
Owner

Sounds good.

I have added #733 as a follow-up. Generally I prefer to run integrations with third parties against code (instead of files), so that we can also know (and reproduce) how the file was generated in the first place.

Could you fix the fmt so we can merge it green?

@Igosuki
Copy link
Contributor Author

Igosuki commented Jan 5, 2022

Done

@jorgecarleitao jorgecarleitao merged commit 9ed6aba into jorgecarleitao:main Jan 5, 2022
@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Jan 5, 2022
@jorgecarleitao jorgecarleitao changed the title Avro arrays can be null Improved reading nullable Avro arrays Jan 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants