-
Notifications
You must be signed in to change notification settings - Fork 224
Fixed growable of dictionaries negative keys #582
Fixed growable of dictionaries negative keys #582
Conversation
It indeed can take a while before it is fixed in pyarrow, as pyarrow 6.0.0 is released a few days ago. |
wow, this looks like a vulnerability to me on the c++ side. if someone accesses those keys, they may read out of bounds if they do not check bounds (which some implementations do). I think this should be addressed in the pyarrow implementation: dictionary keys and offset deltas should never be negative. I think that our implementation is correct: it is the user responsibility to ensure that the keys are positive, even if the type is a signed integer (for JVM reasons). I will follow up on this on the Arrow side. |
What do you think about a temporary patch? ^_^. Before a new pyarrow is shipped to all pandas code that uses this, I guess that will take some months. 🙈 btw, the fix for reading null value keys in the scalar API should still be pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
I was wrong, this is valid in arrow and it is a bug here: we should accept negative keys if they are nulls. Null values can have any (initialized) value.
Left two minor comments, but otherwise ready to sail.
Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>
Codecov Report
@@ Coverage Diff @@
## main #582 +/- ##
==========================================
+ Coverage 78.99% 79.01% +0.01%
==========================================
Files 399 399
Lines 24752 24765 +13
==========================================
+ Hits 19554 19567 +13
Misses 5198 5198
Continue to review full report at Codecov.
|
Pyarrow creates dictionary keys with negative integers, that are masked out by the validity bitmap.
This PR fixes conversion from those negative integers to usize, they are mapped to
0
, and that is ok, because the values are masked.Related downstream issue: pola-rs/polars#1686
Additionally the scalar API also read masked out values, so was incorrect.