Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Added input validation for explode operation in the array namespace #19163

Merged
merged 8 commits into from
Nov 1, 2024

Conversation

barak1412
Copy link
Contributor

Partly fixes #19049.

Now it has validation, but not in the place we want -
I want in seperate PR handle validation for all namespace operations, which suffer from same problem.

@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars labels Oct 9, 2024
Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.

Project coverage is 80.00%. Comparing base (48f6e9d) to head (cd34234).
Report is 179 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-plan/src/dsl/function_expr/array.rs 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19163      +/-   ##
==========================================
+ Coverage   79.65%   80.00%   +0.35%     
==========================================
  Files        1532     1527       -5     
  Lines      209101   209221     +120     
  Branches     2417     2415       -2     
==========================================
+ Hits       166557   167387     +830     
+ Misses      41998    41286     -712     
- Partials      546      548       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @barak1412.

Adding an Explode to the Array methods is the good approach. We only don't have to add other physical explode implementations as we already have a working one.

Upon convert the DSL to IR we should translate Expr::Array(Explode) -> Expr::Explode and during that conversion we should validate if the input type is indeed of DataType::Array.

@@ -170,6 +170,10 @@ pub trait ArrayNameSpace: AsArray {
};
Ok(out.into_series())
}
fn array_explode(&self) -> PolarsResult<Series> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it on the namespace.

@@ -249,3 +253,7 @@ pub(super) fn shift(s: &[Column]) -> PolarsResult<Column> {

ca.array_shift(n.as_materialized_series()).map(Column::from)
}

pub(super) fn explode(s: &Column) -> PolarsResult<Column> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed.

@barak1412
Copy link
Contributor Author

barak1412 commented Oct 11, 2024

@ritchie46 Thanks.

So correct me if I am wrong -
We want to convert FunctionExpr::ArrayExpr(ArrayFunction::Explode) to Expr::Explode inside function conversions.
However, during the conversion, we convert the inputs to AExpr`IR` form, but I can't see how to access the type here, since we doesn't have schema in this context.

What do I miss?

@@ -56,6 +57,7 @@ impl ArrayFunction {
#[cfg(feature = "array_count")]
CountMatches => mapper.with_dtype(IDX_DTYPE),
Shift => mapper.with_same_dtype(),
Explode => mapper.map_to_list_and_array_inner_dtype(),
Copy link
Collaborator

@nameexhaustion nameexhaustion Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do the dtype validation here - e.g.

Explode => mapper.try_map_to_array_inner_dtype()?,

Where try_map_to_array_inner_dtype returns an error if the dtype isn't Array

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we could maybe even add a check before this match block to ensure the dtype is Array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's needed. As that check would be required on every get_dtype. We already have this

fn check_namespace(function: &FunctionExpr, first_dtype: &DataType) -> PolarsResult<()> {

So I think it will resolve itself. During simplification of expressions we can then rewrite Array(Explode) to Explode.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - then adding Explode under ArrayFunction should already add input dtype validation

Copy link
Contributor Author

@barak1412 barak1412 Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nameexhaustion
Thanks, I was thinking about this solution, but I prefered one validation to all namespaces operations.

@ritchie46
It's not going to wok because we don't hit check_namespace due to an extra condition in the match - if options.cast_to_supertypes.is_some():
/~https://github.com/pola-rs/polars/blob/main/crates/polars-plan/src/plans/conversion/type_coercion/mod.rs#L275

In addition, if we convert to Expr:Explode, we may not hit that match at all, I need to verify it.

So I propose -

  1. Make check_namespace hit all FunctionExpr.
  2. If the Array(Explode) -> Explode conversion won't hit, we will implement @nameexhaustion 's idea.

What do you think?

Edit:
@ritchie46 I saw you added to Expr the possibility to know its output type, so I will use it during conversion.

On Separate PR I will make all AExpr::Function hit the namespace validation.

@barak1412
Copy link
Contributor Author

I think only make check_namespace hit was left, it should help all namespaces' operations.

@ritchie46
Copy link
Member

Almost there @barak1412, you need to set the branch in simplify_expr under a dtype-array feature flag.

@barak1412
Copy link
Contributor Author

@ritchie46

I am still little stuck on the type validation, and I will be glad for direction -

simplify_expr happends before type_coercion, so the check_namespace will be possibly applied only after we are in the new AExp::Explode form.
So I have to make the validation before, in simply_exp, where the ArrayExpr(ArrayFunction::Explode) -> Explode happends.
However, I don't have there yet the logical plan arena (Arena<IR>), so I still can't fetch the input type there.

@barak1412 barak1412 changed the title refactor: Make explode part of the array namespace fix: Added input validation for explode operation in the array namespace Oct 14, 2024
@github-actions github-actions bot added the fix Bug fix label Oct 14, 2024
@barak1412
Copy link
Contributor Author

@ritchie46 I went for @nameexhaustion solution, due to what I wrote in the previous comment.

How does it look like?

@barak1412 barak1412 requested a review from ritchie46 October 21, 2024 15:16
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @barak1412. Sorry for the late reply, but it looks great!

@ritchie46 ritchie46 merged commit d2a0441 into pola-rs:main Nov 1, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.arr.explode() fails to raise error for non-Array inputs
3 participants