Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_database_uri triggers ShapeError since 1.18 when querying 2 tables joined with same columns names in the 2 tables #20616

Closed
2 tasks done
Niivii opened this issue Jan 8, 2025 · 7 comments · Fixed by #20624
Assignees
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@Niivii
Copy link

Niivii commented Jan 8, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pl.read_database_uri(query="select table1.id as t1id, table2.id as t2id from table1 join table2 on table1.id = table2.id;", uri="uri", engine="adbc")

Log output

polars.exceptions.ShapeError: X+1 column names provided for a DataFrame of width X

Issue description

Bug introduced in 1.18
Only one column returned when same names on 2 joined tables
Cross referencing similar report : duckdb/duckdb#15528

Expected behavior

2 columns returned

Installed versions

Bug exists on 1.18 and 1.19
No issue on 1.17.1

@Niivii Niivii added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 8, 2025
@alexander-beedie alexander-beedie self-assigned this Jan 8, 2025
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 8, 2025

Looking at the referenced DuckDB report/example it seems that we should actually be raising a DuplicateError on our side here, as the error only manifests when we are presented with duplicate columns; pyarrow.Table objects allow this, but we don't 🤔

(I was able to replicate the exact error via some local queries using connectorx).

@Niivii
Copy link
Author

Niivii commented Jan 8, 2025

I am not sure this a problem of duplicate column name as they are renamed with an 'as' clause.
The column with similar name from the joined table is simply not fetched anymore since 1.18

So if we join two tables with both 'similar_name_field' in each of them and we

select t1.similar_name_field as 'field_renamed_from_t1',
t2.similar_name_field as 'field_renamed_from_t2'
from t1 join t2;

A dataframe of width 1 will be fetched instead of 2 since v1.18.

I can't hardly believe this is the expected behavior and should trigger a DuplicateError.
I would expect to get a dataframe of width 2 with columns field_renamed_from_t1 and field_renamed_from_t2 (and it has always been the case until 1.18).

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 9, 2025

It will only trigger a DuplicateError in the case that we really do receive duplicate colnames, which is what the linked DuckDB report was showing, and what I was able to reproduce. Note that databases I tested locally correctly returned two columns when the aliases were distinct (PostgreSQL, SQLite).

If you can clarify your example above (which isn't reproducible as the uri is just "uri", and there's nothing to connect to ;) then I can take another look, but the specific ShapeError error you are reporting would only happen if there really ARE duplicate column names in the cursor/query result.

Note that if the query returns duplicate column names despite your aliasing then this is not an error on our side, it would be an error on the database side, and raising DuplicateError in this case will help diagnose that as it will report the specific duplicate column name.

@Niivii
Copy link
Author

Niivii commented Jan 9, 2025

Hello Alexander,

uri is a postgres database f"postgresql://{user}:{password}@{host}:{port}/{database}"
When I fetch this query with polars 1.17.1 and below (this dataflow has been running continuously since months with several version combination of polars/connectorx/adbc driver) I get the dataframe as usual correctly.
When I test it with polars 1.18 and 1.19 I get this ShapeError.

The linked issue to duckdb is indeed not the same as they are not using aliases and therefore duplicating columns.
Apologies for the misleading pointer ! Looked similar to me and the timing too :-)

@Niivii
Copy link
Author

Niivii commented Jan 9, 2025

I will try to make a more reproducible case and debug with more info !

@Niivii
Copy link
Author

Niivii commented Jan 9, 2025

There was indeed duplicates in my aliases 🤡
As there was 100+ columns I didn't catch it...

Thanks a lot !

@Niivii Niivii closed this as completed Jan 9, 2025
@alexander-beedie
Copy link
Collaborator

There was indeed duplicates in my aliases 🤡 As there was 100+ columns I didn't catch it...

Lol... well, in the next release we'll actually tell you which column it is in the error msg 😁

Thanks a lot !

No problem; thanks for taking the time to make an Issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
2 participants