Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

innerjoin fast path where join column is allequal? #3247

Closed
anandijain opened this issue Dec 12, 2022 · 3 comments · Fixed by #3233
Closed

innerjoin fast path where join column is allequal? #3247

anandijain opened this issue Dec 12, 2022 · 3 comments · Fixed by #3233
Milestone

Comments

@anandijain
Copy link
Contributor

ts_ = map(x->x.t, dfs)
@test allequal(ts_)
@test all(issorted, ts_)
@test length(dfs) == 11128
@test size(dfs[1]) == (5761, 6)
j = innerjoin(dfs...;on=:t)

Currently I get a stackoverflow and crash when attempting a join of this size (~2.5GB). What is the recommended way to do big joins of this nature?

I don't have a MWE, but it should be relatively easy to make the repro for this.

@bkamins bkamins added this to the 1.5 milestone Dec 12, 2022
@bkamins
Copy link
Member

bkamins commented Dec 12, 2022

StackOverflow is most likely because innerjoin uses recursion in this case and is unrelated with data size.
Instead of splatting do the join in iteration:

foldl((df1, df2) -> innerjoin(df1, df2, on=:t), dfs)

However, if allequal(ts_) is true then instead you can just do insertcols! as it should be faster (or hcat).

@anandijain
Copy link
Contributor Author

Is there a reason to not define innerjoin(dfs::Vector) and just have that call fold?

@bkamins bkamins added bug and removed question labels Dec 12, 2022
@bkamins
Copy link
Member

bkamins commented Dec 12, 2022

I will make a fix to this (the reason why we used recursion is that we did not assume such wide inputs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants