Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert to a simpler ownership model - but always duplicate ALTREP #1151

Merged
merged 6 commits into from
Jun 24, 2020

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jun 15, 2020

Closes tidyverse/dplyr#5327
Closes #1124
Closes #1122

Reverts back to a slightly simpler ownership model of VCTRS_OWNED_true and VCTRS_OWNED_false.

When we own the object, we only ever attempt to duplicate it if it is ALTREP. If vec_init() ever creates ALTREP objects in the future (#837), this will be required. When doing assignment, we have to duplicate ALTREP objects before dereferencing even if we own them, because we need access to the actual data that it is representing, not the ALTREP object's internals.

When we don't own the object, we use r_clone_referenced() to determine if we need to duplicate or not.

This fixed the repeated duplication that was occurring in df_assign()'s calls to vec_proxy_assign_opts(), but I then discovered that we still had repeated duplication in vec_restore(). To fix that, I had to borrow from the ideas in #1124 and pass through owned to vec_restore() as well. This allows us to avoid attempting duplication here as well.

With both of those in place, tidyverse/dplyr#5327 is fixed:

library(dplyr)

date_frame <- tibble(date = rep(lubridate::date("2020-01-01"), 100))
date_frames_to_bind <- rep(list(date_frame), 10000)
bench::mark(bind_rows(date_frames_to_bind))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression                          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 bind_rows(date_frames_to_bind)    133ms    142ms      6.87    8.39MB     25.8

It is worth noting that @lionel- and I both think that eventually we should probably have recursive proxy and restore functions (#1107). This would allow df_assign() to not have to proxy and restore the output columns repeatedly, which is where many of these duplication issues arise. That would also allow us to remove the owned argument from vec_restore() again, which I am currently viewing as a temporary fix.


This also closes #1124, because it fixes the original problem there where columns of df-cols were being duplicated.

@DavisVaughan DavisVaughan marked this pull request as ready for review June 15, 2020 15:01
@DavisVaughan DavisVaughan requested a review from lionel- June 15, 2020 15:06
Copy link
Member

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I'm glad we're starting a set of performance tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants