Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of bind_rows on frames with a Date column #5327

Closed
dkutner opened this issue Jun 11, 2020 · 3 comments · Fixed by r-lib/vctrs#1151
Closed

Performance of bind_rows on frames with a Date column #5327

dkutner opened this issue Jun 11, 2020 · 3 comments · Fixed by r-lib/vctrs#1151

Comments

@dkutner
Copy link

dkutner commented Jun 11, 2020

Beginning in dplyr 1.0.0, running bind_rows on frames with a Date column is slow and memory intensive.

library(dplyr)

date_frame <- tibble(date = rep(lubridate::date("2020-01-01"), 100))
date_frames_to_bind <- rep(list(date_frame), 10000)
bench::mark(bind_rows(date_frames_to_bind))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression                          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 bind_rows(date_frames_to_bind)      30s      30s    0.0333    74.5GB     90.6

character_frame <- tibble(date = rep("2020-01-01", 100))
character_frames_to_bind <- rep(list(character_frame), 10000)
bench::mark(bind_rows(character_frames_to_bind))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 bind_rows(character_frames_to_bind)   65ms 67.6ms      14.7     8.2MB     20.3

Created on 2020-06-11 by the reprex package (v0.3.0)

Possibly related, running tidyr::unnest on a nested frame with a Date column yields similar issues. I've reproduced this issue with dplyr 0.8.5, tibble 2.1.3, and tidyr 1.0.2.

library(dplyr)

nested_frame <-
  tibble(
    a = rep(1:10000, 100),
    date = lubridate::date("2020-01-01")
  ) %>%
  tidyr::nest(data = -a)
bench::mark(a = tidyr::unnest(nested_frame, data))
#> # A tibble: 1 x 13
#>   expression   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 a          16.9s  16.9s    0.0592    74.5GB     104.     1  1757      16.9s
@lionel-
Copy link
Member

lionel- commented Jun 12, 2020

All the time is spent duplicating vectors in df_assign(). This is linked to r-lib/vctrs#1124 but also happens on R < 4.0.

@DavisVaughan We might have to rethink the ownership assumptions, and maybe consider reverting to a simple ownership parameter.

@DavisVaughan
Copy link
Member

DavisVaughan commented Jun 15, 2020

Slightly more minimal vctrs reprex

library(vctrs)

df <- data.frame(date = rep(as.Date("2020-01-01"), 100))
lst <- rep(list(df), 1000)

lst_rbind <- function(x) {
  vec_rbind(!!!x)
}

bench::mark(lst_rbind(lst))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lst_rbind(lst)    243ms    277ms      3.61     764MB     59.6

Created on 2020-06-15 by the reprex package (v0.3.0)

@DavisVaughan
Copy link
Member

DavisVaughan commented Jun 15, 2020

It seems like what is happening here is:

  • vec_init() creates the output structure which has a Date column

  • We loop over the lst data frames to assign, calling df_assign() on each

  • In df_assign() we always:

    • vec_proxy() each column of out

    • Assign into that column with vec_proxy_assign_opts()

    • vec_restore() that column

The problem is that vec_proxy() for Dates calls back up to an R level vec_proxy.Date(). This bumps the refcnt from 1 to 2 for that "fresh" Date column even though it doesn't really do anything.

So then when vec_proxy_assign_opts() is called it makes a copy, even though we have passed vctrs_ownership_total through, because it uses r_clone_shared() which duplicates if the refcnt is > 1 as it is here.

We might be able to avoid this with the following alternative approach to vec_rbind(), which I think would be generally more performant anyways.

  • Get the output ptype

  • Perform a recursive vec_proxy() on it, proxying the columns as well

  • Then vec_init() the proxied ptype to get a proxy object that represents the output container

  • Now df_assign() into this proxied version repeatedly. This is where the benefit comes in, we don't have to repeatedly proxy/restore each column (which is what causes the duplication)

  • vec_restore() recursively once at the end to get the real out

@lionel- reminded me of r-lib/vctrs#1107, which this seems to be related to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants