Performance of bind_rows on frames with a Date column #5327

dkutner · 2020-06-11T19:44:04Z

Beginning in dplyr 1.0.0, running bind_rows on frames with a Date column is slow and memory intensive.

library(dplyr)

date_frame <- tibble(date = rep(lubridate::date("2020-01-01"), 100))
date_frames_to_bind <- rep(list(date_frame), 10000)
bench::mark(bind_rows(date_frames_to_bind))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression                          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 bind_rows(date_frames_to_bind)      30s      30s    0.0333    74.5GB     90.6

character_frame <- tibble(date = rep("2020-01-01", 100))
character_frames_to_bind <- rep(list(character_frame), 10000)
bench::mark(bind_rows(character_frames_to_bind))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 bind_rows(character_frames_to_bind)   65ms 67.6ms      14.7     8.2MB     20.3

^{Created on 2020-06-11 by the reprex package (v0.3.0)}

Possibly related, running tidyr::unnest on a nested frame with a Date column yields similar issues. I've reproduced this issue with dplyr 0.8.5, tibble 2.1.3, and tidyr 1.0.2.

library(dplyr)

nested_frame <-
  tibble(
    a = rep(1:10000, 100),
    date = lubridate::date("2020-01-01")
  ) %>%
  tidyr::nest(data = -a)
bench::mark(a = tidyr::unnest(nested_frame, data))
#> # A tibble: 1 x 13
#>   expression   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 a          16.9s  16.9s    0.0592    74.5GB     104.     1  1757      16.9s

The text was updated successfully, but these errors were encountered:

lionel- · 2020-06-12T03:53:34Z

All the time is spent duplicating vectors in df_assign(). This is linked to r-lib/vctrs#1124 but also happens on R < 4.0.

@DavisVaughan We might have to rethink the ownership assumptions, and maybe consider reverting to a simple ownership parameter.

DavisVaughan · 2020-06-15T12:24:51Z

Slightly more minimal vctrs reprex

library(vctrs)

df <- data.frame(date = rep(as.Date("2020-01-01"), 100))
lst <- rep(list(df), 1000)

lst_rbind <- function(x) {
  vec_rbind(!!!x)
}

bench::mark(lst_rbind(lst))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lst_rbind(lst)    243ms    277ms      3.61     764MB     59.6

^{Created on 2020-06-15 by the reprex package (v0.3.0)}

DavisVaughan · 2020-06-15T12:36:21Z

It seems like what is happening here is:

vec_init() creates the output structure which has a Date column
We loop over the lst data frames to assign, calling df_assign() on each
In df_assign() we always:
- vec_proxy() each column of out
- Assign into that column with vec_proxy_assign_opts()
- vec_restore() that column

The problem is that vec_proxy() for Dates calls back up to an R level vec_proxy.Date(). This bumps the refcnt from 1 to 2 for that "fresh" Date column even though it doesn't really do anything.

So then when vec_proxy_assign_opts() is called it makes a copy, even though we have passed vctrs_ownership_total through, because it uses r_clone_shared() which duplicates if the refcnt is > 1 as it is here.

We might be able to avoid this with the following alternative approach to vec_rbind(), which I think would be generally more performant anyways.

Get the output ptype
Perform a recursive vec_proxy() on it, proxying the columns as well
Then vec_init() the proxied ptype to get a proxy object that represents the output container
Now df_assign() into this proxied version repeatedly. This is where the benefit comes in, we don't have to repeatedly proxy/restore each column (which is what causes the duplication)
vec_restore() recursively once at the end to get the real out

@lionel- reminded me of r-lib/vctrs#1107, which this seems to be related to

DavisVaughan mentioned this issue Jun 15, 2020

Revert to a simpler ownership model - but always duplicate ALTREP r-lib/vctrs#1151

Merged

hadley added performance 🚀 vctrs ↗️ wip work in progress labels Jun 22, 2020

DavisVaughan closed this as completed in r-lib/vctrs#1151 Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of bind_rows on frames with a Date column #5327

Performance of bind_rows on frames with a Date column #5327

dkutner commented Jun 11, 2020

lionel- commented Jun 12, 2020

DavisVaughan commented Jun 15, 2020 •

edited

Loading

DavisVaughan commented Jun 15, 2020 •

edited

Loading

Performance of bind_rows on frames with a Date column #5327

Performance of bind_rows on frames with a Date column #5327

Comments

dkutner commented Jun 11, 2020

lionel- commented Jun 12, 2020

DavisVaughan commented Jun 15, 2020 • edited Loading

DavisVaughan commented Jun 15, 2020 • edited Loading

DavisVaughan commented Jun 15, 2020 •

edited

Loading

DavisVaughan commented Jun 15, 2020 •

edited

Loading