02-intro-to-dtplyr.Rmd

```{r intro-to-dtplyr, include = FALSE}
eval_dtplyr <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_dtplyr <- Sys.getenv("GLOBAL_EVAL")
```


```{r, eval = eval_dtplyr, include = FALSE}
library(data.table)
library(dtplyr)
library(dplyr)
library(lobstr)
library(fs)
library(purrr)
```

# Introduction to `dtplyr`

## `dtplyr` basics
*Load data into R via `data.table`, and then wrap it with `dtplyr`*

1. Load the `data.table`, `dplyr`, `dtplyr`, `purrr` and `fs` libraries
    ```{r, eval = eval_dtplyr}
    library(data.table)
    library(dplyr)
    library(dtplyr)
    library(purrr)
    library(fs)
    ```

2. Read the **transactions.csv** file, from the **/usr/share/class/files** folder. Use the `fread()` function to load the data into a variable called `transactions`
    ```{r, eval = eval_dtplyr}
    transactions <- dir_ls("/usr/share/class/files", glob = "*.csv") %>%
       map(fread) %>%
       rbindlist()
    ```

3. Preview the data using `glimpse()`
    ```{r, eval = eval_dtplyr}
    
    ```

4. Use `lazy_dt()` to "wrap" the `transactions` variable into a new variable called `dt_transactions`
    ```{r, eval = eval_dtplyr}
    
    ```

5. View `dt_transactions` structure with `glimpse()`
    ```{r, eval = eval_dtplyr}
    
    ```

## Object sizes
*Confirm that `dtplyr` is not making copies of the original `data.table`*

1. Load the `lobstr` library
    ```{r, eval = eval_dtplyr}
    library(lobstr)
    ```

2. Use `obj_size()` to obtain `transactions`'s size in memory
    ```{r, eval = eval_dtplyr}
    
    ```

3. Use `obj_size()` to obtain `dt_transactions`'s size in memory
    ```{r, eval = eval_dtplyr}
    
    ```

4. Use `obj_size()` to obtain `dt_transactions` and `transactions` size in memory together
    ```{r, eval = eval_dtplyr}
    
    ```

## How `dtplyr` works
*Under the hood view of how `dtplyr` operates `data.table` objects*

1. Use `dplyr` verbs on top of `dt_transactions` to obtain the total sales by month
    ```{r, eval = eval_dtplyr}
    dt_transactions %>%
      group_by(date_month) %>%
      summarise(total_sales = sum(price))
    ```

2. Load the above code into a variable called `by_month`
    ```{r, eval = eval_dtplyr}
    
    ```

3. Use `show_query()` to see the `data.table` code that `by_month` actually runs
    ```{r, eval = eval_dtplyr}
    
    ```

4. Use `glimpse()` to view how `by_month`, instead of modifying the data, only adds steps that will later be executed by `data.table`
    ```{r, eval = eval_dtplyr}
    
    ```
    
5. Create a new column using `mutate()`
    ```{r, eval = eval_dtplyr}
    dt_transactions %>%
      mutate(new_field = price / 2)
    ```

6. Use `show_query()` to see the `copy()` command being used
    ```{r, eval = eval_dtplyr}
    
    ```

7. Check to confirm that the new column *did not* persist in `dt_transactions`
    ```{r, eval = eval_dtplyr}
    
    ```

8. Use `lazy_dt()` with the `immutable` argument set to `FALSE` to avoid the copy
    ```{r, eval = eval_dtplyr}
    m_transactions <- lazy_dt(copy(transactions), immutable = FALSE)
    ```

    ```{r, eval = eval_dtplyr}
    m_transactions
    ```

9. Create a `new_field` column in `m_transactions` using `mutate()`
    ```{r, eval = eval_dtplyr}
    m_transactions %>% 
      mutate(new_field = price / 2)
    ```

10. Use `show_query()` to see that `copy()` is no longer being used
    ```{r, eval = eval_dtplyr}
    
    ```

11. Inspect `m_transactions` to see that `new_field` has persisted
    ```{r, eval = eval_dtplyr}
    
    ```

## Working with `dtplyr`
*Learn data conversion and basic visualization techniques*

1. Use `as_tibble()` to convert the results of `by_month` into a `tibble`
    ```{r, eval = eval_dtplyr}
    by_month %>%
      as_tibble()
    ```

2. Load the `ggplot2` library
    ```{r, eval = eval_dtplyr}
    library(ggplot2)
    ```

3. Use `as_tibble()` to convert before creating a line plot 
    ```{r, eval = eval_dtplyr}
    by_month %>%
      
      ggplot() +
      geom_line(aes(date_month, total_sales))
    ```

## Pivot data
*Review a simple way to aggregate data faster, and then pivot it as a tibble*

1. Load the `tidyr` library
    ```{r, eval = eval_dtplyr}
    library(tidyr)
    ```

2. Group `db_transactions` by `date_month` and `date_day`, then aggregate `price` into `total_sales`
    ```{r, eval = eval_dtplyr}
    dt_transactions %>%
      group_by(date_month, date_day) %>% 
      summarise(total_sales = sum(price))
    ```

3. Copy the aggregation code above, **collect it into a `tibble`**, and then use `pivot_wider()` to make the `date_day` the column headers.
    ```{r, eval = eval_dtplyr}
    dt_transactions %>%
      group_by(date_month, date_day) %>% 
      summarise(total_sales = sum(price)) %>%
      
      pivot_wider(names_from = date_day, values_from = total_sales)
    ```