Script - Manipulation with data using data.table.Rmd

---
title: "Wrangling Data at scale with data.table"
output: html_document
date: "`r format(Sys.time(), '%d %B %Y')`"
  html_document:
    toc: TRUE
    df_print: paged
    number_sections: FALSE
    highlight: tango
    theme: lumen
    toc_depth: 3
    toc_float: true
    css: custom.css 
    self_contained: false
    includes:
      after_body: footer.html
---
  
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.align = 'center', echo = TRUE)
```

***

# Wrangling Data at scale with Data.table
  
# For this exercise we will need to install two packages. The first will be data.table and the second will be the coronavirus package, which summarizes Coronavirus (COVID-19) cases across countries/provinces/states. 

```{r}
install.packages("data.table")
install.packages("coronavirus")
```

# Afterwards, load them.

```{r}
library(data.table)
library(coronavirus)
```

# This guide will do data manipulation with datatables. A brief explainer: Datatables are in fact data frames, but it permits to perform operations by groups besides the classic operations with rows and columns. One can say that it is capable of processing operation in a third dimension besides the traditional two-dimensional rows and columns. 

### Create the data.table of the coronavirus and check its dimensions

```{r}
table_corona <- as.data.table(coronavirus)
```

### Its super extensive. It's worth to note that data tables are fast, so it can save you lots of time doing filtering, ordering, grouping, and so on when compared with other packages. This is quite relevant when operating with large datasets, such as the coronavirus one that we are using in this exercise. Check the columns and rows.

```{r}
ncol(table_corona)

dim(table_corona)
```

# Now, Let's do some filtering. 

```{r}
table_corona[3:4]

table_corona[3:4, ]

table_corona[!(1:5)]
```

# or 

```{r}
table_corona[-(1:5)]
```

# This is quite similar to the filtering operation that we are used to.

```{r}
table_corona[.N]
```

### The .N is an integer. If used during the "i" argument, it gives back total number of rows in the data table. We obtained the last row using it within the square brackets. It's analogous to nrow().  

```{r}
EU_corona <- table_corona[continent_name == "Europe"]
```

### We will continue filtering, but will rely to other alternatives that may hasten the process. We will filter using %like%, which will look for patterns in character observations. Say that you want something more specific from England, such as Channel Islands.

```{r}
Uk_corona <- EU_corona[combined_key %like% "^Channel Islands"]
```

### Therefore, %like% will subset while looking for all rows that contain the given character string in a specified column.

### We will proceed to %between%. It is extremely useful on numeric columns, allowing us to specify intervals. 

```{r}
EU_corona[cases %between% c(200, 300)]
```

# instead of doing the traditional way.

```{r}
EU_corona [cases >= 200 & cases <= 300]
```

### The %chin% is perfect for searching for multiple character strings.

```{r}
EU_corona[country %chin% c("Germany", "France", "Italy")]
```

# General form of data.table

### From the general form of data.table, we use the second argument j to select and compute columns. 
### Remember to Take DT, filter rows in "i", then compute "j", grouped by "by".

DT[i, j, by]

### One difference between data frames and data tables is that when selecting a single column from a data frame, the result will be a vector. In a data table, the output will remain a data table.

### We will check some examples by selecting columns. Remember to avoid using numbers when selecting them, as it is considered bad practice. As normal projects will normally pass through updates and modifications, the column numbering can change, thus creating a series of headache for future-me. Therefore, use columns names to avoid future problems.

```{r}
max(EU_corona[, "cases"])
```

# "j" accepts a list of columns, so we can diversify a little bit:

```{r}
cases_corona <- EU_corona[, .(date, country, cases)]
```

### and deselect what is not necessary as well. Like the uid codes (they are mostly for maps, that are going to be covered in another workshop).

```{r}
EU_corona[, -c("uid", "iso2", "iso3")]

EU_corona[, mean(cases)]
```

### We can also do some descriptive statistics. Here we perform opeations directly within "j". Compare with the data frame approach:

# the data frame
```{r}
mean(EU_corona[, "cases"])
```

#Computing on rows and columns 

```{r}
EU_corona[country == "Germany", max(cases)]
```

### We will use the examples from the "i" and combine with the "j" to perform statistic operations. It is also an alternative that saves energy and memory in comparison with data frames, as they would return all the columns only to compute some operations of rows. 

```{r}
EU_corona[type == "death", .(mean(cases))]

mean <- EU_corona[, .(mean_cases = mean(cases))]
```

### As an last example on "j", we will perform multiple operations. We will check a specific country during this year and see the mean and median of cases.  

```{r}
EU_corona[country == "Germany" & date > "2021-01-01", .(mean_cases = mean(cases), median_cases = median(cases))]
```

### Lastly, we will talk about the "by" argument. It is a grouping operation. Here we will look more specifically at the number of deaths by country.

```{r}
n_deaths <- EU_corona[type == "death", .(total = sum(cases)), by = "country"]
```

# The final example will see the total sum of cases in Germany.

```{r}
deu_cases <- EU_corona[country == "Germany", .(main_cases = sum(cases)), by = c("type")]
```

### Moving to the final step, we will talk about chaining expressions. Chaining allows us to do sucessive operations on the results that we had in the previous brackets. You can certaintly use a high number of chains to obtain high specific results. Wise reminder: Do not try multiple chainings at once, slow and steady is the way to not receive an error.

```{r}
EU_corona[type == "death", .(total = max(cases)), by = c("date", "country")][order(-total)]
```

# Ok! We covered only the surface of data.tables! If you want to delve into it more deeply, we recommend taking a look at .SDs and application of functions to columns.