-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathScript - Manipulation with data using data.table.Rmd
185 lines (126 loc) · 6.12 KB
/
Script - Manipulation with data using data.table.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
title: "Wrangling Data at scale with data.table"
output: html_document
date: "`r format(Sys.time(), '%d %B %Y')`"
html_document:
toc: TRUE
df_print: paged
number_sections: FALSE
highlight: tango
theme: lumen
toc_depth: 3
toc_float: true
css: custom.css
self_contained: false
includes:
after_body: footer.html
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.align = 'center', echo = TRUE)
```
***
# Wrangling Data at scale with Data.table
# For this exercise we will need to install two packages. The first will be data.table and the second will be the coronavirus package, which summarizes Coronavirus (COVID-19) cases across countries/provinces/states.
```{r}
install.packages("data.table")
install.packages("coronavirus")
```
# Afterwards, load them.
```{r}
library(data.table)
library(coronavirus)
```
# This guide will do data manipulation with datatables. A brief explainer: Datatables are in fact data frames, but it permits to perform operations by groups besides the classic operations with rows and columns. One can say that it is capable of processing operation in a third dimension besides the traditional two-dimensional rows and columns.
### Create the data.table of the coronavirus and check its dimensions
```{r}
table_corona <- as.data.table(coronavirus)
```
### Its super extensive. It's worth to note that data tables are fast, so it can save you lots of time doing filtering, ordering, grouping, and so on when compared with other packages. This is quite relevant when operating with large datasets, such as the coronavirus one that we are using in this exercise. Check the columns and rows.
```{r}
ncol(table_corona)
dim(table_corona)
```
# Now, Let's do some filtering.
```{r}
table_corona[3:4]
table_corona[3:4, ]
table_corona[!(1:5)]
```
# or
```{r}
table_corona[-(1:5)]
```
# This is quite similar to the filtering operation that we are used to.
```{r}
table_corona[.N]
```
### The .N is an integer. If used during the "i" argument, it gives back total number of rows in the data table. We obtained the last row using it within the square brackets. It's analogous to nrow().
```{r}
EU_corona <- table_corona[continent_name == "Europe"]
```
### We will continue filtering, but will rely to other alternatives that may hasten the process. We will filter using %like%, which will look for patterns in character observations. Say that you want something more specific from England, such as Channel Islands.
```{r}
Uk_corona <- EU_corona[combined_key %like% "^Channel Islands"]
```
### Therefore, %like% will subset while looking for all rows that contain the given character string in a specified column.
### We will proceed to %between%. It is extremely useful on numeric columns, allowing us to specify intervals.
```{r}
EU_corona[cases %between% c(200, 300)]
```
# instead of doing the traditional way.
```{r}
EU_corona [cases >= 200 & cases <= 300]
```
### The %chin% is perfect for searching for multiple character strings.
```{r}
EU_corona[country %chin% c("Germany", "France", "Italy")]
```
# General form of data.table
### From the general form of data.table, we use the second argument j to select and compute columns.
### Remember to Take DT, filter rows in "i", then compute "j", grouped by "by".
DT[i, j, by]
### One difference between data frames and data tables is that when selecting a single column from a data frame, the result will be a vector. In a data table, the output will remain a data table.
### We will check some examples by selecting columns. Remember to avoid using numbers when selecting them, as it is considered bad practice. As normal projects will normally pass through updates and modifications, the column numbering can change, thus creating a series of headache for future-me. Therefore, use columns names to avoid future problems.
```{r}
max(EU_corona[, "cases"])
```
# "j" accepts a list of columns, so we can diversify a little bit:
```{r}
cases_corona <- EU_corona[, .(date, country, cases)]
```
### and deselect what is not necessary as well. Like the uid codes (they are mostly for maps, that are going to be covered in another workshop).
```{r}
EU_corona[, -c("uid", "iso2", "iso3")]
EU_corona[, mean(cases)]
```
### We can also do some descriptive statistics. Here we perform opeations directly within "j". Compare with the data frame approach:
# the data frame
```{r}
mean(EU_corona[, "cases"])
```
#Computing on rows and columns
```{r}
EU_corona[country == "Germany", max(cases)]
```
### We will use the examples from the "i" and combine with the "j" to perform statistic operations. It is also an alternative that saves energy and memory in comparison with data frames, as they would return all the columns only to compute some operations of rows.
```{r}
EU_corona[type == "death", .(mean(cases))]
mean <- EU_corona[, .(mean_cases = mean(cases))]
```
### As an last example on "j", we will perform multiple operations. We will check a specific country during this year and see the mean and median of cases.
```{r}
EU_corona[country == "Germany" & date > "2021-01-01", .(mean_cases = mean(cases), median_cases = median(cases))]
```
### Lastly, we will talk about the "by" argument. It is a grouping operation. Here we will look more specifically at the number of deaths by country.
```{r}
n_deaths <- EU_corona[type == "death", .(total = sum(cases)), by = "country"]
```
# The final example will see the total sum of cases in Germany.
```{r}
deu_cases <- EU_corona[country == "Germany", .(main_cases = sum(cases)), by = c("type")]
```
### Moving to the final step, we will talk about chaining expressions. Chaining allows us to do sucessive operations on the results that we had in the previous brackets. You can certaintly use a high number of chains to obtain high specific results. Wise reminder: Do not try multiple chainings at once, slow and steady is the way to not receive an error.
```{r}
EU_corona[type == "death", .(total = max(cases)), by = c("date", "country")][order(-total)]
```
# Ok! We covered only the surface of data.tables! If you want to delve into it more deeply, we recommend taking a look at .SDs and application of functions to columns.