-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path12-Manage-data-tidyverse.Rmd
292 lines (249 loc) · 17.1 KB
/
12-Manage-data-tidyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
# Manage data with tidyverse {#tidyverse-manage}
Let's start with our first data analysis (I know you were waiting for it!): in this chapter, we will use the magic world of <em>tidyverse</em> and rnotebook to explore a dataset.
<br>
## R notebook {-}
We have already seen how to store R codes into [scripts](#scripts-chapter). However, there are some disadvantages of using them: you cannot see the output of a code directly under it, you cannot see the graphs directly under the code, you cannot create multiline comments without having to comment each line etc.
<br>
Here comes <strong>notebooks</strong>!
Whithin RStudio, just click "File" > "New File" > "R notebook" to open a new notebook.
<br>
As this book is not focused on it, and there is already a wonderful book about it, I invite you to read the first part of <a href="https://bookdown.org/yihui/rmarkdown/notebook.html" target="_blank">it</a> to get used to this wonderful tool.
<br>
<p class="plist">The key feature you need to understand are: </p>
<ul>
<li>code is executed only in <em>code chunks</em>, and all what you type outside is considered as "comment"</li>
<li>after every executed code chunks you can see the output of the chunk itself (super useful for data visualization and dataframe inspection)</li>
<li>you can export the notebook as html/pdf and share it with your colleagues (I prefer html for many reasons)</li>
</ul>
<strong>Dont' worry!</strong> At the end of each chapter you can download the notebook used for the analysis, so you can explore it, see how it works, and run it. Don't be lazy, first write your own notebook, in which you can add whatever comment you want, and then download mine to compare it.
## Load packages and data {-}
At the beginning of every analysis, it is strongly suggested to load any external package you want to use, in our case <em>tidyverse</em>.
Why using tidyverse? Because it is super useful, easy to learn, intuitive and it is the main package used for data analysis with R. We have seen in the last chapter how to install and load it in our session, so let's do it:
```{r, warning=FALSE}
library(tidyverse)
```
<br>
Next, we will load the dataset to analyze. It is taken from <a href="https://www.nature.com/articles/s41514-020-00052-5" target="_blank">this</a> paper, in particular it is Supplementary Table 13 (<a href="data/Supplementary_table_13.csv" download>Download</a>).
```{r}
df <- read.csv("data/Supplementary_table_13.csv")
head(df)
```
## Dataset structure {-}
The very first thing to do is to explore the dataset, in particular the number of rows/columns and the type of columns. We can do it through `str()`, which returns the dimensions of the dataframe and some info about each column.
<br>
Here it is:
```{r}
str(df)
```
You usually know what data you have in your dataset, which variables etc. As this is a downloaded table and it is my first time looking at it, I have to admit that we have to discover few things ahahah
<p class="plist">What can we say about this dataframe?</p>
<ul>
<li>It has 1493 observation and 10 variables</li>
<li>Variable SampleID should be the unique identifier of the sample (we should check it)</li>
<li>Diagnosis, sex, Area and dataset column can be categorical data, so we should convert them to factors</li>
<li>We have NAs in some variables. I see your face asking yourself "What the hell are NAs?". Don't worry, we'll explain it in a bit</li>
</ul>
Let's now check if SampleID are unique identifiers; to do it, we evaluate whether the number of unique values in that column is equal to the number of total values of that column:
```{r}
length(unique(df$SampleID)) == length(df$SampleID)
```
Yes, they are unique identifiers.
### Change column type {-}
Next, we transform some chr columns to factor, this has 2 main advantages: first of all, factors are treated differently in some functions (expecially plotting) and they occupy less memory in you PC.
<br>
"Factors?! Why are you telling me about this data type now?" Yes, this is another data type, but don't worry, you just have to know the two advantages aforementioned and how to create it (we will see it now).
<br>
An you know what?! As I know you're doing great, let's introduce our first function of the magic tidyverse world: `mutate`:
```{r}
df <- df %>%
mutate("Diagnosis" = factor(Diagnosis),
"sex" = factor(sex),
"Area" = factor(Area),
"dataset" = factor(dataset))
str(df)
```
<p class="plist">Great! It worked! Let's dissect this code:</p>
<ol>
<li>`df <-`: we reassign the result to the same variable (overwriting it)</li>
<li>`df %>%`: this is a feature of <em>purr</em>, a package inside tidyverse, that takes what is at the left of `%>%` and pass it as the first input of the function after the symbol. We could have written this code in just one line, but in this way is more readable</li>
<li>`mutata(`: we use the function mutate, with df as first input (implicit thanks to %>%). This function is used to create <strong>new column/s</strong> in the dataframe; if the column already exists (as in our case), it is overwritten</li>
<li>`"Diagnosis" = factor(Diagnosis)`</li>: I take this as example, as this function has this structure `"new_column" = something`. In this case, we create (overwrite) column Diagnosis with `factor(Diagnosis)`. <strong>To note</strong>: in tidyverse functions, the column of the dataframe can be accessed by calling only the name of the column, and not df$name_of_the_column.
<li>`factor(Diagnosis)`: it takes as input the column Diagnosis and transform it from chr to Factor</li>
</ol>
And what can we say about the result?
<br>
As you can see, the four columns we changed inside `mutate` has now changed their output: from chr they became Factor w/ n levels and a bunch of numbers instead of the values. I'm sure you have already understood what does it mean, but to be sure: with "levels", R stores the unique values of that vector (remember, each column of a dataframe is a vector!), and then it uses, for each element, integers to refer to the corresponding levels. This ensure memory optimization (integers takes less memory/space than words).
<br>
But, don't worry, when you ask R to visualize the values of a factor, it returns the value and not the numbers:
```{r}
df$Diagnosis[1:5]
```
### Reorder levels of a factor {-}
There is a cool feature about factors: you can order the levels of it. In fact, by default levels are in alphabetical order. We can see it here, using the function `levels()`:
```{r}
levels(df$Diagnosis)
```
Sometimes, it is useful to change the order of them: think about times (20 weeks comes before 3 days alphabetically), or having a control reference as first in plots (when plotting categorical variable, the order of the levels is the order in the plot). <br>
So, let's see how we can change the order of the levels using `factor`:
```{r}
df$Diagnosis <- factor(df$Diagnosis, levels = c("Control", "AD"), ordered = T)
str(df)
```
<em>"You are not using tidyverse now, why?"</em> Right, I used the standard R syntax to highlight the syntax difference with tidyverse: in fact, here I had to put `df$` into the function to explicit which column to convert. You can use whatever syntax you want, it does not matter at all; you will see shortly that sometimes tidyverse version is more clear.
<br>
Having said that, we can now see in the result of `str()` that now Diagnosis is an <em>Ord.factor w/ 2 levels "Control"<"AD"</em>.
## Summary {-}
Ok, now that we have fixed some columns, let's evaluate the values of the dataframe, we will use `summary()`:
```{r}
summary(df)
```
<p class="plist">How does summary works? It depends on the type of data of the column:<p>
<ul>
<li><em>Character</em>: it returns the length and the class (See SampleID)</li>
<li><em>Factor</em>: it returns the number of occurrence of each levels (see Diagnosis, sex, Area and dataset). That's another reason to use factor over character/numeric when possible. You can easily see if there are imbalance in some categories/levels or not.</li>
<li><em>Numeric</em>: it returns some descriptive (See Braak and others). As you should know the data types you have inserted, you know the range of values you are expecting, so this function helps you to immediately evaluate the presence of some outliers.</li>
<li><em>Boolean</em>: it returns the occurrence of TRUE and FALSE (as a factor with 2 levels)</li>
<ul>
Again, we have these NA's... and now it's the time to explain them.
## Dealing with missing values {-}
OPS, I already spoilered it ahahah. `NA` is the way R represent <strong>missing values</strong>.
<br>
In our example, we can see that we have different missing values in different columns, in particular in CDR where 1/3 of the total observations is missing.
<br>
How to deal with missing values is up to you. First of all, if you have collected the data it is a control on whether all your data are present (if not, you can go back and check why it is missing a data and fill it with a value, if you have it).<br>
Usually, NA's comes with downloaded data or data from different sources. You can decide if you want to delete all the observation with missing values, fill the missing values with another value (imputation) or accept missing values. There is not a <em>standard</em> on it, it is really up to you.
<br>
If you want to exclude all rows that have a missing value in any of the column, you should use `na.omit` function (I won't show it because I don't want to apply this filter to our data, but it is important that you know this function).
<br>
Instead, I propose an example on how to delete rows that have missing values in column PMI. We will use the function `is.na()` that evaluates values of a vector and returns a boolean (TRUE/FALSE) vector on whether a value is NA or not:
```{r}
# 1. Example of is.na
is.na(c(1, NA, NA, 4, 9, 12))
```
You should now tell me how to use it to exclude rows from a dataframe based on whether a value of a column is NA or not.<br>
Here is you should have answered:
```{r}
df <- df[!is.na(df$PMI), ]
summary(df)
```
Great, we have no more NAs in PMI. If you don't remember what `!` stands for, go back and check the [boolean chapter](#boolean-chapter).
<br>
## Filter a dataframe {-}
We have just filtered out some rows, and guess what?! Tidyverse has a function to filter values, it is called...... `filter`. <br>
This function takes as input an expression that returns a vector of boolean. Let's try it to retrieve a dataframe of control males with CDR greater than 4:
```{r}
df_male_ctrl_cdr4 <- df %>%
filter(Diagnosis == "Control" & sex == "M" & CDR > 4)
summary(df_male_ctrl_cdr4)
```
We have just 3 observations that matches our request.<br>
We can also use functions to evaluate thresholds etc. For example, let's take the female whose RIN is above 3rd quartile (75th quantile):
```{r}
df_female_rin3q <- df %>%
filter(sex == "F" & RIN > quantile(df$RIN, p = 0.75))
summary(df_female_rin3q)
```
There are 199 female samples with a RIN above 3rd quartile. We get it thanks to `quantile` function, which accept a vector and returns the quartile values; if you want a particular quantile (as in our case), you can give as input `p`, which is the quantile you want to calculate.
## Sort for the value of a column {-}
You can also sort the entire dataframe based on the values of a column (or more columns if ties are present). We do it through `arrange` function of tidyverse. <br>
For example, let's sort the dataframe based on RIN:
```{r}
df %>%
arrange(RIN) %>%
head(n = 10)
```
<p class="plist">There are few consideration about this code:</p>
<ul>
<li>We have sorted df based on RIN in an <em>ascending</em> order. If you want to sort the values in a descending way, you have to wrap the name of the column into `desc`, as `arrange(desc(RIN))`</li>
<li>We piped (the symbol `%>%` is also called <strong>pipe</strong>) the output of `arrange` to `head` function, to get only top 10 rows to visualize</li>
<li>Row 6 and 7 have the same RIN value, in this case we can add another column to `arrange` and use it to sort ties, as `arrange(RIN, PMI)`</li>
</ul>
## Merge two columns {-}
Sometimes, you want to merge two columns, and guess what?! There is a function for it ahahah
<br>
Let's merge Area and dataset columns using `unite()`:
```{r}
df <- df %>%
unite(col = "Area_dataset", Area, dataset,
sep = "_", remove = T)
head(df)
```
Tadaaa! Instead of two columns we now have just one. <br>
<p class="plist">Looking at the code:</p>
<ul>
<li>`col =` was used to set the new column name</li>
<li>`Area, dataset` were the two columns to merge. BUT, you can merge more than two columns, simply continue adding column names separated by a comma</li>
<li>`sep = "_"` was used to set the separator between the values of the columns (in our case, the underscore)</li>
<li>`remove = T` was used to tell R to remove the columns that we have merged. If set to FALSE, the original columns are kept in the dataframe and the merged column is added</li>
</ul>
## Split two columns {-}
Obviously, there is also a function to split a column based on a patter: `separate`. <br>
We use it to separate the newly created Area_dataset column into the original ones:
```{r}
df <- df %>%
separate(col = Area_dataset, into = c("Area", "dataset"), sep = "_", remove = F) %>%
mutate("Area" = factor(Area),
"dataset" = factor(dataset))
head(df)
```
Here it is! We have regained the original columns as well as the merged one.
<br>
As you may have guessed, `into =` is used to set the names of the columns that are created by this function.
## Merge two dataframes {-}
To conclude this chapter, we will see how to merge two dataframes based on the values of a common column. <br>
This is useful when, for example, you have the output of a differential gene expression with ensemble gene id and another dataframe which associates the ensemble gene ids with gene names, chromosome position etc etc. In this way, you can bring the information of the second dataframe into the first one.
<br>
In our case, I found in the paper the extended names of the Area abbreviations, and I want to add this info to our table. To do so, let's first create the dataframe of abbreviations and full names:
```{r}
# 1. Create vectors
abbreviations <- c("BM10", "BM22", "BM36", "BM44", "DLPFC", "TL")
full_name <- c("Broadmann area 10", "Broadmann area 22", "Broadmann area 36", "Broadmann area 44", "Dorsolateral prefronal cortex", "Temporal lobe")
# 2. Create dataframe
area_df <- data.frame(abbreviations, full_name)
area_df
```
Next, to insert full_name info into our original dataframe we have to use one of the `join` functions from `dplyr` package of the tidyverse universe:
```{r}
df <- df %>%
left_join(y = area_df, by = c("Area" = "abbreviations"))
head(df)
```
<p class="plist">Just a couple of comments on the code:</p>
<ul>
<li>`y =` is used to set the dataframe that has to be joined</li>
<li>`by =` is used to declare which variable of the first dataframe is to match in the second dataframe (in our case Area from the first with abbreviations in the second)</li>
<li>There are many <em>join</em> functions:
<ul>
<li>`left_join` (the one we used): keep all the rows of the first dataframe, even if there is not the correspondence in the second on, and only the rows of the second one that have a match in the first</li>
<li>`right_join`: keep all the rows of the second dataframe, even if there is not the correspondence in the first one, and only the rows of the first one that have a match in the second</li>
<li>`inner_join`: keep only the rows that have common values between the two dataframe in the column used to merge</li>
<li>`full_join`: keep every rows of both dataframes</li>
</ul>
</li>
</ul>
This concept is easier to explain with this image:
<img src="images/png/join_functions.png" height=30%>
## Exercises {-}
Woah... this was tough. Let's do some exercises to strengthen these concepts.
::: {.exercise #read-spreadheet}
The first exercise is not a practical exercise sorry... I want you to read this very short <a href="https://www.biostat.wisc.edu/~kbroman/publications/dataorg.pdf" target="_blank">paper</a> on how to structure tabular data. I is mainly on Excel (which you know... NEVER use it anymore to do analysis), but the key concepts must be applied to every tabular data you are creating.
<br>
Next, take one of your data file from an experiment and try to be consistent with what you have just read.
:::
::: {.exercise #filter-or}
Take our dataset and filter for males samples of Temporal lobe and female samples of Dorsolateral prefronal cortex.<br>
Change full_name to a factor
:::
<details>
<summary>Solution</summary>
```{r}
df_filtered <- df %>%
filter((sex == "M" & full_name == "Temporal lobe") |
(sex == "F" & full_name == "Dorsolateral prefronal cortex")) %>%
mutate("full_name" = factor(full_name))
summary(df_filtered)
```
It is important to note that, since we have one condition OR another, and both have multiple conditions in them, we must use parenthesis to separate the conditions to evaluate. In fact, it works as in math: first parenthesis and then from left to right.
</details>
<br>
This was just an introduction with some cool stuff you can do with tidyverse. In the next chapter we will see more functions that can help us in doing statistics and other useful operations.