generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path04_3_diabetes_sol.Rmd
224 lines (174 loc) · 6.78 KB
/
04_3_diabetes_sol.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
title: "Exercise 4.3: Exploring the diabetes dataset - solution"
author: "Lieven Clement, Jeroen Gilis and Milan Malfait"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
# Aims of this exercise
In this exercise, you will acquire the skills to
- recognize paired data
- conduct a data exploration in R for data from
paired experimental designs.
- interpret the results of a data exploration for paired experimental designs
# The diabetes dataset
The `diabetes dataset` holds information on a small experiment with
8 patients that are subjected to a glucose tolerance test.
Patients had to fast for eight hours before the test.
When the patients entered the hospital their baseline glucose level was measured (mmol/l).
Patients then had to drink 250 ml of a syrupy glucose solution containing 100 grams of sugar.
Two hours later, their blood glucose level was measured again.
The data consist of three variables:
- before: glucose concentration upon 8 hours of fasting (mmol/l)
- after: glucose concentration 2 hours after drinking glucose solution (mmol/l).
- patient: identifier for the patient
# Import the data
Data path:
`https://raw.githubusercontent.com/statOmics/PSLSData/main/diabetes.txt`
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
```
```{r}
diabetes <- read_delim("https://raw.githubusercontent.com/statOmics/PSLSData/main/diabetes.txt", delim = " ")
```
Have a first look at the data
```{r}
glimpse(diabetes)
head(diabetes)
```
# Data visualization
Note, that the dataset is not in the tidy format. The glucose concentration
variable is spread around 2 columns: `before` and `after`, while the "time"
variable is encoded in the column names instead of in a dedicated column. Data
in this form is also called *wide* data. Instead, we want to transform the data
to a *long* format.
To tidy the data, we can use the `gather()` function to
[pivot](https://r4ds.had.co.nz/tidy-data.html#pivoting) the data. In this case,
we want to "gather" the `time` (encoded in the column names `before` and
`after`) and `concentration` variables (which is encoded in the actual values).
The `patient` column should stay the same. We can specify this with the
following syntax.
```{r}
diabetes_tidy <- diabetes %>%
gather(time, concentration, -patient)
diabetes_tidy
```
## Barplot
Not all visualization types will be equally informative.
A barplot is a plot that you will
commonly find in scientific publications.
The code for generating such a barplot
is provided below:
```{r}
diabetes_tidy %>%
## Calculate summary statistics for the "concentration" variable for each "time"
group_by(time) %>%
summarize(
mean = mean(concentration, na.rm = TRUE),
sd = sd(concentration, na.rm = TRUE),
n = n()
) %>%
## Compute the standard errors for the means
mutate(se = sd / sqrt(n)) %>%
ggplot(aes(x = time, y = mean, fill = time)) +
theme_bw() +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width = 0.2) +
ggtitle("Barplot of glucose measurements") +
ylab("concentration (mmol/l)")
```
A barplot, however, is not very informative.
The height of the
bars only provides us with information of the mean blood pressure.
However, we don't see the actual underlying values, so we for
instance don't have any information on the spread of the data.
It is usually more informative to represent to underlying
values as _raw_ as possible.
Note that it is possible to add the
raw data on the barplot, but we still would not see any measures
of the spread, such as the interquartile range.
Another crucial aspect of the data are also not displayed:
the data are paired!
**Based on these critisisms, can you think of a better**
**visualization strategy for the captopril data?**
**Add your proposed visualization strategy here**
It is usually more informative to represent to underlying
values as _raw_ as possible. Boxplots are ideal for this.
```{r}
## Recode the `time` variable so "before" is the first level
diabetes_tidy <- diabetes_tidy %>%
mutate(time = as.factor(time)) %>%
mutate(time = relevel(time, "before"))
diabetes_tidy %>%
ggplot(aes(x = time, y = concentration, fill = time)) +
theme_bw() +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2) +
stat_summary(
fun = mean, geom = "point",
shape = 5, size = 3,
color = "black"
) +
ggtitle("Naive boxplot of glucose concentration\nNot a good representation for paired data!") +
ylab("Concentration (mmol/l)")
```
Note, however that for these data the boxplots are missing a crucial part of the information: The data are paired!
So the boxplots of glucose measurements before and after the are not telling the full story! The boxplot is thus not a good plot for exploring the data of the diabetes dataset!
## Paired data
A line plot is a better plot for paired data!
Note that we also convert the variable time in a factor first and relevel it so that the concentration before is plotted first.
```{r}
diabetes_tidy %>%
ggplot(aes(x = time, y = concentration)) +
geom_point() +
geom_line(aes(group = patient)) +
ylab("Concentration (mmol/l)") +
ggtitle("Informative plot on glucose level for paired data")
```
We observe that the glucose concentrations increase for almost all patients!
This is a plot that shows all features in the data and clearly indicates that the data are paired.
Alternatively, we might plot calculate the differences in glucose levels after and before dosing glucose first and make a boxplot of the differences!
```{r}
diabetes_diff <- diabetes_tidy %>%
group_by(patient) %>%
summarize(difference = diff(concentration))
diabetes_diff
```
```{r}
diabetes_diff %>%
ggplot(aes(x = "", y = difference)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2) +
xlab("") +
ylab("Difference (mmol/l)") +
stat_summary(
fun = mean, geom = "point",
shape = 5, size = 3,
color = "black"
) +
ggtitle("Difference in glucose concentration\nbefore and after sugar intake")
```
The mean difference in glucose concentration is higher zero. It seems that dosing glucose has increased the glucose concentration on average with around 2 mmol/l.
In the next exercises, we will
learn how we can test if this increase is statistically significant.
## QQ-plot
We can now assess if the differences are normally distributed.
```{r}
diabetes_diff %>%
ggplot(aes(sample = difference)) +
geom_qq() +
geom_qq_line()
```
The differences in glucose concentration appears to be normally distributed.
# Descriptive statistics
- Generate a code chunk to calculate useful summary statistics for
the diabetes data
```{r}
diabetes_tidy %>%
group_by(time) %>%
summarize(
mean = mean(concentration, na.rm = TRUE),
sd = sd(concentration, na.rm = TRUE),
n = n()
) %>%
## Compute the standard errors for the means
mutate(se = sd / sqrt(n))
```