generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path08_1_poison1_sol.Rmd
269 lines (203 loc) · 8.77 KB
/
08_1_poison1_sol.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
---
title: "Exercise 8.1: Additive linear model on the poison dataset - solution"
author: "Lieven Clement, Jeroen Gilis and Milan Malfait"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
# The poison dataset
In this experiment, 96 fish (dojofish, goldfish and zebrafish)
were placed separately in a tank with two liters of water and
a certain dose (in mg) of the poison EI-43,064. The resistance
of the fish against the poison was measured as the amount of
minutes the fish survived after being exposed to the poison (`Surv_time`, in
minutes). Additionally, the weight of each fish was measured.
# Goal
In this tutorial session we will focus on **Dojofish**, and we will model the
survival time in function of the poison dose while correcting for the weight of
the fish.
1. We will first analyse the survival data by only considering the dose as an
explanatory variable for survival time
2. Next we will model the survival data with and additive model for dose and
weight
Load libraries
```{r, message=FALSE, warning=FALSE}
# install.packages("GGally")
library(GGally)
library(car)
library(multcomp)
library(tidyverse)
theme_set(theme_bw())
```
# Import the data
```{r, message=FALSE}
poison <- read_csv("https://raw.githubusercontent.com/statOmics/PSLSData/main/poison.csv")
```
# Data tidying
We can see a couple of things in the data that can be improved:
1. Capitalise the fist column name
3. Set the Species column as a factor
4. Change the species factor levels from "0" to Dojofish.
*Hint*: use the `fct_recode` function.
4. In the previous analysis on this dataset
(`Simple linear regression session`), we performed a log-transformation on the
response variable `Surv_time` to meet the normality and homoscedasticity
assumptions of the linear model. Here, we will immediately work with
log-transformed survival times; store these in the new variable `log2Surv_time`
and remove the non-transformed values.
5. Subset the data to only retain **Dojofish**.
```{r}
poison <- poison %>%
rename("Species" = "species") %>%
mutate(Species = as.factor(Species)) %>%
mutate(Species = fct_recode(Species,
Dojofish = "0", Goldfish = "1", Zebrafish = "2"
)) %>%
mutate(log2Surv_time = log2(Surv_time)) %>%
select(-Surv_time) %>%
filter(Species == "Dojofish")
poison
```
# Data exploration
Prior to the analysis, we should explore our data. To start our data
exploration, we will make use of the `ggpairs` function of the
`GGally` R package. This function will generate a visualization containing
multiple panels, which display (1) univariate plots of the different variables
in our dataset, (2) bivariate plots and (3) correlation coefficients between
the different variables.
```{r, message=FALSE, fig.width=12}
ggpairs(poison, columns = 2:4)
```
Based on these plots, we observe that:
- The survival time seems to be associated with dose and fish weight.
From the tutorial of H6 we have seen that the fish weights were not nicely
uniform across the different poison dosages due to the randomisation.
```{r, message=FALSE}
poison %>%
ggplot(aes(x = Dose, y = Weight)) +
geom_point() +
ggtitle("Association between dose and weight") +
theme_bw() +
stat_summary(
fun = mean, geom = "point",
col = "black", size = 4,
shape = 24, fill = "red"
)
```
# Simple linear regression
This is the same regression model that we have already fit in the exercise
session on simple linear regression, with `Dose` as the only explanatory
variable for `log2Surv_time`.
```{r}
# fit a linear regression model with 'Surv_time' as response variable and
# 'Dose' as predictor variabele
lm_simple <- lm(log2Surv_time ~ Dose, data = poison)
## display the diagnostic plots of the model
par(mfrow = c(2, 2))
plot(lm_simple)
```
1. The independence assumption was **met** because the fish were randomized to
the dose.
2. The linearity assumption is **met**.
3. The normality assumption is **met**.
4. The homoscedasticity assumption is **met**.
Finally, we look at the output of the model.
```{r}
summary(lm_simple)
```
Or for an interpretation at the original scale (minutes in stead of log2
minutes):
```{r}
2^(coef(lm_simple))["Dose"]
2^(confint(lm_simple))[2, ]
```
There is a very significant effect of the poison on survival of Dojofish
(p< 0.001). Dojofish that are exposed to a higher dose of the poison will
have a survival time that decrease on average with a factor
`r lm_simple$coef[2] %>% 2^. %>% round(.,2)` per gram of poison that
is added (95% CI [`r confint(lm_simple)[2,] %>% 2^. %>% round(.,2)`]).
# Analysis with additive effect for weight
## Model specification
Here, we will estimate the effect of the poison while correcting for `weight`
and we will add it as an additional covariate to our linear regression model,
such that
$$
y_i=\beta_0+\beta_d x_d + \beta_g x_g + \epsilon_i,
$$
with $\epsilon_i \text{ i.i.d. } N(0,\sigma^2)$.
## Assumptions
The model will again be fit to allow for assessing the model assumptions
```{r}
lm_additive <- lm(log2Surv_time ~ Dose + Weight, data = poison)
par(mfrow = c(2, 2))
plot(lm_additive)
```
The assumption of independence, linearity and homoscedasticity are met.
The QQ-plot suggest that there might be some deviation from normality in the
left tail of the,distribution. However, when we would simulate data under the
normality assumption, it seems that deviations of this size may be expected when
normality is met. We will use simulation to assess if we can observe similar
residual plots if all assumptions for the linear model hold.
```{r}
set.seed(1031)
sigma <- sigma(lm_additive)
dataHlp <- poison
par(mfrow = c(3, 3))
for (i in 1:9) {
nobs <- nrow(poison)
dataHlp$ySim <- fitted.values(lm_additive) + rnorm(nobs, sd = sigma)
simModel <- lm(ySim ~ Dose + Weight, dataHlp)
plot(simModel, which = 2)
}
```
It seems that deviations of the size that we see in our real data may be
expected even when normality is met. As such, all assumptions for linear
regression seem to be valid.
## Inference
We then inspect the results.
```{r}
summary(lm_additive)
```
## Interpretation of model parameters
We see that the effect of dose on survival time remains similar, however, it
has become more significant after we have incorporated weight in our model.
Indeed, from the data exploration, we learned that weight is associated with
survival.
1. As such, by incorporating weight in our model, we are able to
explain a larger part of the variability in the response variable survival time.
As a consequence, the variability in the residuals of the model will decrease,
which in turn will lead to smaller standard error estimates for the different
parameter estimates in the model.
2. From the data exploration, we additionally found that the dojo-fish weights
were not uniform across the different poison dosages due to the randomisation.
Therefore, we can estimate the effect between dose and survival time better
while accounting for the weight.
In this model, the effect of `Dose` can be interpreted as the average change in
the log2 survival time between two groups of dojofish **with the same weight**
that are exposed to a poison dosage that differs 1 mg/L. In symbols:
\begin{eqnarray}
\hat{\mu_1}&=& \beta_0 + \beta_d x_{1d} + \beta_g x_g \text{ (average log2-survival time for dose 1 for a certain weight)}\\
\hat{\mu_2}&=& \beta_0 + \beta_d x_{2d} + \beta_g x_g \text{ (average log2-survival time for dose 2 for that same weight)}\\
\hat \mu_2- \hat \mu_1&=&\beta_0 + \beta_d x_{2d} + \beta_g x_g - (\beta_0 + \beta_d x_{1d} + \beta_g x_g) \text{ (difference in average log2-survival time between dose 2 en dose 1)}\\
\hat \mu_2-\hat \mu_1&=&\beta_d (x_{2d}-x_{1d})
\end{eqnarray}
## Conclusion
```{r}
2^(coef(lm_additive))
2^(confint(lm_additive))
```
The dose of the poison has an extremely significant effect on the log2
transformed survival time of dojofish
(p-value = `r format(summary(lm_additive)$coefficients[2,4],digits=1)`). The
geometric average of the survival time for dojofish that are exposed to a
poison dose that is 1mg/L larger is approximately halved (factor = $2^{\beta_d}=$ `r format(2^(lm_additive$coef["Dose"]),digits=2)`) .
The effect of weight on the survival time of dojofish is also extremely
significant
(p-value = `r format(summary(lm_additive)$coefficients[3,4],digits=1)`). The
geometric average of the survival time of a dojofish that weighs 1 gram more
than another dojofish is approximately twice as long
(factor = $2^{ \beta_g}$= `r format(2^(lm_additive$coef["Weight"]),digits=2)`,
95% BI [`r paste(format(2^(confint(lm_additive)["Weight",]),digits=2),collapse=",")`]).
## Remarks
In the `lm_additive` model, we included only a *main effect* for weight.
However, there could also be an *interaction effect* between weight and
dose. An interaction between weight and dose implies that the dose effect on
the survival time changes according to the weight of the fish.