-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathmcmc.Rmd
393 lines (332 loc) · 10.7 KB
/
mcmc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
---
title: "MCMC pipelines"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{MCMC pipelines}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
# With the root.dir option below,
# this vignette runs the R code in a temporary directory
# so new files are written to temporary storage
# and not the user's file space.
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
if (identical(Sys.getenv("NOT_CRAN", unset = "false"), "false")) {
knitr::opts_chunk$set(eval = FALSE)
}
library(cmdstanr)
library(dplyr)
library(targets)
library(stantargets)
if (identical(Sys.getenv("IN_PKGDOWN"), "true")) {
cmdstanr::install_cmdstan()
}
```
The `stantargets` package makes it easy to run a single Stan model and keep track of the results. [`cmdstanr`](/~https://github.com/stan-dev/cmdstanr) fits the models, and [`targets`](https://docs.ropensci.org/targets/) manages the workflow and helps avoid unnecessary computation.
First, write a Stan model file.
```{r}
lines <- "data {
int <lower = 1> n;
vector[n] x;
vector[n] y;
real true_beta;
}
parameters {
real beta;
}
model {
y ~ normal(x * beta, 1);
beta ~ normal(0, 1);
}"
writeLines(lines, "x.stan")
```
Next, create a `_targets.R` file to define the [`targets`](https://docs.ropensci.org/targets/) pipeline.
```{r, echo = FALSE}
library(targets)
tar_script({
library(stantargets)
options(crayon.enabled = FALSE)
tar_option_set(memory = "transient", garbage_collection = TRUE)
list(
tar_stan_mcmc(
example,
"x.stan",
tar_stan_example_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
)
)
})
```
```{r, eval = FALSE}
# _targets.R
library(targets)
library(stantargets)
generate_data <- function() {
true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- stats::rnorm(n, x * true_beta, 1)
list(n = n, x = x, y = y, true_beta = true_beta)
}
list(
tar_stan_mcmc(
example,
"x.stan",
tar_stan_example_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
)
)
```
`tar_stan_mcmc(example, ...)` defines several targets:
* `example_file_x`: Reproducibly track changes to the Stan model file.
* `example_data`: Run the code you supplied to `data` and return a dataset compatible with Stan.
* `example_mcmc_x`: Run the MCMC and return an object of class `CmdStanMCMC`.
* `example_draws_X`: Return a friendly `tibble` of the posterior draws from `example`. Uses the `$draws()` method. Suppress with `draws = FALSE` in `tar_stan_mcmc()`.
* `example_summaries_x`: Return a friendly `tibble` of the posterior summaries from `example`. Uses the `$summary()` method. Suppress with `summary = FALSE` in `tar_stan_mcmc()`.
* `example_diagnostics_x`: Return a friendly `tibble` of the sampler diagnostics from `example`. Uses the `$sampler_diagnostics()` method. Suppress with `diagnostics = FALSE` in `tar_stan_mcmc()`.
The suffix `_x` comes from the base name of the model file, in this case `x.stan`. If you supply multiple model files to the `stan_files` argument, all the models share the same dataset, and the suffixes distinguish among the various targets.
Run `tar_manifest()` to see a tidy data frame of commands and settings for the targets.
```{r}
tar_manifest()
```
[`targets`](https://docs.ropensci.org/targets/) produces a graph to show the dependency relationships among the targets. Below, the MCMC depends on the model file and the data, and the draws, summary, and diagnostics depend on the MCMC.
```{r, output = FALSE, message = FALSE}
tar_visnetwork(targets_only = TRUE)
```
Run the computation with `tar_make()`.
```{r, output = FALSE}
tar_make()
```
The output lives in a special folder called `_targets/` and you can retrieve it with functions `tar_load()` and `tar_read()` (from [`targets`](https://docs.ropensci.org/targets/)).
```{r}
tar_read(example_summary_x)
```
At this point, all our results are up to date because their dependencies did not change.
```{r}
tar_make()
```
But if we change the underlying code or data, some of the targets will no longer be valid, and they will rerun during the next `tar_make()`. Below, we change the Stan model file, so the MCMC reruns while the data is skipped. This behavior saves time and enhances reproducibility.
```{r}
write(" ", file = "x.stan", append = TRUE)
```
```{r}
tar_outdated()
```
```{r}
tar_visnetwork(targets_only = TRUE)
```
```{r, output = FALSE}
tar_make()
```
At this point, we can add more targets and custom functions for additional post-processing.
```{r, echo = FALSE}
library(targets)
tar_script({
library(stantargets)
options(crayon.enabled = FALSE)
tar_option_set(memory = "transient", garbage_collection = TRUE)
list(
tar_stan_mcmc(
example,
"x.stan",
tar_stan_example_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
),
tar_stan_summary(
custom_summary,
fit = example_mcmc_x,
summaries = list(~posterior::quantile2(.x, probs = c(0.25, 0.75)))
)
)
})
```
```{r, eval = FALSE}
# _targets.R
library(targets)
library(stantargets)
generate_data <- function() {
true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- stats::rnorm(n, x * true_beta, 1)
list(n = n, x = x, y = y, true_beta = true_beta)
}
list(
tar_stan_mcmc(
example,
"x.stan",
generate_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
),
tar_stan_summary(
custom_summary,
fit = example_mcmc_x,
summaries = list(~posterior::quantile2(.x, probs = c(0.25, 0.75)))
)
)
```
In the graph, our new `custom_summary` target should be connected to the upstream `example` target, and only `custom_summary` should be out of date.
```{r}
tar_visnetwork(targets_only = TRUE)
```
In the next `tar_make()`, we skip the expensive MCMC and just run the custom summary.
```{r, output = FALSE, warning = FALSE}
tar_make()
```
```{r}
tar_read(custom_summary)
```
## Multiple models
`tar_stan_mcmc()` and related functions allow you to supply multiple models to `stan_files`. If you do, each model will run on the same dataset. Below, we add another `y.stan` model to the `stan_files` argument of `tar_stan_mcmc()`.
```{r, echo = FALSE}
library(targets)
file.copy("x.stan", "y.stan")
tar_script({
library(stantargets)
options(crayon.enabled = FALSE)
tar_option_set(memory = "transient", garbage_collection = TRUE)
list(
tar_stan_mcmc(
example,
c("x.stan", "y.stan"), # another model
tar_stan_example_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
),
tar_stan_summary(
custom_summary,
fit = example_mcmc_x,
summaries = list(~posterior::quantile2(.x, probs = c(0.25, 0.75)))
)
)
})
```
```{r, eval = FALSE}
# _targets.R
library(targets)
library(stantargets)
generate_data <- function() {
true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- stats::rnorm(n, x * true_beta, 1)
list(n = n, x = x, y = y, true_beta = true_beta)
}
list(
tar_stan_mcmc(
example,
c("x.stan", "y.stan"), # another model
generate_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
),
tar_stan_summary(
custom_summary,
fit = example_mcmc_x,
summaries = list(~posterior::quantile2(.x, probs = c(0.25, 0.75)))
)
)
```
In the graph below, notice how the `*_x` targets and `*_y` targets are both connected to `example_data` upstream.
```{r}
tar_visnetwork(targets_only = TRUE)
```
## Generated quantities
It is possible to use the `CmdStanMCMC` object from one run to simulate generated quantities downstream. For example, the `tar_stan_gq_rep_summaries()` function takes a single `CmdStanMCMC` object, produces multiple replications of generated quantities from multiple models, and aggregates the summaries from each. The following pipeline uses this technique to repeatedly draw from the posterior predictive distribution.
```{r}
lines <- "data {
int <lower = 1> n;
vector[n] x;
vector[n] y;
}
parameters {
real beta;
}
model {
y ~ normal(x * beta, 1);
beta ~ normal(0, 1);
}
generated quantities {
real y_rep[n] = normal_rng(x * beta, 1); // posterior predictive draws
}"
writeLines(lines, "gen.stan")
```
```{r, echo = FALSE}
library(targets)
tar_script({
library(stantargets)
options(crayon.enabled = FALSE)
tar_option_set(memory = "transient", garbage_collection = TRUE)
list(
tar_stan_mcmc(
example,
"x.stan",
tar_stan_example_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
),
tar_stan_gq_rep_summary(
postpred,
stan_files = "gen.stan",
fitted_params = example_mcmc_x, # one CmdStanFit object
data = tar_stan_example_data(), # multiple simulated datasets
batches = 2, # 2 dynamic branches
reps = 5, # 5 replications per branch
quiet = TRUE,
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
)
)
})
```
```{r, eval = FALSE}
# _targets.R
library(targets)
library(stantargets)
generate_data <- function() {
true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- stats::rnorm(n, x * true_beta, 1)
list(n = n, x = x, y = y, true_beta = true_beta)
}
list(
tar_stan_mcmc(
example,
"x.stan",
tar_stan_example_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
),
tar_stan_gq_rep_summary(
postpred,
stan_files = "gen.stan",
fitted_params = example_mcmc_x, # one CmdStanFit object
data = generate_data(), # Function runs once per rep.
batches = 2, # 2 dynamic branches
reps = 5, # 5 replications per branch
quiet = TRUE,
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
)
)
```
Since we have defined many objects in the pipeline, it is extra important to check the graph to be sure everything is connected.
```{r}
tar_visnetwork(targets_only = TRUE)
```
Then, we run the computation. The original MCMC is already up to date, so we only run the targets relevant to the generated quantities.
```{r}
tar_make()
```
Finally, we read the summaries of posterior predictive samples.
```{r}
tar_read(postpred)
```
## More information
For more on [`targets`](https://docs.ropensci.org/targets/), please visit the reference website <https://docs.ropensci.org/targets/> or the user manual <https://books.ropensci.org/targets/>. The manual walks though advanced features of `targets` such as [high-performance computing](https://books.ropensci.org/targets/hpc.html) and [cloud storage support](https://books.ropensci.org/targets/cloud.html).