vignettes/mcmc.Rmd

---
title: "MCMC pipelines"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{MCMC pipelines}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
# With the root.dir option below,
# this vignette runs the R code in a temporary directory
# so new files are written to temporary storage
# and not the user's file space.
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
if (identical(Sys.getenv("NOT_CRAN", unset = "false"), "false")) {
  knitr::opts_chunk$set(eval = FALSE)
}
library(cmdstanr)
library(dplyr)
library(targets)
library(stantargets)
if (identical(Sys.getenv("IN_PKGDOWN"), "true")) {
  cmdstanr::install_cmdstan()
}
```

The `stantargets` package makes it easy to run a single Stan model and keep track of the results. [`cmdstanr`](/~https://github.com/stan-dev/cmdstanr) fits the models, and [`targets`](https://docs.ropensci.org/targets/) manages the workflow and helps avoid unnecessary computation.

First, write a Stan model file.

```{r}
lines <- "data {
  int <lower = 1> n;
  vector[n] x;
  vector[n] y;
  real true_beta;
}
parameters {
  real beta;
}
model {
  y ~ normal(x * beta, 1);
  beta ~ normal(0, 1);
}"
writeLines(lines, "x.stan")
```


Next, create a `_targets.R` file to define the [`targets`](https://docs.ropensci.org/targets/) pipeline.

```{r, echo = FALSE}
library(targets)
tar_script({
  library(stantargets)
  options(crayon.enabled = FALSE)
  tar_option_set(memory = "transient", garbage_collection = TRUE)
  list(
    tar_stan_mcmc(
      example,
      "x.stan",
      tar_stan_example_data(),
      stdout = R.utils::nullfile(),
      stderr = R.utils::nullfile()
    )
  )
})
```

```{r, eval = FALSE}
# _targets.R
library(targets)
library(stantargets)

generate_data <- function() {
  true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- stats::rnorm(n, x * true_beta, 1)
  list(n = n, x = x, y = y, true_beta = true_beta)
}

list(
  tar_stan_mcmc(
    example,
    "x.stan",
    tar_stan_example_data(),
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile()
  )
)
```

`tar_stan_mcmc(example, ...)` defines several targets:

* `example_file_x`: Reproducibly track changes to the Stan model file.
* `example_data`: Run the code you supplied to `data` and return a dataset compatible with Stan.
* `example_mcmc_x`: Run the MCMC and return an object of class `CmdStanMCMC`.
* `example_draws_X`: Return a friendly `tibble` of the posterior draws from `example`. Uses the `$draws()` method. Suppress with `draws = FALSE` in `tar_stan_mcmc()`.
* `example_summaries_x`: Return a friendly `tibble` of the posterior summaries from `example`. Uses the `$summary()` method. Suppress with `summary = FALSE` in `tar_stan_mcmc()`.
* `example_diagnostics_x`: Return a friendly `tibble` of the sampler diagnostics from `example`. Uses the `$sampler_diagnostics()` method. Suppress with `diagnostics = FALSE` in `tar_stan_mcmc()`.

The suffix `_x` comes from the base name of the model file, in this case `x.stan`. If you supply multiple model files to the `stan_files` argument, all the models share the same dataset, and the suffixes distinguish among the various targets.

Run `tar_manifest()` to see a tidy data frame of commands and settings for the targets.

```{r}
tar_manifest()
```

[`targets`](https://docs.ropensci.org/targets/) produces a graph to show the dependency relationships among the targets. Below, the MCMC depends on the model file and the data, and the draws, summary, and diagnostics depend on the MCMC.

```{r, output = FALSE, message = FALSE}
tar_visnetwork(targets_only = TRUE)
```

Run the computation with `tar_make()`.

```{r, output = FALSE}
tar_make()
```

The output lives in a special folder called `_targets/` and you can retrieve it with functions `tar_load()` and `tar_read()` (from [`targets`](https://docs.ropensci.org/targets/)).

```{r}
tar_read(example_summary_x)
```

At this point, all our results are up to date because their dependencies did not change.

```{r}
tar_make()
```

But if we change the underlying code or data, some of the targets will no longer be valid, and they will rerun during the next `tar_make()`. Below, we change the Stan model file, so the MCMC reruns while the data is skipped. This behavior saves time and enhances reproducibility.

```{r}
write(" ", file = "x.stan", append = TRUE)
```

```{r}
tar_outdated()
```

```{r}
tar_visnetwork(targets_only = TRUE)
```

```{r, output = FALSE}
tar_make()
```

At this point, we can add more targets and custom functions for additional post-processing.

```{r, echo = FALSE}
library(targets)
tar_script({
  library(stantargets)
  options(crayon.enabled = FALSE)
  tar_option_set(memory = "transient", garbage_collection = TRUE)
  list(
    tar_stan_mcmc(
      example,
      "x.stan",
      tar_stan_example_data(),
      stdout = R.utils::nullfile(),
      stderr = R.utils::nullfile()
    ),
    tar_stan_summary(
      custom_summary,
      fit = example_mcmc_x,
      summaries = list(~posterior::quantile2(.x, probs = c(0.25, 0.75)))
    )
  )
})
```

```{r, eval = FALSE}
# _targets.R
library(targets)
library(stantargets)

generate_data <- function() {
  true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- stats::rnorm(n, x * true_beta, 1)
  list(n = n, x = x, y = y, true_beta = true_beta)
}

list(
  tar_stan_mcmc(
    example,
    "x.stan",
    generate_data(),
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile()
  ),
  tar_stan_summary(
    custom_summary,
    fit = example_mcmc_x,
    summaries = list(~posterior::quantile2(.x, probs = c(0.25, 0.75)))
  )
)
```

In the graph, our new `custom_summary` target should be connected to the upstream `example` target, and only `custom_summary` should be out of date.

```{r}
tar_visnetwork(targets_only = TRUE)
```

In the next `tar_make()`, we skip the expensive MCMC and just run the custom summary.

```{r, output = FALSE, warning = FALSE}
tar_make()
```

```{r}
tar_read(custom_summary)
```

## Multiple models

`tar_stan_mcmc()` and related functions allow you to supply multiple models to `stan_files`. If you do, each model will run on the same dataset. Below, we add another `y.stan` model to the `stan_files` argument of `tar_stan_mcmc()`.

```{r, echo = FALSE}
library(targets)
file.copy("x.stan", "y.stan")
tar_script({
  library(stantargets)
  options(crayon.enabled = FALSE)
  tar_option_set(memory = "transient", garbage_collection = TRUE)
  list(
    tar_stan_mcmc(
      example,
      c("x.stan", "y.stan"), # another model
      tar_stan_example_data(),
      stdout = R.utils::nullfile(),
      stderr = R.utils::nullfile()
    ),
    tar_stan_summary(
      custom_summary,
      fit = example_mcmc_x,
      summaries = list(~posterior::quantile2(.x, probs = c(0.25, 0.75)))
    )
  )
})
```

```{r, eval = FALSE}
# _targets.R
library(targets)
library(stantargets)

generate_data <- function() {
  true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- stats::rnorm(n, x * true_beta, 1)
  list(n = n, x = x, y = y, true_beta = true_beta)
}

list(
  tar_stan_mcmc(
    example,
    c("x.stan", "y.stan"), # another model
    generate_data(),
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile()
  ),
  tar_stan_summary(
    custom_summary,
    fit = example_mcmc_x,
    summaries = list(~posterior::quantile2(.x, probs = c(0.25, 0.75)))
  )
)
```

In the graph below, notice how the `*_x` targets and `*_y` targets are both connected to `example_data` upstream.

```{r}
tar_visnetwork(targets_only = TRUE)
```

## Generated quantities

It is possible to use the `CmdStanMCMC` object from one run to simulate generated quantities downstream. For example, the `tar_stan_gq_rep_summaries()` function takes a single `CmdStanMCMC` object, produces multiple replications of generated quantities from multiple models, and aggregates the summaries from each. The following pipeline uses this technique to repeatedly draw from the posterior predictive distribution.

```{r}
lines <- "data {
  int <lower = 1> n;
  vector[n] x;
  vector[n] y;
}
parameters {
  real beta;
}
model {
  y ~ normal(x * beta, 1);
  beta ~ normal(0, 1);
}
generated quantities {
  real y_rep[n] = normal_rng(x * beta, 1); // posterior predictive draws
}"
writeLines(lines, "gen.stan")
```

```{r, echo = FALSE}
library(targets)
tar_script({
  library(stantargets)
  options(crayon.enabled = FALSE)
  tar_option_set(memory = "transient", garbage_collection = TRUE)
  list(
    tar_stan_mcmc(
      example,
      "x.stan",
      tar_stan_example_data(),
      stdout = R.utils::nullfile(),
      stderr = R.utils::nullfile()
    ),
    tar_stan_gq_rep_summary(
      postpred,
      stan_files = "gen.stan",
      fitted_params = example_mcmc_x, # one CmdStanFit object
      data = tar_stan_example_data(), # multiple simulated datasets
      batches = 2, # 2 dynamic branches
      reps = 5, # 5 replications per branch
      quiet = TRUE,
      stdout = R.utils::nullfile(),
      stderr = R.utils::nullfile()
    )
  )
})
```

```{r, eval = FALSE}
# _targets.R
library(targets)
library(stantargets)

generate_data <- function() {
  true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- stats::rnorm(n, x * true_beta, 1)
  list(n = n, x = x, y = y, true_beta = true_beta)
}

list(
  tar_stan_mcmc(
    example,
    "x.stan",
    tar_stan_example_data(),
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile()
  ),
  tar_stan_gq_rep_summary(
    postpred,
    stan_files = "gen.stan",
    fitted_params = example_mcmc_x, # one CmdStanFit object
    data = generate_data(), # Function runs once per rep.
    batches = 2, # 2 dynamic branches
    reps = 5, # 5 replications per branch
    quiet = TRUE,
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile()
  )
)
```

Since we have defined many objects in the pipeline, it is extra important to check the graph to be sure everything is connected.

```{r}
tar_visnetwork(targets_only = TRUE)
```

Then, we run the computation. The original MCMC is already up to date, so we only run the targets relevant to the generated quantities.

```{r}
tar_make()
```

Finally, we read the summaries of posterior predictive samples.

```{r}
tar_read(postpred)
```

## More information

For more on [`targets`](https://docs.ropensci.org/targets/), please visit the reference website <https://docs.ropensci.org/targets/> or the user manual <https://books.ropensci.org/targets/>. The manual walks though advanced features of `targets` such as [high-performance computing](https://books.ropensci.org/targets/hpc.html) and [cloud storage support](https://books.ropensci.org/targets/cloud.html).