How to make your simulation study reproducible

This post is brought to you partially by my lovely colleagues Javier and Marvin who were way more excited about something I did than I was myself. Maybe that is due to the circumstances that led to me designing this feature of bayesim, a simulation framework for simulation studies on Bayesian models I wrote for a paper and which will probably stay close to the center of my PhD thesis.

Picture the scene: It is March 4. and I had been working on bayesim for a little over a month. I had thrown together a prototype before as a proof of concept and felt ready to actually start working on an R package. The core simulation loop was mainly done and I had the first brms custom families implemented. Just add support for some more metrics, hack my way into loo to make it work with rmse and and we would be done. But after the first test runs with what we had, some of the results didn’t make sense. Now, we couldn’t just save all the model objects. That would take too much storage. All we had was a result dataframe that told us one of our models had an \(\widehat{R}\) of \(1e+13\) (this later prompted a discourse thread, another good reason to be able to replicate a model).

The question at that point is if I messed up my custom families density function (which happened a lot more than I would like to admit) or maybe there is a bug in the data generating code, the sampling hyperparameters or a number of other things. But remember, all I had was a list of parameters for data generation and model construction. I didn’t have the actual dataset and initial values and anything else that is decided randomly during the simulation loop. Of course, I could have simply started the R debugger and stepped through the code. But trust me when I tell you that this wasn’t a solution (I tried). If however, I could get all the parameters and just run a single simulation step and somehow have all random things do the exact same thing as during the original simulation loop, that stepping through the code would be possible. Luckily, there is a way to achieve exactly that and I’ll show you how I made it possible to replicate every single dataset, model and resulting metric from the large study individually by shoving a dataframe row into a function.

Before we delve in a small note: I wrote my code in R, however this should work for any language that allows you to set a random seed.

Random Seeds

R has this tiny little function called set.seed(). It sets a seed for any RNG to use. Below you can see that by using set.seed(1) the two random numbers directly following that statement are the same. However, calling rnorm twice after the seed produces different results. This is important to remember, as any time an RNG is used, there is a new seed set. However, this new seed is dependent on the first one you set. So one way to reproduce anything that includes random numbers is to simply set seeds everywhere and save them together with the results. If you want to replicate the result one can simply run the original code with the specific seeds from the results.

set.seed(1)
rnorm(1)
## [1] -0.6264538
rnorm(1)
## [1] 0.1836433
set.seed(1)
rnorm(1)
## [1] -0.6264538

Toy Simulation Study

Instead of throwing bayesim at you, I will build a toy simulation study here to showcase the idea that I used.

Let’s say for some reason we are really interested in the mean of datasets generated by beta distributions with different shape parameters.

sample <- rbeta(100, shape1 = 2, shape2 = 2)
mean(sample)
## [1] 0.5197291

We also have a list of combinations of shape1 and shape2 that we want to test in a fully crossed design as shown below:

shape1_list <- c(1, 2)
shape2_list <- c(1, 2)

simulate <- function(s1_list, s2_list) {
  # Preallocates some memory for the vector
  results <- vector(
    mode = "list",
    length = length(s1_list) * length(s2_list)
  )
  index <- 1
  for (s1 in s1_list) {
    for (s2 in s2_list) {
      results[[index]] <- list(
        "mean" = mean(rbeta(100, shape1 = s1, shape2 = s2)),
        "s1" = s1,
        "s2" = s2
      )
      index <- index + 1
    }
  }
  # Makes a nice data frame from the list of individual results
  return(df <- do.call(rbind, lapply(results, data.frame)))
}

simulate(shape1_list, shape2_list)
##        mean s1 s2
## 1 0.5097277  1  1
## 2 0.3178730  1  2
## 3 0.6817081  2  1
## 4 0.4746368  2  2

Great! Our simulation study works and we should have all the information we need to reproduce it right? After all, we have all the information that went into creating the results. But try it for yourself and you will see that you get a (slightly) different set of means every time. But we can use the seed setting to make everything reproducible:

shape1_list <- c(1, 2)
shape2_list <- c(1, 2)

simulate_with_replication <- function(s1_list, s2_list, seed) {
  set.seed(seed)
  results <- vector(
    mode = "list",
    length = length(s1_list) * length(s2_list)
  )
  index <- 1
  for (s1 in s1_list) {
    for (s2 in s2_list) {
      results[[index]] <- list(
        "mean" = mean(rbeta(100, shape1 = s1, shape2 = s2)),
        "s1" = s1,
        "s2" = s2
      )
      index <- index + 1
    }
  }
  return(df <- do.call(rbind, lapply(results, data.frame)))
}

results <- simulate_with_replication(shape1_list, shape2_list, 1)
results[1, ]
##        mean s1 s2
## 1 0.4655728  1  1
results <- simulate_with_replication(shape1_list, shape2_list, 1)
results[1, ]
##        mean s1 s2
## 1 0.4655728  1  1
simulate_with_replication(results$s1[2], results$s2[2], 1)
##        mean s1 s2
## 1 0.3031687  1  2

Now every time you run the whole study with the same seed, you get the same overall results. Great! However you still can’t replicate individual results (besides the first one). For this, we need to explicitly set a seed for each RNG call.

shape1_list <- c(1, 2)
shape2_list <- c(1, 2)

simulate_with_replication <- function(s1_list, s2_list, seed) {
  set.seed(seed)
  # Here we generate a list of all the individual seeds depending on our
  # initial seed, so we don't have to input the long list manually.
  # We use those large integers because someone once told me they contain more
  # entropy or something.
  seed_list <- sample(1000000000:.Machine$integer.max,
    size = length(s1_list) * length(s2_list)
  )
  results <- vector(
    mode = "list",
    length = length(s1_list) * length(s2_list)
  )
  index <- 1
  for (s1 in s1_list) {
    for (s2 in s2_list) {
      set.seed(seed_list[[index]])
      results[[index]] <- list(
        "mean" = mean(rbeta(100, shape1 = s1, shape2 = s2)),
        "s1" = s1,
        "s2" = s2,
        "seed" = seed_list[[index]]
      )
      index <- index + 1
    }
  }
  return(df <- do.call(rbind, lapply(results, data.frame)))
}

replicate_result <- function(result_row) {
  set.seed(result_row$seed)
  data <- rbeta(100, shape1 = result_row$s1, shape2 = result_row$s2)
  replica <- list(
    "data" = data,
    "mean" = mean(data),
    "s1" = result_row$s1,
    "s2" = result_row$s2,
    "seed" = result_row$seed
  )
  return(replica)
}
results <- simulate_with_replication(shape1_list, shape2_list, 1)
results[2, ]
##        mean s1 s2       seed
## 2 0.3277834  1  2 1312928384
replica <- replicate_result(results[2, ])
replica$mean
## [1] 0.3277834

And this is the whole magic. You can now take any row from the result data frame, put it into the replicate_result function and not only get the same mean as before but you can also save any additional information from the process you might be interested in like in this case the sample.

If you are doing something with nested loops and functions, like I did in bayesim, you can just pass a single seed to the next layer, generate a new seed list from that and do that all the way down. Save all those individual seeds and when you write your replication function, set the respective layer seeds where you would have in the original code. Just make sure to follow the original flow of your code, so that the same functions depending on the random seed are run in the same order. If you want to see this work on a larger scale, this file contains the core of the simulation loop for bayesim.

Related