Applied Bayesian Statistics (Turku

Outline

In a broad range of scientific disciplines, Bayesian statistics are gaining popularity. Yet the basic training of most researchers only introduces the frequentist framework for statistical inference. This workshop aims to introduce Bayesian statistics in a practical way, laying a foundation for crucial concepts in the Bayesian realm.

Participants will be introduced to a workflow for Bayesian analyses in the open-source environment R. The starting point will be the linear model (aka regression model) and we can extend this to linear mixed models. The central piece of software will be the brms package in R, that bridges the typical R modelling language with Stan, which is a probabilistic programming language built to estimate models within the Bayesian framework. Given that a workflow in R will be introduced, it is advised that participants have some experience working with R. Moreover, packages will be used from the tidyverse (e.g., dplyr, ggplot2, …). Therefore, we advise participants to acknowledge themselves with these packages (see tab Preparations)

Rooms

When?	Where
Friday June 7 (9:00 am - 11:00 am)	Room Pub209
Monday June 10 (10:00 am - 12:00 am)	Room Pub209
Wednesday June 12 (10:00 am - 12:00 am)	Room Pub209
Friday June 14 (10:00 am - 12:00 am)	Room Pub209

Intro

For the workshop we will be using R. To be able to participate properly, it is advisable to install a number of packages on your own PC beforehand, as well as to become familiar with the way I usually code in R. This document is a brief guide to help you prepare.

Necessary Software & Packages

CmdStan & CmdStanR

Bayesian estimation and analyses require some dedicated samplers that should be installed on your machine. The workflow that we will cover in the workshop relies on Stan software. Stan stands alone from R but can be called through R making use of two different ways. We will cover this more in the workshop.

For the workshop we will use the CmdStan chain and the package cmdstanr. Therefore we would encourage you to download and install a working cmdstan and cmdstanr. The following vignette can help you accomplish this: https://mc-stan.org/cmdstanr/articles/cmdstanr.html

Normally, if you follow the steps described in that vignette you will succeed in installing and testing your installation.

`brms`

For the analyses we will rely on the amazing package brms. Installing this package is not that difficult, but maybe install before the workshop and start some exploring ;-)

`tidyverse`

When we code we use the functional programming approach and a lot of the tidy principles (see below). The package tidyverse bundles a set of packages that are needed for this approach to coding. If you want to learn more on the whole universe of tidyverse you can explore the book of Hadley Wickham: https://r4ds.had.co.nz/

`ggplot2`

Part of the tidyverse package is the ggplot2 package. Therefore it will be installed if you also install tidyverse normally. Nevertheless, we find it worth mentioning separately because we think knowing how to make graphs with ggplot2 is very handy even if you do not dive into bayesian analyses. A great resource to learn to visualize your data is the tutorial written by Cédric Scherer: https://www.cedricscherer.com/2019/08/05/a-ggplot2-tutorial-for-beautiful-plotting-in-r/ !

He also blogs on creating great visualizations and has some nice talks that we think you can find on YouTube!

`patchwork`

We mention this package separately as well. patchwork allows you to combine different plots created with ggplot2. This package should be installed on it’s own.

`tidybayes`

This package is a dedicated package to make use of the tidy principles when applying it to (results of) Bayesian analyses.

`bayesplot`

We will use visualisations a lot in Bayesian analyses to summarize the information of models and parameter estimation. bayesplot bundles a number of functions that make this visualizing much easier (although it uses ggplot2 under the hood so we could accomplish similar results just knowing ggplot2).

Coding in the tidy universe…

There are a great number of roads that will lead to Rome! Coding in R can be accomplished in several ways. This means that code can get messy as well! Therefore, we try to keep ourselfs as much as possible to some rules that are described by what is called the tidy way of coding. Some basic things are key to this approach. We will shortly describe them here so you will recognize this way of working when we’re in the workshop.

The pipe

When coding we make use of what is called the pipe operator (notice that you can only use this when you loaded the tidyverse package).

The pipe in R code is the following: %>%

What is nice about using the pipe operator is that you can read a piece of code from left to right focussing on the verbs (as Hadley Wickham will call it sometimes).

A short example.

Imagine you have a vector of numbers for which you want to calculate a mean: c(1,2,3,4,5).

There are different ways to do this in R.

A first one is doing it in separate lines of code:

[1] 3

For this easy example this would work fine. But a typical side-effect of working this way is that we create a lot of objects along the way. Here we created the object x that stored the values of the vector.

To avoid this we could combine all the code in one line by embedding functions:

[1] 3

One line of code and no storage of a new object. But this can get complicated if you want to combine a set of functions resulting in almost non-readable code!

Then there is the pipe operator:

[1] 3

This is how to read this code:

[1] 3

So the pipe can be read as take (or create) A and then do B with the result. We could create a whole pipeline of functions to be applied in this way making use of the pipe operator (this becomes more clear in the next section when we shortly describe the dplyr verbs).

A tweet that we encountered gives a great analogy. Using the pipe you could write:

I %>% woke up %>% took a shower %>% got breakfast %>% took the metro %>% arrived at work %>% …

In a recent update of R they also introduced this idea of a pipe as a “native pipe”. This is written as |> and functions in a similar way as %>% with the advantage that it will also work outside the tidyverse. We are slowly transforming all my material and code to make use of this native pipe. But, chances are high that you will encounter some old-fashioned tidyverse pipes in the code during the workshop…

I |> learned about the native pipe |> transformed my code |> potentially failed somewhere …

`dplyr` verbs

Within the tidyverse universe we can make use of a great package called dplyr for a broad spectrum of data-management functions. The functions provided in that package are also sometimes called verbs:

mutate( ) to do recoding etc;
select( ) to select a subset of columns/variables;
filter( ) to filter cases;
group_by( ) to apply what’s coming afterwards for specific groups in the data by a grouping variable;
arrange( ) to sort the data;
summarize( ) to apply summarizing functions (like taking the mean etc.);
rename( ) to rename columns;

These functions can be combined in a single statement making use of the pipe operator.

Here are some examples on the use of these verbs applied to the built-in dataset starwars that is part of the tidyverse package.

Calculate the mean birth_year for all the characters:

# A tibble: 1 × 1
  mean_birth_year
            <dbl>
1            87.6

Calculate the average height and the average mass by gender:

# A tibble: 3 × 3
  gender    mean_height mean_mass
  <chr>           <dbl>     <dbl>
1 feminine         167.      54.7
2 masculine        177.     107. 
3 <NA>             175       81

Now, let’s use the verbs to create a graph. We want to create a scatterplot with mass on the x-axis and height on the y-axis. Also, we want to add a third variable in the mix: whether the character’s birth year is hither than 50 or not (just for fun). Finally we only want to show characters that have a mass lower than 200 and with a birth_year that is known.

Outline

In the first part, we will introduce the basic rationale behind Bayesian statistics. We start by explaining the three key components of a Bayesian model: the prior, the likelihood, and the posterior.

Then, we switch to the estimation of parameters by first introducing the basic idea of grid approximation and then outlining the basic idea of MCMC sampling.

At the end of the session we introduce brms and learn how to estimate a simple regression model with brms and just use the summary() and plot() functions to get insight in the model results.

Materials

Slides

The htlm-version of the slides for this first part can be found here

Data

For this first part, we use a straightforward dataset on predicting racetimes for a marathon. The data can be downloaded here (right-click to save as).

References and resources

Data comes from Kaggle

Paul Bürkner’s presentation available on YouTube: click here

Interactive tool demonstrating MCMC sampling: click here

brms homepage: click here

Outline

The second part of this workshop taps into the concept of the priors. Prior probability distributions are of crucial importance in Bayesian models. brms will set default priors for all the parameters in your model. Nevertheless, as a researcher it is important to understand what your piors actually mean. Therefore, just using the default priors without a critical understanding of these priors is sub-optimal. We will discuss and exercise how to think about priors and how to use them in brms.

Slides

The htlm-version of the slides for this part can be found here

Data

For this part, we again use the straightforward dataset on predicting racetimes for a marathon. The data can be downloaded here (right-click to save as).

Outline

In this part of the workshop we go one step further. We will first shortly introduce the idea of leave-one-out cross-validation which can be used to compare alternative models. Next we will estimate some Bayesian Mixed Effects Models and delve into the endeavour to think about the different priors. We will consider alternative priors for both the fixed and random effect parameters in the model. Using prior predictive checking we will consider the impact of these priors. Finally, we will also introduce the concept of posterior predictive checking to evaluate our model.

Slides

The htlm-version of the slides for this first part can be found here

Data

For this first part, we used a dataset on the effect of a writing intervention. The data can be downloaded here (right-click to save as).

References and resources

Outline

This part of the workshop is about interpreting and reporting the results of our models. Here we demonstrate different ways to make sense of the information in our posterior probability distributions for our parameters. First, we will show how different packages in R can be used to visually explore the posterior probability distributions. Next, we focus on how to numerically summarize the information in the posterior probability distribution in order to support our reporting on the results.

Slides

The htlm-version of the slides for this first part can be found here

Data

For this first part, we used a dataset on the effect of a writing intervention. The data can be downloaded here (right-click to save as).

R scripts

A powerful way to visualize the effects of (mixed effects) regression models, is to plot Hypothetical Outcome Plots. These type of plots need a considerable amount of coding. Therefore, we provide two R-scripts that can be used as a starting point to plot your own results:

References and resources

More on the bayesplot package: https://mc-stan.org/bayesplot/

More on the ggdist package: https://mjskay.github.io/ggdist/reference/index.html

Papers on the use of quantile dotplots:

Outline

In this final part, it is your turn! We have created an integrated exercise.

On the following page you can find an html file with the tasks for this exercise: click here

The dataset for this exercise can be downloaded here: Subtitles.RData

If you are stuck while making this exercise, you can consult an example response here

References and resources

The data for the exercise is a simulated dataset based on the following study:

Frumuselu, Anca Daniela, Sven De Maeyer, Vincent Donche, and María del Mar Gutiérrez Colon Plana. 2015. “Television Series Inside the EFL Classroom: Bridging the Gap Between Teaching and Learning Informal Language Through Subtitles.” Linguistics and Education 32 (December): 107–17. https://doi.org/10.1016/j.linged.2015.10.001.

Info

During the course we shorlty introduced the WAMBS (when to Worry and how to Avoid the Misuse of Bayesian Statistics) check-list as a tool to structure your thinking and checking of your Bayesian model(s).

To see an example of this WAMBS checklist applied to the final model on the Marathon Data click here.

The Quarto document to make this report can be found here (right-click to save).

If you want to render the Quarto document and have the references in there, you also need to download the references.bib file and store it in the same folder as the Quarto document. The references file can be downloaded here (right-click to save).

References and resources

Depaoli, S., & Van de Schoot, R. (2017). Improving transparency and replication in Bayesian statistics: The WAMBS-Checklist. Psychological methods, 22(2), 240.

Van de Schoot, R., Veen, D., Smeets, L., Winter, S. D., & Depaoli, S. (2020). A tutorial on using the WAMBS checklist to avoid the misuse of Bayesian statistics. Small Sample Size Solutions: A Guide for Applied Researchers and Practitioners; van de Schoot, R., Miocevic, M., Eds, 30-49.

Reading tips

Many books are written that discuss Bayesian modelling. Here you can find two books that I would suggest for learning more about Bayesian analyses.

A first book is the book by Ben Lambert: “A Student’s Guide to Bayesian Statistics”. This book is great if you want to learn more about the statistical theory behind Bayesian analyses and MCMC sampling. He succeeds in explaining these topics in a thorough conceptual way, without you getting lost in the mathematics and formulas. For me, this book was of great value!

A big hit these days is the book by Richard McElreath: “Statistical Rethinking. A Bayesian Course with examples in R and Stan” . In this course, Richard McElreath introduces statistical reasoning from scratch, integrating three key components for good statistical thinking: DAGs, Bayesian statistics, and multi-level models. So, this book is more than solely a book about Bayesian models. The lectures of Richard McElreath based on this book can be found for free on YouTube.

Some free online books

John Kruschke’s book Doing Bayesian Data Analyses: a tutorial with R, Jags, and Stan :

https://nyu-cdsc.github.io/learningr/assets/kruschke_bayesian_in_R.pdf

Bayes Rules!:

https://www.bayesrulesbook.com/

Or this book:

https://vasishth.github.io/bayescogsci/book/

THE Podcast

If you like running - like I do - this could be a great companion on your run! In this podcast different guests from different backgrounds discuss the power and reasoning behind Bayesian analyses and how Bayesian statistics are used in their field from Astronomy, over Psychology to Sports Analyses (and many more).

https://www.learnbayesstats.com/

Online learning material

The research group of Rens van de Schoot (Utrecht University) publishes a lot on Bayesian statistics. They also developed a summer school. Their materials are also openly available at (you will notice how their way of sharing materials has inspired me :-))

https://utrechtuniversity.github.io/BayesianEstimation/#quick-overview

My website

When I find the time I write short posts about (Bayesian) statistics. Some of the topics touched upon in the workshop are also documented in these posts, where I try to explain certain concepts or workflows. These posts also contain the necessary code!

https://sdemaeyer.quarto.pub/posts.html

Some example papers

Here some references to articles reporting on Bayesian analyses.

Gijsen, M., Catrysse, L., De Maeyer, S., & Gijbels, D. (2024). Mapping cognitive processes in video-based learning by combining trace and think-aloud data. Learning and Instruction, 90, 101851. https://doi.org/10.1016/j.learninstruc.2023.101851

Roeser, J., De Maeyer, S., Leijten, M., & Van Waes, L. (2021). Modelling typing disfluencies as finite mixture process. Reading and Writing. https://doi.org/10.1007/s11145-021-10203-z

Article on Bayesian Evidence Interval

During the course I mentioned a paper that criticises the ROPE + HDI rule and proposes an alternative for making decisions on hypotheses in the Bayesian realm. This is the reference:

Kelter, R. (2022), The evidence interval and the Bayesian evidence value: On a unified theory for Bayesian hypothesis testing and interval estimation. Br J Math Stat Psychol, 75: 550-592.
https://doi.org/10.1111/bmsp.12267

CmdStan & CmdStanR

brms

tidyverse

ggplot2

patchwork

tidybayes

bayesplot

The pipe

dplyr verbs

Slides

Data

Slides

Data

Slides

Data

Slides

Data

R scripts

`brms`

`tidyverse`

`ggplot2`

`patchwork`

`tidybayes`

`bayesplot`

`dplyr` verbs