While not as popular as AI or machine learning, Bayesian modeling has grown in popularity over the past few years. From a theoretical viewpoint, Bayesian modeling is attractive since it allows the user to specify domain specific prior information into a model. The downside is that Bayesian modeling is computationally intensive, and so anything other than a small model can take a long time to run. This made Bayesian methods somewhat undesirable, but with modern computing power, this is less of an issue. Bayesian models can get complex, but in today’s “how-to” post I’ll demonstrate how to a specify a simple but effective model using the programming language R.

Before we dive in, just a reminder that all of our “How To” posts can be found here.

Subscribe now

Share Down on the Farm

The model that we’ll be building today is called the “beta-binomial” model. Without getting into the technical details of conjugate priors, this is a simple model that illustrates quite nicely the power of Bayesian modeling. There are three steps involved: establishing a prior, finding the likelihood, and then combining the first two steps to get the posterior distribution. Since we focus on minor league baseball here at Down on the Farm, we’ll use this model to estimate a hypothetical pitcher’s strikeout rate at a new level after he makes his first start. This is something that we do all the time and is a good explanative example as well as a practical one.

Choosing a model and establishing a prior

We’ve already established that we are using the beta-binomial model, but why are we using this model, aside from it just being simple? The binomial distribution is used to model binary events, which is what striking out a batter is (either a pitcher strikes out the hitter, or he doesn’t). It has two parameters, n, which is the number of batters faced, and p, which is the probability of striking out a given batter. For a given n and p, the binomial distribution returns a distribution of strikeouts based on those two parameters.

The number of batters faced isn’t something that needs to be estimated, but p does. If we have a large sample size, we can get a good estimate of p, or the strikeout rate, just by looking at the player’s FanGraphs or Baseball Reference page. But what if a player just got called up to a level, or was just drafted? There isn’t any player level specific data available since he just got called up, and so we have to make some assumptions. Being able to use domain knowledge is why we are interested in using Bayesian modeling in the first place, and so this is how we start to establish our prior distribution.

As an example, we are going to use Ethan Pecko, a sixth round pick out of Towson in the 2023 draft, to be our example. Since we don’t know much about him other than he’s a college draftee with a middling draft status, a good starting point would be to assume that he is roughly a league average pitcher, but not be too confident of that assessment. In 2023, the average Single-A pitcher had a 25% strikeout rate, and so we will assume Pecko will also have roughly a 25% strikeout rate, again keeping in mind that we don’t have much information and so we don’t want to be too confident. Next, we need to choose our prior distribution of p.

Since we are using the beta-binomial model, our prior distribution is the beta distribution. Similar to the binomial distribution, the beta distribution has two parameters, alpha and beta. Alpha in this context is the number of strikeouts, and beta is the number of batters faced that did not struck out. The average, or expected value, of the beta distribution given alpha and beta is alpha / (alpha + beta). In our context, the average strikeout rate is 25%, and so we want to choose alpha and beta such that alpha / (alpha + beta) = 0.25.

Read more