6.4 Bayesian Multiple Regression

Needed libraries

In order to model and visualize Bayesian multivariate regressions in R, we need to load some libraries.

  • For making models:
library(rstanarm)
  • For isolating coefficients using the tidy function:
library(broom)
  • For plotting model uncertainty using the spread_draws function:
library(tidybayes)
  • For assigning colors to lines:
library(scales)
  • And, of course:
library(tidyverse)

6.4.1 Term Review

In Chapter 5.4, we learned all about single-variate Bayesian modeling in R (i.e. modeling with only one explanatory variable). We learned about the rstanarm library and the stan_glm function, we learned about the outputs of the summary function when called on a Bayesian model, and we learned about using the posterior_interval function to generate credible intervals and how credible intervals differ from frequentist confidence intervals.

We know that, in the summary output, instead of singular point estimates and fit statistics, we get a distribution of values for those estimates—the mean, the standard deviation, and the 10th, 50th, and 90th percentile values. We also know about a few other variables:

In Estimates and MCMC diagnostics, we get:

  • sigma, which is the standard deviation of errors

In Fit Diagnostics, we get (again, the mean, standard deviation, and 10th, 50th, and 90th percentile values of):

  • mean_PPD, which is the sample mean of “the posterior predictive distribution of the outcome variable”

In MCMC diagnostics, we get:

  • log-posterior, which is the log of the combined posterior distributions, similar to a likelihood of a frequentist regression
  • mcse, which is (and stands for) the Monte Carlo standard error
  • Rhat, which indicates whether or not the model converges, and, in turn, whether or not our results are reliable. Posterior distributions are sampled in chains, and the Rhat statistic is a measure of within-chain variance compared to across-chain variance. Rhat values less than 1.1 indicate model convergence, and, according to the “How to Use the rstanarm Package” document on the CRAN website, if any Rhat values exceed 1.1, rstanarm will give you the following warning message: “Markov chains did not converge! Do not analyze results!”
  • n_eff, which is “a crude measure” of the effective sample size

And we know that a 90% credible interval tells us that there is a 90% probability that the true value falls within the interval, while a frequentist 90% confidence interval tells us that, in repeated sampling, the true value falls within the confidence interval 90% of the time.

6.4.2 Performing Bayesian Multi-Variate Modeling in R

Moving on from single-variate to multi-variate Bayesian modeling (i.e. modeling with more than one explanatory variable), in this chapter, we have covered two types of multiple regression models: interaction models and parallel slopes models.

Interaction model - Bayesian

Recall our use of the lm function to make score_model_interaction, a frequentist multi-variate interaction model where \(y = score\), \(x_1 = age\), \(x_2 = gender\), and \(age\) and \(gender\) are interacting, as denoted by the * separating them:

score_model_interaction <- lm(score ~ age * gender, data = evals_ch7)

The syntax for making Bayesian multi-variate interaction models is quite similar to the above code. Below, we make score_model_interaction_bayes, get summary information about the model, and get the credible interval for the model:

# Fit regression model, now setting refresh equal to 0 so as to suppress all of
# the "SAMPLING FOR MODEL 'continuous' NOW (CHAIN #)" outputs

score_model_interaction_bayes <- stan_glm(score ~ age * gender, data = evals_ch7, refresh = 0)

# Get important summary information

summary(score_model_interaction_bayes)
## 
## Model Info:
##  function:     stan_glm
##  family:       gaussian [identity]
##  formula:      score ~ age * gender
##  algorithm:    sampling
##  sample:       4000 (posterior sample size)
##  priors:       see help('prior_summary')
##  observations: 463
##  predictors:   4
## 
## Estimates:
##                  mean   sd   10%   50%   90%
## (Intercept)     4.9    0.2  4.6   4.9   5.1 
## age             0.0    0.0  0.0   0.0   0.0 
## gendermale     -0.4    0.3 -0.8  -0.4  -0.1 
## age:gendermale  0.0    0.0  0.0   0.0   0.0 
## sigma           0.5    0.0  0.5   0.5   0.6 
## 
## Fit Diagnostics:
##            mean   sd   10%   50%   90%
## mean_PPD 4.2    0.0  4.1   4.2   4.2  
## 
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
## 
## MCMC diagnostics
##                mcse Rhat n_eff
## (Intercept)    0.0  1.0  1564 
## age            0.0  1.0  1523 
## gendermale     0.0  1.0  1324 
## age:gendermale 0.0  1.0  1263 
## sigma          0.0  1.0  2756 
## mean_PPD       0.0  1.0  3284 
## log-posterior  0.0  1.0  1714 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
# Get credible intervals

posterior_interval(score_model_interaction_bayes)
##                          5%          95%
## (Intercept)     4.551495824  5.210501282
## age            -0.024668060 -0.010399308
## gendermale     -0.856906264 -0.004390902
## age:gendermale  0.004291578  0.022085146
## sigma           0.504760775  0.560817404

Parallel slopes model - Bayesian

Now, recall our use of the lm function to make score_model_parallel_slopes_bayes, a frequentist multi-variate parallel slopes model where \(y = score\), \(x_1 = age\), \(x_2 = gender\), and \(age\) and \(gender\) are not interacting, as denoted by the + separating them:

score_model_parallel_slopes <- lm(score ~ age + gender, data = evals_ch7)

Once again, the syntax for making Bayesian multi-variate parallel slopes models is quite similar to the above code. Below, we make score_model_parallel_slopes_bayes, get summary information about the model, and get the credible interval for the model:

# Fit regression model, now setting refresh equal to 0 so as to suppress all of
# the "SAMPLING FOR MODEL 'continuous' NOW (CHAIN #)" outputs

score_model_parallel_slopes_bayes <- stan_glm(score ~ age + gender, data = evals_ch7, refresh = 0)

# Get important summary information

summary(score_model_parallel_slopes_bayes)
## 
## Model Info:
##  function:     stan_glm
##  family:       gaussian [identity]
##  formula:      score ~ age + gender
##  algorithm:    sampling
##  sample:       4000 (posterior sample size)
##  priors:       see help('prior_summary')
##  observations: 463
##  predictors:   3
## 
## Estimates:
##               mean   sd   10%   50%   90%
## (Intercept) 4.5    0.1  4.3   4.5   4.6  
## age         0.0    0.0  0.0   0.0   0.0  
## gendermale  0.2    0.1  0.1   0.2   0.3  
## sigma       0.5    0.0  0.5   0.5   0.6  
## 
## Fit Diagnostics:
##            mean   sd   10%   50%   90%
## mean_PPD 4.2    0.0  4.1   4.2   4.2  
## 
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
## 
## MCMC diagnostics
##               mcse Rhat n_eff
## (Intercept)   0.0  1.0  4323 
## age           0.0  1.0  3948 
## gendermale    0.0  1.0  4471 
## sigma         0.0  1.0  4394 
## mean_PPD      0.0  1.0  4403 
## log-posterior 0.0  1.0  2077 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
# Get credible intervals

posterior_interval(score_model_parallel_slopes_bayes)
##                      5%          95%
## (Intercept)  4.27786429  4.692453379
## age         -0.01307291 -0.004222844
## gendermale   0.09914391  0.277101307
## sigma        0.50731788  0.565009957

6.4.3 Visualizing Bayesian Multi-Variate Regression Models

In this chapter, we have made Bayesian multi-variate regression models, examined their regression tables, and visualized them.

Recall first our visualization of our frequentist multi-variate interaction model, score_model_interaction:

ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
  geom_point() +
  labs(x = "Age", y = "Teaching Score", color = "Gender") +
  geom_smooth(method = "lm", se = FALSE)

Now recall our visualization of our frequentist multi-variate parallel slopes model, score_model_interaction, using Modern Dive’s special function gg_parallel_slopes:

gg_parallel_slopes(y = "score", num_x = "age", cat_x = "gender", 
                   data = evals_ch7)

Visualizing Bayesian multi-variate regression models is certainly a more involved process, as we do not automatically get singular point estimates for our intercepts and slopes with Bayes, and as we do not have a special method = "lm" argument or gg_parallel_slopes function with Bayes (although the absence of such things for Bayesian visualization presents an opportunity for the development of a new package!). And, as you will see later, all of our work in visualizing these Bayesian models will result in very similar graphs to the above frequentist ones. However, it is still very, very useful to know how to visualize Bayesian models, as only Bayesian models allow us to make probabilistic inferences about those models’ parameters.

Bayesian interaction model - visualization

And so, lets learn how to visualize our Bayesian multi-variate interaction model, score_model_interaction_bayes.

Because, as previously mentioned, we do not automatically get singular point estimates for our Bayesian models’ intercepts and slopes, our first step is to retrieve those singular parameter estimates. We can do this using the tidy function, which can be found in the broom library:

# Call "tidy" on "score_model_interaction_bayes" in order to retrieve the
# model's singular parameter estimates; store this tibble as "tidy_coefs"

tidy_coefs <- tidy(score_model_interaction_bayes)

# Print "tidy_coefs" so as to view the model's parameter estimates

tidy_coefs
## # A tibble: 4 x 3
##   term           estimate std.error
##   <chr>             <dbl>     <dbl>
## 1 (Intercept)      4.88     0.198  
## 2 age             -0.0175   0.00433
## 3 gendermale      -0.441    0.261  
## 4 age:gendermale   0.0135   0.00535

If we compare these parameter estimates to those of the frequentist version of this model score_model_interaction, we can see that they differ slightly:

get_regression_table(score_model_interaction)
## # A tibble: 4 x 7
##   term           estimate std_error statistic p_value lower_ci upper_ci
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept         4.88      0.205     23.8    0        4.48     5.29 
## 2 age              -0.018     0.004     -3.92   0       -0.026   -0.009
## 3 gendermale       -0.446     0.265     -1.68   0.094   -0.968    0.076
## 4 age:gendermale    0.014     0.006      2.45   0.015    0.003    0.024

Our next step is to isolate each parameter estimate so that we can use them in making our geom_ablines, storing the first row of tidy_coef’s estimate column, which corresponds to the term intercept, as intercept, storing the second row, which corresponds to the term age, as age_slope, and so on:

# Isolate and store each parameter estimate frome the "estimate" column of
# "tidy_coefs"

intercept <- tidy_coefs$estimate[1]
age_slope <- tidy_coefs$estimate[2]
gendermale_slope <- tidy_coefs$estimate[3]
age_gendermale_slope <-tidy_coefs$estimate[4]

Now that we have extracted the estimated values of each coefficient, we can plug them into the following equation in order to find the equation of the line if male and the equation of the line if female:

\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_{12}x_1x_2\)

If \(y = score\), \(x_1 = age\), \(x_2 = gender\), and \(x_1x_2 = age * gender\), then the equations are as follows:

If female:

  • \(y = \beta_0 + \beta_1(x_1)\)

Which equates to, in y = b + mx form:

  • y = intercept + age_slope(age)

If male:

  • \(y = (\beta_0 + \beta_2) + (\beta_1 + \beta_{12})(x_1)\)

Which equates to, in y = b + mx form:

  • y = (intercept + gendermale_slope) + (age_slope + age_gendermale_slope)(age)

Thus, we can define the following terms—female_intercept and female_slope; and male_intercept and male_slope:

# Define "female_intercept" and "female_slope" using the above calculations

female_intercept <- intercept
female_slope <- age_slope

# Define "male_intercept" and "male_slope" using the above calculations

male_intercept <- intercept + gendermale_slope
male_slope <- age_slope + age_gendermale_slope

For the sake of aesthetics and professionalism, we here recode the gender variable in the evals_ch7 dataset so that “female” appears as “Female” and male appears as “Male” in the legends of our graphics.

# Recode "gender" in "evals_ch7"

evals_ch7 <- evals_ch7 %>%
  mutate(gender = recode(gender, `female` = "Female", `male` = "Male"))

Finally, we can make our ggplot! Read the in-line code-comments for a step-by-step explanation of what’s going on in the code below!

# Plot the modified "evals_ch7" dataset using ggplot, setting "age" as the
# x-axis variable, setting "score" as the y-axis variable, and coloring points
# by "gender"

ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
  
  # Make the plot a scatterplot
  
  geom_point() + 
  
  # Add the estimated regression line that corresponds to "female" data, setting
  # "female_intercept" (defined above according to our calculations) as its
  # intercept and setting "female_slope" (defined above according to our
  # calculations) as its slope.  We use the "hue_pal" (hue palette) function to
  # set the color of the line to match the color of the corresponding (coral)
  # points on the scatterplot.  The syntax works like this: hue_pal() spits out
  # how the function works, hue_pal()(2) spits out two colors: the two default
  # colors used in this scatterplot--coral and teal, or #F8766D and #00BFC4.
  # (The number 2 in this statement tells hue_pal how many colors to spit out:
  # change the statement to hue_pal()(3), and it will spit out three colors.)
  # hue_pal()(2)[1] selects the first of these two colors, coral or #F8766D.
  # Finally, we set the size, or the thickness, of the line to 1.
  
  geom_abline(intercept = female_intercept, slope = female_slope, 
              color = hue_pal()(2)[1], 
              size = 1) +
  
  # Do the same to add the estimated regression line that corresponds to "male"
  # data, setting "male_intercept" (defined above according to our calculations)
  # as its intercept and setting "male_slope" (defined above according to our
  # calculations) as its slope.  Again, use the "hue_pal" function to set the
  # color of the line to match the color of the corresponding (teal) points on
  # the scatterplot.  This time, specify the color to be hue_pal()(2)[2] to
  # select the second of the two colors, teal or #00BFC4.  Again, set the size
  # of the line to 1.
  
  geom_abline(intercept = male_intercept, slope = male_slope,
              color = hue_pal()(2)[2], 
              size = 1) +
  
  # Give the x-axis the label "Age", give the y-axis the label "Teaching Score",
  # and give the legend the title "Gender"
  
  labs(x = "Age",
       y = "Teaching Score",
       color = "Gender")