In order to model and visualize Bayesian multivariate regressions in R, we need to load some libraries.

- For making models:

`library(rstanarm)`

- For isolating coefficients using the
`tidy`

function:

`library(broom)`

- For plotting model uncertainty using the
`spread_draws`

function:

`library(tidybayes)`

- For assigning colors to lines:

`library(scales)`

- And, of course:

`library(tidyverse)`

In Chapter 5.4, we learned all about single-variate Bayesian modeling in R (i.e. modeling with only one explanatory variable). We learned about the `rstanarm`

library and the `stan_glm`

function, we learned about the outputs of the `summary`

function when called on a Bayesian model, and we learned about using the `posterior_interval`

function to generate credible intervals and how credible intervals differ from frequentist confidence intervals.

We know that, in the `summary`

output, instead of singular point estimates and fit statistics, we get a distribution of values for those estimates—the mean, the standard deviation, and the 10th, 50th, and 90th percentile values. We also know about a few other variables:

In `Estimates`

and `MCMC diagnostics`

, we get:

`sigma`

, which is the standard deviation of errors

In `Fit Diagnostics`

, we get (again, the mean, standard deviation, and 10th, 50th, and 90th percentile values of):

`mean_PPD`

, which is the sample mean of “the posterior predictive distribution of the outcome variable”

In `MCMC diagnostics`

, we get:

`log-posterior`

, which is the log of the combined posterior distributions, similar to a likelihood of a frequentist regression`mcse`

, which is (and stands for) the Monte Carlo standard error`Rhat`

, which indicates whether or not the model converges, and, in turn, whether or not our results are reliable. Posterior distributions are sampled in chains, and the Rhat statistic is a measure of within-chain variance compared to across-chain variance. Rhat values less than 1.1 indicate model convergence, and, according to the “How to Use the rstanarm Package” document on the CRAN website, if any Rhat values exceed 1.1,`rstanarm`

will give you the following warning message: “Markov chains did not converge! Do not analyze results!”`n_eff`

, which is “a crude measure” of the effective sample size

And we know that a 90% credible interval tells us that there is a 90% probability that the true value falls within the interval, while a frequentist 90% confidence interval tells us that, in repeated sampling, the true value falls within the confidence interval 90% of the time.

Moving on from single-variate to multi-variate Bayesian modeling (i.e. modeling with more than one explanatory variable), in this chapter, we have covered two types of multiple regression models: interaction models and parallel slopes models.

Recall our use of the `lm`

function to make `score_model_interaction`

, a frequentist multi-variate interaction model where \(y = score\), \(x_1 = age\), \(x_2 = gender\), and \(age\) and \(gender\) are interacting, as denoted by the `*`

separating them:

`score_model_interaction <- lm(score ~ age * gender, data = evals_ch7)`

The syntax for making Bayesian multi-variate interaction models is quite similar to the above code. Below, we make `score_model_interaction_bayes`

, get summary information about the model, and get the credible interval for the model:

```
# Fit regression model, now setting refresh equal to 0 so as to suppress all of
# the "SAMPLING FOR MODEL 'continuous' NOW (CHAIN #)" outputs
score_model_interaction_bayes <- stan_glm(score ~ age * gender, data = evals_ch7, refresh = 0)
# Get important summary information
summary(score_model_interaction_bayes)
```

```
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: score ~ age * gender
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 463
## predictors: 4
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) 4.9 0.2 4.6 4.9 5.1
## age 0.0 0.0 0.0 0.0 0.0
## gendermale -0.4 0.3 -0.8 -0.4 -0.1
## age:gendermale 0.0 0.0 0.0 0.0 0.0
## sigma 0.5 0.0 0.5 0.5 0.6
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 4.2 0.0 4.1 4.2 4.2
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.0 1.0 1564
## age 0.0 1.0 1523
## gendermale 0.0 1.0 1324
## age:gendermale 0.0 1.0 1263
## sigma 0.0 1.0 2756
## mean_PPD 0.0 1.0 3284
## log-posterior 0.0 1.0 1714
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
```

```
# Get credible intervals
posterior_interval(score_model_interaction_bayes)
```

```
## 5% 95%
## (Intercept) 4.551495824 5.210501282
## age -0.024668060 -0.010399308
## gendermale -0.856906264 -0.004390902
## age:gendermale 0.004291578 0.022085146
## sigma 0.504760775 0.560817404
```

Now, recall our use of the `lm`

function to make `score_model_parallel_slopes_bayes`

, a frequentist multi-variate parallel slopes model where \(y = score\), \(x_1 = age\), \(x_2 = gender\), and \(age\) and \(gender\) are **not** interacting, as denoted by the `+`

separating them:

`score_model_parallel_slopes <- lm(score ~ age + gender, data = evals_ch7)`

Once again, the syntax for making Bayesian multi-variate parallel slopes models is quite similar to the above code. Below, we make `score_model_parallel_slopes_bayes`

, get summary information about the model, and get the credible interval for the model:

```
# Fit regression model, now setting refresh equal to 0 so as to suppress all of
# the "SAMPLING FOR MODEL 'continuous' NOW (CHAIN #)" outputs
score_model_parallel_slopes_bayes <- stan_glm(score ~ age + gender, data = evals_ch7, refresh = 0)
# Get important summary information
summary(score_model_parallel_slopes_bayes)
```

```
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: score ~ age + gender
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 463
## predictors: 3
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) 4.5 0.1 4.3 4.5 4.6
## age 0.0 0.0 0.0 0.0 0.0
## gendermale 0.2 0.1 0.1 0.2 0.3
## sigma 0.5 0.0 0.5 0.5 0.6
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 4.2 0.0 4.1 4.2 4.2
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.0 1.0 4323
## age 0.0 1.0 3948
## gendermale 0.0 1.0 4471
## sigma 0.0 1.0 4394
## mean_PPD 0.0 1.0 4403
## log-posterior 0.0 1.0 2077
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
```

```
# Get credible intervals
posterior_interval(score_model_parallel_slopes_bayes)
```

```
## 5% 95%
## (Intercept) 4.27786429 4.692453379
## age -0.01307291 -0.004222844
## gendermale 0.09914391 0.277101307
## sigma 0.50731788 0.565009957
```

In this chapter, we have made Bayesian multi-variate regression models, examined their regression tables, and visualized them.

Recall first our visualization of our frequentist multi-variate interaction model, `score_model_interaction`

:

```
ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
geom_point() +
labs(x = "Age", y = "Teaching Score", color = "Gender") +
geom_smooth(method = "lm", se = FALSE)
```

Now recall our visualization of our frequentist multi-variate parallel slopes model, `score_model_interaction`

, using Modern Dive’s special function `gg_parallel_slopes`

:

```
gg_parallel_slopes(y = "score", num_x = "age", cat_x = "gender",
data = evals_ch7)
```

Visualizing Bayesian multi-variate regression models is certainly a more involved process, as we do not automatically get singular point estimates for our intercepts and slopes with Bayes, and as we do not have a special `method = "lm"`

argument or `gg_parallel_slopes`

function with Bayes (although the absence of such things for Bayesian visualization presents an opportunity for the development of a new package!). And, as you will see later, all of our work in visualizing these Bayesian models will result in very similar graphs to the above frequentist ones. However, it is still very, very useful to know how to visualize Bayesian models, as only Bayesian models allow us to make probabilistic inferences about those models’ parameters.

And so, lets learn how to visualize our Bayesian multi-variate interaction model, `score_model_interaction_bayes`

.

Because, as previously mentioned, we do not automatically get singular point estimates for our Bayesian models’ intercepts and slopes, our first step is to retrieve those singular parameter estimates. We can do this using the `tidy`

function, which can be found in the `broom`

library:

```
# Call "tidy" on "score_model_interaction_bayes" in order to retrieve the
# model's singular parameter estimates; store this tibble as "tidy_coefs"
tidy_coefs <- tidy(score_model_interaction_bayes)
# Print "tidy_coefs" so as to view the model's parameter estimates
tidy_coefs
```

```
## # A tibble: 4 x 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 (Intercept) 4.88 0.198
## 2 age -0.0175 0.00433
## 3 gendermale -0.441 0.261
## 4 age:gendermale 0.0135 0.00535
```

If we compare these parameter estimates to those of the frequentist version of this model `score_model_interaction`

, we can see that they differ slightly:

`get_regression_table(score_model_interaction)`

```
## # A tibble: 4 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 4.88 0.205 23.8 0 4.48 5.29
## 2 age -0.018 0.004 -3.92 0 -0.026 -0.009
## 3 gendermale -0.446 0.265 -1.68 0.094 -0.968 0.076
## 4 age:gendermale 0.014 0.006 2.45 0.015 0.003 0.024
```

Our next step is to isolate each parameter estimate so that we can use them in making our `geom_abline`

s, storing the first row of `tidy_coef`

’s `estimate`

column, which corresponds to the term `intercept`

, as `intercept`

, storing the second row, which corresponds to the term `age`

, as `age_slope`

, and so on:

```
# Isolate and store each parameter estimate frome the "estimate" column of
# "tidy_coefs"
intercept <- tidy_coefs$estimate[1]
age_slope <- tidy_coefs$estimate[2]
gendermale_slope <- tidy_coefs$estimate[3]
age_gendermale_slope <-tidy_coefs$estimate[4]
```

Now that we have extracted the estimated values of each coefficient, we can plug them into the following equation in order to find the equation of the line if male and the equation of the line if female:

\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_{12}x_1x_2\)

If \(y = score\), \(x_1 = age\), \(x_2 = gender\), and \(x_1x_2 = age * gender\), then the equations are as follows:

If female:

- \(y = \beta_0 + \beta_1(x_1)\)

Which equates to, in `y = b + mx`

form:

**y = intercept + age_slope(age)**

If male:

- \(y = (\beta_0 + \beta_2) + (\beta_1 + \beta_{12})(x_1)\)

Which equates to, in `y = b + mx`

form:

**y = (intercept + gendermale_slope) + (age_slope + age_gendermale_slope)(age)**

Thus, we can define the following terms—female_intercept and female_slope; and male_intercept and male_slope:

```
# Define "female_intercept" and "female_slope" using the above calculations
female_intercept <- intercept
female_slope <- age_slope
# Define "male_intercept" and "male_slope" using the above calculations
male_intercept <- intercept + gendermale_slope
male_slope <- age_slope + age_gendermale_slope
```

For the sake of aesthetics and professionalism, we here recode the `gender`

variable in the `evals_ch7`

dataset so that “female” appears as “Female” and male appears as “Male” in the legends of our graphics.

```
# Recode "gender" in "evals_ch7"
evals_ch7 <- evals_ch7 %>%
mutate(gender = recode(gender, `female` = "Female", `male` = "Male"))
```

Finally, we can make our `ggplot`

! Read the in-line code-comments for a step-by-step explanation of what’s going on in the code below!

```
# Plot the modified "evals_ch7" dataset using ggplot, setting "age" as the
# x-axis variable, setting "score" as the y-axis variable, and coloring points
# by "gender"
ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
# Make the plot a scatterplot
geom_point() +
# Add the estimated regression line that corresponds to "female" data, setting
# "female_intercept" (defined above according to our calculations) as its
# intercept and setting "female_slope" (defined above according to our
# calculations) as its slope. We use the "hue_pal" (hue palette) function to
# set the color of the line to match the color of the corresponding (coral)
# points on the scatterplot. The syntax works like this: hue_pal() spits out
# how the function works, hue_pal()(2) spits out two colors: the two default
# colors used in this scatterplot--coral and teal, or #F8766D and #00BFC4.
# (The number 2 in this statement tells hue_pal how many colors to spit out:
# change the statement to hue_pal()(3), and it will spit out three colors.)
# hue_pal()(2)[1] selects the first of these two colors, coral or #F8766D.
# Finally, we set the size, or the thickness, of the line to 1.
geom_abline(intercept = female_intercept, slope = female_slope,
color = hue_pal()(2)[1],
size = 1) +
# Do the same to add the estimated regression line that corresponds to "male"
# data, setting "male_intercept" (defined above according to our calculations)
# as its intercept and setting "male_slope" (defined above according to our
# calculations) as its slope. Again, use the "hue_pal" function to set the
# color of the line to match the color of the corresponding (teal) points on
# the scatterplot. This time, specify the color to be hue_pal()(2)[2] to
# select the second of the two colors, teal or #00BFC4. Again, set the size
# of the line to 1.
geom_abline(intercept = male_intercept, slope = male_slope,
color = hue_pal()(2)[2],
size = 1) +
# Give the x-axis the label "Age", give the y-axis the label "Teaching Score",
# and give the legend the title "Gender"
labs(x = "Age",
y = "Teaching Score",
color = "Gender")
```