library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
df <- readr::read_rds("df.rds")
step_1_df <- readr::read_rds("step_1_df.rds")
step_2_a_df <- readr::read_rds("step_2_a_df.rds")
step_2_b_df <- readr::read_rds("step_2_b_df.rds")
This investigation is divided into 5
main parts, of which we have completed the first on exploratory data analysis. Although our ultimate goal is to fit the data to more comprehensive Bayesian models, we will first focus on getting a feel for the behavior of response_1
as a function of the step 1 inputs, using lm()
.
Therefore, part ii is split into two sub-parts, iiA and iiB, as such:
iiA. Regression models - lm()
iiB. Regression models - Bayesian iiC. Regression models - Models with Resampling
This R Markdown file tackles part iiA, specifically using lm()
to fit several linear models to the dataset.
lm()
response_1
to discrete inputsLet’s fit a linear model for response_1
, using a linear relationship with the discrete inputs, xA
and xB
, as additive terms. That is, we want to build a model for estimating response_1
based on the discrete inputs xA
and xB
:
\[ \mathrm{response\_1} = b_0 + b_1\mathrm{xA} + b_2\mathrm{xB} \]
With lm()
the linear model is easy to fit using the formula interface:
mod_lm_discrete <- lm(response_1 ~ xA + xB, step_1_df)
Summarize the model results with the summary()
function.
mod_lm_discrete %>% summary()
##
## Call:
## lm(formula = response_1 ~ xA + xB, data = step_1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.2049 -2.6076 -0.2137 2.7004 26.1931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3763 0.2426 13.915 < 2e-16 ***
## xAA2 0.3426 0.2051 1.670 0.095 .
## xBB2 -2.4703 0.2826 -8.743 < 2e-16 ***
## xBB3 -7.7089 0.2881 -26.759 < 2e-16 ***
## xBB4 -1.4109 0.3046 -4.632 3.85e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.543 on 2008 degrees of freedom
## Multiple R-squared: 0.2954, Adjusted R-squared: 0.294
## F-statistic: 210.5 on 4 and 2008 DF, p-value: < 2.2e-16
From above, the p-value of the F-statistic is < 2.2e-16, which is significant; it is likely that at least one or more of the discrete variables are significantly related to response_1
.
From the coefficients table, xA
seems to have the lowest absolute t-value, and lowest effect on response_1
, given that it has the lowest coefficient Estimate
. Additionally, xAA2
has the highest p-value for the t test. This tells us that xA
is most probably the less significant input.
Computing and storing the confidence interval of the xB
coefficients.
mod_lm_discrete_confint <- mod_lm_discrete %>% confint()
response_1
to continuous inputsNext, we fit a linear model for response_1
using a linear relationship with the continuous step 1 inputs, x01
to x06
:
mod_lm_cont <- lm(response_1 ~ . -xA -xB, step_1_df)
Summarize the model results with the summary()
function.
mod_lm_cont %>% summary()
##
## Call:
## lm(formula = response_1 ~ . - xA - xB, data = step_1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.9007 -2.1221 0.0251 2.2287 22.5315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.14788 32.36197 1.148 0.2512
## x01 1.01973 0.06305 16.174 < 2e-16 ***
## x02 -3.36156 0.22277 -15.090 < 2e-16 ***
## x03 0.15331 0.03640 4.212 2.64e-05 ***
## x04 0.07727 0.03852 2.006 0.0450 *
## x05 0.04013 0.01610 2.493 0.0127 *
## x06 0.32723 0.29464 1.111 0.2669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.842 on 2006 degrees of freedom
## Multiple R-squared: 0.2003, Adjusted R-squared: 0.1979
## F-statistic: 83.75 on 6 and 2006 DF, p-value: < 2.2e-16
From the above result, x01
, x02
, and x03
are the most significant continuous inputs; observed from their lowest p-values, highest t-statistics, which are also indicated by the ***
indicated on their respective rows on the coefficient table.
Computing and storing the confidence interval of the continuous coefficients.
mod_lm_cont %>% confint()
## 2.5 % 97.5 %
## (Intercept) -26.318713740 100.61446692
## x01 0.896079310 1.14337588
## x02 -3.798443295 -2.92468149
## x03 0.081931903 0.22469779
## x04 0.001730871 0.15280775
## x05 0.008562060 0.07169289
## x06 -0.250608248 0.90505900
response_1
to all step 1 inputsNow, we want to fit a linear model for response_1
using a linear relationship with the all step 1 inputs, x01
to x06
, as additive terms:
mod_lm_step1 <- lm(response_1 ~ ., step_1_df)
Summarize the model results with the summary()
function.
mod_lm_step1 %>% summary()
##
## Call:
## lm(formula = response_1 ~ ., data = step_1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.8852 -2.1475 -0.1666 2.0917 19.4622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 180.491682 27.664490 6.524 8.63e-11 ***
## xAA2 0.432349 0.173763 2.488 0.0129 *
## xBB2 -2.407999 0.253956 -9.482 < 2e-16 ***
## xBB3 -7.868357 0.244465 -32.186 < 2e-16 ***
## xBB4 -1.273069 0.274611 -4.636 3.78e-06 ***
## x01 0.861475 0.050990 16.895 < 2e-16 ***
## x02 -4.136833 0.181353 -22.811 < 2e-16 ***
## x03 0.064684 0.029526 2.191 0.0286 *
## x04 0.037112 0.031071 1.194 0.2324
## x05 -0.003172 0.013071 -0.243 0.8083
## x06 -0.157523 0.239080 -0.659 0.5101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.843 on 2002 degrees of freedom
## Multiple R-squared: 0.4973, Adjusted R-squared: 0.4948
## F-statistic: 198 on 10 and 2002 DF, p-value: < 2.2e-16
response_1
to a basis functionFinally, let’s define a basis function of the inputs on which to fit to fit a linear model, using lm()
. Considering the most significant inputs from fitting all step 1 inputs, and defining natural spline functions for the continuous inputs with an appropriate degree of freedom, then modeling interactions among all the most significant inputs, we get:
mod_lm_basis <- lm(response_1 ~ xA + xB*(splines::ns(x01, df = 2) + splines::ns(x02, df = 2) + splines::ns(x03, df = 2)), step_1_df)
Summarize the model results with the summary()
function.
mod_lm_basis %>% summary()
##
## Call:
## lm(formula = response_1 ~ xA + xB * (splines::ns(x01, df = 2) +
## splines::ns(x02, df = 2) + splines::ns(x03, df = 2)), data = step_1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1094 -1.1645 0.0061 1.1703 6.0096
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.06430 0.93666 47.044 < 2e-16 ***
## xAA2 0.32502 0.08148 3.989 6.88e-05 ***
## xBB2 -20.77093 1.45378 -14.288 < 2e-16 ***
## xBB3 -96.08513 1.29596 -74.142 < 2e-16 ***
## xBB4 -21.36615 1.57853 -13.535 < 2e-16 ***
## splines::ns(x01, df = 2)1 -25.01097 1.03390 -24.191 < 2e-16 ***
## splines::ns(x01, df = 2)2 15.46004 0.55859 27.677 < 2e-16 ***
## splines::ns(x02, df = 2)1 -24.71942 0.92389 -26.756 < 2e-16 ***
## splines::ns(x02, df = 2)2 -4.96759 0.58559 -8.483 < 2e-16 ***
## splines::ns(x03, df = 2)1 -23.92220 1.12445 -21.275 < 2e-16 ***
## splines::ns(x03, df = 2)2 7.48919 0.75993 9.855 < 2e-16 ***
## xBB2:splines::ns(x01, df = 2)1 25.21496 1.64141 15.362 < 2e-16 ***
## xBB3:splines::ns(x01, df = 2)1 68.74650 1.35603 50.697 < 2e-16 ***
## xBB4:splines::ns(x01, df = 2)1 7.31292 1.72922 4.229 2.45e-05 ***
## xBB2:splines::ns(x01, df = 2)2 -9.31338 0.68195 -13.657 < 2e-16 ***
## xBB3:splines::ns(x01, df = 2)2 -20.17660 0.86424 -23.346 < 2e-16 ***
## xBB4:splines::ns(x01, df = 2)2 -3.28783 0.71287 -4.612 4.24e-06 ***
## xBB2:splines::ns(x02, df = 2)1 0.50466 1.43248 0.352 0.72465
## xBB3:splines::ns(x02, df = 2)1 24.25162 1.22995 19.718 < 2e-16 ***
## xBB4:splines::ns(x02, df = 2)1 22.15519 1.47612 15.009 < 2e-16 ***
## xBB2:splines::ns(x02, df = 2)2 -0.70131 0.73326 -0.956 0.33897
## xBB3:splines::ns(x02, df = 2)2 -8.78045 0.89353 -9.827 < 2e-16 ***
## xBB4:splines::ns(x02, df = 2)2 -9.63658 0.75418 -12.778 < 2e-16 ***
## xBB2:splines::ns(x03, df = 2)1 6.86556 1.69934 4.040 5.55e-05 ***
## xBB3:splines::ns(x03, df = 2)1 62.25154 1.54186 40.374 < 2e-16 ***
## xBB4:splines::ns(x03, df = 2)1 6.23329 1.94525 3.204 0.00138 **
## xBB2:splines::ns(x03, df = 2)2 -2.67150 0.88862 -3.006 0.00268 **
## xBB3:splines::ns(x03, df = 2)2 -22.96057 1.00569 -22.831 < 2e-16 ***
## xBB4:splines::ns(x03, df = 2)2 -0.89516 0.93233 -0.960 0.33711
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.796 on 1984 degrees of freedom
## Multiple R-squared: 0.8911, Adjusted R-squared: 0.8896
## F-statistic: 579.9 on 28 and 1984 DF, p-value: < 2.2e-16
lm()
modelsThe linear model fit using the basis function is the best. This model is chosen because it has the highest adjusted R-squared value of 0.8896
which is closest to 1 than all 3 other models. Also, it has the lowest residual standard error of 1.796
.
The second best model is the linear model fit using all step 1 inputs.
Load in package coefplot
:
library(coefplot)
Compute confidence intervals for best and second best models:
mod_lm_basis_confint <- mod_lm_basis %>% confint()
mod_lm_basis_confint
## 2.5 % 97.5 %
## (Intercept) 42.2273462 45.9012456
## xAA2 0.1652203 0.4848259
## xBB2 -23.6220220 -17.9198418
## xBB3 -98.6267100 -93.5435468
## xBB4 -24.4619019 -18.2703967
## splines::ns(x01, df = 2)1 -27.0386074 -22.9833389
## splines::ns(x01, df = 2)2 14.3645420 16.5555293
## splines::ns(x02, df = 2)1 -26.5313185 -22.9075272
## splines::ns(x02, df = 2)2 -6.1160268 -3.8191522
## splines::ns(x03, df = 2)1 -26.1274276 -21.7169800
## splines::ns(x03, df = 2)2 5.9988492 8.9795390
## xBB2:splines::ns(x01, df = 2)1 21.9958858 28.4340253
## xBB3:splines::ns(x01, df = 2)1 66.0871028 71.4059006
## xBB4:splines::ns(x01, df = 2)1 3.9216464 10.7041935
## xBB2:splines::ns(x01, df = 2)2 -10.6508058 -7.9759638
## xBB3:splines::ns(x01, df = 2)2 -21.8715080 -18.4816822
## xBB4:splines::ns(x01, df = 2)2 -4.6858750 -1.8897790
## xBB2:splines::ns(x02, df = 2)1 -2.3046653 3.3139781
## xBB3:splines::ns(x02, df = 2)1 21.8395034 26.6637438
## xBB4:splines::ns(x02, df = 2)1 19.2602715 25.0501064
## xBB2:splines::ns(x02, df = 2)2 -2.1393394 0.7367230
## xBB3:splines::ns(x02, df = 2)2 -10.5328027 -7.0281022
## xBB4:splines::ns(x02, df = 2)2 -11.1156458 -8.1575067
## xBB2:splines::ns(x03, df = 2)1 3.5328891 10.1982350
## xBB3:splines::ns(x03, df = 2)1 59.2277077 65.2753713
## xBB4:splines::ns(x03, df = 2)1 2.4183397 10.0482474
## xBB2:splines::ns(x03, df = 2)2 -4.4142243 -0.9287775
## xBB3:splines::ns(x03, df = 2)2 -24.9328969 -20.9882481
## xBB4:splines::ns(x03, df = 2)2 -2.7236170 0.9332961
mod_lm_step1_confint <- mod_lm_step1 %>% confint()
mod_lm_step1_confint
## 2.5 % 97.5 %
## (Intercept) 126.237476952 234.7458861
## xAA2 0.091573496 0.7731248
## xBB2 -2.906045380 -1.9099529
## xBB3 -8.347789337 -7.3889237
## xBB4 -1.811621774 -0.7345152
## x01 0.761476901 0.9614731
## x02 -4.492493292 -3.7811717
## x03 0.006779131 0.1225894
## x04 -0.023822432 0.0980472
## x05 -0.028806049 0.0224628
## x06 -0.626394808 0.3113489
Plot model coefficients for best model:
coefplot(mod_lm_basis, innerCI = 2, pointSize = 2)
Plot model coefficients for second best model:
coefplot(mod_lm_step1, innerCI = 2, pointSize = 2)
In general, more coefficients of the basis model are further away from 0 than the all step 1 inputs model. For example, coefficients of terms containing x01
for the basis model are both negative and positive, while x01
term in all step 1 inputs model is slightly negative, but much closer to 0. This explains the higher R-squared for the basis model. For both models, the 95% confidence intervals for the coefficients are small.
mod_lm_basis %>% readr::write_rds("mod_lm_basis.rds")
mod_lm_step1 %>% readr::write_rds("mod_lm_step1.rds")