Prediction of Wins in the MLB Based off Pitching Statistics
This project examines how pitching performance contributes to overall team wins in Major League Baseball (MLB). Using data from the Lahman database stats, we created a linear regression model after conducting exploratory data analysis as well as model analysis to create our final model. Using HA, HRA, SHO, SV, and ERA as predictors, we received an adjusted \(R^2\) value of 55.73% with all predictors being significant. These results highlight key pitching statistics that best explain the variation in team wins since 1980.
A team’s success in the Major League Baseball (MLB) is influenced by many different factors such as offensive, defensive, and pitching. Out of all these factors, pitching is the most crucial part. Preventing teams to score runs increases the team’s likelihood of winning games. There is a variety of different pitching statistics that could influence the entire game as a whole. Understanding these relationships could provide insight into how pitching effects the amount of wins a team has.
This project focuses on examining how pitching performance contributes to a team’s success by using a multiple linear regression model to identify which pitching statistics strongest associate with team success. There is a long list of many predictors in the Lahman database, but we are only interested in the pitching statistics from each team. The Lahman database was created by Sean Lahman and it contains complete MLB stats dating back to 1871. So we have selected the predictors stated in the “List of Variables”.
With these stats we are able to capture different aspects of pitching effectiveness. We will be looking at stats from 1980 to present day to limit the variability in rules and regulations from affecting our data. We will also be removing the years 1981, 1994, 1995, and 2020 because the MLB did not play their regular 162 game season. Using the selected predictors, this project aims to predict the amount of wins in an MLB season based on pitching statistics.
W - Amount of Wins
ERA - Earned Run Average, which is total number of runs that score against a pitcher that are not due to errors or passed balls divided by 9
HA - Hits Allowed
BBA - Walks Allowed
SOA - Strikeouts by Pitchers
HRA - Homeruns Allowed
RA - Opponents Runs Allowed
SHO - Shutouts by Pitchers
SV - Saves. A save is awarded to the relief pitcher who finishes a game for the winning team, under certain circumstances.
| W | HA | BBA | SOA | HRA | RA | SHO | SV | ERA |
|---|---|---|---|---|---|---|---|---|
| 65 | 1548 | 529 | 725 | 141 | 797 | 6 | 30 | 4.52 |
| 93 | 1436 | 482 | 728 | 124 | 670 | 10 | 27 | 3.82 |
| 70 | 1636 | 496 | 668 | 130 | 779 | 7 | 23 | 4.31 |
| 81 | 1526 | 474 | 754 | 143 | 697 | 12 | 26 | 3.96 |
| 90 | 1453 | 514 | 767 | 171 | 703 | 8 | 41 | 3.91 |
| 92 | 1356 | 478 | 955 | 153 | 684 | 12 | 40 | 3.84 |
| 75 | 1481 | 504 | 941 | 212 | 803 | 7 | 36 | 4.38 |
| 75 | 1503 | 568 | 817 | 135 | 771 | 9 | 33 | 4.32 |
| 91 | 1384 | 465 | 897 | 113 | 578 | 20 | 38 | 3.28 |
| 80 | 1482 | 544 | 944 | 106 | 706 | 13 | 42 | 3.79 |
W HA BBA SOA HRA
Min. : 41.00 Min. :1107 Min. :348.0 Min. : 575.0 Min. : 69
1st Qu.: 73.00 1st Qu.:1364 1st Qu.:481.0 1st Qu.: 946.8 1st Qu.:139
Median : 81.00 Median :1429 Median :524.0 Median :1070.5 Median :164
Mean : 80.95 Mean :1430 Mean :526.9 Mean :1096.9 Mean :163
3rd Qu.: 90.00 3rd Qu.:1497 3rd Qu.:569.2 3rd Qu.:1233.2 3rd Qu.:185
Max. :116.00 Max. :1734 Max. :784.0 Max. :1687.0 Max. :305
RA SHO SV ERA
Min. : 513.0 Min. : 0.000 Min. :13.00 Min. :2.800
1st Qu.: 669.0 1st Qu.: 7.000 1st Qu.:35.00 1st Qu.:3.770
Median : 725.0 Median : 9.000 Median :41.00 Median :4.110
Mean : 732.4 Mean : 9.426 Mean :40.42 Mean :4.165
3rd Qu.: 792.0 3rd Qu.:12.000 3rd Qu.:45.00 3rd Qu.:4.532
Max. :1103.0 Max. :24.000 Max. :68.00 Max. :6.380
W - This plot seems to be slightly skewed to the left with a median of 81 wins and the middle 50 percent of the data is between 73 and 90 wins. There are no noticeable outliers.
ERA - This plot is slightly skewed to the right but looks relatively normal with a median of 4.11 ERA and the middle 50 percent of the data is between 3.77 and 4.53 ERA. There are no noticeable outliers.
HA - This plot is relatively normal with the median being 1429 hits against. The middle 50 percent of the data is between 1364 and 1497 hits. There are no noticeable outliers but there may be some on the left tail.
BBA - This plot is slightly skewed to the right with a median at 524 walks allowed with no noticeable outliers. The middle 50 percent of the data is between 481 and 569 walks.
SOA - This plot is slightly skewed to the right with a median 1070 strikeouts with the middle 50 percent of the data is between 946 and 1233 strikeouts. There are no noticeable outliers.
HRA - This plot is relatively normal with a median of 164 homeruns with the middle 50 percent of the data is between 139 and 185. There could be one outlier with a value above 300.
RA - This plot is slightly skewed to the right with a median of 725 runs and the middle 50 percent of the data is between 669 and 792 runs. There could be one outlier above 1100 runs.
SHO - This plot is skewed to the right with a median of 9 shutouts with the middle 50 percent of the data is between 7 and 12 shutouts. There are no noticeable outliers.
SV - This plot has a very normal distribution and has a median of 41 saves with the middle 50 percent of the data is between 35 and 45 saves. There are no noticeable outliers.
Looking at the correlation plot it shows use potential multicollinearity issues with the predictors. 3 predictors in particular, hits allowed, runs allowed and ERA, have concerningly high correlations with multiple predictors and each other which would weaken the model.
\[y = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2 + ... + \beta_8 * X_8 + \epsilon \]
We are using a multivariable linear regression model because wins are influenced by many aspects of team performance rather than a single metric. With a multivariable linear regression model we are able to consider each predictor all together while controlling the other predictors. Using an equation like the one above where each \(\beta\) is equal to a predictor, using the 8 variables we are able to create a model that can predict the number of wins in an MLB season using pitching statistics.
The use of best subsets compares all possible models using a specific set of predictors and displays the best fitting model. Among all possible predictor combinations of up to eight predictors, this model achieves the highest adjusted \(R^2\). This means it explains the most variation in team wins while removing unnecessary predictors as well as having highly significant predictors with very low p-values. Low p-values with a high adjusted \(R^2\) indicate a well developed model. Therefore, the selected model is the statistically strongest and most efficient model available among all models tested.
Linear Assumption The Residuals vs Fitted values plot scatters around the red line as well a being somewhat straight along 0. This suggests that the suggested model captures the linearity of the relationship of the predicted wins with the regressors used. Most of the data seems to fall between 60 and 100 on the fitted values.
Normality Assumption Using the Q-Q Residuals plot, the data lies along the 45 degree line perfectly with a light tail below -2 Theoretical Quantiles. Along with the normally distributed predictors and the Q-Q Residuals plot, we can assume normality.
Equal Variance The Scale Location plot has a red trend line that is relatively flat so we can assume equal variance. Just like in The Residuals vs Fitted values plot, most of the data seems to fall between 60 and 100 on the fitted values.
Influential Points When looking at the cook’s distance of all the points, there are no points that exceed a value of 1, so we can assume there are no influential points in our data.
Multicollinearity When looking at the variance inflation factor (VIF), there are 2 predictors that exceed the value of 10 which means there is high correlation with the predictors ERA and RA.
Model When looking at this model, 2 predictors have been removed by best subsets meaning their p-values were more than 0.05. This means they are not significant to the model. These values are walks allowed and strikeouts allowed. Another predictor, runs allowed, also has a p-value of 0.0177 which is close to being not significant. To improve the model, we will be removing runs allowed because it has a value very close to be considered not significant as well as having a very high VIF factor causing multicollinearity. We will be keeping ERA in this model because it is a significant predictor to the model. It may have had a high VIF factor because ERA and RA are extremely similar statistics. By removing one, the other should have a reduced VIF.
HA HRA RA SHO SV ERA
3.803501 2.879370 39.265891 1.893144 1.277560 39.251709
Call:
lm(formula = best_formula, data = wt)
Residuals:
Min 1Q Median 3Q Max
-22.6347 -5.6025 -0.2737 5.1026 22.7107
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 69.434357 5.351809 12.974 < 2e-16 ***
HA 0.016652 0.004548 3.662 0.000262 ***
HRA 0.073904 0.011081 6.670 3.95e-11 ***
RA -0.037505 0.015786 -2.376 0.017668 *
SHO 0.425875 0.078371 5.434 6.70e-08 ***
SV 0.633586 0.033165 19.104 < 2e-16 ***
ERA -6.362470 2.554313 -2.491 0.012881 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.67 on 1169 degrees of freedom
Multiple R-squared: 0.5613, Adjusted R-squared: 0.5591
F-statistic: 249.3 on 6 and 1169 DF, p-value: < 2.2e-16
Using best subsets instead of 8 predictors, we will be looking at 5 predictors to reduce multicollinearity in the model as well as improving significance.
Linear Assumption The Residuals vs Fitted values plot scatters around the red line more than the original model but stays somewhat straight along 0. This suggests that the suggested model captures the linearity of the relationship of the predicted wins with the predictors used.
Normality Assumption The data in the Q-Q Residuals plot, like the previous plot, also closely follows the 45 diagonal line so we can assume normality.
Equal Variance The Scale Location plot has a red trend line that is relatively flat similar to the first model so we can assume equal variance.
Influential Points When looking at the cook’s distance of all the points, there are no points that exceed the value of 1 so we can assume there are no influential points in our data.
Multicollinearity Now, when looking at the variance inflation factor there are no values that exceed 10, which means there is a moderate to low correlation.
Conclusion Even though the analysis plots are similar, the ultimate factor was multicollinearity. By removing the 3 predictors, all our assumptions can be made and conclude that this multiple linear regression model of MLB pitching statistics is accurate for predicting the number of wins in a season.
HA HRA SHO SV ERA
3.530562 2.835924 1.827278 1.248706 7.134164
Call:
lm(formula = best_formula1, data = wt)
Residuals:
Min 1Q Median 3Q Max
-22.7876 -5.6012 -0.2292 5.0790 23.1541
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.63393 5.30839 12.741 < 2e-16 ***
HA 0.01376 0.00439 3.134 0.00177 **
HRA 0.07714 0.01102 7.001 4.28e-12 ***
SHO 0.46061 0.07715 5.970 3.13e-09 ***
SV 0.64543 0.03285 19.646 < 2e-16 ***
ERA -11.85205 1.09113 -10.862 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.685 on 1170 degrees of freedom
Multiple R-squared: 0.5592, Adjusted R-squared: 0.5573
F-statistic: 296.8 on 5 and 1170 DF, p-value: < 2.2e-16
Call:
lm(formula = best_formula1, data = wt)
Residuals:
Min 1Q Median 3Q Max
-22.7876 -5.6012 -0.2292 5.0790 23.1541
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.63393 5.30839 12.741 < 2e-16 ***
HA 0.01376 0.00439 3.134 0.00177 **
HRA 0.07714 0.01102 7.001 4.28e-12 ***
SHO 0.46061 0.07715 5.970 3.13e-09 ***
SV 0.64543 0.03285 19.646 < 2e-16 ***
ERA -11.85205 1.09113 -10.862 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.685 on 1170 degrees of freedom
Multiple R-squared: 0.5592, Adjusted R-squared: 0.5573
F-statistic: 296.8 on 5 and 1170 DF, p-value: < 2.2e-16
The final model predicts the number of wins a MLB baseball team has in a season using earned run average (ERA), walks allowed, homeruns allowed, shutouts by pitchers, and saves with a adjusted \(R^2\) value of 0.5573. This shows that 55.73% of the variability is explained by these predictors after taking into account model complexity.
The final equation to predict the number of wins of a MLB baseball team is:
\(\hat{Wins}\) = 67.634 + 0.0138(HA) + 0.0771(HRA) + 0.461(SHO) + 0.645(SV) - 11.852(ERA)
For additional hit allowed by pitchers, the expected number of wins increases by 0.0138 if all the other predictors remain constant.
For additional homerun allowed by pitchers, the expected number of wins increases by 0.0771 if all the other predictors remain constant.
For additional shutout allowed by pitchers, the expected number of wins increases by 0.461 if all the other predictors remain constant.
For additional save allowed by pitchers, the expected number of wins increases by 0.645 if all the other predictors remain constant.
For each one unit increase in ERA, the expected value of the dependent variable decreases by 11.852 if all the other predictors remain constant.
The goal of this model was to accurately predict the number of wins of MLB teams based on pitching metrics. A predictor to note is ERA as it dominates this model. ERA is a very important pitching statistic and is commonly discussed by professionals. This reinforces the idea that run prevention is essential to win games in the MLB. Another predictor to note is that both hits allowed and home runs allowed are positive predictors for winning games. Normally these predictors are seen as negative towards winning games. When looking at the 2 statistics in a scatter plot with the number of wins, we can see that as the number of homeruns against and hits allowed increase, the number of wins for a team decreases. However, in the model they are both statistically significant. Removing them would cause variable bias similar to removing relative pitching data.
The following are some of the limitations of this model:
There is a moderate adjusted \(R^2\) value of 0.5573 that only accounts for 55.73% of the variability after accounting for model complexity. Which could be a little higher.
There may be more pitching stats that could interact with each other.
This data set is from 1980 to present so there could have been a shift in pitching affectability over the years.
There are many other factors that could impact winning games in the MLB, like the batting and fielding of each team.
It would be interesting to see the effect that all baseball metrics have on winning not just pitching alone to see if there is a more accurate model.
The final regression model provides a explanation of how pitching performance of MLB teams from 1980 to present relates to the amount of wins a team has in the regular season. Hits against, homeruns against, shutouts, saves and earned run average are all significant predictors for this model showing that run prevention is a important outcome for winning games. However, with hits against and homeruns being positive influences it is also important to have good batting and fielding on the team. Overall, this model explains over half of the variation after accounting for model complexity with an adjusted \(R^2\) value of 55.73%. Which demonstrates that pitching success from MLB teams helps them win more games. In conclusion this model does help predict the number of wins in a MLB season using pitching statistics alone.
Lahman, Sean, et al. Lahman: Sean Lahman Baseball Database. R package version 14.0, 2024.
OpenAI. ChatGPT, version 5.1, OpenAI, 2025, https://chat.openai.com/
Chat GPT assisted in code writing and formatting.
Some of the packages used to create this project include
flexdashboard: to build an interactive dashboard
plotly: to create interactive graphs
Lahman: provides the full Sean Lahman Baseball Dataset in R
tidyverse: collection of packages like dplyr, ggplot2, ect. to help with datacleaning and visualization
pacman: a package that loads and installs packages automatically
car: a package used for regression diagnostic tools
MASS: a package used for model building
leaps: a package used for model building
corrplot: a package that helps create correlation matrixes
knitr: a package used to create tables
---
title: "Prediction of Wins in the MLB"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: minty
orientation: columns
vertical_layout: fill
source_code: embed
target: blank
---
```{r setup}
library(flexdashboard)
library(plotly)
library("Lahman")
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("remove", "dplyr")
conflicts_prefer(dplyr::filter)
library(tidyverse)
library(pacman)
library(car)
library(MASS)
library(leaps)
library(corrplot)
library(knitr)
data("Teams")
wt <- Teams %>%
filter(yearID >= 1980, yearID != 2020, yearID != 1981, yearID != 1994, yearID != 1995)
```
Introduction
===
Column {data-width=400}
---
### Title
**Prediction of Wins in the MLB Based off Pitching Statistics**
##### Abstract
This project examines how pitching performance contributes to overall team wins in Major League Baseball (MLB). Using data from the Lahman database stats, we created a linear regression model after conducting exploratory data analysis as well as model analysis to create our final model. Using HA, HRA, SHO, SV, and ERA as predictors, we received an adjusted $R^2$ value of 55.73% with all predictors being significant. These results highlight key pitching statistics that best explain the variation in team wins since 1980.
### Introduction
A team's success in the Major League Baseball (MLB) is influenced by many different factors such as offensive, defensive, and pitching. Out of all these factors, pitching is the most crucial part. Preventing teams to score runs increases the team's likelihood of winning games. There is a variety of different pitching statistics that could influence the entire game as a whole. Understanding these relationships could provide insight into how pitching effects the amount of wins a team has.
This project focuses on examining how pitching performance contributes to a team's success by using a multiple linear regression model to identify which pitching statistics strongest associate with team success. There is a long list of many predictors in the Lahman database, but we are only interested in the pitching statistics from each team. The Lahman database was created by Sean Lahman and it contains complete MLB stats dating back to 1871. So we have selected the predictors stated in the "List of Variables".
With these stats we are able to capture different aspects of pitching effectiveness. We will be looking at stats from 1980 to present day to limit the variability in rules and regulations from affecting our data. We will also be removing the years 1981, 1994, 1995, and 2020 because the MLB did not play their regular 162 game season. Using the selected predictors, this project aims to predict the amount of wins in an MLB season based on pitching statistics.
Column {.tabset data-width=600}
---
### List of Variables
**W** - Amount of Wins
**ERA** - Earned Run Average, which is total number of runs that score against a pitcher that are not due to errors or passed balls divided by 9
**HA** - Hits Allowed
**BBA** - Walks Allowed
**SOA** - Strikeouts by Pitchers
**HRA** - Homeruns Allowed
**RA** - Opponents Runs Allowed
**SHO** - Shutouts by Pitchers
**SV** - Saves. A save is awarded to the relief pitcher who finishes a game for the winning team, under certain circumstances.
### Relation to Eachother
```{r pairs}
sample(1:nrow(wt), 50) -> index
wt[index,] -> wt_cor
pairs(~W + HA + BBA + SOA + HRA + RA + SHO + SV + ERA, data = wt_cor, col = "darkgreen")
```
### Glimpse of Dataset
```{r glimpse}
wt <- wt %>%
select(W, HA, BBA, SOA, HRA, RA, SHO, SV, ERA)
knitr::kable(wt[1:10, ])
```
Exploratory Data Analysis
===
Column {.tabset data-width=600}
---
### W
```{r W}
wt %>%
ggplot(aes(x=W)) + geom_histogram(fill = "#A8E6CF", color = "black",binwidth = 5)
```
### ERA
```{r ERA}
wt %>%
ggplot(aes(x=ERA)) + geom_histogram(fill = "#DCE775", color = "black", binwidth = .2)
```
### HA
```{r HA}
wt %>%
ggplot(aes(x=HA)) + geom_histogram(fill = "#A8E6CF", color = "black")
```
### BBA
```{r BBA}
wt %>%
ggplot(aes(x=BBA)) + geom_histogram(fill = "#DCE775", color = "black")
```
### SOA
```{r SOA}
wt %>%
ggplot(aes(x=SOA)) + geom_histogram(fill = "#A8E6CF", color = "black")
```
### HRA
```{r HRA}
wt %>%
ggplot(aes(x=HRA)) + geom_histogram(fill = "#DCE775", color = "black")
```
### RA
```{r RA}
wt %>%
ggplot(aes(x=RA)) + geom_histogram(fill = "#A8E6CF", color = "black")
```
### SHO
```{r SHO}
wt %>%
ggplot(aes(x=SHO)) + geom_histogram(fill = "#DCE775", color = "black", binwidth = 1)
```
### SV
```{r SV}
wt %>%
ggplot(aes(x=SV)) + geom_histogram(fill = "#A8E6CF", color = "black", binwidth = 2)
```
### Correlation
```{r cor}
ccc <- cor(wt)
corrplot.mixed(ccc, number.cex = .5)
```
### Summary Statistics
```{r stats}
summary(wt)
```
Column {.tabset data-width=400}
---
### Analysis
**W** - This plot seems to be slightly skewed to the left with a median of 81 wins and the middle 50 percent of the data is between 73 and 90 wins. There are no noticeable outliers.
**ERA** - This plot is slightly skewed to the right but looks relatively normal with a median of 4.11 ERA and the middle 50 percent of the data is between 3.77 and 4.53 ERA. There are no noticeable outliers.
**HA** - This plot is relatively normal with the median being 1429 hits against. The middle 50 percent of the data is between 1364 and 1497 hits. There are no noticeable outliers but there may be some on the left tail.
**BBA** - This plot is slightly skewed to the right with a median at 524 walks allowed with no noticeable outliers. The middle 50 percent of the data is between 481 and 569 walks.
**SOA** - This plot is slightly skewed to the right with a median 1070 strikeouts with the middle 50 percent of the data is between 946 and 1233 strikeouts. There are no noticeable outliers.
**HRA** - This plot is relatively normal with a median of 164 homeruns with the middle 50 percent of the data is between 139 and 185. There could be one outlier with a value above 300.
**RA** - This plot is slightly skewed to the right with a median of 725 runs and the middle 50 percent of the data is between 669 and 792 runs. There could be one outlier above 1100 runs.
**SHO** - This plot is skewed to the right with a median of 9 shutouts with the middle 50 percent of the data is between 7 and 12 shutouts. There are no noticeable outliers.
**SV** - This plot has a very normal distribution and has a median of 41 saves with the middle 50 percent of the data is between 35 and 45 saves. There are no noticeable outliers.
Looking at the correlation plot it shows use potential multicollinearity issues with the predictors. 3 predictors in particular, hits allowed, runs allowed and ERA, have concerningly high correlations with multiple predictors and each other which would weaken the model.
Model Analysis
===
Column {data-width=400}
---
### Methods
\[y = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2 + ... + \beta_8 * X_8 + \epsilon \]
We are using a multivariable linear regression model because wins are influenced by many aspects of team performance rather than a single metric. With a multivariable linear regression model we are able to consider each predictor all together while controlling the other predictors. Using an equation like the one above where each $\beta$ is equal to a predictor, using the 8 variables we are able to create a model that can predict the number of wins in an MLB season using pitching statistics.
The use of best subsets compares all possible models using a specific set of predictors and displays the best fitting model. Among all possible predictor combinations of up to eight predictors, this model achieves the highest adjusted $R^2$. This means it explains the most variation in team wins while removing unnecessary predictors as well as having highly significant predictors with very low p-values. Low p-values with a high adjusted $R^2$ indicate a well developed model. Therefore, the selected model is the statistically strongest and most efficient model available among all models tested.
### Assumptions
**Linear Assumption**
The Residuals vs Fitted values plot scatters around the red line as well a being somewhat straight along 0. This suggests that the suggested model captures the linearity of the relationship of the predicted wins with the regressors used. Most of the data seems to fall between 60 and 100 on the fitted values.
**Normality Assumption**
Using the Q-Q Residuals plot, the data lies along the 45 degree line perfectly with a light tail below -2 Theoretical Quantiles. Along with the normally distributed predictors and the Q-Q Residuals plot, we can assume normality.
**Equal Variance**
The Scale Location plot has a red trend line that is relatively flat so we can assume equal variance. Just like in The Residuals vs Fitted values plot, most of the data seems to fall between 60 and 100 on the fitted values.
**Influential Points**
When looking at the cook's distance of all the points, there are no points that exceed a value of 1, so we can assume there are no influential points in our data.
**Multicollinearity**
When looking at the variance inflation factor (VIF), there are 2 predictors that exceed the value of 10 which means there is high correlation with the predictors ERA and RA.
**Model**
When looking at this model, 2 predictors have been removed by best subsets meaning their p-values were more than 0.05. This means they are not significant to the model. These values are walks allowed and strikeouts allowed. Another predictor, runs allowed, also has a p-value of 0.0177 which is close to being not significant. To improve the model, we will be removing runs allowed because it has a value very close to be considered not significant as well as having a very high VIF factor causing multicollinearity. We will be keeping ERA in this model because it is a significant predictor to the model. It may have had a high VIF factor because ERA and RA are extremely similar statistics. By removing one, the other should have a reduced VIF.
Column {.tabset data-width=600}
---
```{r model}
fit.subsets <- regsubsets(W ~ ., data = wt, nvmax = 8)
subs_sum <- summary(fit.subsets)
best_size <- which.max(subs_sum$adjr2)
best_terms <- names(coef(fit.subsets, best_size))[-1]
best_formula <- as.formula(paste("W ~", paste(best_terms, collapse = " + ")))
fit.best <- lm(best_formula, data = wt)
```
### Linear Assumption
```{r la}
plot(fit.best,1, col = "#19C2BD")
```
### Normality Assumption
```{r nor}
plot(fit.best,2, col = "#3EB489")
```
### Equal Variance
```{r ev}
plot(fit.best,3, col = "#19C2BD")
```
### Influential Points
```{r ip}
plot(fit.best,4, col ="#3EB489")
```
### Multicollinearity
```{r mult}
vif(fit.best)
```
### Model
```{r mod1}
summary(fit.best)
```
Model Refinement
===
Column {data-width=400}
---
Using best subsets instead of 8 predictors, we will be looking at 5 predictors to reduce multicollinearity in the model as well as improving significance.
**Linear Assumption**
The Residuals vs Fitted values plot scatters around the red line more than the original model but stays somewhat straight along 0. This suggests that the suggested model captures the linearity of the relationship of the predicted wins with the predictors used.
**Normality Assumption**
The data in the Q-Q Residuals plot, like the previous plot, also closely follows the 45 diagonal line so we can assume normality.
**Equal Variance**
The Scale Location plot has a red trend line that is relatively flat similar to the first model so we can assume equal variance.
**Influential Points**
When looking at the cook's distance of all the points, there are no points that exceed the value of 1 so we can assume there are no influential points in our data.
**Multicollinearity**
Now, when looking at the variance inflation factor there are no values that exceed 10, which means there is a moderate to low correlation.
**Conclusion**
Even though the analysis plots are similar, the ultimate factor was multicollinearity. By removing the 3 predictors, all our assumptions can be made and conclude that this multiple linear regression model of MLB pitching statistics is accurate for predicting the number of wins in a season.
```{r model1}
wt <- wt %>%
select(W, HA, HRA, SHO, SV, ERA)
fit.subsets1 <- regsubsets(W ~ ., data = wt, nvmax = 7)
subs_sum1 <- summary(fit.subsets1)
best_size1 <- which.max(subs_sum1$adjr2)
best_terms1 <- names(coef(fit.subsets1, best_size1))[-1]
best_formula1 <- as.formula(paste("W ~", paste(best_terms1, collapse = " + ")))
fit.best1 <- lm(best_formula1, data = wt)
```
Column {.tabset data-width=600}
---
### Linear Assumption
```{r la1}
plot(fit.best1,1, col ="#19C2BD")
```
### Normality Assumption
```{r nor1}
plot(fit.best1,2, col = "#3EB489")
```
### Equal Variance
```{r ev1}
plot(fit.best1,3, col = "#19C2BD")
```
### Influential Points
```{r ip1}
plot(fit.best1,4, col = "#3EB489")
```
### Multicollinearity
```{r mult1}
vif(fit.best1)
```
### Model 2
```{r sum}
summary(fit.best1)
```
Results
===
Column {.tabset data-width=400}
---
### Summary of Model
```{r sum1}
summary(fit.best1)
```
### Hits Allowed vs Wins
```{r havw}
wt %>%
ggplot(aes(x=HA, y=W)) + geom_point(col = "darkgreen") + geom_smooth(method = "lm", se = FALSE, color = "green")
```
### Homeruns Allowed vs Wins
```{r hravw}
wt %>%
ggplot(aes(x=HRA, y=W)) + geom_point(col = "darkgreen") + geom_smooth(method = "lm", se = FALSE, color = "green")
```
Column {.tabset data-width=600}
---
### Discussion
The final model predicts the number of wins a MLB baseball team has in a season using earned run average (ERA), walks allowed, homeruns allowed, shutouts by pitchers, and saves with a adjusted $R^2$ value of 0.5573. This shows that 55.73% of the variability is explained by these predictors after taking into account model complexity.
The final equation to predict the number of wins of a MLB baseball team is:
$\hat{Wins}$ = 67.634 + 0.0138(HA) + 0.0771(HRA) + 0.461(SHO) + 0.645(SV) - 11.852(ERA)
For additional hit allowed by pitchers, the expected number of wins increases by 0.0138 if all the other predictors remain constant.
For additional homerun allowed by pitchers, the expected number of wins increases by 0.0771 if all the other predictors remain constant.
For additional shutout allowed by pitchers, the expected number of wins increases by 0.461 if all the other predictors remain constant.
For additional save allowed by pitchers, the expected number of wins increases by 0.645 if all the other predictors remain constant.
For each one unit increase in ERA, the expected value of the dependent variable decreases by 11.852 if all the other predictors remain constant.
The goal of this model was to accurately predict the number of wins of MLB teams based on pitching metrics. A predictor to note is ERA as it dominates this model. ERA is a very important pitching statistic and is commonly discussed by professionals. This reinforces the idea that run prevention is essential to win games in the MLB. Another predictor to note is that both hits allowed and home runs allowed are positive predictors for winning games. Normally these predictors are seen as negative towards winning games. When looking at the 2 statistics in a scatter plot with the number of wins, we can see that as the number of homeruns against and hits allowed increase, the number of wins for a team decreases. However, in the model they are both statistically significant. Removing them would cause variable bias similar to removing relative pitching data.
### Limitations
The following are some of the limitations of this model:
* There is a moderate adjusted $R^2$ value of 0.5573 that only accounts for 55.73% of the variability after accounting for model complexity. Which could be a little higher.
* There may be more pitching stats that could interact with each other.
* This data set is from 1980 to present so there could have been a shift in pitching affectability over the years.
* There are many other factors that could impact winning games in the MLB, like the batting and fielding of each team.
It would be interesting to see the effect that all baseball metrics have on winning not just pitching alone to see if there is a more accurate model.
### Conclusion
The final regression model provides a explanation of how pitching performance of MLB teams from 1980 to present relates to the amount of wins a team has in the regular season. Hits against, homeruns against, shutouts, saves and earned run average are all significant predictors for this model showing that run prevention is a important outcome for winning games. However, with hits against and homeruns being positive influences it is also important to have good batting and fielding on the team. Overall, this model explains over half of the variation after accounting for model complexity with an adjusted $R^2$ value of 55.73%. Which demonstrates that pitching success from MLB teams helps them win more games. In conclusion this model does help predict the number of wins in a MLB season using pitching statistics alone.
### About the Author/Reasoning
My name is Evan McClelland and I am a senior studying Mechanical Engineering Student at The University of Dayton with a minor in Data Analytics.
I am doing this project because I love sports and looking at the statistics of all the players and teams. I have the opportunity to work with the University of Dayton Baseball team to do statistics for them and this is good practice working with baseball data and getting familiar with it.
Connect with me on [LinkedIn](https://www.linkedin.com/in/evanmcclelland3/)
### Citations
Lahman, Sean, et al. Lahman: Sean Lahman Baseball Database. R package version 14.0, 2024.
OpenAI. ChatGPT, version 5.1, OpenAI, 2025, https://chat.openai.com/
Chat GPT assisted in code writing and formatting.
Some of the packages used to create this project include
* flexdashboard: to build an interactive dashboard
* plotly: to create interactive graphs
* Lahman: provides the full Sean Lahman Baseball Dataset in R
* tidyverse: collection of packages like dplyr, ggplot2, ect. to help with datacleaning and visualization
* pacman: a package that loads and installs packages automatically
* car: a package used for regression diagnostic tools
* MASS: a package used for model building
* leaps: a package used for model building
* corrplot: a package that helps create correlation matrixes
* knitr: a package used to create tables