This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20210907) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 2ef046e. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/2021_07_27_sliced.Rmd) and HTML (docs/2021_07_27_sliced.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.
Season 1 Episode 9 of #SLICED features a Major League Baseball challenge to predict whether a batter’s hit results in a home run. Each row represents a unique pitch and ball in play. The evaluation metric for submissions in this competition is classification mean logloss.
SLICED is like the TV Show Chopped but for data science. The four competitors get a never-before-seen dataset and two-hours to code a solution to a prediction challenge. Contestants get points for the best model plus bonus points for data visualization, votes from the audience, and more.
The audience is invited to participate as well. This file consists of my submissions with cleanup and commentary added.
To make the best use of the resources that we have, we will explore the data set for features to select those with the most predictive power, build a random forest to confirm the recipe, and then build one or more ensemble models. If there is time, we will craft some visuals for model explainability.
Let’s load up packages:
suppressPackageStartupMessages({
library(tidyverse) # clean and transform rectangular data
library(hrbrthemes) # plot theming
library(tidymodels) # machine learning tools
library(finetune) # racing methods for accelerating hyperparameter tuning
library(themis) # ml prep tools for handling unbalanced datasets
library(baguette) # ml tools for bagged decision tree models
library(vip) # interpret model performance
library(DALEXtra)
})
source(here::here("code","_common.R"),
verbose = FALSE,
local = knitr::knit_global())
ggplot2::theme_set(theme_jim(base_size = 12))
#create a data directory
data_dir <- here::here("data",Sys.Date())
if (!file.exists(data_dir)) dir.create(data_dir)
# set a competition metric
mset <- metric_set(mn_log_loss)
# set the competition name from the web address
competition_name <- "sliced-s01e09-playoffs-1"
zipfile <- paste0(data_dir,"/", competition_name, ".zip")
path_export <- here::here("data",Sys.Date(),paste0(competition_name,".csv"))
Get the Data
A quick reminder before downloading the dataset: Go to the web site and accept the competition terms!!!
We have basic shell commands available to interact with Kaggle here:
# from the Kaggle api https://github.com/Kaggle/kaggle-api
# the leaderboard
shell(glue::glue("kaggle competitions leaderboard { competition_name } -s"))
# the files to download
shell(glue::glue("kaggle competitions files -c { competition_name }"))
# the command to download files
shell(glue::glue("kaggle competitions download -c { competition_name } -p { data_dir }"))
# unzip the files received
shell(glue::glue("unzip { zipfile } -d { data_dir }"))
We are reading in the contents of the three datafiles here, unnesting the id_artists column, joining the artists table to each id of the artists, cleaning the genres text, and finally collapsing the genres back.
Some questions to answer here: What features have missing data, and imputations may be required? What does the outcome variable look like, in terms of imbalance?
skimr::skim(train_df)
Outcome variable is_home_run is a binary class. bb_type, launch_speed, and launch_angle are missing some data. We will take a closer look at what missingness means in this context.
Outcome Variable Distribution
summarize_is_home_run <- function(tbl) {
ret <- tbl %>%
summarize(
n_is_home_run = sum(is_home_run == "yes"),
n = n(),
.groups = "drop"
) %>%
arrange(desc(n)) %>%
mutate(
pct_is_home_run = n_is_home_run / n,
low = qbeta(.025, n_is_home_run + 5, n - n_is_home_run + .5),
high = qbeta(.975, n_is_home_run + 5, n - n_is_home_run + .5)
) %>%
mutate(pct = n / sum(n))
ret
}
train_df %>%
group_by(batter_team) %>%
summarize_is_home_run() %>%
mutate(batter_team = fct_reorder(batter_team, pct_is_home_run)) %>%
ggplot(aes(pct_is_home_run, batter_team)) +
geom_point(aes(size = pct)) +
geom_errorbarh(aes(xmin = low, xmax = high), height = .3) +
scale_size_continuous(
labels = percent,
guide = "none",
range = c(.5, 4)
) +
scale_x_continuous(labels = percent) +
labs(
x = "Proportion of at bats",
y = "",
title = "What teams get the most home runs?",
subtitle = "Including 95% intervals. Size of points is proportional to at-bat frequency in the dataset"
)
train_df %>%
group_by(name) %>%
summarize_is_home_run() %>%
mutate(name = fct_reorder(name, pct_is_home_run)) %>%
ggplot(aes(pct_is_home_run, name)) +
geom_point(aes(size = pct)) +
geom_errorbarh(aes(xmin = low, xmax = high), height = .3) +
scale_size_continuous(
labels = percent,
guide = "none",
range = c(.5, 4)
) +
scale_x_continuous(labels = percent) +
labs(
x = "Proportion of at bats",
y = "",
title = "What ballparks get the most home runs?",
subtitle = "Including 95% intervals. Size of points is proportional to at-bat frequency in the dataset"
)
train_df %>%
group_by(inning = pmin(inning, 10)) %>%
summarize_is_home_run() %>%
arrange(inning) %>%
# mutate(inning = fct_reorder(inning, -as.numeric(inning))) %>%
ggplot(aes(pct_is_home_run, inning)) +
geom_point(aes(size = pct), show.legend = FALSE) +
geom_line(orientation = "y") +
geom_ribbon(aes(xmin = low, xmax = high), alpha = .2) +
scale_x_continuous(labels = percent) +
scale_y_continuous(breaks = 1:10, labels = c(1:9, "10+")) +
labs(
x = "Proportion of at bats that are home runs",
y = "",
title = "What innings get the most home runs?",
subtitle = "Including 95% intervals. Size of points is proportional to at-bat frequency in the dataset"
) +
theme(
legend.position = c(0.8, 0.8),
legend.background = element_rect(color = "white")
)
train_df %>%
group_by(balls, strikes) %>%
summarize_is_home_run() %>%
mutate(pitch_count = paste0(strikes, "-", balls)) %>%
ggplot(aes(pct_is_home_run, pitch_count)) +
geom_point(aes(size = pct)) +
geom_errorbarh(aes(xmin = low, xmax = high), height = .3) +
scale_size_continuous(
labels = percent,
guide = "none",
range = c(1, 7)
) +
scale_x_continuous(labels = percent) +
geom_text(
data = . %>% filter(pitch_count == "2-3"),
label = "Home runs are likely with a full count",
check_overlap = TRUE,
nudge_y = -0.3
) +
geom_text(
data = . %>% filter(pitch_count == "0-3"),
label = "Home runs are likely with the batter ahead",
check_overlap = TRUE,
nudge_y = -0.3
) +
labs(
x = "Proportion of at bats",
y = "Strikes - Balls",
title = "At what levels of pitch count are there more home runs?",
subtitle = "Including 95% intervals. Size of points is proportional to at-bat frequency in the dataset"
)
train_df %>%
group_by(balls, strikes) %>%
summarize_is_home_run() %>%
ggplot(aes(balls, strikes, fill = pct_is_home_run)) +
geom_tile() +
labs(
x = "# of balls",
y = "# of strikes",
title = "Home runs are more likely with many balls, fewer strikes",
fill = "% HR"
)
train_df %>%
group_by(balls, strikes) %>%
summarize(
pct_hr = mean(is_home_run == "yes"),
avg_height = mean((plate_z), na.rm = TRUE),
avg_abs_distance_center = mean(abs(plate_x), na.rm = TRUE),
.groups = "drop"
) %>%
mutate(count = paste0(balls, "-", strikes)) %>%
ggplot(aes(avg_abs_distance_center,
avg_height,
color = pct_hr
)) +
geom_point(size = 5, shape = 20) +
scale_color_viridis_b(option = "H") +
ggrepel::geom_text_repel(aes(label = count)) +
labs(
x = "Average distance from center plate (feet)",
y = "Average height (feet)",
fill = "% home run",
subtitle = "The count affects where a pitcher throws the ball, & therefore probability of HR"
)
train_df %>%
group_by(bb_type) %>%
summarize_is_home_run() %>%
filter(!is.na(bb_type)) %>%
ggplot(aes(bb_type, pct_is_home_run)) +
geom_col() +
scale_y_continuous(labels = percent) +
labs(
y = "% home run",
subtitle = "Ground balls and pop-ups are (literally) *never* home runs. Fly balls often are"
) +
theme(panel.grid.major.x = element_blank())
train_df %>%
group_by(bearing) %>%
summarize_is_home_run() %>%
mutate(bearing = fct_relevel(bearing, "right", "center", "left")) %>%
ggplot(aes(pct_is_home_run, bearing)) +
geom_point(aes(size = pct)) +
geom_errorbarh(aes(xmin = low, xmax = high), height = .3) +
scale_size_continuous(
labels = percent,
guide = "none",
range = c(.5, 4)
) +
scale_x_continuous(labels = percent) +
labs(
x = "Proportion of at bats",
y = "",
title = "What bearings get the most home runs?",
subtitle = "Including 95% intervals. Size of points is proportional to at-bat frequency in the dataset"
) +
theme(panel.grid.major.y = element_blank())
train_df %>%
group_by(pitch_name) %>%
summarize_is_home_run() %>%
filter(n > 10) %>%
mutate(pitch_name = fct_reorder(pitch_name, pct_is_home_run)) %>%
ggplot(aes(pct_is_home_run, pitch_name)) +
geom_point(aes(size = pct)) +
geom_errorbarh(aes(xmin = low, xmax = high), height = .3) +
scale_size_continuous(
labels = percent,
guide = "none",
range = c(.5, 4)
) +
scale_x_continuous(labels = percent) +
labs(
x = "Proportion of at bats",
y = "",
title = "What pitch names get the most home runs?",
subtitle = "Including 95% intervals. Size of points is proportional to at-bat frequency in the dataset"
) +
theme(panel.grid.major.y = element_blank())
train_df %>%
count(is_home_run) %>%
ggplot(aes(n, is_home_run, fill = is_home_run)) +
geom_col(show.legend = FALSE) +
scale_fill_viridis_d(option = "H") +
labs(subtitle = "There are a lot more non-home runs in this dataset than home runs.
", fill = NULL, y = NULL)
Time series
train_df %>%
group_by(week = as.Date("2020-01-01") + lubridate::week(game_date) * 7) %>%
summarize_is_home_run() %>%
ggplot(aes(week, pct_is_home_run)) +
geom_point(aes(size = n)) +
geom_line() +
geom_ribbon(aes(ymin = low, ymax = high), alpha = .2) +
expand_limits(y = 0) +
scale_x_date(
date_labels = "%b",
date_breaks = "month",
minor_breaks = NULL
) +
scale_y_continuous(labels = percent) +
scale_size_continuous(guide = "none") +
labs(
x = NULL,
y = "% home runs",
title = "Home Runs are more common later in the season!",
subtitle = glue::glue("Ribbon shows 95% confidence bound by week for dataset spanning { min(train_df$game_date) } thru { max(train_df$game_date) }.")
)
train_df %>%
select(is_home_run, plate_x:launch_angle) %>%
pivot_longer(cols = -is_home_run, names_to = "feature", values_to = "value") %>%
ggplot(aes(value, fill = is_home_run)) +
geom_density(alpha = .5) +
scale_fill_viridis_d(option = "H") +
facet_wrap(~feature, scales = "free") +
labs(
subtitle = "There's a sweet spot of launch angle & speed where home runs happen",
fill = "Home Run"
) +
theme(legend.position = c(0.8, 0.3))
train_df %>%
group_by(
launch_angle_bucket = round(launch_angle * 2, -1) / 2,
launch_speed_bucket = round(launch_speed * 2, -1) / 2
) %>%
summarize_is_home_run() %>%
filter(n >= 30) %>%
filter(complete.cases(.)) %>%
ggplot(aes(launch_speed_bucket, launch_angle_bucket, fill = pct_is_home_run)) +
geom_tile() +
scale_fill_viridis_c(option = "H", labels = scales::percent) +
labs(
x = "Launch Speed",
y = "Launch Angle",
title = "There is a sweet spot of high speed + moderate angle",
subtitle = "Rounded to the nearest 5 on each scale; no buckets shown with <30 data points",
fill = "% HR"
)
train_df %>%
group_by(
plate_x = round(plate_x, 1),
plate_z = round(plate_z, 1)
) %>%
summarize_is_home_run() %>%
filter(n >= 30) %>%
filter(complete.cases(.)) %>%
ggplot(aes(plate_x, plate_z, z = pct_is_home_run)) +
stat_summary_hex(alpha = 0.9, bins = 10) +
scale_fill_viridis_c(option = "H", labels = scales::percent) +
geom_vline(xintercept = 0, lty = 2) +
labs(
x = "Relative position from center plate (in feet)",
y = "Distance above plate (in feet)",
title = "The best place is center plate, about 2.5-3.5 feet up",
subtitle = "Rounded to the nearest 5 on each scale; no buckets shown with <30 data points",
fill = "% HR"
)
Machine Learning: Random Forest
Let’s run models in two steps. The first is a simple, fast shallow random forest, to confirm that the model will run and observe feature importance scores. The second will use xgboost. Both use the basic recipe preprocessor for now.
We will use 5-fold cross validation and stratify on the outcome to build models that are less likely to over-fit the training data.
Proper business modeling practice would holdout a sample from training entirely for assessing model performance. I’ve made an exception here for Kaggle.
set.seed(2021)
(folds <- vfold_cv(train_df, v = 5, strata = is_home_run))
Model Specification
This first model is a bagged tree, where the number of predictors to consider for each split of a tree (i.e., mtry) equals the number of all available predictors. The min_n of 10 means that each tree branch of the 50 decision trees built have at least 10 observations. As a result, the decision trees in the ensemble all are relatively shallow.
future::plan("multisession", workers = all_cores) # on Windows
Fit and Variable Importance
Lets make a cursory check of the recipe and variable importance, which comes out of rpart for free. This workflow also handles factors without dummies.
bag_wf <-
workflow() %>%
add_recipe(basic_rec) %>%
add_model(bag_spec)
bag_fit <- parsnip::fit(bag_wf, data = train_df)
extract_fit_parsnip(bag_fit)$fit$imp %>%
mutate(term = fct_reorder(term, value)) %>%
ggplot(aes(value, term)) +
geom_point() +
geom_errorbarh(aes(
xmin = value - `std.error` / 2,
xmax = value + `std.error` / 2
),
height = .3
) +
labs(
title = "Feature Importance",
x = NULL, y = NULL
)
Wow, that’s not too shabby. Of course, this may have overfitted. Let’s bank this first submission to Kaggle as-is, and work more with xgboost to do better.
[21:03:07] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
On training data, this log loss figure is not an improvement. I am going to attempt to post this second submission to Kaggle anyway, and work more with xgboost and a more advanced recipe to do better.
Let’s use what we learned above to set a more advanced recipe. This time, let’s also try thetune_race_anova technique for skipping the parts of the grid search that do not perform well.
We can visualize how the possible parameter combinations we tried did during the “race.” Notice how we saved a TON of time by not evaluating the parameter combinations that were clearly doing poorly on all the resamples; we only kept going with the good parameter combinations.
plot_race(cv_res_xgboost)
And we can look at the top results
autoplot(cv_res_xgboost)
show_best(cv_res_xgboost)
The best here is still discouraging. This figure is likely more robust and a better estimate of performance on holdout data. Let’s fit on the entire training set at these hyperparameters to get a single performance estimate on the best model so far.
[21:03:32] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Preparation of a new explainer is initiated
-> model label : workflow ( [33m default [39m )
-> data : 46244 rows 32 cols
-> data : tibble converted into a data.frame
-> target variable : 46244 values
-> predict function : yhat.workflow will be used ( [33m default [39m )
-> predicted values : No value for predict function target column. ( [33m default [39m )
-> model_info : package tidymodels , ver. 0.1.4 , task classification ( [33m default [39m )
-> predicted values : numerical, min = 6.556511e-06 , mean = 0.05292456 , max = 0.9867029
-> residual function : difference between y and yhat ( [33m default [39m )
-> residuals : numerical, min = 0.06649031 , mean = 0.9999904 , max = 1.985391
[32m A new explainer has been created! [39m
pdp_angle <- model_profile(explainer_xgb,
N = 500,
variables = "launch_angle"
)
as_tibble(pdp_angle$agr_profiles) %>%
ggplot(aes(`_x_`, `_yhat_`)) +
geom_line(
data = as_tibble(
pdp_angle$cp_profiles
),
aes(launch_angle, group = `_ids_`),
size = 0.5, alpha = 0.1, color = "gray30"
) +
geom_line(size = 1.2, alpha = 0.8, color = "orange") +
labs(x = "Launch Angle", y = "Predicted Home Runs")
What is the aggregated effect of the launch_speed feature over 500 examples?
pdp_speed <- model_profile(explainer_xgb,
N = 500,
variables = "launch_speed"
)
as_tibble(pdp_speed$agr_profiles) %>%
ggplot(aes(`_x_`, `_yhat_`)) +
geom_line(
data = as_tibble(
pdp_speed$cp_profiles
),
aes(launch_speed, group = `_ids_`),
size = 0.5, alpha = 0.1, color = "gray30"
) +
geom_line(size = 1.2, alpha = 0.8, color = "darkblue") +
labs(x = "Launch Speed", y = "Predicted Home Runs")
We’re out of time. This will be as good as it gets. Our final submission: