Last updated: 2021-09-14

Checks: 7 0

Knit directory: myTidyTuesday/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20210907)

The command set.seed(20210907) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 71e3ad7

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 71e3ad7. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    catboost_info/
    Ignored:    data/2021-09-08/
    Ignored:    data/acs_poverty.rds
    Ignored:    data/fmhpi.rds
    Ignored:    data/grainstocks.rds
    Ignored:    data/hike_data.rds
    Ignored:    data/us_states.rds
    Ignored:    data/us_states_hexgrid.geojson
    Ignored:    data/weatherstats_toronto_daily.csv

Untracked files:
    Untracked:  code/work list batch targets.R
    Untracked:  figure/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/HimalayanClimbers.Rmd) and HTML (docs/HimalayanClimbers.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	71e3ad7	opus1993	2021-09-14	swap element_markdown subtitle colors in top20 peaks chart
html	e164a8f	opus1993	2021-09-14	Build site.
Rmd	d9bc00c	opus1993	2021-09-14	clean the confusion matrix graphics
html	0947ff3	opus1993	2021-09-13	Build site.
Rmd	a7d52ca	opus1993	2021-09-13	update to new autoplot vis for confusion matrices

This week’s data comes from the Himalayan Database, a compilation of records for all expeditions that have climbed in the Nepal Himalaya mountain chain. The data covers all expeditions from 1905 thru the spring of 2019 for more than 465 significant peaks in Nepal.

Let’s first explore the details of the mountain peaks

peaks %>%
  arrange(desc(height_meters)) %>%
  head(20) %>%
  mutate(peak_name = fct_reorder(peak_name, height_meters)) %>%
  ggplot(aes(height_meters, peak_name, fill = climbing_status)) +
  geom_col(show.legend = FALSE) +
  labs(
    x = "Height (meters)",
    y = NULL,
    title = "Top 20 Himalayan Peaks",
    subtitle = "With both <span style = 'color:#7A0403FF;'>Climbed</span>
and <span style = 'color:#F1CA3AFF;'>Unclimbed</span> summits.",
    fill = ""
  ) +
  theme(
    plot.subtitle = ggtext::element_markdown(
      size = 11, lineheight = 1.2,
      face = "bold"
    ),
    panel.grid.major.y = element_blank()
  )

Let’s tidy some of the expeditions dataset

na_reasons <- c(
  "Unknown",
  "Attempt rumoured",
  "Did not attempt climb",
  "Did not reach base camp"
)

expeditions <- tt$expeditions %>%
  mutate(
    time_to_highpoint = highpoint_date -
      basecamp_date,
    success = case_when(
      str_detect(termination_reason, "Success") ~ "Success",
      termination_reason %in% na_reasons ~ "Other",
      TRUE ~ "Failure"
    ),
    days_to_highpoint = as.integer(highpoint_date - basecamp_date)
  )

Other questions that could be posed and possible answered with this dataset:

Fraction of successful climbs per year
Rate of death over time, deadliest peaks, by mountain
Death causes and rate of injury
Distribution of duration of climbs versus the height or time
Correlation between frequency of expeditions and death rate
Which are the most dangerous peaks that have been climbed?

summarize_expeditions <- function(tbl) {
  tbl %>%
    summarize(
      first_climb = min(year),
      success_rate = mean(success == "Success"),
      n_climbs = n(),
      across(members:hired_staff_deaths, sum),
      .groups = "drop"
    ) %>%
    mutate(
      pct_death = member_deaths / members,
      pct_hired_staff_deaths = hired_staff_deaths / hired_staff
    )
}


peak_summarized <- expeditions %>%
  group_by(peak_id, peak_name) %>%
  summarize_expeditions() %>%
  ungroup() %>%
  arrange(desc(n_climbs)) %>%
  inner_join(peaks %>% select(peak_id, height_meters), by = "peak_id")

What are the deadliest mountains?

Answering the question directly from the proportions may be misleading because some mountains have so few climbers.

Given the limited amount of data available on some mountains, the death rate is adjusted by adding the available priors from all climbing. The model could be further refined by regressing on information about oxygen use or other features.

For further reading: Introduction to Empirical Bayes: Examples from Baseball Statistics

peaks_eb <- peak_summarized %>%
  filter(members >= 20) %>%
  arrange(desc(pct_death)) %>%
  add_ebb_estimate(member_deaths, members)

peaks_eb %>%
  ggplot(aes(pct_death, .fitted)) +
  geom_point(aes(size = members)) +
  geom_abline(color = "red") +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_color_viridis_c(trans = "log10") +
  labs(
    x = "Death Rate (raw)",
    y = "Adjusted Estimate of Death Rate using empirical Bayes",
    title = "How much adjustment compression using priors?"
  ) +
  theme(legend.position = c(0.8, 0.2))

peaks_eb %>%
  filter(members >= 500) %>%
  arrange(desc(.fitted)) %>%
  mutate(peak_name = fct_reorder(peak_name, .fitted)) %>%
  ggplot(aes(.fitted, peak_name)) +
  geom_point(aes(size = members)) +
  geom_errorbarh(aes(xmin = .low, xmax = .high), color = "gray") +
  expand_limits(x = 0) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(
    subtitle = "Estimate of Death Rate using empirical Bayes with 95% credible interval",
    y = "", x = "",
    title = "Deadliest Himalayan Peaks with more than 500 climbers",
    caption = "Visual: @jim_gruman, Data: the Himalayan Database"
  ) +
  theme(
    panel.grid.major.y = element_blank(),
    legend.position = c(0.8, 0.2)
  )

How does death rate relate to elevation?

peaks_eb %>%
  filter(members >= 500) %>%
  ggplot(aes(height_meters, .fitted)) +
  geom_point(aes(size = members)) +
  labs(title = "There does not appear to be a relationship between death rate and height")

What is the distance between the base camp and the summit?

expeditions %>%
  filter(
    success == "Success",
    !is.na(days_to_highpoint),
    !is.na(peak_name)
  ) %>%
  mutate(
    peak_name = fct_lump(peak_name, 10),
    peak_name = fct_reorder(peak_name, days_to_highpoint, mean)
  ) %>%
  ggplot(aes(days_to_highpoint, peak_name)) +
  geom_boxplot() +
  labs(
    title = "Duration to successfully summit the most popular Himalayan peaks",
    caption = "Visual: @jim_gruman, Data:  the Himalayan Database",
    x = "Days from basecamp to summit",
    y = ""
  ) +
  theme(panel.grid.major.y = element_blank())

Explore Mount Everest more closely

expeditions %>%
  filter(peak_name == "Everest") %>%
  ggplot(aes(days_to_highpoint, fill = success)) +
  geom_density(alpha = 0.3) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(
    title = "Duration to successfully summit Everest",
    caption = "Visual: @jim_gruman, Data:  the Himalayan Database",
    x = "Days from basecamp to summit",
    y = "Percentage",
    fill = ""
  ) +
  theme(
    panel.grid.major.y = element_blank(),
    legend.position = c(0.6, 0.5),
    legend.background = element_rect(color = "white")
  )

When have the Everest expeditions been successful?

expeditions %>%
  ggplot(aes(year, fill = success)) +
  geom_histogram(alpha = 0.3, bins = 40) +
  labs(
    title = "Year of Everest climbs",
    caption = "Visual: @jim_gruman, Data:  the Himalayan Database",
    x = "Year",
    y = "Climbers",
    fill = ""
  ) +
  theme(
    panel.grid.major.y = element_blank(),
    legend.position = c(0.2, 0.5),
    legend.background = element_rect(color = "white")
  )

What is the median amount of time to climb?

everest_by_decade <- expeditions %>%
  filter(peak_name == "Everest") %>%
  mutate(decade = pmax(10 * (year %/% 10), 1970)) %>%
  # limp prior to 1970
  group_by(decade, peak_name) %>%
  summarize_expeditions()

everest_by_decade %>%
  ggplot(aes(decade, pct_death)) +
  geom_line(aes(color = "All climbers")) +
  geom_line(aes(y = pct_hired_staff_deaths, color = "Hired Staff")) +
  geom_point(aes(color = "All climbers", size = members)) +
  geom_point(aes(y = pct_hired_staff_deaths, color = "Hired Staff", size = hired_staff)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = .1)) +
  scale_x_continuous(
    breaks = seq(1970, 2010, 10),
    labels = c("<= 1980", seq(1980, 2010, 10))
  ) +
  expand_limits(y = 0) +
  labs(
    x = "Decade", color = "",
    y = "Death Rate",
    title = "Everest has been getting less deadly over time",
    subtitle = "Though trends have reversed for hired staff",
    size = "Number of climbers", caption = "Visual: @jim_gruman, Data:  the Himalayan Database"
  ) +
  theme(
    legend.position = c(0.2, 0.4),
    legend.background = element_rect(color = "white")
  )

everest_by_decade %>%
  ggplot(aes(decade, success_rate)) +
  geom_line() +
  geom_point(aes(size = n_climbs)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(
    breaks = seq(1970, 2010, 10),
    labels = c("<= 1980", seq(1980, 2010, 10))
  ) +
  expand_limits(y = 0) +
  labs(
    x = "Decade",
    y = "Success Rate",
    title = "The Everest summit success rate has been increasing",
    size = "Number of climbs", caption = "Visual: @jim_gruman, Data:  the Himalayan Database"
  ) +
  theme(
    legend.position = c(0.7, 0.3),
    legend.background = element_rect(color = "white")
  )

Examine the death probability per expedition member. What causes climber fatalities?

members <- tt$members

members %>%
  count(died, death_cause) %>%
  knitr::kable()

died	death_cause	n
FALSE	NA	75413
TRUE	AMS	102
TRUE	Avalanche	369
TRUE	Crevasse	27
TRUE	Disappearance (unexplained)	49
TRUE	Exhaustion	41
TRUE	Exposure / frostbite	42
TRUE	Fall	331
TRUE	Falling rock / ice	26
TRUE	Icefall collapse	16
TRUE	Illness (non-AMS)	60
TRUE	Other	33
TRUE	Unknown	10

members %>%
  count(sex) %>%
  knitr::kable()

sex	n
F	7044
M	69473
NA	2

For Everest, what are the significant features in predicting death?

members %>%
  filter(peak_name == "Everest") %>%
  mutate(expedition_role = fct_lump(expedition_role, 5)) %>%
  glm(died ~ year + age + sex + expedition_role,
    data = .,
    family = "binomial"
  ) %>%
  tidy() %>%
  mutate(p.value = format.pval(p.value)) %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	49.8368629	7.0523777	7.0666752	1.5869e-12
year	-0.0276627	0.0035435	-7.8066459	5.8730e-15
age	0.0182624	0.0066097	2.7629740	0.0057277
sexM	0.3353777	0.2907061	1.1536656	0.2486373
expedition_roleDeputy Leader	-0.0639574	0.5124297	-0.1248120	0.9006724
expedition_roleExp Doctor	-0.1399287	0.4634403	-0.3019346	0.7627019
expedition_roleH-A Worker	0.3469201	0.1625223	2.1346007	0.0327936
expedition_roleLeader	0.6033960	0.1852958	3.2563929	0.0011284
expedition_roleOther	0.0700387	0.2232777	0.3136842	0.7537609

The significant features include the year (lowers likelihood), the leader role (more likely to die), age (more likely to die), and the H-Aworker (much more likely to die)

members %>%
  filter(peak_name == "Everest", !is.na(age)) %>%
  group_by(age = 10 * (age %/% 10)) %>%
  summarize(
    n_climbers = n(),
    pct_death = mean(died)
  ) %>%
  knitr::kable()

age	n_climbers	pct_death
10	263	0.0076046
20	5258	0.0125523
30	8079	0.0120064
40	5001	0.0115977
50	1897	0.0147601
60	446	0.0269058
70	48	0.0000000
80	5	0.4000000

members %>%
  filter(peak_name == "Everest", !is.na(age)) %>%
  group_by(hired) %>%
  summarize(
    n_climbers = n(),
    pct_death = mean(died)
  ) %>%
  knitr::kable()

hired	n_climbers	pct_death
FALSE	14624	0.0121718
TRUE	6373	0.0136513

It seems that expedition role H-A worker may be confounded with hired in some funky way

What happens when oxygen is included?

model <- members %>%
  filter(peak_name == "Everest", !is.na(age)) %>%
  mutate(leader = expedition_role == "Leader") %>%
  glm(died ~ year + age + sex + leader + hired + oxygen_used,
    data = .,
    family = "binomial"
  )

model %>%
  tidy(conf.int = TRUE, exponentiate = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(estimate, term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), color = "gray") +
  geom_vline(xintercept = 1, color = "red") +
  labs(
    title = "Relative risk factors for death on Everest summits",
    caption = "Visual: @jim_gruman, Data:  the Himalayan Database",
    x = "Exponentiated estimate for log odds"
  )

There may be an interaction, where added risk to hired workers is associated with climbing expeditions that use oxygen (maybe due to carrying the tanks). Maybe explore it further next time.

Women on Himalayan climbing expeditions @dm_ferrero

Calculate proportion of female climbers per peak per year

top_peaks <- members %>%
  count(peak_name, sort = TRUE) %>%
  slice(1:20) %>%
  pull(peak_name)

prop_women <- members %>%
  filter(peak_name %in% top_peaks) %>%
  count(peak_name, year, sex) %>%
  pivot_wider(names_from = sex, values_from = n, names_prefix = "num_") %>%
  replace_na(list(num_F = 0)) %>%
  mutate(
    prop_F = num_F / (num_M + num_F),
    total_peeps = num_M + num_F
  ) %>%
  arrange(year)

Calculate proportion of female climbers per peak per year

prop_women <- members %>%
  filter(peak_name %in% top_peaks) %>%
  count(peak_name, year, sex) %>%
  pivot_wider(names_from = sex, values_from = n, names_prefix = "num_") %>%
  replace_na(list(num_F = 0)) %>%
  mutate(
    prop_F = num_F / (num_M + num_F),
    total_peeps = num_M + num_F
  ) %>%
  arrange(year)

Inspired by Danielle Ferraro @dm_ferraro

# Plot
ggplot(prop_women, aes(
  x = year,
  y = factor(peak_name, levels = rev(top_peaks))
)) +
  geom_point(aes(size = total_peeps, color = prop_F), alpha = 0.7) +
  scale_color_viridis_b(
    n.breaks = 4,
    labels = scales::percent_format(accuracy = 1),
    option = "H"
  ) +
  scale_x_continuous(breaks = seq(1900, 2010, by = 10)) +
  labs(
    x = NULL,
    y = NULL,
    size = "Total expedition\nmembers",
    color = "Percentage of\nwomen",
    title = "Women on Himalayan climbing expeditions",
    subtitle = "Each point represents the number of expedition members and percentage of women on that peak for that year",
    caption = "Data: The Himalayan Database\nFigure by @jim_gruman for #TidyTuesday"
  ) +
  theme(
    panel.background = element_blank(),
    panel.grid.major = element_line(color = "grey30", size = 0.2),
    panel.grid.minor = element_line(color = "grey30", size = 0.2),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    legend.background = element_blank(),
    axis.ticks = element_blank(),
    legend.key = element_blank(),
    plot.title = element_text(size = 22)
  )

More Machine Learning Modeling

Strategies for handling class imbalances

Let’s make one last exploratory plot and look at seasons. How much difference is there in survival across the four seasons?

members %>%
  filter(season != "Unknown") %>%
  count(season, died) %>%
  group_by(season) %>%
  mutate(
    percent = n / sum(n),
    died = case_when(
      died ~ "Died",
      TRUE ~ "Did not die"
    )
  ) %>%
  ggplot(aes(season, percent, fill = season)) +
  geom_col(position = "dodge", show.legend = FALSE) +
  scale_fill_viridis_d(option = "H") +
  scale_y_continuous(labels = scales::percent_format()) +
  facet_wrap(~died, scales = "free") +
  labs(x = NULL, y = "% of expedition members")

Let’s now create the dataset that we’ll use for modeling by filtering on some of the variables and transforming some variables to a be factors. There are still lots of NA values for age but we are going to impute those.

members_df <- members %>%
  filter(season != "Unknown", !is.na(sex), !is.na(citizenship)) %>%
  select(peak_id, year, season, sex, age, citizenship, hired, success, died) %>%
  mutate(died = case_when(
    died ~ "died",
    TRUE ~ "survived"
  )) %>%
  mutate_if(is.character, factor) %>%
  mutate_if(is.logical, as.integer)

Build a Model

We can start by loading the tidymodels metapackage, and splitting our data into training and testing sets.

members_split <- initial_split(members_df, strata = died)
members_train <- training(members_split)
members_test <- testing(members_split)

We are going to use resampling to evaluate model performance.

members_folds <- vfold_cv(members_train, strata = died)

Next we build a recipe for data preprocessing.

First, we must tell the recipe() what our model is going to be (using a formula here) and what our training data is.
Next, we impute the missing values for age using the median age in the training data set. There are more complex steps available for imputation, but we’ll stick with a straightforward option here.
Next, we use step_other() to collapse categorical levels for peak and citizenship. Before this step, there were hundreds of values in each variable.
After this, we can create indicator variables for the non-numeric, categorical values, except for the outcome died which we need to keep as a factor.
Finally, there are many more people who survived their expedition than who died (thankfully) so we will use step_smote() to balance the classes.

members_rec <- recipe(died ~ ., data = members_train) %>%
  step_impute_median(age) %>%
  step_other(peak_id, citizenship, threshold = 0.01) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_smote(died)

members_rec

Data Recipe

Inputs:

      role #variables
   outcome          1
 predictor          8

Operations:

Median Imputation for age
Collapsing factor levels for peak_id, citizenship
Dummy variables from all_nominal_predictors()
SMOTE based on died

We’re going to use this recipe in a workflow() so we don’t need to stress a lot about whether to prep() or not. If you want to explore the what the recipe is doing to your data, you can first prep() the recipe to estimate the parameters needed for each step and then bake(new_data = NULL) to pull out the training data with those steps applied.

Let’s compare two different models, a logistic regression model and a random forest model. We start by creating the model specifications.

glm_spec <- logistic_reg() %>%
  set_engine("glm")

glm_spec

Logistic Regression Model Specification (classification)

Computational engine: glm

rf_spec <- rand_forest(trees = 1000) %>%
  set_mode("classification") %>%
  set_engine("ranger")

rf_spec

Random Forest Model Specification (classification)

Main Arguments:
  trees = 1000

Computational engine: ranger

Next let’s start putting together a tidymodelsworkflow(), a helper object to help manage modeling pipelines with pieces that fit together like Lego blocks. Notice that there is no model yet: Model: None.

members_wf <- workflow() %>%
  add_recipe(members_rec)

members_wf

== Workflow ====================================================================
Preprocessor: Recipe
Model: None

-- Preprocessor ----------------------------------------------------------------
4 Recipe Steps

* step_impute_median()
* step_other()
* step_dummy()
* step_smote()

Now we can add a model, and the fit to each of the resamples. First, we can fit the logistic regression model. Let’s set a non-default metric set so we can add sensitivity and specificity.

all_cores <- parallelly::availableCores(omit = 1)
all_cores

system 
    11

future::plan("multisession", workers = all_cores) # on Windows

members_metrics <- metric_set(roc_auc, accuracy, sensitivity, specificity)

# doParallel::registerDoParallel()
glm_rs <- members_wf %>%
  add_model(glm_spec) %>%
  fit_resamples(
    resamples = members_folds,
    metrics = members_metrics,
    control = control_resamples(save_pred = TRUE)
  )

Second, we can fit the random forest model.

rf_rs <- members_wf %>%
  add_model(rf_spec) %>%
  fit_resamples(
    resamples = members_folds,
    metrics = members_metrics,
    control = control_resamples(save_pred = TRUE)
  )

We have fit each of our candidate models to our resampled training set!

Evaluate Models

Let’s check out how we did

collect_metrics(glm_rs) %>% knitr::kable()

.metric	.estimator	mean	n	std_err	.config
accuracy	binary	0.6676891	10	0.0015514	Preprocessor1_Model1
roc_auc	binary	0.7202278	10	0.0085022	Preprocessor1_Model1
sens	binary	0.6798315	10	0.0148981	Preprocessor1_Model1
spec	binary	0.6674792	10	0.0016366	Preprocessor1_Model1

Well, this is middling but at least mostly consistent for the positive and negative classes.

collect_metrics(rf_rs) %>% knitr::kable()

.metric	.estimator	mean	n	std_err	.config
accuracy	binary	0.9780237	10	0.0003557	Preprocessor1_Model1
roc_auc	binary	0.7594277	10	0.0104190	Preprocessor1_Model1
sens	binary	0.1355960	10	0.0075674	Preprocessor1_Model1
spec	binary	0.9903824	10	0.0004311	Preprocessor1_Model1

The accuracy is great but that sensitivity is not what we’d like to see. The random forest model has not done a great job of learning how to recognize both classes, even with our oversampling strategy. Let’s dig deeper into how these models are doing to see this more. For example, how are they predicting the two classes?

glm_rs %>%
  conf_mat_resampled(tidy = FALSE) %>%
  autoplot() +
  labs(title = "GLM Confidence Matrix across resamples") +
  theme_jim(base_size = 10)

rf_rs %>%
  conf_mat_resampled(tidy = FALSE) %>%
  autoplot() +
  labs(title = "Random Forest Confidence Matrix across resamples") +
  theme_jim(base_size = 10)

The random forest model is quite bad at identifying which expedition members died, while the logistic regression model does about the same for both classes.

We can also make an ROC curve.

glm_rs %>%
  collect_predictions() %>%
  group_by(id) %>%
  roc_curve(died, .pred_died) %>%
  ggplot(aes(1 - specificity, sensitivity, color = id)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_path(show.legend = FALSE, alpha = 0.6, size = 1.2) +
  coord_equal()

It is finally time for us to return to the testing set. Notice that we have not used the testing set yet during this whole analysis; to compare and assess models we used resamples of the training set. Let’s fit one more time to the training data and evaluate on the testing data using the function last_fit().

members_final <- members_wf %>%
  add_model(glm_spec) %>%
  last_fit(members_split)

members_final

# Resampling results
# Manual resampling 
# A tibble: 1 x 6
  splits                id               .metrics  .notes .predictions .workflow
  <list>                <chr>            <list>    <list> <list>       <list>   
1 <split [57380/19127]> train/test split <tibble ~ <tibb~ <tibble [19~ <workflo~

The metrics and predictions here are on the testing data.

collect_metrics(members_final) %>% knitr::kable()

.metric	.estimator	.estimate	.config
accuracy	binary	0.6715115	Preprocessor1_Model1
roc_auc	binary	0.7243124	Preprocessor1_Model1

collect_predictions(members_final) %>%
  conf_mat(died, .pred_class) %>%
  autoplot() +
  labs(title = "GLM Model Performance on Testing Data") +
  theme_jim(base_size = 10)

Well, this is not so great. The GLM model predicts that all climbers survive.

The coefficients (which we can get out using tidy()) have been estimated using the training data. If we use exponentiate = TRUE, we have odds ratios.

members_final %>%
  extract_workflow() %>%
  tidy(exponentiate = TRUE) %>%
  arrange(estimate) %>%
  knitr::kable(digits = 4)

term	estimate	std.error	statistic	p.value
(Intercept)	0.0000	1.0071	-50.7877	0.0000
peak_id_ANN1	0.0814	0.0529	-47.4281	0.0000
peak_id_DHA1	0.0983	0.0480	-48.3149	0.0000
peak_id_KANG	0.1000	0.0570	-40.3997	0.0000
peak_id_PUMO	0.1652	0.0537	-33.5158	0.0000
peak_id_MAKA	0.1952	0.0504	-32.4175	0.0000
peak_id_MANA	0.1973	0.0433	-37.5068	0.0000
peak_id_EVER	0.2294	0.0377	-39.0889	0.0000
peak_id_other	0.2550	0.0375	-36.4663	0.0000
hired	0.4030	0.0575	-15.8060	0.0000
sex_M	0.4672	0.0317	-24.0209	0.0000
peak_id_LHOT	0.5757	0.0587	-9.4069	0.0000
peak_id_CHOY	0.6400	0.0427	-10.4613	0.0000
citizenship_Russia	0.8491	0.0698	-2.3435	0.0191
peak_id_HIML	0.8916	0.0832	-1.3781	0.1682
peak_id_BARU	0.9108	0.0656	-1.4230	0.1547
peak_id_ANN4	0.9647	0.0917	-0.3919	0.6952
age	0.9904	0.0008	-12.7651	0.0000
year	1.0267	0.0005	52.1618	0.0000
citizenship_India	1.0906	0.0635	1.3655	0.1721
season_Spring	1.1001	0.0168	5.6950	0.0000
season_Winter	1.1219	0.0444	2.5881	0.0096
citizenship_Germany	1.3270	0.0623	4.5435	0.0000
citizenship_S.Korea	1.4019	0.0591	5.7147	0.0000
citizenship_Poland	1.7852	0.0686	8.4455	0.0000
citizenship_other	1.8164	0.0521	11.4518	0.0000
success	2.3182	0.0166	50.7364	0.0000
citizenship_Japan	2.4260	0.0542	16.3549	0.0000
citizenship_Spain	2.8580	0.0603	17.4191	0.0000
citizenship_Switzerland	2.9370	0.0660	16.3193	0.0000
citizenship_France	2.9518	0.0578	18.7304	0.0000
citizenship_Nepal	2.9763	0.0747	14.5924	0.0000
citizenship_Austria	3.4269	0.0680	18.1080	0.0000
citizenship_China	4.0187	0.0714	19.4695	0.0000
citizenship_New.Zealand	4.4991	0.0960	15.6614	0.0000
citizenship_Canada	4.9039	0.0905	17.5721	0.0000
citizenship_UK	5.0950	0.0587	27.7364	0.0000
citizenship_Italy	5.3972	0.0662	25.4856	0.0000
citizenship_USA	5.6072	0.0581	29.6837	0.0000
season_Summer	6.4447	0.0974	19.1391	0.0000

we can also visualize the variable coefficients of the GLM model.

members_final %>%
  extract_workflow() %>%
  tidy() %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(estimate, fct_reorder(term, estimate))) +
  geom_vline(xintercept = 0, color = "gray50", lty = 2, size = 1.2) +
  geom_errorbar(aes(
    xmin = estimate - std.error,
    xmax = estimate + std.error
  ),
  width = .2, color = "gray50", alpha = 0.7
  ) +
  geom_point(size = 2) +
  labs(y = NULL, x = "Coefficent from logistic regression")

The features with coefficients on the positive side (like climbing in summer, being on a successful expedition, or being from the UK or US) are associated with surviving.
The features with coefficients on the negative side (like climbing specific peaks including Everest, being one of the hired members of a expedition, or being a man) are associated with dying.

Remember that we have to interpret model coefficients like this in light of the predictive accuracy of our model, which was somewhat middling; there are more factors at play in who survives these expeditions than what we have accounted for in the model directly. Also note that we see evidence in this model for how dangerous it is to be a native Sherpa climber in Nepal, hired as an expedition member.

sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ranger_0.13.1      vctrs_0.3.8        rlang_0.4.11       themis_0.1.4      
 [5] yardstick_0.0.8    workflowsets_0.1.0 workflows_0.2.3    tune_0.1.6        
 [9] rsample_0.1.0      recipes_0.1.16     parsnip_0.1.7.900  modeldata_0.1.1   
[13] infer_1.0.0        dials_0.0.9.9000   scales_1.1.1       broom_0.7.9       
[17] tidymodels_0.1.3   gt_0.3.1           ebbr_0.1           forcats_0.5.1     
[21] stringr_1.4.0      dplyr_1.0.7        purrr_0.3.4        readr_2.0.1       
[25] tidyr_1.1.3        tibble_3.1.4       ggplot2_3.3.5      tidyverse_1.3.1   
[29] workflowr_1.6.2   

loaded via a namespace (and not attached):
  [1] utf8_1.2.2         R.utils_2.10.1     tidyselect_1.1.1  
  [4] grid_4.1.1         pROC_1.18.0        munsell_0.5.0     
  [7] codetools_0.2-18   ragg_1.1.3         future_1.22.1     
 [10] withr_2.4.2        colorspace_2.0-2   highr_0.9         
 [13] knitr_1.34         rstudioapi_0.13    stats4_4.1.1      
 [16] Rttf2pt1_1.3.9     listenv_0.8.0      labeling_0.4.2    
 [19] git2r_0.28.0       bit64_4.0.5        DiceDesign_1.9    
 [22] farver_2.1.0       rprojroot_2.0.2    mlr_2.19.0        
 [25] parallelly_1.28.1  generics_0.1.0     ipred_0.9-11      
 [28] xfun_0.25          R6_2.5.1           markdown_1.1      
 [31] doParallel_1.0.16  VGAM_1.1-5         lhs_1.1.3         
 [34] cachem_1.0.6       assertthat_0.2.1   promises_1.2.0.1  
 [37] vroom_1.5.4        nnet_7.3-16        gtable_0.3.0      
 [40] globals_0.14.0     timeDate_3043.102  BBmisc_1.11       
 [43] systemfonts_1.0.2  splines_4.1.1      extrafontdb_1.0   
 [46] lazyeval_0.2.2     selectr_0.4-2      prismatic_1.0.0   
 [49] checkmate_2.0.0    yaml_2.2.1         modelr_0.1.8      
 [52] backports_1.2.1    httpuv_1.6.2       gridtext_0.1.4    
 [55] extrafont_0.17     tools_4.1.1        lava_1.6.10       
 [58] usethis_2.0.1      ellipsis_0.3.2     jquerylib_0.1.4   
 [61] Rcpp_1.0.7         plyr_1.8.6         parallelMap_1.5.1 
 [64] rpart_4.1-15       ParamHelpers_1.14  viridis_0.6.1     
 [67] haven_2.4.3        hrbrthemes_0.8.0   fs_1.5.0          
 [70] here_1.0.1         furrr_0.2.3        unbalanced_2.0    
 [73] magrittr_2.0.1     data.table_1.14.0  reprex_2.0.1      
 [76] RANN_2.6.1         GPfit_1.0-8        whisker_0.4       
 [79] ROSE_0.0-4         R.cache_0.15.0     hms_1.1.0         
 [82] evaluate_0.14      readxl_1.3.1       gridExtra_2.3     
 [85] compiler_4.1.1     crayon_1.4.1       R.oo_1.24.0       
 [88] htmltools_0.5.2    later_1.3.0        tzdb_0.1.2        
 [91] ggtext_0.1.1       lubridate_1.7.10   DBI_1.1.1         
 [94] dbplyr_2.1.1       MASS_7.3-54        Matrix_1.3-4      
 [97] cli_3.0.1          R.methodsS3_1.8.1  parallel_4.1.1    
[100] gower_0.2.2        pkgconfig_2.0.3    xml2_1.3.2        
[103] foreach_1.5.1      bslib_0.3.0        hardhat_0.1.6     
[106] tidytuesdayR_1.0.1 prodlim_2019.11.13 rvest_1.0.1       
[109] digest_0.6.27      rmarkdown_2.10     cellranger_1.1.0  
[112] fastmatch_1.1-3    gdtools_0.2.3      curl_4.3.2        
[115] lifecycle_1.0.0    jsonlite_1.7.2     viridisLite_0.4.0 
[118] fansi_0.5.0        pillar_1.6.2       lattice_0.20-44   
[121] fastmap_1.1.0      httr_1.4.2         survival_3.2-11   
[124] glue_1.4.2         conflicted_1.0.4   FNN_1.1.3         
[127] iterators_1.0.13   bit_4.0.4          class_7.3-19      
[130] stringi_1.7.4      sass_0.4.0         rematch2_2.1.2    
[133] textshaping_0.3.5  styler_1.5.1       future.apply_1.8.1

Himalayan Climbers

Jim Gruman

September 22, 2020

Explore Mount Everest more closely

Women on Himalayan climbing expeditions @dm_ferrero

More Machine Learning Modeling

Build a Model

Evaluate Models