You are reading the work-in-progress first edition of Causal Inference in R. This chapter has its foundations written but is still undergoing changes.
7 Preparing data to answer causal questions
7.1 Introduction to the data
Throughout this book we will be using data obtained from Touring Plans. Touring Plans is a company that helps folks plan their trips to Disney and Universal theme parks. One of their goals is to accurately predict attraction wait times at these theme parks by leveraging data and statistical modeling. The touringplans R package includes several datasets containing information about Disney theme park attractions. A summary of the attractions included in the package can be found by running the following:
library(touringplans)
attractions_metadata
# A tibble: 14 × 8
dataset_name name short_name park land opened_on
<chr> <chr> <chr> <chr> <chr> <date>
1 alien_sauce… Alie… Alien Sau… Disn… Toy … 2018-06-30
2 dinosaur DINO… DINOSAUR Disn… Dino… 1998-04-22
3 expedition_… Expe… Expeditio… Disn… Asia 2006-04-07
4 flight_of_p… Avat… Flight of… Disn… Pand… 2017-05-27
5 kilimanjaro… Kili… Kilimanja… Disn… Afri… 1998-04-22
6 navi_river Na'v… Na'vi Riv… Disn… Pand… 2017-05-27
7 pirates_of_… Pira… Pirates o… Magi… Adve… 1973-12-17
8 rock_n_roll… Rock… Rock Coas… Disn… Suns… 1999-07-29
9 seven_dwarf… Seve… 7 Dwarfs … Magi… Fant… 2014-05-28
10 slinky_dog Slin… Slinky Dog Disn… Toy … 2018-06-30
11 soarin Soar… Soarin' Epcot Worl… 2005-05-05
12 spaceship_e… Spac… Spaceship… Epcot Worl… 1982-10-01
13 splash_moun… Spla… Splash Mo… Magi… Fron… 1992-07-17
14 toy_story_m… Toy … Toy Story… Disn… Toy … 2008-05-31
# ℹ 2 more variables: duration <dbl>,
# average_wait_per_hundred <dbl>
Additionally, this package contains a dataset with raw metadata about the parks, with observations recorded daily. This metadata includes information like the Walt Disney World ticket season on the particular day (was it high season – think Christmas – or low season – think right when school started), what the historic temperatures were in the park on that day, and whether there was a special event, such as “extra magic hours” in the park on that day (did the park open early to guests staying in the Walt Disney World resorts?).
parks_metadata_raw
# A tibble: 2,079 × 181
date wdw_ticket_season dayofweek dayofyear
<date> <chr> <dbl> <dbl>
1 2015-01-01 <NA> 5 0
2 2015-01-02 <NA> 6 1
3 2015-01-03 <NA> 7 2
4 2015-01-04 <NA> 1 3
5 2015-01-05 <NA> 2 4
6 2015-01-06 <NA> 3 5
7 2015-01-07 <NA> 4 6
8 2015-01-08 <NA> 5 7
9 2015-01-09 <NA> 6 8
10 2015-01-10 <NA> 7 9
# ℹ 2,069 more rows
# ℹ 177 more variables: weekofyear <dbl>,
# monthofyear <dbl>, year <dbl>, season <chr>,
# holidaypx <dbl>, holidaym <dbl>, holidayn <chr>,
# holiday <dbl>, wdwticketseason <chr>,
# wdwracen <chr>, wdweventn <chr>, wdwevent <dbl>,
# wdwrace <dbl>, wdwseason <chr>, …
Suppose the causal question of interest is:
Is there a relationship between whether there were “Extra Magic Hours” in the morning at Magic Kingdom and the average wait time for an attraction called the “Seven Dwarfs Mine Train” the same day between 9am and 10am in 2018?
Let’s begin by diagramming this causal question (Figure 7.1).
Historically, guests who stayed in a Walt Disney World resort hotel could access the park during “Extra Magic Hours,” during which the park was closed to all other guests. These extra hours could be in the morning or evening. The Seven Dwarfs Mine Train is a ride at Walt Disney World’s Magic Kingdom. Magic Kingdom may or may not be selected each day to have these “Extra Magic Hours.” We are interested in examining the relationship between whether there were “Extra Magic Hours” in the morning and the average wait time for the Seven Dwarfs Mine Train on the same day between 9 am and 10 am. Below is a proposed DAG for this question.
Since we are not in charge of Walt Disney World’s operations, we cannot randomize dates to have (or not) “Extra Magic Hours”, therefore, we need to rely on previously collected observational data and do our best to emulate the target trial that we would have created, should it have been possible. Here, our observations are days. Looking at the diagram above, we can map each element of the causal question to elements of our target trial protocol:
- Eligibility criteria: Days must be from 2018
- Exposure definition: Magic kingdom had “Extra Magic Hours” in the morning
- Assignment procedures: Observed – if the historic data suggests there were “Extra Magic Hours” in the morning on a particular day, that day is classified as “exposed” otherwise it is “unexposed”
- Follow-up period: From park open to 10am.
- Outcome definition: The average posted wait time between 9am and 10am
- Causal contrast of interest: Average treatment effect (we will discuss this in Chapter 10)
- Analysis plan: We use inverse probability weighting after fitting a propensity score model to estimate the average treatment effect of the exposure on the outcome of interest. We will adjust for variables as determined by our DAG (Figure 7.2)
7.2 Data wrangling and recipes
Most of our data manipulation tools come from the dplyr package (Table 7.1). We will also use lubridate to help us manipulate dates.
Target trial protocol element | {dplyr} functions |
---|---|
Eligibility criteria | filter() |
Exposure definition | mutate() |
Assignment procedures | mutate() |
Follow-up period |
mutate() pivot_longer() pivot_wider()
|
Outcome definition | mutate() |
Analysis plan |
select() mutate()
|
To answer this question, we are going to need to manipulate both the seven_dwarfs_train
dataset as well as the parks_metadata_raw
dataset. Let’s start with the seven_dwarfs_train
data set. The Seven Dwarfs Mine Train ride is an attraction at Walt Disney World’s Magic Kingdom. The seven_dwarfs_train
dataset in the {touringplans} package contains information about the date a particular wait time was recorded (park_date
), the time of the wait time (wait_datetime
), the actual wait time (wait_minutes_actual
), and the posted wait time (wait_minutes_posted
). Let’s take a look at this dataset. The {skimr} package is great for getting a quick glimpse at a new dataset.
Name | seven_dwarfs_train |
Number of rows | 321631 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
Date | 1 |
numeric | 2 |
POSIXct | 1 |
________________________ | |
Group variables | None |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
park_date | 0 | 1 | 2015-01-01 | 2021-12-28 | 2018-04-07 | 2334 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
wait_minutes_actual | 313996 | 0.02 | 23.99 | 1064.06 | -92918 | 21 | 31 | 46 | 217 | ▁▁▁▁▇ |
wait_minutes_posted | 30697 | 0.90 | 76.96 | 33.99 | 0 | 50 | 70 | 95 | 300 | ▆▇▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
wait_datetime | 0 | 1 | 2015-01-01 07:51:12 | 2021-12-28 22:57:34 | 2018-04-07 23:14:06 | 321586 |
Examining the output above, we learn that this dataset contains four columns and 321,631 rows. We also learn that the dates span from 2015 to 2021. We can also examine the distribution of each of the variables to detect any potential anomalies. Notice anything strange? Look at the p0
(that is the minimum value) for wait_minutes_actual
. It is -92918
! We are not using this variable for this analysis, but we will for future analyses, so this is good to keep in mind.
We need this dataset to calculate our outcome. Recall from above that our outcome is defined as the average posted wait time between 9am and 10am. Additionally, recall our eligibility criteria states that we need to restrict our analysis to days in 2018.
library(dplyr)
library(lubridate)
seven_dwarfs_train_2018 <- seven_dwarfs_train |>
filter(year(park_date) == 2018) |> # eligibility criteria
mutate(hour = hour(wait_datetime)) |> # get hour from wait
group_by(park_date, hour) |> # group by date
summarise(
wait_minutes_posted_avg = mean(wait_minutes_posted, na.rm = TRUE),
.groups = "drop"
) |> # get average wait time
mutate(
wait_minutes_posted_avg =
case_when(
is.nan(wait_minutes_posted_avg) ~ NA,
TRUE ~ wait_minutes_posted_avg
)
) |> # if it is NAN make it NA
filter(hour == 9) # only keep the average wait time between 9 and 10
seven_dwarfs_train_2018
# A tibble: 362 × 3
park_date hour wait_minutes_posted_avg
<date> <int> <dbl>
1 2018-01-01 9 60
2 2018-01-02 9 60
3 2018-01-03 9 60
4 2018-01-04 9 68.9
5 2018-01-05 9 70.6
6 2018-01-06 9 33.3
7 2018-01-07 9 46.4
8 2018-01-08 9 69.5
9 2018-01-09 9 64.3
10 2018-01-10 9 74.3
# ℹ 352 more rows
Now that we have our outcome settled, we need to get our exposure variable, as well as any other park-specific variables about the day in question that may be used as variables that we adjust for. Examining Figure 7.2, we see that we need data for three proposed confounders: the ticket season, the time the park closed, and the historic high temperature. These are in the parks_metadata_raw
dataset. This data will require extra cleaning, since the names are in the original format.
We like to have our variable names follow a clean convention – one way to do this is to follow Emily Riederer’s “Column Names as Contracts” format (Riederer 2020). The basic idea is to predefine a set of words, phrases, or stubs with clear meanings to index information, and use these consistently when naming variables. For example, in these data, variables that are specific to a particular wait time are prepended with the term wait
(e.g. wait_datetime
and wait_minutes_actual
), variables that are specific to the park on a particular day, acquired from parks metadata, are prepended with the term park
(e.g. park_date
or park_temperature_high
).
Let’s first decide what variables we will need. In practice, this decision may involve an iterative process. For example, after drawing our DAG or after conducting diagnostic, we may determine that we need more variables than what we originally cleaned. Let’s start by skimming this dataframe.
skim(parks_metadata_raw)
Name | parks_metadata_raw |
Number of rows | 2079 |
Number of columns | 181 |
_______________________ | |
Column type frequency: | |
character | 42 |
Date | 1 |
difftime | 46 |
logical | 6 |
numeric | 86 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
wdw_ticket_season | 861 | 0.59 | 4 | 7 | 0 | 3 | 0 |
season | 253 | 0.88 | 4 | 29 | 0 | 17 | 0 |
holidayn | 1865 | 0.10 | 3 | 7 | 0 | 43 | 0 |
wdwticketseason | 761 | 0.63 | 4 | 7 | 0 | 3 | 0 |
wdwracen | 1992 | 0.04 | 4 | 6 | 0 | 5 | 0 |
wdweventn | 1832 | 0.12 | 3 | 12 | 0 | 8 | 0 |
wdwseason | 87 | 0.96 | 4 | 29 | 0 | 17 | 0 |
mkeventn | 1546 | 0.26 | 3 | 11 | 0 | 10 | 0 |
epeventn | 868 | 0.58 | 4 | 5 | 0 | 4 | 0 |
hseventn | 1877 | 0.10 | 4 | 7 | 0 | 5 | 0 |
akeventn | 2010 | 0.03 | 4 | 4 | 0 | 2 | 0 |
holidayj | 2037 | 0.02 | 5 | 15 | 0 | 8 | 0 |
insession | 105 | 0.95 | 2 | 3 | 0 | 95 | 0 |
insession_enrollment | 105 | 0.95 | 2 | 4 | 0 | 100 | 0 |
insession_wdw | 105 | 0.95 | 2 | 4 | 0 | 94 | 0 |
insession_dlr | 105 | 0.95 | 2 | 4 | 0 | 94 | 0 |
insession_sqrt_wdw | 105 | 0.95 | 2 | 4 | 0 | 97 | 0 |
insession_sqrt_dlr | 105 | 0.95 | 2 | 4 | 0 | 97 | 0 |
insession_california | 105 | 0.95 | 2 | 4 | 0 | 89 | 0 |
insession_dc | 105 | 0.95 | 2 | 4 | 0 | 86 | 0 |
insession_central_fl | 105 | 0.95 | 2 | 4 | 0 | 71 | 0 |
insession_drive1_fl | 105 | 0.95 | 2 | 4 | 0 | 85 | 0 |
insession_drive2_fl | 105 | 0.95 | 2 | 4 | 0 | 95 | 0 |
insession_drive_ca | 105 | 0.95 | 2 | 4 | 0 | 91 | 0 |
insession_florida | 105 | 0.95 | 2 | 4 | 0 | 86 | 0 |
insession_mardi_gras | 105 | 0.95 | 2 | 4 | 0 | 82 | 0 |
insession_midwest | 105 | 0.95 | 2 | 4 | 0 | 75 | 0 |
insession_ny_nj | 105 | 0.95 | 2 | 4 | 0 | 8 | 0 |
insession_ny_nj_pa | 105 | 0.95 | 2 | 4 | 0 | 19 | 0 |
insession_new_england | 105 | 0.95 | 2 | 4 | 0 | 45 | 0 |
insession_new_jersey | 105 | 0.95 | 2 | 4 | 0 | 2 | 0 |
insession_nothwest | 105 | 0.95 | 2 | 4 | 0 | 17 | 0 |
insession_planes | 105 | 0.95 | 2 | 4 | 0 | 81 | 0 |
insession_socal | 105 | 0.95 | 2 | 4 | 0 | 80 | 0 |
insession_southwest | 105 | 0.95 | 2 | 4 | 0 | 86 | 0 |
mkprddn | 183 | 0.91 | 33 | 41 | 0 | 2 | 0 |
mkprdnn | 1358 | 0.35 | 29 | 38 | 0 | 2 | 0 |
mkfiren | 134 | 0.94 | 18 | 65 | 0 | 8 | 0 |
epfiren | 126 | 0.94 | 13 | 35 | 0 | 2 | 0 |
hsfiren | 485 | 0.77 | 24 | 66 | 0 | 6 | 0 |
hsshwnn | 164 | 0.92 | 10 | 28 | 0 | 2 | 0 |
akshwnn | 883 | 0.58 | 15 | 33 | 0 | 2 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2015-01-01 | 2021-08-31 | 2017-11-05 | 2079 |
Variable type: difftime
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
mkopen | 0 | 1.00 | 21600 secs | 32400 secs | 09:00:00 | 4 |
mkclose | 0 | 1.00 | 54000 secs | 107940 secs | 22:00:00 | 13 |
mkemhopen | 0 | 1.00 | 21600 secs | 32400 secs | 09:00:00 | 5 |
mkemhclose | 0 | 1.00 | 54000 secs | 107940 secs | 23:00:00 | 14 |
mkopenyest | 0 | 1.00 | 21600 secs | 32400 secs | 09:00:00 | 4 |
mkcloseyest | 0 | 1.00 | 54000 secs | 107940 secs | 22:00:00 | 13 |
mkopentom | 0 | 1.00 | 21600 secs | 32400 secs | 09:00:00 | 4 |
mkclosetom | 0 | 1.00 | 54000 secs | 107940 secs | 22:00:00 | 13 |
epopen | 0 | 1.00 | 25200 secs | 43200 secs | 09:00:00 | 6 |
epclose | 0 | 1.00 | 61200 secs | 90000 secs | 21:00:00 | 9 |
epemhopen | 0 | 1.00 | 25200 secs | 43200 secs | 09:00:00 | 6 |
epemhclose | 0 | 1.00 | 61200 secs | 90000 secs | 21:00:00 | 12 |
epopenyest | 0 | 1.00 | 25200 secs | 43200 secs | 09:00:00 | 6 |
epcloseyest | 0 | 1.00 | 61200 secs | 90000 secs | 21:00:00 | 9 |
epopentom | 0 | 1.00 | 25200 secs | 43200 secs | 09:00:00 | 6 |
epclosetom | 0 | 1.00 | 61200 secs | 90000 secs | 21:00:00 | 9 |
hsopen | 0 | 1.00 | 21600 secs | 36000 secs | 09:00:00 | 6 |
hsclose | 0 | 1.00 | 50400 secs | 86400 secs | 21:00:00 | 14 |
hsemhopen | 0 | 1.00 | 21600 secs | 36000 secs | 09:00:00 | 7 |
hsemhclose | 0 | 1.00 | 50400 secs | 93600 secs | 21:00:00 | 18 |
hsopenyest | 0 | 1.00 | 21600 secs | 36000 secs | 09:00:00 | 6 |
hscloseyest | 0 | 1.00 | 50400 secs | 86400 secs | 21:00:00 | 14 |
hsopentom | 0 | 1.00 | 21600 secs | 36000 secs | 09:00:00 | 6 |
hsclosetom | 0 | 1.00 | 50400 secs | 86400 secs | 21:00:00 | 14 |
akopen | 0 | 1.00 | 25200 secs | 32400 secs | 09:00:00 | 3 |
akclose | 0 | 1.00 | 50400 secs | 86400 secs | 20:00:00 | 16 |
akemhopen | 0 | 1.00 | 25200 secs | 32400 secs | 09:00:00 | 3 |
akemhclose | 0 | 1.00 | 50400 secs | 90000 secs | 20:00:00 | 17 |
akopenyest | 0 | 1.00 | 25200 secs | 32400 secs | 09:00:00 | 3 |
akcloseyest | 0 | 1.00 | 50400 secs | 86400 secs | 20:00:00 | 16 |
akopentom | 0 | 1.00 | 25200 secs | 32400 secs | 09:00:00 | 3 |
akclosetom | 0 | 1.00 | 50400 secs | 86400 secs | 20:00:00 | 16 |
mkprddt1 | 183 | 0.91 | 39600 secs | 61200 secs | 15:00:00 | 5 |
mkprddt2 | 1851 | 0.11 | 50400 secs | 73800 secs | 15:30:00 | 5 |
mkprdnt1 | 1358 | 0.35 | 68400 secs | 82800 secs | 21:00:00 | 11 |
mkprdnt2 | 1480 | 0.29 | 0 secs | 84600 secs | 23:00:00 | 8 |
mkfiret1 | 134 | 0.94 | 66600 secs | 80100 secs | 21:15:00 | 12 |
mkfiret2 | 2069 | 0.00 | 85800 secs | 85800 secs | 23:50:00 | 1 |
epfiret1 | 126 | 0.94 | 64800 secs | 81000 secs | 21:00:00 | 6 |
epfiret2 | 2074 | 0.00 | 85200 secs | 85200 secs | 23:40:00 | 1 |
hsfiret1 | 485 | 0.77 | 0 secs | 82800 secs | 21:00:00 | 17 |
hsfiret2 | 2045 | 0.02 | 0 secs | 81000 secs | 21:00:00 | 5 |
hsshwnt1 | 164 | 0.92 | 65100 secs | 79200 secs | 20:30:00 | 10 |
hsshwnt2 | 1369 | 0.34 | 72000 secs | 82800 secs | 21:30:00 | 11 |
akshwnt1 | 883 | 0.58 | 65700 secs | 76500 secs | 20:30:00 | 13 |
akshwnt2 | 1149 | 0.45 | 70200 secs | 81000 secs | 21:45:00 | 13 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
hsprddt1 | 2079 | 0 | NaN | : |
hsprddn | 2079 | 0 | NaN | : |
akprddt1 | 2079 | 0 | NaN | : |
akprddt2 | 2079 | 0 | NaN | : |
akprddn | 2079 | 0 | NaN | : |
akfiren | 2079 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
dayofweek | 0 | 1 | 4.000e+00 | 2.000e+00 | 1.000e+00 | 2.000e+00 | 4.000e+00 | 6.000e+00 | 7.000e+00 | ▇▃▃▃▇ |
dayofyear | 0 | 1 | 1.818e+02 | 1.063e+02 | 0.000e+00 | 8.900e+01 | 1.840e+02 | 2.730e+02 | 3.650e+02 | ▇▇▇▇▇ |
weekofyear | 0 | 1 | 2.609e+01 | 1.520e+01 | 0.000e+00 | 1.300e+01 | 2.600e+01 | 3.900e+01 | 5.300e+01 | ▇▇▇▇▇ |
monthofyear | 0 | 1 | 6.510e+00 | 3.480e+00 | 1.000e+00 | 3.000e+00 | 7.000e+00 | 1.000e+01 | 1.200e+01 | ▇▅▆▅▇ |
year | 0 | 1 | 2.017e+03 | 1.740e+00 | 2.015e+03 | 2.016e+03 | 2.017e+03 | 2.019e+03 | 2.021e+03 | ▇▃▃▃▃ |
holidaypx | 0 | 1 | 7.850e+00 | 6.890e+00 | 0.000e+00 | 3.000e+00 | 6.000e+00 | 1.000e+01 | 3.300e+01 | ▇▅▂▁▁ |
holidaym | 0 | 1 | 5.400e-01 | 1.350e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 5.000e+00 | ▇▁▁▁▁ |
holiday | 0 | 1 | 1.000e-01 | 3.000e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
wdwevent | 0 | 1 | 1.200e-01 | 3.200e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
wdwrace | 0 | 1 | 4.000e-02 | 2.000e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
wdwmaxtemp | 5 | 1 | 8.280e+01 | 8.530e+00 | 5.111e+01 | 7.829e+01 | 8.450e+01 | 8.954e+01 | 9.772e+01 | ▁▁▃▇▆ |
wdwmintemp | 6 | 1 | 6.550e+01 | 1.018e+01 | 2.748e+01 | 5.903e+01 | 6.835e+01 | 7.410e+01 | 8.128e+01 | ▁▂▃▆▇ |
wdwmeantemp | 6 | 1 | 7.415e+01 | 9.060e+00 | 3.975e+01 | 6.876e+01 | 7.637e+01 | 8.161e+01 | 8.776e+01 | ▁▂▃▆▇ |
mkevent | 0 | 1 | 2.600e-01 | 4.400e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | 1.000e+00 | ▇▁▁▁▃ |
epevent | 0 | 1 | 5.800e-01 | 4.900e-01 | 0.000e+00 | 0.000e+00 | 1.000e+00 | 1.000e+00 | 1.000e+00 | ▆▁▁▁▇ |
hsevent | 0 | 1 | 1.000e-01 | 3.000e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
akevent | 0 | 1 | 3.000e-02 | 1.800e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
mkemhmorn | 0 | 1 | 1.900e-01 | 4.000e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▂ |
mkemhmyest | 0 | 1 | 1.900e-01 | 4.000e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▂ |
mkemhmtom | 0 | 1 | 1.900e-01 | 4.000e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▂ |
mkemheve | 0 | 1 | 1.300e-01 | 3.300e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
mkhoursemh | 0 | 1 | 1.364e+01 | 1.980e+00 | 7.500e+00 | 1.300e+01 | 1.400e+01 | 1.500e+01 | 2.398e+01 | ▁▇▅▁▁ |
mkhoursemhyest | 0 | 1 | 1.365e+01 | 1.980e+00 | 7.500e+00 | 1.300e+01 | 1.400e+01 | 1.500e+01 | 2.398e+01 | ▁▇▅▁▁ |
mkhoursemhtom | 0 | 1 | 1.364e+01 | 1.980e+00 | 7.500e+00 | 1.300e+01 | 1.400e+01 | 1.500e+01 | 2.398e+01 | ▁▇▅▁▁ |
mkemheyest | 0 | 1 | 1.300e-01 | 3.300e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
mkemhetom | 0 | 1 | 1.300e-01 | 3.300e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
epemhmorn | 0 | 1 | 1.300e-01 | 3.400e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
epemhmyest | 0 | 1 | 1.300e-01 | 3.400e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
epemhmtom | 0 | 1 | 1.300e-01 | 3.400e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
epemheve | 0 | 1 | 1.300e-01 | 3.400e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
epemheyest | 0 | 1 | 1.300e-01 | 3.400e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
epemhetom | 0 | 1 | 1.300e-01 | 3.400e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
ephoursemh | 0 | 1 | 1.241e+01 | 9.600e-01 | 9.000e+00 | 1.200e+01 | 1.200e+01 | 1.300e+01 | 1.700e+01 | ▁▇▃▂▁ |
ephoursemhyest | 0 | 1 | 1.241e+01 | 9.600e-01 | 9.000e+00 | 1.200e+01 | 1.200e+01 | 1.300e+01 | 1.700e+01 | ▁▇▃▂▁ |
ephoursemhtom | 0 | 1 | 1.241e+01 | 9.600e-01 | 9.000e+00 | 1.200e+01 | 1.200e+01 | 1.300e+01 | 1.700e+01 | ▁▇▃▂▁ |
hsemhmorn | 0 | 1 | 1.800e-01 | 3.800e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▂ |
hsemhmyest | 0 | 1 | 1.800e-01 | 3.800e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▂ |
hsemhmtom | 0 | 1 | 1.800e-01 | 3.800e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▂ |
hsemheve | 0 | 1 | 6.000e-02 | 2.500e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
hsemheyest | 0 | 1 | 6.000e-02 | 2.500e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
hsemhetom | 0 | 1 | 6.000e-02 | 2.500e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
hshoursemh | 0 | 1 | 1.232e+01 | 1.490e+00 | 8.000e+00 | 1.100e+01 | 1.200e+01 | 1.300e+01 | 1.800e+01 | ▁▇▇▂▁ |
hshoursemhyest | 0 | 1 | 1.232e+01 | 1.490e+00 | 8.000e+00 | 1.100e+01 | 1.200e+01 | 1.300e+01 | 1.800e+01 | ▁▇▇▂▁ |
hshoursemhtom | 0 | 1 | 1.232e+01 | 1.490e+00 | 8.000e+00 | 1.100e+01 | 1.200e+01 | 1.300e+01 | 1.800e+01 | ▁▇▇▂▁ |
akemhmorn | 0 | 1 | 3.000e-01 | 4.600e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | 1.000e+00 | ▇▁▁▁▃ |
akemhmyest | 0 | 1 | 3.000e-01 | 4.600e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | 1.000e+00 | ▇▁▁▁▃ |
akemhmtom | 0 | 1 | 3.000e-01 | 4.600e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | 1.000e+00 | ▇▁▁▁▃ |
akemheve | 0 | 1 | 4.000e-02 | 2.000e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
akemheyest | 0 | 1 | 4.000e-02 | 2.000e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
akemhetom | 0 | 1 | 4.000e-02 | 2.000e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 1.000e+00 | ▇▁▁▁▁ |
akhoursemh | 0 | 1 | 1.177e+01 | 1.800e+00 | 7.000e+00 | 1.100e+01 | 1.200e+01 | 1.300e+01 | 1.700e+01 | ▂▇▇▅▁ |
akhoursemhyest | 0 | 1 | 1.177e+01 | 1.800e+00 | 7.000e+00 | 1.100e+01 | 1.200e+01 | 1.300e+01 | 1.700e+01 | ▂▇▇▅▁ |
akhoursemhtom | 0 | 1 | 1.176e+01 | 1.800e+00 | 7.000e+00 | 1.100e+01 | 1.200e+01 | 1.300e+01 | 1.700e+01 | ▂▇▇▅▁ |
mkhours | 0 | 1 | 1.326e+01 | 2.010e+00 | 7.000e+00 | 1.200e+01 | 1.300e+01 | 1.500e+01 | 2.398e+01 | ▂▇▇▁▁ |
mkhoursyest | 0 | 1 | 1.326e+01 | 2.010e+00 | 7.000e+00 | 1.200e+01 | 1.300e+01 | 1.500e+01 | 2.398e+01 | ▂▇▇▁▁ |
mkhourstom | 0 | 1 | 1.326e+01 | 2.000e+00 | 7.000e+00 | 1.200e+01 | 1.300e+01 | 1.500e+01 | 2.398e+01 | ▂▇▇▁▁ |
ephours | 0 | 1 | 1.202e+01 | 6.400e-01 | 8.000e+00 | 1.200e+01 | 1.200e+01 | 1.200e+01 | 1.700e+01 | ▁▁▇▁▁ |
ephoursyest | 0 | 1 | 1.203e+01 | 6.400e-01 | 8.000e+00 | 1.200e+01 | 1.200e+01 | 1.200e+01 | 1.700e+01 | ▁▁▇▁▁ |
ephourstom | 0 | 1 | 1.202e+01 | 6.400e-01 | 8.000e+00 | 1.200e+01 | 1.200e+01 | 1.200e+01 | 1.700e+01 | ▁▁▇▁▁ |
hshours | 0 | 1 | 1.192e+01 | 1.190e+00 | 5.000e+00 | 1.100e+01 | 1.200e+01 | 1.250e+01 | 1.800e+01 | ▁▁▇▂▁ |
hshoursyest | 0 | 1 | 1.193e+01 | 1.200e+00 | 5.000e+00 | 1.100e+01 | 1.200e+01 | 1.250e+01 | 1.800e+01 | ▁▁▇▂▁ |
hshourstom | 0 | 1 | 1.192e+01 | 1.190e+00 | 5.000e+00 | 1.100e+01 | 1.200e+01 | 1.250e+01 | 1.800e+01 | ▁▁▇▂▁ |
akhours | 0 | 1 | 1.146e+01 | 1.680e+00 | 6.000e+00 | 1.050e+01 | 1.100e+01 | 1.250e+01 | 1.700e+01 | ▁▃▇▃▁ |
akhoursyest | 0 | 1 | 1.147e+01 | 1.680e+00 | 6.000e+00 | 1.050e+01 | 1.100e+01 | 1.250e+01 | 1.700e+01 | ▁▃▇▃▁ |
akhourstom | 0 | 1 | 1.146e+01 | 1.680e+00 | 6.000e+00 | 1.050e+01 | 1.100e+01 | 1.250e+01 | 1.700e+01 | ▁▃▇▃▁ |
weather_wdwhigh | 0 | 1 | 8.235e+01 | 7.860e+00 | 7.020e+01 | 7.460e+01 | 8.280e+01 | 9.060e+01 | 9.230e+01 | ▅▃▂▂▇ |
weather_wdwlow | 0 | 1 | 6.410e+01 | 9.260e+00 | 4.920e+01 | 5.580e+01 | 6.360e+01 | 7.400e+01 | 7.610e+01 | ▅▅▃▂▇ |
weather_wdwprecip | 0 | 1 | 1.500e-01 | 8.000e-02 | 3.000e-02 | 8.000e-02 | 1.200e-01 | 2.300e-01 | 3.500e-01 | ▇▆▃▅▁ |
capacitylost_mk | 0 | 1 | 4.221e+05 | 3.646e+04 | 3.529e+05 | 3.858e+05 | 4.339e+05 | 4.561e+05 | 4.736e+05 | ▅▂▁▇▆ |
capacitylost_ep | 0 | 1 | 3.669e+05 | 2.402e+04 | 3.252e+05 | 3.384e+05 | 3.808e+05 | 3.820e+05 | 3.947e+05 | ▅▁▁▂▇ |
capacitylost_hs | 0 | 1 | 2.875e+05 | 3.320e+04 | 2.038e+05 | 2.796e+05 | 3.019e+05 | 3.119e+05 | 3.219e+05 | ▂▁▁▁▇ |
capacitylost_ak | 0 | 1 | 2.282e+05 | 1.497e+04 | 2.108e+05 | 2.208e+05 | 2.232e+05 | 2.328e+05 | 2.739e+05 | ▇▅▁▁▂ |
capacitylostwgt_mk | 0 | 1 | 4.137e+07 | 3.621e+06 | 3.466e+07 | 3.764e+07 | 4.259e+07 | 4.458e+07 | 4.655e+07 | ▅▂▂▇▆ |
capacitylostwgt_ep | 0 | 1 | 3.534e+07 | 2.201e+06 | 3.169e+07 | 3.269e+07 | 3.667e+07 | 3.667e+07 | 3.768e+07 | ▅▁▁▁▇ |
capacitylostwgt_hs | 0 | 1 | 2.753e+07 | 3.049e+06 | 1.981e+07 | 2.677e+07 | 2.877e+07 | 2.976e+07 | 3.075e+07 | ▂▁▁▁▇ |
capacitylostwgt_ak | 0 | 1 | 2.239e+07 | 1.398e+06 | 2.079e+07 | 2.178e+07 | 2.180e+07 | 2.278e+07 | 2.674e+07 | ▇▅▁▁▂ |
mkprdday | 0 | 1 | 1.070e+00 | 5.900e-01 | 0.000e+00 | 1.000e+00 | 1.000e+00 | 1.000e+00 | 3.000e+00 | ▁▇▁▁▁ |
mkprdngt | 0 | 1 | 6.400e-01 | 9.000e-01 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 2.000e+00 | 3.000e+00 | ▇▁▁▃▁ |
mkfirewk | 0 | 1 | 9.400e-01 | 2.600e-01 | 0.000e+00 | 1.000e+00 | 1.000e+00 | 1.000e+00 | 2.000e+00 | ▁▁▇▁▁ |
epfirewk | 0 | 1 | 9.400e-01 | 2.500e-01 | 0.000e+00 | 1.000e+00 | 1.000e+00 | 1.000e+00 | 3.000e+00 | ▁▇▁▁▁ |
hsprdday | 0 | 1 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | ▁▁▇▁▁ |
hsfirewk | 0 | 1 | 7.800e-01 | 4.500e-01 | 0.000e+00 | 1.000e+00 | 1.000e+00 | 1.000e+00 | 2.000e+00 | ▂▁▇▁▁ |
hsshwngt | 0 | 1 | 1.280e+00 | 6.200e-01 | 0.000e+00 | 1.000e+00 | 1.000e+00 | 2.000e+00 | 3.000e+00 | ▁▇▁▅▁ |
hsfirewks | 0 | 1 | 1.000e+00 | 0.000e+00 | 1.000e+00 | 1.000e+00 | 1.000e+00 | 1.000e+00 | 1.000e+00 | ▁▁▇▁▁ |
akprdday | 0 | 1 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | 0.000e+00 | ▁▁▇▁▁ |
akshwngt | 0 | 1 | 1.040e+00 | 9.700e-01 | 0.000e+00 | 0.000e+00 | 1.000e+00 | 2.000e+00 | 3.000e+00 | ▇▂▁▇▁ |
This dataset contains many more variables than the one we worked with previously. For this analysis, we are going to select date
(the observation date), wdw_ticket_season
(the ticket season for the observation), wdwmaxtemp
(the maximum temperature), mkclose
(the time Magic Kingdom closed), mkemhmorn
(whether Magic Kingdom had an “Extra Magic Hour” in the morning).
parks_metadata_clean <- parks_metadata_raw |>
## based on our analysis plan, we will select the following variables
select(date, wdw_ticket_season, wdwmaxtemp, mkclose, mkemhmorn) |>
## based on eligibility criteria, limit to 2018
filter(year(date) == 2018) |>
## rename variables
rename(
park_date = date,
park_ticket_season = wdw_ticket_season,
park_temperature_high = wdwmaxtemp,
park_close = mkclose,
park_extra_magic_morning = mkemhmorn
)
7.3 Working with multiple data sources
Frequently we find ourselves merging data from multiple sources when attempting to answer causal questions in order to ensure that all of the necessary factors are accounted for. The way we can combine datasets is via joins – joining two or more datasets based on a set or sets of common variables. We can think of three main types of joins: left, right, and inner. A left join combines data from two datasets based on a common variable and includes all records from the left dataset along with matching records from the right dataset (in dplyr, left_join()
), while a right join includes all records from the right dataset and their corresponding matches from the left dataset (in dplyr right_join()
); an inner join, on the other hand, includes only the records with matching values in both datasets, excluding non-matching records (in dplyr inner_join()
. For this analysis, we need to use a left join to pull in the cleaned parks metadata.
seven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>
left_join(parks_metadata_clean, by = "park_date")
7.4 Recognizing missing data
It is important to recognize whether we have any missing data in our variables. The visdat package is great for getting a quick sense of whether we have any missing data.
It looks like we only have a few observations (2%) missing our outcome of interest. This is not too bad. For this first analysis we will ignore the missing values. We can explicitly drop them using the drop_na()
function from dplyr.
seven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>
drop_na()
7.5 Exploring and visualizing data and assumptions
The positivity assumption requires that within each level and combination of the study variables used to achieve exchangeability, there are exposed and unexposed subjects (Section 3.3). We can explore this by visualizing the distribution of each of our proposed confounders stratified by the exposure.
7.5.1 Single variable checks for positivity violations
Figure 7.3 shows the distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning. There is not clear evidence of a lack of positivity here as both exposure levels span the majority of the covariate space.
ggplot(
seven_dwarfs_train_2018,
aes(
x = factor(park_close),
group = factor(park_extra_magic_morning),
fill = factor(park_extra_magic_morning)
)
) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +
labs(
fill = "Extra Magic Morning",
x = "Time of Park Close"
)
To examine the distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning we can use a mirrored histogram. We’ll use the {halfmoon} package’s geom_mirror_histogram()
to create one. Examining Figure 7.4, it does look like there are very few days in the exposed group with maximum temperatures less than 60 degrees, while not necessarily a positivity violation it is worth keeping an eye on, particularly because the dataset is not very large, so this could make it difficult to estimate an average exposure effect across this whole space. If we found this to be particularly difficult, we could posit changing our causal question to instead restrict the analysis to warmer days. This of course would also restrict which days we could draw conclusions about for the future.
library(halfmoon)
ggplot(
seven_dwarfs_train_2018,
aes(
x = park_temperature_high,
group = factor(park_extra_magic_morning),
fill = factor(park_extra_magic_morning)
)
) +
geom_mirror_histogram(bins = 20) +
labs(
fill = "Extra Magic Morning",
x = "Historic maximum temperature (degrees F)"
)
Finally, let’s look at the distribution of ticket season by whether there were extra magic hours in the morning. Examining Figure 7.5, we do not see any positivity violations.
7.5.2 Multiple variable checks for positivity violations
We have confirmed that for each of the three confounders, we do not see strong evidence of positivity violations. Because we have so few variables here, we can examine this a bit more closely. Let’s start by discretizing the park_temperature_high
variable a bit (we will cut it into tertiles).
seven_dwarfs_train_2018 |>
## cut park_temperature_high into tertiles
mutate(park_temperature_high_bin = cut(park_temperature_high, breaks = 3)) |>
## bin park close time
mutate(park_close_bin = case_when(
hour(park_close) < 19 & hour(park_close) > 12 ~ "(1) early",
hour(park_close) >= 19 & hour(park_close) < 24 ~ "(2) standard",
hour(park_close) >= 24 | hour(park_close) < 12 ~ "(3) late"
)) |>
group_by(park_close_bin, park_temperature_high_bin, park_ticket_season) |>
## calculate the proportion exposed in each bin
summarise(prop_exposed = mean(park_extra_magic_morning), .groups = "drop") |>
ggplot(aes(x = park_close_bin, y = park_temperature_high_bin, fill = prop_exposed)) +
geom_tile() +
scale_fill_gradient2(midpoint = 0.5) +
facet_wrap(~park_ticket_season) +
labs(
y = "Historic Maximum Temperature (F)",
x = "Magic Kingdom Park Close Time",
fill = "Proportion of Days Exposed"
)
Interesting! Figure 7.6 shows an interesting potential violation. It looks like 100% of days with lower temperatures (historic highs between 51 and 65 degrees) that are in the peak ticket season have extra magic hours in the morning. This actually makes sense if we think a bit about this data set. The only days with cold temperatures in Florida that would also be considered a “peak” time to visit Walt Disney World would be over Christmas / New Years. During this time there historically were always extra magic hours.
We are going to proceed with the analysis, but we will keep these observations in mind.
7.6 Presenting descriptive statistics
Let’s examine a table of the variables of interest in this data frame. To do so, we are going to use the tbl_summary()
function from the gtsummary package. (We’ll also use the labelled package to clean up the variable names for the table.)
library(gtsummary)
library(labelled)
seven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>
mutate(park_close = as.character(park_close)) |>
set_variable_labels(
park_ticket_season = "Ticket Season",
park_close = "Close Time",
park_temperature_high = "Historic High Temperature"
)
tbl_summary(
seven_dwarfs_train_2018,
by = park_extra_magic_morning,
include = c(park_ticket_season, park_close, park_temperature_high)
) |>
# add an overall column to the table
add_overall(last = TRUE)
Characteristic |
0 N = 2941 |
1 N = 601 |
Overall N = 3541 |
---|---|---|---|
Ticket Season | |||
peak | 60 (20%) | 18 (30%) | 78 (22%) |
regular | 158 (54%) | 35 (58%) | 193 (55%) |
value | 76 (26%) | 7 (12%) | 83 (23%) |
Close Time | |||
16:30:00 | 1 (0.3%) | 0 (0%) | 1 (0.3%) |
18:00:00 | 37 (13%) | 18 (30%) | 55 (16%) |
20:00:00 | 18 (6.1%) | 2 (3.3%) | 20 (5.6%) |
21:00:00 | 28 (9.5%) | 0 (0%) | 28 (7.9%) |
22:00:00 | 91 (31%) | 11 (18%) | 102 (29%) |
23:00:00 | 78 (27%) | 11 (18%) | 89 (25%) |
24:00:00 | 40 (14%) | 17 (28%) | 57 (16%) |
25:00:00 | 1 (0.3%) | 1 (1.7%) | 2 (0.6%) |
Historic High Temperature | 84 (78, 89) | 83 (76, 87) | 84 (78, 89) |
1 n (%); Median (Q1, Q3) |