Illustration of Bayesian segmented regression analysis of interrupted time series data with a testing hypothesis about the impact of the COVID-19 pandemic on increase in people’s search interest in work-life balance and well-being.
The turn of the year, which is full of all sorts of resolutions to change for the better in our private lives and in our organizations, is a good time to remind ourselves that analytic tools can be very helpful in our efforts to make these resolutions come true. One way they can help us is by verifying that we have really achieved our stated goals and that we are not just fooling ourselves into believing so. We need to keep in mind Richard Feynman’s famous principle of critical thinking…
One of the tools that can help us with that is segmented regression analysis of interrupted time series data (thanks to Masatake Hirono for pointing me to its existence). It allows us to model changes in various processes and outcomes that follow interventions, while controlling for other types of changes (e.g. trends and seasonality) that may have occurred regardless of the interventions. It is thus very useful for data analysis conducted within studies with a quasi experimental study design that are often in the organizational context the best alternative to the “gold standard” of randomized controlled trials (RCTs) that are not always realizable or politically acceptable.
For illustration, let’s use this tool for testing hypothesis about people’s increased interest in topics related to work-life balance and well-being due to the COVID-19 pandemic and subsequent changes in the way people work. As a proxy measure of this interest we will use worldwide search interest data over the last 10 years from Google Trends using search terms work-life balance and well-being (see Fig. 1 and 2 below).
Fig. 1: Interest in “work-life balance” topic over the last 10 years measured as a search interest by Google Trends. The numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means that there was not enough data for this term.
Fig. 2: Interest in “well-being” topic over the last 10 years measured as a search interest by Google Trends. The numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means that there was not enough data for this term.
Based solely on the visual inspection of the graphs, it is pretty difficult to tell whether there was some effect of the COVID-19 pandemic or not, especially in the case of work-life balance (for the purpose of this analysis, the beginning of the pandemic is assumed to have started in March 2020). For sure it’s not a job for “inter-ocular trauma test” when the existence of the effect hits you directly between the eyes. We need to rely here on inferential statistics and its ability to help us with distinguishing signal from noise.
Before conducting the analysis itself, we need to wrangle the data from Google Trends a little bit using the recipe presented in the Wagner, Zhang, and Ross-Degnan’s paper. Specifically, we need the following five variables (or six, given that we have two dependent variables):
# uploading library for data manipulation
library(tidyverse)
# uploading data
dfWorkLifeBalance <- readr::read_csv("./workLifeBalanceGoogleTrendData.csv")
dfWellBeing <- readr::read_csv("./wellBeingGoogleTrendData.csv")
dfAll <- dfWorkLifeBalance %>%
# joining both datasets
dplyr::left_join(
dfWellBeing, by = "Month"
) %>%
# changing the format and name of Month variable
dplyr::mutate(
Month = stringr::str_glue("{Month}-01"),
Month = lubridate::ymd(Month)
) %>%
dplyr::rename(
date = Month
) %>%
# creating new variable month
dplyr::mutate(
month = lubridate::month(date,label = TRUE, abbr = TRUE),
month = factor(month,
levels = c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),
labels = c("Jan","Feb","Mar","Apr","May", "Jun","Jul","Aug","Sep","Oct","Nov","Dec"),
ordered = FALSE)
) %>%
# arranging data in ascending order by date
dplyr::arrange(
date
) %>%
# creating new variables
dplyr::mutate(
elapsedTime = row_number(),
pandemic = case_when(
date >= "2020-03-01" ~ 1,
TRUE ~ 0
),
elapsedTimeAfterPandemic = cumsum(pandemic)
) %>%
dplyr::mutate(
pandemic = as.factor(case_when(
pandemic == 1 ~ "After the pandemic outbreak",
TRUE ~ "Before the pandemic outbreak"
))
) %>%
# changing order of variables in df
dplyr::select(
date, workLifeBalance, wellBeing, elapsedTime, month, pandemic, elapsedTimeAfterPandemic
)
Here is a table with the resulting data we will use for testing our hypothesis.
# uploading library for making user-friendly data table
library(DT)
DT::datatable(
dfAll,
class = 'cell-border stripe',
filter = 'top',
extensions = 'Buttons',
fillContainer = FALSE,
rownames= FALSE,
options = list(
pageLength = 5,
autoWidth = TRUE,
dom = 'Bfrtip',
buttons = c('copy'),
scrollX = TRUE,
selection="multiple"
)
)
Table 1: Final dataset used for testing hypothesis about impact of the COVID-19 pandemic on people’s interest in work-life balance and well-being.
We will model our data using common segmented regression models that have following general structure:
\[Y_{t} = β_{0} + β_{1}*time_{t} + β_{2}*intervention_{t} + β_{3}*time after intervention_{t} + e_{t}\]
The β0 coefficient estimates the baseline level of the outcome variable at time zero; β1 coefficient estimates the change in the mean of the outcome variable that occurs with each unit of time before the intervention (i.e. the baseline trend); β2 coefficient estimates the level change in the mean of the outcome variable immediately after the intervention (i.e. from the end of the preceding segment); and β3 estimates the change in the trend in the mean of the outcome variable per unit of time after the intervention, compared with the trend before the intervention (thus, the sum of β1 and β3 equals to the post-intervention slope). For a better understanding of the model, take a look at the illustrative chart below.
Since we are dealing with correlated and truncated data, we should also include two additional terms in our model, an autocorrelation term and a truncation term, to handle these specific properties of our data.
Now let’s fit the models to the data and check what they tell us about the effect of pandemic on people’s search interest in work-life balance and well-being. We will use brms r package that enables making inferences about statistical models’ parameters within Bayesian inferential framework. Because of that, we also need to specify some additional parameters (e.g. chains
, iter
or warmup
) of the Markov Chain Monte Carlo (MCMC) algorithm that will generate posterior samples of our models’ parameters.
Bayesian framework also enables us to specify priors for estimated parameter and through them include our domain knowledge in the analysis. The specified priors are important for both parameter estimation and hypothesis testing as they define our starting information state before we take into account our data. Here we will use rather wide, uninformative, and only mildly regularizing priors (it means that the results of the inference will be very close to the results of standard, frequentist parameter estimation/hypothesis testing).
# uploading library for Bayesian statistical inference
library(brms)
# checking available priors for the models
brms::get_prior(
workLifeBalance | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1),
data = dfAll,
family = gaussian())
brms::get_prior(
wellBeing | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1),
data = dfAll,
family = gaussian())
# uploading library for Bayesian statistical inference
library(brms)
# specifying wide, uninformative, and only mildly regularizing priors for predictors in both models
priors <- c(set_prior("normal(0,50)", class = "b", coef = "elapsedTime"),
set_prior("normal(0,50)", class = "b", coef = "elapsedTimeAfterPandemic"),
set_prior("normal(0,50)", class = "b", coef = "pandemicBeforethepandemicoutbreak"),
set_prior("normal(0,50)", class = "b", coef = "monthApr"),
set_prior("normal(0,50)", class = "b", coef = "monthAug"),
set_prior("normal(0,50)", class = "b", coef = "monthDec"),
set_prior("normal(0,50)", class = "b", coef = "monthFeb"),
set_prior("normal(0,50)", class = "b", coef = "monthJul"),
set_prior("normal(0,50)", class = "b", coef = "monthJun"),
set_prior("normal(0,50)", class = "b", coef = "monthMar"),
set_prior("normal(0,50)", class = "b", coef = "monthMay"),
set_prior("normal(0,50)", class = "b", coef = "monthNov"),
set_prior("normal(0,50)", class = "b", coef = "monthOct"),
set_prior("normal(0,50)", class = "b", coef = "monthSep"))
# defining the statistical model for work-life balance
modelWorkLifeBalance <- brms::brm(
workLifeBalance | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1),
data = dfAll,
family = gaussian(),
prior = priors,
chains = 4,
iter = 3000,
warmup = 1000,
seed = 12345,
sample_prior = TRUE
)
# defining the statistical model for well-being
modelWellBeing <- brms::brm(
wellBeing | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1),
data = dfAll,
family = gaussian(),
prior = priors,
chains = 4,
iter = 3000,
warmup = 1000,
seed = 678910,
sample_prior = TRUE
)
Before making any inferences, we should make some sanity checks to be sure that the mechanics of the MCMC algorithm worked well and that we can use generated posterior samples for making inferences about our models’ parameters. There are many ways for doing that, but here we will use only visual check of the MCMC chains. We want plots of these chains look like hairy caterpillar which would indicate convergence of the underlying Markov chain to stationarity and convergence of Monte Carlo estimators to population quantities, respectively. As can be seen in Graph 1 and 2 below, in case of both models we can observe wanted characteristics of the MCMC chains described above. (For additional MCMC diagnostics procedures, see for example Bayesian Notes from Jeffrey B. Arnold.)
# uploading library for plotting Bayesian models
library(bayesplot)
# plotting the MCMC chains for the modelWorkLifeBalance
bayesplot::mcmc_trace(
modelWorkLifeBalance,
facet_args = list(nrow = 6)
) +
ggplot2::labs(
title = "Plots of the MCMC chains used for estimation of the modelWorkLifeBalance's parameters"
)
Graph 1: Trace plots of Markov chains for individual parameters of the modelWorkLifeBalance.
# plotting the MCMC chains for the modelWellBeing
bayesplot::mcmc_trace(
modelWellBeing,
facet_args = list(nrow = 6)
) +
ggplot2::labs(
title = "Plots of the MCMC chains used for estimation of the modelWellBeing's parameters"
)
Graph 2: Trace plots of Markov chains for individual parameters of the modelWellBeing.
It is also important to check how well the models fit the data. We can use for this purpose posterior predictive checks that use specified number of sampled posterior values of models’ parameters and show how well the fitted models predict observed data. We can see in Graphs 3 and 4 that both models fit the observed data reasonably well.
Graph 3: Posterior predictive checks comparing simulated/replicated data under the fitted modelWorkLifeBalance with the observed data.
Graph 4: Posterior predictive checks comparing simulated/replicated data under the fitted modelWellBeing with the observed data.
Now, after having sufficient confidence that - using terminology from the Richard McElreath’s book Statistical Rethinking - our “small worlds” can pretty accurately mimic the data coming from our real,“big world”, we can use our models’ parameters to learn something about our research questions. Our primary interest is in the coefficient value of the pandemicBeforethepandemicoutbreak
and elapsedTimeAfterPandemic
terms in our models. It expresses how much and in what direction people’s search interest in work-life balance and well-being changed immediately after the outbreak of pandemic, and how slope of the trend changed after the pandemic, respectively.
In Graph 5 and 6 we can see posterior distribution of the pandemicBeforethepandemicoutbreak
parameter in our two models. In both cases the posterior distribution of the pandemic term is (predominantly or completely) on the left side of the zero value, which supports the claim about existence of the effect of pandemic on people’s increased search interest in work-life balance and well-being immediately after the outbreak of pandemic. As is apparent from the graphs, for well-being (Graph 6) this evidence is much stronger than for work-life balance (Graph 5), which corresponds to impression we might have when looking at the original Google Trends charts shown in Fig. 1 and 2.
# uploading library for
library(tidybayes)
# visualizing posterior distribution of the pandemicBeforethepandemicoutbreak parameter in the modelWorkLifeBalance
modelWorkLifeBalance %>%
tidybayes::gather_draws(
b_pandemicBeforethepandemicoutbreak
) %>%
dplyr::mutate(
.variable = factor(
.variable,
levels = c("b_pandemicBeforethepandemicoutbreak"),
ordered = TRUE
)
) %>%
dplyr::rename(value = .value) %>%
ggplot2::ggplot(
aes(x = value)
) +
ggplot2::geom_density(
fill = "lightblue"
) +
ggplot2::labs(
title = "Posterior distribution of the pandemicBeforethepandemicoutbreak parameter\nin the modelWorkLifeBalance"
)
Graph 5: Visualization of the posterior distribution of the pandemicBeforethepandemicoutbreak
parameter in the modelWorkLifeBalance.
# visualizing posterior distribution of the pandemicBeforethepandemicoutbreak parameter in the modelWellBeing
modelWellBeing %>%
tidybayes::gather_draws(
b_pandemicBeforethepandemicoutbreak
) %>%
dplyr::mutate(
.variable = factor(
.variable,
levels = c("b_pandemicBeforethepandemicoutbreak"),
ordered = TRUE
)
) %>%
dplyr::rename(value = .value) %>%
ggplot2::ggplot(
aes(x = value)
) +
ggplot2::geom_density(
fill = "lightblue"
) +
ggplot2::labs(
title = "Posterior distribution of the pandemicBeforethepandemicoutbreak parameter\nin the modelWellBeing"
)
Graph 6: Visualization of the posterior distribution of the pandemicBeforethepandemicoutbreak
parameter in the modelWellBeing.
To generate more summary statistics about posterior distributions (and also some diagnostic information like Rhat
or ESS
), we can use summary()
function.
# generating a summary of the results for modelWorkLifeBalance
summary(modelWorkLifeBalance)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: workLifeBalance | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1)
Data: dfAll (Number of observations: 132)
Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
total post-warmup draws = 8000
Correlation Structures:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
ar[1] 0.21 0.09 0.03 0.40 1.00 7232 5932
Population-Level Effects:
Estimate Est.Error l-95% CI
Intercept 69.59 11.51 47.51
elapsedTime -0.05 0.04 -0.13
pandemicBeforethepandemicoutbreak -13.90 10.15 -34.52
elapsedTimeAfterPandemic 0.40 1.59 -2.77
monthFeb 3.50 4.47 -5.29
monthMar 4.49 4.99 -5.30
monthApr 6.45 5.10 -3.44
monthMay 8.81 5.10 -1.18
monthJun -2.91 5.05 -12.82
monthJul -7.40 5.02 -17.18
monthAug -2.51 5.01 -12.52
monthSep -2.20 5.01 -11.87
monthOct 5.35 5.01 -4.68
monthNov 12.31 4.91 2.56
monthDec -0.64 4.51 -9.37
u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 92.59 1.00 4930 4892
elapsedTime 0.03 1.00 8111 5233
pandemicBeforethepandemicoutbreak 5.76 1.00 5569 5338
elapsedTimeAfterPandemic 3.63 1.00 5567 5543
monthFeb 12.29 1.00 3805 4101
monthMar 14.36 1.00 3316 4648
monthApr 16.44 1.00 3225 4565
monthMay 18.82 1.00 3148 4520
monthJun 7.14 1.00 3008 4287
monthJul 2.42 1.00 3148 4652
monthAug 7.42 1.00 3206 4379
monthSep 7.58 1.00 3234 4922
monthOct 14.99 1.00 3173 4504
monthNov 22.21 1.00 2945 4709
monthDec 8.24 1.00 3662 5175
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 11.54 0.77 10.17 13.16 1.00 6603 5274
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# generating a summary of the results for modelWellBeing
summary(modelWellBeing)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: wellBeing | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1)
Data: dfAll (Number of observations: 132)
Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
total post-warmup draws = 8000
Correlation Structures:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
ar[1] 0.24 0.10 0.05 0.44 1.00 6461 5967
Population-Level Effects:
Estimate Est.Error l-95% CI
Intercept 56.14 4.63 46.81
elapsedTime 0.11 0.02 0.08
pandemicBeforethepandemicoutbreak -21.29 4.11 -29.26
elapsedTimeAfterPandemic 0.50 0.64 -0.78
monthFeb 8.61 1.80 5.09
monthMar 10.63 2.02 6.70
monthApr 9.57 2.02 5.66
monthMay 3.52 2.07 -0.52
monthJun -4.35 2.05 -8.43
monthJul -8.05 2.06 -12.10
monthAug -5.76 2.05 -9.68
monthSep 4.56 2.05 0.57
monthOct 8.13 2.01 4.24
monthNov 6.74 2.01 2.83
monthDec -4.95 1.80 -8.39
u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 65.13 1.00 4316 5151
elapsedTime 0.14 1.00 8987 5597
pandemicBeforethepandemicoutbreak -13.07 1.00 5825 5557
elapsedTimeAfterPandemic 1.78 1.00 6106 5784
monthFeb 12.19 1.00 3387 4626
monthMar 14.59 1.00 2868 4601
monthApr 13.59 1.00 2724 4233
monthMay 7.60 1.00 2938 4027
monthJun -0.34 1.00 2815 4633
monthJul -4.01 1.00 2749 4036
monthAug -1.77 1.00 2863 4364
monthSep 8.55 1.00 2969 3748
monthOct 12.22 1.00 2907 3914
monthNov 10.73 1.00 2962 4142
monthDec -1.35 1.00 3525 4630
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 4.63 0.31 4.07 5.28 1.00 7355 6173
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Given that for work-life balance model the posterior distribution of pandemic term crosses the zero value, it would be useful to know how strong is the evidence in the favor of hypothesis that pandemic term is lower than zero. For that purpose we can extract posterior samples and use them for calculation of the proportion of values that are larger/smaller than zero. The resulting proportions show that the vast majority (around 92%) of posterior distribution lies below zero.
# extracting posterior samples
samples <- brms::posterior_samples(modelWorkLifeBalance, seed = 12345)
# probability of b_pandemicBeforethepandemicoutbreak coefficient being lower than 0
sum(samples$b_pandemicBeforethepandemicoutbreak < 0) / nrow(samples)
[1] 0.9145
Now let’s check the parameter elapsedTimeAfterPandemic
. Its posterior distribution in both models “safely” includes zero value, which indicates that there is not huge support for positive change in trend after the outbreak of pandemic.
# visualizing posterior distribution of the elapsedTimeAfterPandemic parameter in the modelWorkLifeBalance
modelWorkLifeBalance %>%
tidybayes::gather_draws(
b_elapsedTimeAfterPandemic
) %>%
dplyr::mutate(
.variable = factor(
.variable,
levels = c("b_elapsedTimeAfterPandemic"),
ordered = TRUE
)
) %>%
dplyr::rename(value = .value) %>%
ggplot2::ggplot(
aes(x = value)
) +
ggplot2::geom_density(
fill = "lightblue"
) +
ggplot2::labs(
title = "Posterior distribution of the elapsedTimeAfterPandemic parameter\nin the modelWorkLifeBalance"
)
Graph 7: Visualization of the posterior distribution of the elapsedTimeAfterPandemic
parameter in the modelWorkLifeBalance.
# visualizing posterior distribution of the elapsedTimeAfterPandemic parameter in the modelWellBeing
modelWellBeing %>%
tidybayes::gather_draws(
b_elapsedTimeAfterPandemic
) %>%
dplyr::mutate(
.variable = factor(
.variable,
levels = c("b_elapsedTimeAfterPandemic"),
ordered = TRUE
)
) %>%
dplyr::rename(value = .value) %>%
ggplot2::ggplot(
aes(x = value)
) +
ggplot2::geom_density(
fill = "lightblue"
) +
ggplot2::labs(
title = "Posterior distribution of the elapsedTimeAfterPandemic parameter\nin the modelWellBeing"
)
Graph 8: Visualization of the posterior distribution of the elapsedTimeAfterPandemic
parameter in the modelWellBeing.
In conclusion, we can say that there is some evidence that the COVID-19 pandemic has prompted people to be more interested in topics related to work-life balance and well-being. I wish us all to be able to transform our increased interest in these topics into truly increased quality of our personal and professional lives. It would be a shame not to use that extra incentive many of us have now for making significant change in our lives.
For attribution, please cite this work as
Stehlík (2020, Dec. 31). Ludek's Blog About People Analytics: Modeling impact of the COVID-19 pandemic on people’s interest in work-life balance and well-being. Retrieved from https://blog-about-people-analytics.netlify.app/posts/2020-12-31-segmentedregression/
BibTeX citation
@misc{stehlík2020modeling, author = {Stehlík, Luděk}, title = {Ludek's Blog About People Analytics: Modeling impact of the COVID-19 pandemic on people’s interest in work-life balance and well-being}, url = {https://blog-about-people-analytics.netlify.app/posts/2020-12-31-segmentedregression/}, year = {2020} }