Ludek's Blog About People Analytics: Modeling impact of the COVID-19 pandemic on people’s interest in work-life balance and well-being

Luděk Stehlík

The turn of the year, which is full of all sorts of resolutions to change for the better in our private lives and in our organizations, is a good time to remind ourselves that analytic tools can be very helpful in our efforts to make these resolutions come true. One way they can help us is by verifying that we have really achieved our stated goals and that we are not just fooling ourselves into believing so. We need to keep in mind Richard Feynman’s famous principle of critical thinking…

One of the tools that can help us with that is segmented regression analysis of interrupted time series data (thanks to Masatake Hirono for pointing me to its existence). It allows us to model changes in various processes and outcomes that follow interventions, while controlling for other types of changes (e.g. trends and seasonality) that may have occurred regardless of the interventions. It is thus very useful for data analysis conducted within studies with a quasi experimental study design that are often in the organizational context the best alternative to the “gold standard” of randomized controlled trials (RCTs) that are not always realizable or politically acceptable.

Search interest in work-life balance and well-being

For illustration, let’s use this tool for testing hypothesis about people’s increased interest in topics related to work-life balance and well-being due to the COVID-19 pandemic and subsequent changes in the way people work. As a proxy measure of this interest we will use worldwide search interest data over the last 10 years from Google Trends using search terms work-life balance and well-being (see Fig. 1 and 2 below).

Fig. 1: Interest in “work-life balance” topic over the last 10 years measured as a search interest by Google Trends. The numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means that there was not enough data for this term.

Fig. 2: Interest in “well-being” topic over the last 10 years measured as a search interest by Google Trends. The numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means that there was not enough data for this term.

Based solely on the visual inspection of the graphs, it is pretty difficult to tell whether there was some effect of the COVID-19 pandemic or not, especially in the case of work-life balance (for the purpose of this analysis, the beginning of the pandemic is assumed to have started in March 2020). For sure it’s not a job for “inter-ocular trauma test” when the existence of the effect hits you directly between the eyes. We need to rely here on inferential statistics and its ability to help us with distinguishing signal from noise.

Before conducting the analysis itself, we need to wrangle the data from Google Trends a little bit using the recipe presented in the Wagner, Zhang, and Ross-Degnan’s paper. Specifically, we need the following five variables (or six, given that we have two dependent variables):

search interest – numerical variable representing search interest relative to the highest point on the chart for the given region and time; this variable is truncated within the interval between values of 0 and 100; a value of 100 is the peak popularity for the term; a value of 50 means that the term is half as popular; a score of 0 means that there was not enough data for this term; this variable serves as a dependent (criterion) variable;
elapsed time – numerical variable representing the number of months that elapsed from the beginning of the time series; this variable enables estimation of the size and direction of the overall trend in the data;
pandemic – dichotomic variable indicating the presence/absence of pandemic; as already mentioned above, for the purpose of this analysis, the beginning of the pandemic is assumed to have started in March 2020; this variable enables estimation of the level change in the interest in work-life balance and well-being immediately after the pandemic outbreak;
elapsed time after pandemic outbreak – numerical variable representing the number of months that elapsed from the beginning of pandemic; this variable enables estimation of the change in the trend in the interest in work-life balance and well-being after the outbreak of pandemic;
month – categorical variable representing specific month within a year; this variable enables controlling for the effect of seasonality.

Show code

# uploading library for data manipulation
library(tidyverse)

# uploading data
dfWorkLifeBalance <- readr::read_csv("./workLifeBalanceGoogleTrendData.csv")
dfWellBeing <- readr::read_csv("./wellBeingGoogleTrendData.csv")

dfAll <- dfWorkLifeBalance %>%
  # joining both datasets
  dplyr::left_join(
    dfWellBeing, by = "Month"
    ) %>%
  # changing the format and name of Month variable
  dplyr::mutate(
    Month = stringr::str_glue("{Month}-01"),
    Month = lubridate::ymd(Month)
    ) %>%
  dplyr::rename(
    date = Month
    ) %>%
  # creating new variable month
  dplyr::mutate(
    month = lubridate::month(date,label = TRUE, abbr = TRUE),
    month = factor(month, 
                   levels = c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"), 
                   labels = c("Jan","Feb","Mar","Apr","May", "Jun","Jul","Aug","Sep","Oct","Nov","Dec"), 
                   ordered = FALSE)
    ) %>%
  # arranging data in ascending order by date
  dplyr::arrange(
    date
    ) %>%
  # creating new variables
  dplyr::mutate(
    elapsedTime = row_number(),
    pandemic = case_when(
      date >= "2020-03-01" ~ 1,
      TRUE ~ 0
      ),
    elapsedTimeAfterPandemic = cumsum(pandemic)
  ) %>%
  dplyr::mutate(
    pandemic = as.factor(case_when(
        pandemic == 1 ~ "After the pandemic outbreak",
        TRUE ~ "Before the pandemic outbreak"
        ))
  ) %>%
  # changing order of variables in df
  dplyr::select(
    date, workLifeBalance, wellBeing, elapsedTime, month, pandemic, elapsedTimeAfterPandemic
    )

Here is a table with the resulting data we will use for testing our hypothesis.

Show code

# uploading library for making user-friendly data table
library(DT)

DT::datatable(
  dfAll,
  class = 'cell-border stripe', 
  filter = 'top',
  extensions = 'Buttons',
  fillContainer = FALSE,
  rownames= FALSE,
  options = list(
    pageLength = 5, 
    autoWidth = TRUE,
    dom = 'Bfrtip',
    buttons = c('copy'), 
    scrollX = TRUE, 
    selection="multiple"
    )
  )

Table 1: Final dataset used for testing hypothesis about impact of the COVID-19 pandemic on people’s interest in work-life balance and well-being.

Bayesian segmented regression

We will model our data using common segmented regression models that have following general structure:

\[Y_{t} = β_{0} + β_{1}*time_{t} + β_{2}*intervention_{t} + β_{3}*time after intervention_{t} + e_{t}\]

The β₀ coefficient estimates the baseline level of the outcome variable at time zero; β₁ coefficient estimates the change in the mean of the outcome variable that occurs with each unit of time before the intervention (i.e. the baseline trend); β₂ coefficient estimates the level change in the mean of the outcome variable immediately after the intervention (i.e. from the end of the preceding segment); and β₃ estimates the change in the trend in the mean of the outcome variable per unit of time after the intervention, compared with the trend before the intervention (thus, the sum of β₁ and β₃ equals to the post-intervention slope). For a better understanding of the model, take a look at the illustrative chart below.

Since we are dealing with correlated and truncated data, we should also include two additional terms in our model, an autocorrelation term and a truncation term, to handle these specific properties of our data.

Now let’s fit the models to the data and check what they tell us about the effect of pandemic on people’s search interest in work-life balance and well-being. We will use brms r package that enables making inferences about statistical models’ parameters within Bayesian inferential framework. Because of that, we also need to specify some additional parameters (e.g. chains, iter or warmup) of the Markov Chain Monte Carlo (MCMC) algorithm that will generate posterior samples of our models’ parameters.

Bayesian framework also enables us to specify priors for estimated parameter and through them include our domain knowledge in the analysis. The specified priors are important for both parameter estimation and hypothesis testing as they define our starting information state before we take into account our data. Here we will use rather wide, uninformative, and only mildly regularizing priors (it means that the results of the inference will be very close to the results of standard, frequentist parameter estimation/hypothesis testing).

Show code

# uploading library for Bayesian statistical inference
library(brms)

# checking available priors for the models 
brms::get_prior(
  workLifeBalance | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1),
  data = dfAll,
  family = gaussian())

brms::get_prior(
  wellBeing | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1),
  data = dfAll,
  family = gaussian())

Show code

# uploading library for Bayesian statistical inference
library(brms)

# specifying wide, uninformative, and only mildly regularizing priors for predictors in both models 
priors <- c(set_prior("normal(0,50)", class = "b", coef = "elapsedTime"),
            set_prior("normal(0,50)", class = "b", coef = "elapsedTimeAfterPandemic"),
            set_prior("normal(0,50)", class = "b", coef = "pandemicBeforethepandemicoutbreak"),
            set_prior("normal(0,50)", class = "b", coef = "monthApr"),
            set_prior("normal(0,50)", class = "b", coef = "monthAug"),
            set_prior("normal(0,50)", class = "b", coef = "monthDec"),
            set_prior("normal(0,50)", class = "b", coef = "monthFeb"),
            set_prior("normal(0,50)", class = "b", coef = "monthJul"),
            set_prior("normal(0,50)", class = "b", coef = "monthJun"),
            set_prior("normal(0,50)", class = "b", coef = "monthMar"),
            set_prior("normal(0,50)", class = "b", coef = "monthMay"),
            set_prior("normal(0,50)", class = "b", coef = "monthNov"),
            set_prior("normal(0,50)", class = "b", coef = "monthOct"),
            set_prior("normal(0,50)", class = "b", coef = "monthSep"))

# defining the statistical model for work-life balance
modelWorkLifeBalance <- brms::brm(
  workLifeBalance | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1),
  data = dfAll,
  family = gaussian(),
  prior = priors,
  chains = 4,
  iter = 3000,
  warmup = 1000,
  seed = 12345,
  sample_prior = TRUE
  )

# defining the statistical model for well-being
modelWellBeing <- brms::brm(
  wellBeing | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1),
  data = dfAll,
  family = gaussian(),
  prior = priors,
  chains = 4,
  iter = 3000,
  warmup = 1000,
  seed = 678910,
  sample_prior = TRUE
  )

Some necessary sanity checks

Before making any inferences, we should make some sanity checks to be sure that the mechanics of the MCMC algorithm worked well and that we can use generated posterior samples for making inferences about our models’ parameters. There are many ways for doing that, but here we will use only visual check of the MCMC chains. We want plots of these chains look like hairy caterpillar which would indicate convergence of the underlying Markov chain to stationarity and convergence of Monte Carlo estimators to population quantities, respectively. As can be seen in Graph 1 and 2 below, in case of both models we can observe wanted characteristics of the MCMC chains described above. (For additional MCMC diagnostics procedures, see for example Bayesian Notes from Jeffrey B. Arnold.)

Show code

# uploading library for plotting Bayesian models
library(bayesplot)

# plotting the MCMC chains for the modelWorkLifeBalance 
bayesplot::mcmc_trace(
  modelWorkLifeBalance,
  facet_args = list(nrow = 6)
  ) +
  ggplot2::labs(
    title = "Plots of the MCMC chains used for estimation of the modelWorkLifeBalance's parameters"
    )

Graph 1: Trace plots of Markov chains for individual parameters of the modelWorkLifeBalance.

Show code

# plotting the MCMC chains for the modelWellBeing 
bayesplot::mcmc_trace(
  modelWellBeing,
  facet_args = list(nrow = 6)
  ) +
  ggplot2::labs(
    title = "Plots of the MCMC chains used for estimation of the modelWellBeing's parameters"
    )

Graph 2: Trace plots of Markov chains for individual parameters of the modelWellBeing.

It is also important to check how well the models fit the data. We can use for this purpose posterior predictive checks that use specified number of sampled posterior values of models’ parameters and show how well the fitted models predict observed data. We can see in Graphs 3 and 4 that both models fit the observed data reasonably well.

Show code

# investigating modelWorkLifeBalance fit

# specifying the number of samples
nsamples = 1000

brms::pp_check(
  modelWorkLifeBalance, 
  nsamples = nsamples
  ) + 
  ggplot2::labs(
    title = stringr::str_glue("Posterior predictive checks for modelWorkLifeBalance (using {nsamples} samples)")
    )

Graph 3: Posterior predictive checks comparing simulated/replicated data under the fitted modelWorkLifeBalance with the observed data.

Show code

# investigating modelWellBeing fit

# specifying the number of samples
nsamples = 1000

brms::pp_check(
  modelWellBeing, 
  nsamples = nsamples
  ) + 
  ggplot2::labs(
    title = stringr::str_glue("Posterior predictive checks for modelWellBeing (using {nsamples} samples)")
    )

Graph 4: Posterior predictive checks comparing simulated/replicated data under the fitted modelWellBeing with the observed data.

Results of the analysis

Now, after having sufficient confidence that - using terminology from the Richard McElreath’s book Statistical Rethinking - our “small worlds” can pretty accurately mimic the data coming from our real,“big world”, we can use our models’ parameters to learn something about our research questions. Our primary interest is in the coefficient value of the pandemicBeforethepandemicoutbreak and elapsedTimeAfterPandemic terms in our models. It expresses how much and in what direction people’s search interest in work-life balance and well-being changed immediately after the outbreak of pandemic, and how slope of the trend changed after the pandemic, respectively.

In Graph 5 and 6 we can see posterior distribution of the pandemicBeforethepandemicoutbreak parameter in our two models. In both cases the posterior distribution of the pandemic term is (predominantly or completely) on the left side of the zero value, which supports the claim about existence of the effect of pandemic on people’s increased search interest in work-life balance and well-being immediately after the outbreak of pandemic. As is apparent from the graphs, for well-being (Graph 6) this evidence is much stronger than for work-life balance (Graph 5), which corresponds to impression we might have when looking at the original Google Trends charts shown in Fig. 1 and 2.

Show code

# uploading library for 
library(tidybayes)

# visualizing posterior distribution of the pandemicBeforethepandemicoutbreak parameter in the modelWorkLifeBalance
modelWorkLifeBalance %>%
  tidybayes::gather_draws(
    b_pandemicBeforethepandemicoutbreak
    ) %>%
  dplyr::mutate(
    .variable = factor(
      .variable, 
      levels = c("b_pandemicBeforethepandemicoutbreak"), 
      ordered = TRUE
      )
    ) %>%
  dplyr::rename(value = .value) %>%
  ggplot2::ggplot(
    aes(x = value)
    ) +
  ggplot2::geom_density(
    fill = "lightblue"
  ) +
  ggplot2::labs(
    title = "Posterior distribution of the pandemicBeforethepandemicoutbreak parameter\nin the modelWorkLifeBalance"
    )

Graph 5: Visualization of the posterior distribution of the pandemicBeforethepandemicoutbreak parameter in the modelWorkLifeBalance.

Show code

# visualizing posterior distribution of the pandemicBeforethepandemicoutbreak parameter in the modelWellBeing
modelWellBeing %>%
  tidybayes::gather_draws(
    b_pandemicBeforethepandemicoutbreak
    ) %>%
  dplyr::mutate(
    .variable = factor(
      .variable, 
      levels = c("b_pandemicBeforethepandemicoutbreak"), 
      ordered = TRUE
      )
    ) %>%
  dplyr::rename(value = .value) %>%
  ggplot2::ggplot(
    aes(x = value)
    ) +
  ggplot2::geom_density(
    fill = "lightblue"
  ) +
  ggplot2::labs(
    title = "Posterior distribution of the pandemicBeforethepandemicoutbreak parameter\nin the modelWellBeing"
    )

Graph 6: Visualization of the posterior distribution of the pandemicBeforethepandemicoutbreak parameter in the modelWellBeing.

To generate more summary statistics about posterior distributions (and also some diagnostic information like Rhat or ESS), we can use summary() function.

Show code

# generating a summary of the results for modelWorkLifeBalance 
summary(modelWorkLifeBalance)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: workLifeBalance | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1) 
   Data: dfAll (Number of observations: 132) 
  Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Correlation Structures:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
ar[1]     0.21      0.09     0.03     0.40 1.00     7232     5932

Population-Level Effects: 
                                  Estimate Est.Error l-95% CI
Intercept                            69.59     11.51    47.51
elapsedTime                          -0.05      0.04    -0.13
pandemicBeforethepandemicoutbreak   -13.90     10.15   -34.52
elapsedTimeAfterPandemic              0.40      1.59    -2.77
monthFeb                              3.50      4.47    -5.29
monthMar                              4.49      4.99    -5.30
monthApr                              6.45      5.10    -3.44
monthMay                              8.81      5.10    -1.18
monthJun                             -2.91      5.05   -12.82
monthJul                             -7.40      5.02   -17.18
monthAug                             -2.51      5.01   -12.52
monthSep                             -2.20      5.01   -11.87
monthOct                              5.35      5.01    -4.68
monthNov                             12.31      4.91     2.56
monthDec                             -0.64      4.51    -9.37
                                  u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept                            92.59 1.00     4930     4892
elapsedTime                           0.03 1.00     8111     5233
pandemicBeforethepandemicoutbreak     5.76 1.00     5569     5338
elapsedTimeAfterPandemic              3.63 1.00     5567     5543
monthFeb                             12.29 1.00     3805     4101
monthMar                             14.36 1.00     3316     4648
monthApr                             16.44 1.00     3225     4565
monthMay                             18.82 1.00     3148     4520
monthJun                              7.14 1.00     3008     4287
monthJul                              2.42 1.00     3148     4652
monthAug                              7.42 1.00     3206     4379
monthSep                              7.58 1.00     3234     4922
monthOct                             14.99 1.00     3173     4504
monthNov                             22.21 1.00     2945     4709
monthDec                              8.24 1.00     3662     5175

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    11.54      0.77    10.17    13.16 1.00     6603     5274

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Show code

# generating a summary of the results for modelWellBeing 
summary(modelWellBeing)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: wellBeing | trunc(lb = 0, ub = 100) ~ elapsedTime + pandemic + elapsedTimeAfterPandemic + month + ar(p = 1) 
   Data: dfAll (Number of observations: 132) 
  Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Correlation Structures:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
ar[1]     0.24      0.10     0.05     0.44 1.00     6461     5967

Population-Level Effects: 
                                  Estimate Est.Error l-95% CI
Intercept                            56.14      4.63    46.81
elapsedTime                           0.11      0.02     0.08
pandemicBeforethepandemicoutbreak   -21.29      4.11   -29.26
elapsedTimeAfterPandemic              0.50      0.64    -0.78
monthFeb                              8.61      1.80     5.09
monthMar                             10.63      2.02     6.70
monthApr                              9.57      2.02     5.66
monthMay                              3.52      2.07    -0.52
monthJun                             -4.35      2.05    -8.43
monthJul                             -8.05      2.06   -12.10
monthAug                             -5.76      2.05    -9.68
monthSep                              4.56      2.05     0.57
monthOct                              8.13      2.01     4.24
monthNov                              6.74      2.01     2.83
monthDec                             -4.95      1.80    -8.39
                                  u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept                            65.13 1.00     4316     5151
elapsedTime                           0.14 1.00     8987     5597
pandemicBeforethepandemicoutbreak   -13.07 1.00     5825     5557
elapsedTimeAfterPandemic              1.78 1.00     6106     5784
monthFeb                             12.19 1.00     3387     4626
monthMar                             14.59 1.00     2868     4601
monthApr                             13.59 1.00     2724     4233
monthMay                              7.60 1.00     2938     4027
monthJun                             -0.34 1.00     2815     4633
monthJul                             -4.01 1.00     2749     4036
monthAug                             -1.77 1.00     2863     4364
monthSep                              8.55 1.00     2969     3748
monthOct                             12.22 1.00     2907     3914
monthNov                             10.73 1.00     2962     4142
monthDec                             -1.35 1.00     3525     4630

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     4.63      0.31     4.07     5.28 1.00     7355     6173

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Given that for work-life balance model the posterior distribution of pandemic term crosses the zero value, it would be useful to know how strong is the evidence in the favor of hypothesis that pandemic term is lower than zero. For that purpose we can extract posterior samples and use them for calculation of the proportion of values that are larger/smaller than zero. The resulting proportions show that the vast majority (around 92%) of posterior distribution lies below zero.

Show code

# extracting posterior samples
samples <- brms::posterior_samples(modelWorkLifeBalance, seed = 12345)

# probability of b_pandemicBeforethepandemicoutbreak coefficient being lower than 0
sum(samples$b_pandemicBeforethepandemicoutbreak < 0) / nrow(samples)

[1] 0.9145

Now let’s check the parameter elapsedTimeAfterPandemic. Its posterior distribution in both models “safely” includes zero value, which indicates that there is not huge support for positive change in trend after the outbreak of pandemic.

Show code

# visualizing posterior distribution of the elapsedTimeAfterPandemic parameter in the modelWorkLifeBalance
modelWorkLifeBalance %>%
  tidybayes::gather_draws(
    b_elapsedTimeAfterPandemic
    ) %>%
  dplyr::mutate(
    .variable = factor(
      .variable, 
      levels = c("b_elapsedTimeAfterPandemic"), 
      ordered = TRUE
      )
    ) %>%
  dplyr::rename(value = .value) %>%
  ggplot2::ggplot(
    aes(x = value)
    ) +
  ggplot2::geom_density(
    fill = "lightblue"
  ) +
  ggplot2::labs(
    title = "Posterior distribution of the elapsedTimeAfterPandemic parameter\nin the modelWorkLifeBalance"
    )

Graph 7: Visualization of the posterior distribution of the elapsedTimeAfterPandemic parameter in the modelWorkLifeBalance.

Show code

# visualizing posterior distribution of the elapsedTimeAfterPandemic parameter in the modelWellBeing
modelWellBeing %>%
  tidybayes::gather_draws(
    b_elapsedTimeAfterPandemic
    ) %>%
  dplyr::mutate(
    .variable = factor(
      .variable, 
      levels = c("b_elapsedTimeAfterPandemic"), 
      ordered = TRUE
      )
    ) %>%
  dplyr::rename(value = .value) %>%
  ggplot2::ggplot(
    aes(x = value)
    ) +
  ggplot2::geom_density(
    fill = "lightblue"
  ) +
  ggplot2::labs(
    title = "Posterior distribution of the elapsedTimeAfterPandemic parameter\nin the modelWellBeing"
    )

Graph 8: Visualization of the posterior distribution of the elapsedTimeAfterPandemic parameter in the modelWellBeing.

In conclusion, we can say that there is some evidence that the COVID-19 pandemic has prompted people to be more interested in topics related to work-life balance and well-being. I wish us all to be able to transform our increased interest in these topics into truly increased quality of our personal and professional lives. It would be a shame not to use that extra incentive many of us have now for making significant change in our lives.

Comment on this article Share:

Modeling impact of the COVID-19 pandemic on people’s interest in work-life balance and well-being

Search interest in work-life balance and well-being

Bayesian segmented regression

Some necessary sanity checks

Results of the analysis

Citation