Sharing one learning from the awesome book Probably Overthinking It by Allen B. Downey.
For those interested in data analytics, I highly recommend the book Probably Overthinking It by Allen B. Downey. It offers insightful details on the various, more or less known data analytical situations that one might encounter while using data to understand the world better or to make more informed decisions.
Even if you are not a beginner in data analytics, there is a good chance that you will come across some new and valuable insights, as it happened to me.
For example, as a people analytics practitioner, I found it particularly enlightening to examine the survival function - commonly used in employee attrition modeling - from the perspective of average remaining time and related concepts of new better than used in expectation (NBUE) vs. new worse than used in expectation (NWUE). From now on, for me, org units exhibiting NBUE and NWUE characteristics are “light bulbs” and “wines”, respectively 😉
If you’re curious about how to get from survival function to average remaining time, check out the short snippets of Python code below that do the trick. It’s not a big deal, but it can still save you some time 👇
First, let’s upload the dummy data and create functions that allow us to estimate and display the survival curve and the corresponding expected remaining time curve.
# libraries used
import pandas as pd
from plotnine import *
from sksurv.nonparametric import kaplan_meier_estimator
# uploading the data
= pd.read_csv('./attrition_data_bulb.csv')
data_bulb = pd.read_csv('./attrition_data_wine.csv')
data_wine
# function for estimating and plotting the survival function
def survival_function_estimation_plotting(data, plot_name = 'chart'):
# estimating the survival function using the Kaplan-Meier estimator
= kaplan_meier_estimator(data["event"], data["tenure_years"], conf_level=0.95, conf_type="log-log")
time, survival_prob, conf_int
# supp df for dataviz
= pd.DataFrame({
supp_df 'time': time,
'survival_prob': survival_prob,
'lower_ci': conf_int[0],
'upper_ci': conf_int[1]
})
# dataviz
= (
plot ='time')) +
ggplot(supp_df, aes(x='survival_prob'), color='#23004C', size=1) +
geom_step(aes(y='lower_ci', ymax='upper_ci'), alpha=0.25) +
geom_ribbon(aes(ymin=[0,1]) +
scale_y_continuous(limits
labs(="Probability of staying over time since joining the company",
title='TIME (IN YEARS)',
x='ESTIMATED PROBABILITY OF STAYING'
y+
) +
theme_bw()
theme(=element_text(size=20, margin={'b': 12}, ha='left'),
plot_title=element_text(size=15, margin={'t': 15}),
axis_title_x=element_text(size=15, margin={'r': 15}),
axis_title_y=element_text(size=10),
axis_text_x=element_text(size=10),
axis_text_y=element_text(size=14, weight='bold'),
strip_text_x=element_blank(),
panel_grid_major=element_blank(),
panel_grid_minor=(12, 6.5)
figure_size
)
)
# saving the plot
# ggsave(plot=plot, filename=f'survival_curve_{plot_name}.png', width=12, height=6, dpi=500)
# printing the plot
print(plot)
# function for computing the average remaining time in the company
def remaining_time(data):
= []
results # iterating over each time point in the data's index
for t in data.index:
if data.loc[t, "survival_prob"] > 0:
# calculating the conditional survival probabilities from time t onwards
# by dividing the survival probabilities by the survival probability at time t
= data.loc[t:, "survival_prob"] / data.loc[t, "survival_prob"]
conditional_df
# removing the survival probability at time t from the calculations
# as it's not needed for the expected additional time calculation
= conditional_df.iloc[1:]
conditional_df
# calculating the expected additional time by taking the weighted average
# of the time points, using the conditional survival probabilities as weights
= sum(conditional_df.values * (conditional_df.index - t)) / conditional_df.sum()
expected_additional_time = {
result "time": t,
"expected_additional_time": expected_additional_time
}
# adding the result to the results list
results.append(result)else:
pass
# converting the results list to a DataFrame
return pd.DataFrame(results)
# function for estimating the uncertainty using the bootstrapping technique and for plotting the results
def remaining_time_bootstraping_plotting(data, n_bootstrap=100, plot_name='chart'):
# calculating the expected remaining time using all the data
= kaplan_meier_estimator(data["event"], data["tenure_years"])
time, survival_prob = pd.DataFrame({'time': time, 'survival_prob': survival_prob})
supp_df_all = supp_df_all .set_index('time')
supp_df_all = remaining_time(supp_df_all)
all_results_df
# estimating the uncertainty using the bootstrapping technique
= []
all_curves
for i in range(n_bootstrap):
# resampling the data with replacement
= data.sample(n=len(data), replace=True)
bootstrap_sample # estimating the survival function using the Kaplan-Meier estimator
= kaplan_meier_estimator(bootstrap_sample["event"], bootstrap_sample["tenure_years"])
time, survival_prob # supp df for further calculations
= pd.DataFrame({'time': time, 'survival_prob': survival_prob})
supp_df = supp_df.set_index('time')
supp_df # calculating the expected remaining time
= remaining_time(supp_df)
remaining_time_bootstrap # adding a column to identify the bootstrap iteration
'bootstrap_id'] = i
remaining_time_bootstrap[# adding the result to the all_curves list
all_curves.append(remaining_time_bootstrap)
# concatenating all bootstrap results into a single df
= pd.concat(all_curves)
bootstrap_results_df
# dataviz
= (
plot +
ggplot() ='time', y='expected_additional_time', group='bootstrap_id'), color='grey', alpha=0.1) +
geom_step(bootstrap_results_df, aes(x='time', y='expected_additional_time'), color='#23004C', size=1) +
geom_step(all_results_df, aes(x
labs(="Average remaining time in the company",
title='TIME SINCE JOINING THE COMPANY (IN YEARS)',
x='AVERAGE REMAINING TIME (IN YEARS)'
y+
) +
theme_bw()
theme(=element_text(size=20, margin={'b': 12}, ha='left'),
plot_title=element_text(size=15, margin={'t': 15}),
axis_title_x=element_text(size=15, margin={'r': 15}),
axis_title_y=element_text(size=10),
axis_text_x=element_text(size=10),
axis_text_y=element_text(size=14, weight='bold'),
strip_text_x=element_blank(),
panel_grid_major=element_blank(),
panel_grid_minor=(12, 6.5)
figure_size
)
)
# saving the plot
# ggsave(plot=plot, filename=f'remaining_time_curve_{plot_name}.png', width=12, height=6, dpi=500)
# printing the plot
print(plot)
Now let’s estimate and visualize these two curves for a “light bulb” team (i.e., a team exhibiting the NBUE characteristic)…
=data_bulb) survival_function_estimation_plotting(data
=data_bulb) remaining_time_bootstraping_plotting(data
… and now for a “wine” team (i.e., a team exhibiting, at least partially, the NWUE characteristic).
=data_wine) survival_function_estimation_plotting(data
=data_wine) remaining_time_bootstraping_plotting(data
For attribution, please cite this work as
Stehlík (2024, Feb. 26). Ludek's Blog About People Analytics: Does your team belong among “light bulbs” or “wines”?. Retrieved from https://blog-about-people-analytics.netlify.app/posts/2024-02-26-expected-remaining-time/
BibTeX citation
@misc{stehlík2024does, author = {Stehlík, Luděk}, title = {Ludek's Blog About People Analytics: Does your team belong among “light bulbs” or “wines”?}, url = {https://blog-about-people-analytics.netlify.app/posts/2024-02-26-expected-remaining-time/}, year = {2024} }