Is it satisfaction or dissatisfaction that drives comments? Or perhaps it’s the extremes on both ends of the satisfaction spectrum?
I’ve been curious about this for a while, but until recently, I only had my personal pet theories to rely on. Luckily, one of my recent projects gave me the chance to explore this question with real-world data and satisfy my curiosity a bit.
Why bother? Well, you can get a better feel for the representativeness of the comments, from which a lot of useful insights into the employee experience can be gleaned.
Btw, what’s your guess? And try to make a prediction before reading on and/or checking the charts – with hindsight the results may seem too obvious 😉
To test my ideas, I used a classification RF model to be able to capture non-linear relationships, and used item score and some common controls as predictors of whether an employee would leave a comment on a given item. Then, I applied Partial Dependence Plot – a global ML interpretation tool – to the fitted model to examine the relationship between item scores and the likelihood of leaving a comment.
What were the results? Well, as usual, it depends. However, across the sample of items shown, we can observe a common pattern of a non-linear, S-reversed-shaped relationship. Dissatisfied employees tend to comment more, except for those who are extremely dissatisfied. As satisfaction increases, the probability of commenting decreases, only to slightly rise again as we approach a satisfaction level of 10. Generally, it can be said that less satisfied employees comment more on average. Given that we collect feedback from employees to improve things, it makes kind of sense, right? 🤓
Does this match your expectations, or are you surprised? Would you expect different patterns for different items? Have you conducted a similar exercise with your own data? If so, what were the results? Perhaps you also know of some relevant research on this topic. Feel free to share.
P.S. If you would like to replicate this analysis using your own data, you can use the following Python script as inspiration.
# required libraries
# data manipulation
import pandas as pd
import numpy as np
import copy
# dataviz
from plotnine import *
# ML
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
# ML explanation & interpretation
from sklearn.inspection import partial_dependence
# function for changing snake case names to titles (to beautify titles in generated charts)
def snake_to_title(snake_str):
# Split the string by underscores
= snake_str.split('_')
words # Capitalize each word
= [word.capitalize() for word in words]
capitalized_words # Join the words with spaces
= ' '.join(capitalized_words)
title_case_string return title_case_string
# the analysis assumes wide-format data with individual-level records of employees' responses to employee survey questions on a scale 0-10 and their comments to these questions
= pd.read_csv('your_data.csv')
mydata
# list of survey items of interest
= [
items 'autonomy',
'engagement',
'workload',
'recognition',
'reward',
'strategy',
'growth',
'management_support',
'peer_relationship',
'diversity_inclusion',
'health_wellbeing_balance'
]
# looping over individual items
for item in items:
print(item)
# dataset to be used for ML task
= copy.deepcopy(mydata)
ml_data
# name of the field with comments to a specific question
= f'comment_{item}'
item_comment
# creating a flag indicating presence/absence of a comment
= np.where(
ml_data[item_comment] False, True
ml_data[item_comment].isna(),
)
# keeping only those employees who replied to the question on a scale 0-10
=[item], inplace=True)
ml_data.dropna(subset
# defining the predictors and target variable
= [item, 'age', 'gender', 'country', 'job_family_group', 'is_manager', 'management_level', 'org_unit', 'tenure']
predictors = item_comment
target
# stratified split of the data into training and testing sets
= ml_data[predictors]
X = ml_data[target]
y = train_test_split(X, y, test_size=0.2, random_state=1979, stratify=y)
X_train, X_test, y_train, y_test
# defining the column transformer for data pre-processing
= ColumnTransformer(
preprocessor =[
transformers'num', StandardScaler(), ['age', 'tenure']),
('cat', OneHotEncoder(drop='first'), [item, 'gender', 'country', 'job_family_group', 'is_manager', 'management_level', 'org_unit'])
(
]
)
# Random Forest Classifier
# skipping hyper-parameter fine-tuning for the sake of brevity
= RandomForestClassifier(min_samples_leaf=5, min_samples_split=30, n_estimators=500,random_state=1979)
rf
# creating a pipeline
= Pipeline(steps=[
pipeline 'preprocessor', preprocessor),
('classifier', rf)
(
])
# fitting the pipeline
pipeline.fit(X_train, y_train)
# skipping assessment of the quality of the model for the sake of brevity
# PDP (Partial Dependence) plots
# size of the sample of individual conditional expectation (ICE) curves
= 500
n = X.columns
feature_names = np.where(feature_names == item)[0][0]
fIndex = partial_dependence(pipeline, X_train, [fIndex], grid_resolution=50, kind="both")
pdp_results
# extracting the data
= pdp_results['values'][0]
values = pdp_results['average'][0]
average = pdp_results['individual'][0]
individual
# df for the average line
= pd.DataFrame({
pdp_data_avg 'Score': values,
'Partial Dependence': average
})
# df for the ICE curves
= pd.DataFrame(individual, columns=values)
pdp_data_ind 'ID'] = pdp_data_ind.index
pdp_data_ind[= pdp_data_ind.melt(id_vars='ID', var_name='Score', value_name='Partial Dependence')
pdp_data_ind
# sampling 500 unique IDs
= pdp_data_ind['ID'].unique()
sampled_ids if len(sampled_ids) > n:
1979)
np.random.seed(= np.random.choice(sampled_ids, n, replace=False)
sampled_ids
= pdp_data_ind[pdp_data_ind['ID'].isin(sampled_ids)]
pdp_data_ind 'Score'] = pdp_data_ind['Score'].astype(float)
pdp_data_ind[
# plotting the results
= snake_to_title(item)
item_title = (
plot +
ggplot() ='Score', y='Partial Dependence', group='ID'), size=0.1, alpha=0.2, data=pdp_data_ind) +
geom_line(aes(x='Score', y='Partial Dependence', group=1), size=1.5, data=pdp_data_avg) +
geom_line(aes(x=range(0,11)) +
scale_x_continuous(breaks
labs(=f'PDP plot for score on "{item_title}" survey item',
title=f'Score on "{item_title}" survey item',
x='Probability of commenting'
y+
) +
theme_bw()
theme( =element_text(size=18, margin={'t': 0, 'r': 0, 'b': 10, 'l': 0}),
plot_title=element_text(size=12),
axis_text=element_text(size=14),
axis_title=element_text(margin={'t': 10, 'r': 0, 'b': 0, 'l': 0}),
axis_title_x=element_text(margin={'t': 0, 'r': 10, 'b': 0, 'l': 0}),
axis_title_y=element_text(size=13),
strip_text_x=element_blank(),
panel_grid_major=element_blank(),
panel_grid_minor=(11, 6)
figure_size
)
)
#print(plot)
# saving the plot
=f"{item}_item_pdp.png", width=11, height=6, dpi=500) plot.save(filename
For attribution, please cite this work as
Stehlík (2024, May 21). Ludek's Blog About People Analytics: What makes people more likely to comment on a question in an employee survey?. Retrieved from https://blog-about-people-analytics.netlify.app/posts/2024-05-21-probability-of-comments-in-a-survey/
BibTeX citation
@misc{stehlík2024what, author = {Stehlík, Luděk}, title = {Ludek's Blog About People Analytics: What makes people more likely to comment on a question in an employee survey?}, url = {https://blog-about-people-analytics.netlify.app/posts/2024-05-21-probability-of-comments-in-a-survey/}, year = {2024} }