A demonstration of one method useful for sharing insights from fitted ML models.
There is one very useful albeit relatively underused method of sharing insights from fitted ML models, at least in my professional bubble.
It’s a method of identifying personas based on outputs from ML local interpretation algorithms, which provide information about the specific drivers of predictions for individual observations.
Its implementation is pretty straightforward:
Let’s see this in action using the Python code below. First, we need to create an artificial dataset on which we will demonstrate the method described above. We’ll create a classification dataset - imagine, for example, that we’re trying to predict sales performance based on some collaboration metrics, but feel free to imagine any scenario you like - and prepare a training and testing set to train our ML.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import pandas as pd
# defining the number of samples and features for the dataset
= 10000
n_samples = 10
n_features
# creating the dataset
= make_classification(n_samples=n_samples, n_features=n_features, n_informative=6, n_redundant=4, n_clusters_per_class=3, flip_y=0.27, class_sep=1, random_state=1979)
X, y
# creating a df from X and y
= pd.DataFrame(X, columns=['feature_{}'.format(i) for i in range(n_features)])
df 'criterion'] = y
df[
# splitting the dataset into training and test sets
= train_test_split(X, y, test_size=0.2, random_state=1979) X_train, X_test, y_train, y_test
library(DT)
library(tidyverse)
library(reticulate)
# table dataviz
DT::datatable(
py$df,
class = 'cell-border stripe',
filter = 'top',
extensions = 'Buttons',
fillContainer = FALSE,
rownames= FALSE,
options = list(
pageLength = 5,
autoWidth = TRUE,
dom = 'Bfrtip',
buttons = c('copy'),
scrollX = TRUE,
selection="multiple"
)
)
Now we can fit and fine-tune our ML (XGBoost) model.
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
# fitting a XGBoost model with hyperparameter tuning and 10-fold cross-validation
# initializing the XGBoost classifier
= XGBClassifier(random_state=1979)
xgb_model
# defining the parameter grid for hyperparameter tuning
= {
param_grid 'n_estimators': [100, 200],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2]
}
# setting up the grid search with 10-fold cross-validation
= GridSearchCV(xgb_model, param_grid, cv=10, scoring='f1')
grid_search
# fitting the model with hyperparameter tuning
grid_search.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=1979, reg_alpha=None, reg_lambda=None, ...), param_grid={'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 7], 'n_estimators': [100, 200]}, scoring='f1')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GridSearchCV(cv=10, estimator=XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=1979, reg_alpha=None, reg_lambda=None, ...), param_grid={'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 7], 'n_estimators': [100, 200]}, scoring='f1')
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=1979, reg_alpha=None, reg_lambda=None, ...)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=1979, reg_alpha=None, reg_lambda=None, ...)
# getting the best estimator
= grid_search.best_estimator_ best_xgb_model
The classification performance metrics below show that the fitted model performs well on the test data, so we can safely proceed further.
from sklearn.metrics import classification_report, roc_auc_score
import numpy as np
# predictions on the test set
= best_xgb_model.predict(X_test)
y_pred = best_xgb_model.predict_proba(X_test)[:, 1]
y_pred_proba
# classification report
= classification_report(y_test, y_pred)
report print(report)
precision recall f1-score support
0 0.81 0.78 0.80 977
1 0.80 0.83 0.81 1023
accuracy 0.81 2000
macro avg 0.81 0.80 0.80 2000
weighted avg 0.81 0.81 0.80 2000
# ROC AUC score
= roc_auc_score(y_test, y_pred_proba)
roc_auc print('ROC AUC score: ', np.round(roc_auc, 2))
ROC AUC score: 0.85
Now we will generate LIME explanations for each observation in the testing dataset and standardize them for later analysis and visualization (including the predicted probabilities of the positive class).
import lime.lime_tabular
from sklearn.preprocessing import StandardScaler
# using LIME for local interpretation
# initializing the LIME explainer
= lime.lime_tabular.LimeTabularExplainer(
explainer =X_train,
training_data=['Feature_{}'.format(i) for i in range(X_train.shape[1])],
feature_names=['Low Performance', 'High Performance'],
class_names='classification',
mode=1234
random_state
)
# df for storing the LIME explanations for all observation
= pd.DataFrame()
explanations_df = df.columns[:-1].tolist()
feature_names
# generating LIME explanations for each observation in the test set
for i in range(X_test.shape[0]):
# predicted probability for the positive class
= y_pred_proba[i]
predicted_class_proba
# generating the LIME explanation
= explainer.explain_instance(X_test[i], best_xgb_model.predict_proba, num_features=X_train.shape[1])
exp = exp.as_list()
exp_list
= {name: 0 for name in feature_names}
feature_values # looping through the employee's conditions and updating feature_values accordingly
for condition, value in exp_list:
for feature_name in feature_names:
if feature_name in condition.lower():
= value
feature_values[feature_name] break
# adding the predicted probability for the positive class
'predicted_class_proba'] = predicted_class_proba
feature_values[
= pd.DataFrame(feature_values, index=[0])
supp_df = pd.concat([explanations_df, supp_df], ignore_index=True)
explanations_df
# standardizing all features (including the probability for the positive class)
= StandardScaler()
scaler = scaler.fit_transform(explanations_df)
explanations_scaled = pd.DataFrame(explanations_scaled)
explanations_scaled_df = explanations_df.columns explanations_scaled_df.columns
Using the UMAP 2D projection of the prediction explanations and predicted probabilities, we can see that that are several clusters of observations with similar predicted probabilities and their drivers.
import umap
import matplotlib.pyplot as plt
import seaborn as sns
="white")
sns.set_theme(style
# visualizing the personas using UMAP
# initializing and fitting UMAP
= umap.UMAP(n_components=2, n_neighbors=50, min_dist=0.01, metric='euclidean', random_state=1979, n_jobs=1)
reducer = reducer.fit_transform(explanations_scaled)
embedding
# plotting the explanations and predicted probability in 2D scatterplot
plt.close()=(12, 8))
plt.figure(figsize= plt.scatter(embedding[:, 0], embedding[:, 1], c='lightblue', s=50, alpha=0.5)
scatter 'People with similar predictions and similar prediction drivers\n', fontsize=24)
plt.title(0.05, 0.05, "UMAP projection of the LIME prediction explanations and predicted probabilities.", wrap=True, horizontalalignment='left', fontsize=12)
plt.figtext(='x', which='both', bottom=False, top=False, labelbottom=False)
plt.tick_params(axis='y', which='both', left=False, right=False, labelleft=False)
plt.tick_params(axis plt.show()
A clustering algorithm, such as HDBSCAN, can help us identify the clusters.
import hdbscan
import matplotlib.patches as mpatches
# HDBSCAN clustering
= hdbscan.HDBSCAN(min_cluster_size=25, min_samples=10, cluster_selection_epsilon=0.3, prediction_data=True)
clusterer clusterer.fit(embedding)
HDBSCAN(cluster_selection_epsilon=0.3, min_cluster_size=25, min_samples=10, prediction_data=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
HDBSCAN(cluster_selection_epsilon=0.3, min_cluster_size=25, min_samples=10, prediction_data=True)
= clusterer.labels_
clusters
# adding the cluster labels to the dataframe
'cluster'] = clusters
explanations_df[
# plotting the clusters
plt.close()=(12, 8))
plt.figure(figsize= plt.cm.get_cmap('tab20')
cmap = plt.Normalize(clusters.min(), clusters.max())
norm = plt.scatter(embedding[:, 0], embedding[:, 1], c=clusters, s=50, cmap=cmap, norm=norm, alpha=0.5)
scatter = [mpatches.Patch(color=cmap(norm(i)), label=f'Cluster {i}') for i in np.unique(clusters)]
patches =patches, fontsize=12)
plt.legend(handles'People with similar predictions and similar prediction drivers\n', fontsize=24)
plt.title(0.05, 0.05, "UMAP projection of the LIME prediction explanations and predicted probabilities. The clusters were identified using HDBSCAN.", wrap=True, horizontalalignment='left', fontsize=12)
plt.figtext(='x', which='both', bottom=False, top=False, labelbottom=False)
plt.tick_params(axis='y', which='both', left=False, right=False, labelleft=False)
plt.tick_params(axis plt.show()
There seem to be about eight clusters and some outliers (cluster -1). Let’s look at how they differ in terms of predicted probabilities. According to the chart below, there appear to be four clusters with increased predicted probabilities (clusters 3, 5, 6, and 7), three with decreased predicted probabilities (clusters 0, 2, and 4), and one with more mixed predictions (cluster 1).
import matplotlib.colorbar as colorbar
# plotting the distribution of probability of positive classes
plt.close()=(12, 8))
plt.figure(figsize= plt.scatter(embedding[:, 0], embedding[:, 1], c=y_pred_proba, cmap='viridis', s=50, alpha = 0.5)
scatter 'People with similar predictions and similar prediction drivers\n', fontsize=24)
plt.title(0.05, 0.05, "UMAP projection of the LIME prediction explanations and predicted probabilities.", wrap=True, horizontalalignment='left', fontsize=12)
plt.figtext(='x', which='both', bottom=False, top=False, labelbottom=False)
plt.tick_params(axis='y', which='both', left=False, right=False, labelleft=False)
plt.tick_params(axis= plt.colorbar(scatter)
cbar 'Predicted class probability', rotation=270, labelpad=15)
cbar.set_label( plt.show()
Now we can check for the selected clusters which features and in which direction most affect their respective predicted probabilities. For example, we can see from the table below that the clusters with lower predicted probabilities (clusters 0, 2 and 4) are driven either by the respective values in features 6 and 9 (cluster 0), or by the respective values in features 0, 3, 7 and 9 (cluster 2), or by the respective values in feature 0 (cluster 4).
# creating a summary for each cluster across all feature drivers of predicted probabilities of the positive class
tab1 <- py$explanations_df %>%
dplyr::group_by(cluster) %>%
dplyr::summarise_all(~median(., na.rm = TRUE))
# tab dataviz
DT::datatable(
round(tab1,2),
class = 'cell-border stripe',
filter = 'top',
extensions = 'Buttons',
fillContainer = FALSE,
rownames= FALSE,
options = list(
pageLength = 10,
autoWidth = TRUE,
dom = 'Bfrtip',
buttons = c('copy'),
scrollX = TRUE,
selection="multiple"
)
) %>%
formatStyle(
names(tab1 %>% dplyr::select(-cluster, -predicted_class_proba)),
background = styleColorBar(range(tab1 %>% dplyr::select(-cluster, -predicted_class_proba)), 'lightblue'),
backgroundSize = '98% 90%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center'
)
Combined with information on the median values of these specific features, we can get a good idea of the people who tend to under-perform and “why”. For example, people in cluster 0 score too high in feature 6 and too low in feature 9; people in cluster 2 score too low in features 0, 3, 7 and 9; and people in cluster 4 score too low in features 0. Given these differences, it would be useful to consider different approaches to try to improve the sales performance of people based on information about which persona they belong to. We could also repeat a similar analysis for clusters with higher predicted probabilities to see which combination of features tends to be associated with higher performance.
# creating a df with X and y from testing part of the dataset
= pd.DataFrame(X_test, columns=['feature_{}'.format(i) for i in range(n_features)])
test_df 'predicted_class_proba'] = y_pred_proba
test_df['cluster'] = clusters test_df[
# creating a summary for each cluster across all raw data and predicted probability of the positive class
tab2 <- py$test_df %>%
dplyr::group_by(cluster) %>%
dplyr::summarise_all(~median(., na.rm = TRUE))
# tab dataviz
DT::datatable(
round(tab2,2),
class = 'cell-border stripe',
filter = 'top',
extensions = 'Buttons',
fillContainer = FALSE,
rownames= FALSE,
options = list(
pageLength = 10,
autoWidth = TRUE,
dom = 'Bfrtip',
buttons = c('copy'),
scrollX = TRUE,
selection="multiple"
)
) %>%
formatStyle(
names(tab2 %>% dplyr::select(-cluster, -predicted_class_proba)),
background = styleColorBar(range(tab2 %>% dplyr::select(-cluster, -predicted_class_proba)), 'lightblue'),
backgroundSize = '98% 90%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center'
)
Maybe you’ll find the method described here useful in one of your ML projects. Happy data sleuthing 🙂
For attribution, please cite this work as
Stehlík (2023, Nov. 10). Ludek's Blog About People Analytics: Personas based on ML local interpretation algorithms. Retrieved from https://blog-about-people-analytics.netlify.app/posts/2023-11-09-personas-based-on-ml-local-interpretation-algos/
BibTeX citation
@misc{stehlík2023personas, author = {Stehlík, Luděk}, title = {Ludek's Blog About People Analytics: Personas based on ML local interpretation algorithms}, url = {https://blog-about-people-analytics.netlify.app/posts/2023-11-09-personas-based-on-ml-local-interpretation-algos/}, year = {2023} }