In this followup to my earlier post on modeling workflows in Python, I demonstrate how to integrate sample splitting, parallel processing, exception handling and caching into many-models workflows. I also discuss some differences between exploration/inference-centric workflows and tuning-centric workflows.

## Motivating example

We will work with the Palmer Penguin dataset, which contains various biological measures on three species of penguins. Our goal will be to cluster the penguins into groups that correspond to their species using their bill length, bill depth, flipper length and body mass. We’ll use a Gaussian Mixture Model for clustering. Caveat: my goal here is to demonstrate a workflow, not good science.

We’ll start by pulling the data into Python and taking a look.

```
import pandas as pd
penguins = pd.read_csv('https://tinyurl.com/palmerpenguincsv').dropna()
clustering_features = [
'bill_length_mm',
'bill_depth_mm',
'flipper_length_mm',
'body_mass_g'
]
penguins
```

```
## species island bill_length_mm ... body_mass_g sex year
## 0 Adelie Torgersen 39.1 ... 3750.0 male 2007
## 1 Adelie Torgersen 39.5 ... 3800.0 female 2007
## 2 Adelie Torgersen 40.3 ... 3250.0 female 2007
## 4 Adelie Torgersen 36.7 ... 3450.0 female 2007
## 5 Adelie Torgersen 39.3 ... 3650.0 male 2007
## .. ... ... ... ... ... ... ...
## 339 Chinstrap Dream 55.8 ... 4000.0 male 2009
## 340 Chinstrap Dream 43.5 ... 3400.0 female 2009
## 341 Chinstrap Dream 49.6 ... 3775.0 male 2009
## 342 Chinstrap Dream 50.8 ... 4100.0 male 2009
## 343 Chinstrap Dream 50.2 ... 3775.0 female 2009
##
## [333 rows x 8 columns]
```

Our next step is to define a helper function to fit GMM estimators for us. I’m going to use `sklearn`

’s GMM implementation, which uses the EM algorithm for estimation. The estimator has two key hyperparameters that we’ll want to explore:

`n_components`

: the number of clusters to find`covariance_type`

: the assumed structure of the covariances of each cluster

Nothing in the code turns on these hyperparameters dramatically, but it probably worth reading the documentation linked above if haven’t seen GMMs before.

```
from typing import List
from sklearn.mixture import GaussianMixture
# some comments:
#
# 1. I highly recommend using type hints, since it'll help you keep
# keep track of what's happening in each column of the eventual
# data frame of models
#
# 2. Passing hyperparameters as keyword arguments is a generally usefu
# pattern that make it easy to swap out estimators without having
# to modify a lot of code that relies on estimators having
# particular hyperparameters.
#
# 3. Here I specify the columns of the data to use for modeling by
# passing a list of column names. This is extremely limited and
# will result in technical debt if you need to do any sort of
# interesting work to build a design matrix. Alternatives here
# might be a list of sklearn Transformers or a Patsy formula.
# In R you would use a formula object or perhaps a recipe
# from the recipes packages.
#
def fit_gmm(
train : pd.DataFrame,
features : List[chr],
**hyperparameters
) -> GaussianMixture:
# the hyperparameters are: n_components: int, covariance_type: str
gmm = GaussianMixture(
**hyperparameters,
n_init=5,
random_state=27
)
gmm.fit(train[features])
return gmm
# sanity check
fit_gmm(
penguins,
features=clustering_features,
n_components=3,
covariance_type='full'
)
```

`## GaussianMixture(n_components=3, n_init=5, random_state=27)`

Now let’s build some infrastructure so that we can use sample splitting to evaluate how well our models perform out of sample. My approach below is heavily inspired by rsample. At the end of this post I have some longer commentary about this choice and why I haven’t used `sklearn`

built-in resampling infrastructure.

```
import numpy as np
class VFoldCV:
def __init__(self, data : pd.DataFrame, num_folds : int = 10):
self.data = data
self.num_folds = num_folds
permuted_indices = np.random.permutation(len(data))
self.indices = np.array_split(permuted_indices, num_folds)
def __iter__(self):
for test_indices in self.indices:
test = self.data.iloc[test_indices]
train = self.data[~self.data.index.isin(test_indices)]
yield train, test
resamples = VFoldCV(penguins, num_folds=5)
for fold_index, (train, test) in enumerate(resamples):
print(f"""
In fold {fold_index} there are {len(train)}
training observations and {len(test)} test observations
""")
```

```
##
## In fold 0 there are 268
## training observations and 67 test observations
##
##
## In fold 1 there are 269
## training observations and 67 test observations
##
##
## In fold 2 there are 268
## training observations and 67 test observations
##
##
## In fold 3 there are 270
## training observations and 66 test observations
##
##
## In fold 4 there are 268
## training observations and 66 test observations
##
```

Now that we have cross validation folds, we want to create a grid of parameters to explore over. Here we leverage `sklearn`

’s ParameterGrid object, which basically takes a Cartesian product that we can iterate over, as well as coerce to a data frame.

```
from sklearn.model_selection import ParameterGrid
param_grid = ParameterGrid({
'n_components': range(2, 11),
'covariance_type': ['full', 'tied', 'diag', 'spherical']
})
model_grid = pd.DataFrame(param_grid)
model_grid.head()
```

```
## covariance_type n_components
## 0 full 2
## 1 full 3
## 2 full 4
## 3 full 5
## 4 full 6
```

Now we are going to do something a little tricky. We are going to create a new column of `model_grid`

that contains the results from fitting a GMM to each split from the CV object. In particular, each element in this column will itself be a list with `num_folds`

elements. That is, we will create a list-column `cv_fits`

were each element of `model_grid.cv_fits`

is itself a list. To work with these structures it is easiest to define a helper function that fits a GMM to each CV split for a single combination of hyperparameters.

```
def fit_gmm_vfold(
folds: VFoldCV,
**hyperparameters
) -> List[GaussianMixture]:
fits = [
fit_gmm(train, **hyperparameters)
for train, test in folds
]
return fits
model_grid['cv_fits'] = [
fit_gmm_vfold(
resamples,
features=clustering_features,
**hyperparameters
)
for hyperparameters in param_grid
]
model_grid.head()
```

```
## covariance_type ... cv_fits
## 0 full ... [GaussianMixture(n_components=2, n_init=5, ran...
## 1 full ... [GaussianMixture(n_components=3, n_init=5, ran...
## 2 full ... [GaussianMixture(n_components=4, n_init=5, ran...
## 3 full ... [GaussianMixture(n_components=5, n_init=5, ran...
## 4 full ... [GaussianMixture(n_components=6, n_init=5, ran...
##
## [5 rows x 3 columns]
```

**Note**: To get our cross validated fits, we iterated over `param_grid`

and stored results in `model_grid`

. It’s essential that the row indices match up between these objects. Here they match by construction, but be careful if you start to play with this structure. For example, one alternative (and frequently convenient approach) here is to create a `model_grid`

object based on `itertools.product(param_grid, resamples)`

rather than just `param_grid`

. This avoids nesting lists in list-columns, at the cost of inefficiency in terms of storage. This route is more fiddly than it looks.

Anyway, now that we have training fits, we want to compute out of sample performance estimates. In our case, we’ll use a measure of clustering quality known at the Adjusted Rand Score. Again we use nested list comprehensions to get out of sample estimates for all CV splits.

```
from sklearn.metrics import adjusted_rand_score
def oos_vfold_ars(
folds: VFoldCV,
fits: List[GaussianMixture],
features : List[chr]
) -> List[float]:
ars = [
adjusted_rand_score(
test.species.astype('category').values.codes,
fit.predict(test[features])
) for (_, test), fit in zip(folds, fits)
]
return ars
model_grid['cv_ars'] = [
oos_vfold_ars(
resamples,
fits,
features=clustering_features
)
for fits in model_grid.cv_fits
]
model_grid.head()
```

```
## covariance_type ... cv_ars
## 0 full ... [0.704269194758452, 0.6393736747675746, 0.6868...
## 1 full ... [0.7636755414797662, 0.9203075242889042, 0.556...
## 2 full ... [0.8070440985726826, 0.6689267916052503, 0.865...
## 3 full ... [0.7427371407981262, 0.6771584099431122, 0.384...
## 4 full ... [0.631159776690568, 0.6329954381285452, 0.6960...
##
## [5 rows x 4 columns]
```

Now we can compare models by expanding the `cv_ars`

column and comparing out of sample performance measures. Here I’m just going to visualize the results, but you could also fit a mixed model of the form `adjusted_rand_index ~ hyperparameter_combination + (1|cv_fold)`

or something along those lines if you want to be fancier.

It’s worth pausing a moment to comment on the `cv_ars`

column. In my previous post, I introduced a helper function `unnest()`

that you can use to expand a list-column that contains data frames. That `unnest()`

function does not work with list-columns of lists, and instead we use the `pandas.Series.explode()`

method, which is like an extremely limited version of `tidyr::unnest()`

. Importantly, `pandas.Series.explode()`

is not very clever about types, so you may need to coerce types after expanding list columns, as I do below.

```
cross_validated_ars = (
model_grid
.explode('cv_ars')
)
print(cross_validated_ars.dtypes, '\n')
```

```
## covariance_type object
## n_components int64
## cv_fits object
## cv_ars object
## dtype: object
```

```
cross_validated_ars["cv_ars"] = pd.to_numeric(cross_validated_ars["cv_ars"])
print(cross_validated_ars.dtypes)
```

```
## covariance_type object
## n_components int64
## cv_fits object
## cv_ars float64
## dtype: object
```

Now that we have numeric data we can plot it.

```
import seaborn as sns
import matplotlib.pyplot as plt
plot = sns.lineplot(
data=cross_validated_ars,
x='n_components', y='cv_ars',
hue='covariance_type', ci=None
)
plt.show()
```

Higher Adjusted Rand Scores are better, so this initial pass of modeling suggests that we should use three clusters. This is reassuring, since we know there are three species of penguin in the dataset!

Note that I’m taking a very informal approach to model selection here and a real analysis should be more careful (in slightly more detail: the difference between CV for model selection and CV for risk estimation is germane here).

There are other approaches to model selection that we could take. For example, we could compare across hyperparameters with in-sample BIC, which is the approach taken in the R package `mclust`

. We’ll do this briefly for illustrative purposes, and start incorporating some fancier tools along the way:

**Parallelization**(via`joblib`

): Modeling fitting across hyperparameter is embarassingly parallel, so this will make things faster.**Exception handling**(via a function decorator): In practice, lots of estimators will run into numerical or data issues that you can safely ignore. In particular, when model fitting fails, it’s okay fine to just return a`np.nan`

(in R, an`NA`

-like object) and use the results from whatever estimators did succeed, propagating the`np.nan`

forward.**Caching**(via`joblib`

): In interactive workflows it’s easy to kill your Jupyter kernel or crash your computer or do something dumb that forces you to restart your Python session. It is really frustrating to have to refit your models everytime this happens. Caching models to disk (also known as memoization) prevents you from having to re-run all your models everytime you break Jupyter, etc, etc. If you use caching, you will have to be careful about cache invalidation, which is one of the most notoriously difficult problems in computing.

Full disclosure: I did not get parallelization and caching to work together for this post (bug report), but it has worked for me in the past. I’m going to include some commented out caching code in the following for you to play with as you see fit. See the `joblib`

documentation for more examples of how to combine parallel mapping with caching.

```
from joblib import Parallel, delayed # parallelism
from functools import wraps # nicer exception handling
from joblib import Memory # caching
# setup caching
cache_dir = "./model_cache"
memory = Memory(cache_dir, verbose=0)
# NOTE: here `n_components` will exceed the number
# of observations for some hyperparameter combinations
# which will cause errors. here i'm artificially introducing
# errors; in real life you'll have to supply your own
fancy_param_grid = ParameterGrid({
'n_components': range(2, 400),
'covariance_type': ['full', 'tied', 'diag', 'spherical']
})
fancy_model_grid = pd.DataFrame(fancy_param_grid)
fancy_model_grid.tail()
```

```
## covariance_type n_components
## 1587 spherical 395
## 1588 spherical 396
## 1589 spherical 397
## 1590 spherical 398
## 1591 spherical 399
```

Now we define our function decorator for handling exceptions in list comprehensions. I’m basically copying `purrr::safely()`

from R here. Exception handling is essential because it’s painful when you fit 9 out of 10 models using a list-comprehension, and then the final model fails and you have to re-run the first 9 models (caching also helps with this when you can get it to work).

```
def safely(func, otherwise=np.nan):
# traditionally you'd use @wraps here,
# but this seems to interact poorly with parallelism
# for the time being
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except:
return otherwise
return wrapper
# if bad models become np.nan, downstream calls that use models
# need to handle np.nan input. any easy way to do this is use
# @safely again to just continue propagate np.nan
@safely
def get_bic(fit):
return fit.bic(penguins[clustering_features])
fit_safely = safely(fit_gmm)
# to combine parallelism and caching, use something like:
# fit_cached = memory.cache(fit_safely)
# create persistent pool of workers and use them for fitting
with Parallel(n_jobs=20) as parallel:
fancy_model_grid['fit'] = parallel(
delayed(fit_safely)(penguins, clustering_features, **hyperparameters)
for hyperparameters in fancy_param_grid
)
fancy_model_grid['bic'] = parallel(
delayed(get_bic)(fit) for fit in fancy_model_grid.fit
)
```

Now we can compare BIC across our absurdly large number of models. `seaborn`

will automatically drop `np.nan()`

values, so our `@safely`

approach plays very nice here.

```
plot = sns.lineplot(
data=fancy_model_grid,
x='n_components', y='bic', hue='covariance_type'
)
plt.show()
```

In this plot lower BIC is better. We see than this in-sample approach to model selection prefers models with as many clusters as observations. This would clearly correspond to overfitting in our case.

## Returning to the big picture

At this point, you might be starting to wonderful why you would want to use a many-models workflow at all. All I’ve done in this blog post is some grid searching over hyperparameters, and we could easily recreate everything in this blog post with `GridSearchCV`

and some custom scoring functions.

The problem is that `GridSeachCV`

(and other related implementations) are not that flexible. Last summer, I had (1) custom data splitting, (2) custom estimators, and (3) wanted to compute high dimensional summaries for each model I fit. And I want to save my fits so that I could investigate them individually, rather than throwing them out as soon as I know about their predictive performance. `GridSearchCV`

just can’t handle this, and, by and large, the data science ecosystem in Python doesn’t either.

We can imagine that there are two contrasting modeling workflows. First, there’s a many-models workflow, which is especially appropriate for research, inference and sensitivity analysis. It’s interactive, and not particularly focused on compute efficiency. Then, there’s a hyperparameter tuning workflow, which has a simple goal: predict well. Tools for tuning workflows are typically developed by machine learners who want to train models as computationally efficiently as possible. Because these practitioners emphasize prediction accuracy over all else, it can be hard to re-purpose tools for tuning workflows to learn about models beyond their predictive accuracy^{1}.

Hopefully this post highlights some design patterns you can use when existing infrastructure isn’t a good fit for your Python modeling needs. I’m very curious to hear about other approaches people take this problem.

**Aside about**: Early work in the`tidymodels`

`tidymodels`

ecosystem focused on low level infrastructure that facilitated many-models workflows. I still use this infrastructure a lot, especially combined with`targets`

, which is a`make`

variant for R that plays nicely with modeling workflows, and`tidymodels`

inspired much of the approach I took in the post above. However, current work in the`tidymodels`

ecosystem focuses on high level infrastructure for tuning hyperparameters in predictive modeling scenarios. This mixture of general purpose low level infrastructure and more prediction specific high level infrastructure leads to some interesting discussions like this one, where someone asks how to use`tidymodels`

to analyze designed experiments, which`tidymodels`

doesn’t really provide any tools for.↩︎