ProbableOdyssey

Better config files for python ML experiments

ML experiments essentially boil down to a single script, for example:

 1# ml_experiment.py
 2#
 3# Executed with `python ml_experiment.py`
 4#
 5
 6def main():
 7    dataloader = ...
 8    model = ...
 9    loss_function = ...
10    optmizer = ...
11    n_epochs = ...
12
13    losses = []
14    for epoch in range(n_epochs):
15        epoch_error = 0.0
16        for x, y in dataloader:
17            y_hat = model(x)
18            loss = criterion(y_hat, y)
19            loss.backward()
20            optimizer.step()
21            optimizer.zero_grad()
22            epoch_error += float(error)
23        losses.append(epoch_error)
24
25    ...
26
27if __name__ == "__main__":
28    main()

This works for a single experiment, but in development we always use multiple experiments. We should copy and paste all the common iteration and optimisation code across every experiments, since this is extremely error prone. Instead we should opt for extracting all the common training logic into a single script and specify the experiment-specific settings (i.e. dataloader, model, loss function, etc.) in a config file that we can load:

 1# train.py
 2#
 3# Executed with `python train.py <CONFIG_PATH>`
 4#
 5import argparse
 6
 7def parse_args() -> argparse.Namespace:
 8    parser = argparse.ArgumentParser()
 9    parser.add_argument("config_path", type=str, nargs="?")
10    return parser.parse_args()
11
12def main():
13    args = parse_args()
14    config_path = args.config_path
15
16    dataloader = load_dataloader_from_config(config_path)
17    model = load_model_from_config(config_path)
18    loss_function = load_loss_function_from_config(config_path)
19    ...
20
21
22if __name__ == "__main__":
23    main()

We could use a format such as JSON or YAML, but this introduces new problems:

Who seriously wants to write a bunch of convoluted parsing code?! I’ve written a few of these and seen a couple of implementations out in the wild, but it’s never sat well with me — I would much rather focus on the actually important machine learning code!

Why don’t we skip the config interpreters, and just write our configs as python scripts? A couple of upsides:

But the downside is that we’d have to hardcode the imports at the top of the file:

 1# train.py
 2#
 3# Executed with `python train.py`
 4#
 5
 6# Question: do I have to hardcode this?
 7# What if I want to select a different config?
 8# Also python imports can be a bit annoying sometimes.
 9from ml_experiment.config import (
10    dataloader,
11    model,
12    loss_function,
13    optimizer,
14    n_epochs,
15)
16
17def main():
18    ...
19
20if __name__ == "__main__":
21    main()

But there’s a way around hardcoding the imports by using the built-in methods of importlib. I learned this snippet from one of my last jobs (thanks to Tim Esler and David Norrish):

 1# src/utils.py
 2#
 3from importlib.util import module_from_spec, spec_from_file_location
 4from pathlib import Path
 5from types import ModuleType
 6
 7
 8def import_script_as_module(script_path: str | Path) -> ModuleType:
 9    """Execute a python script and return object where symbols are defined as attributes.
10
11    Args:
12        config_path: Path to a python script.
13    """
14    config_path = Path(script_path)
15    name = Path(config_path).stem
16    spec = spec_from_file_location(name, config_path)
17    module = module_from_spec(spec)
18    spec.loader.exec_module(module)
19    return module

This is effectively a way to dynamically load a .py file as if it were a normal module, without needing it in sys.path.

How it works:

So now I can write my experiment runner as:

 1# train.py
 2#
 3# Executed with `python train.py <CONFIG_PATH>`
 4#
 5import argparse
 6
 7from src.utils import import_script_as_module
 8
 9def parse_args() -> argparse.Namespace:
10    parser = argparse.ArgumentParser()
11    parser.add_argument("config_path", type=str, nargs="?")
12    return parser.parse_args()
13
14def main():
15    args = parse_args()
16    config = import_script_as_module(args.config_path)
17
18    dataloader = config.dataloader
19    model = config.model
20    loss_function = config.loss_function
21    ...
22
23if __name__ == "__main__":
24    main()

And my config file can be written as a simple modularisable Python script:

 1# config.py
 2#
 3
 4model = MyModel(arg1, arg2, ...)
 5
 6transforms = [ ... ]
 7dataset = Dataset(transforms, ...)
 8dataloader = DataLoader(dataset, ...)
 9
10...

Taking this approach, we end up with an experiment setup that is much more interpretable and maintainable.

And I don’t have to write a single line of YAML, which is always a win.

Reply to this post by email ↪