Better config files for python ML experiments
ML experiments essentially boil down to a single script, for example:
1# ml_experiment.py
2#
3# Executed with `python ml_experiment.py`
4#
5
6def main():
7 dataloader = ...
8 model = ...
9 loss_function = ...
10 optmizer = ...
11 n_epochs = ...
12
13 losses = []
14 for epoch in range(n_epochs):
15 epoch_error = 0.0
16 for x, y in dataloader:
17 y_hat = model(x)
18 loss = criterion(y_hat, y)
19 loss.backward()
20 optimizer.step()
21 optimizer.zero_grad()
22 epoch_error += float(error)
23 losses.append(epoch_error)
24
25 ...
26
27if __name__ == "__main__":
28 main()
This works for a single experiment, but in development we always use multiple experiments. We should copy and paste all the common iteration and optimisation code across every experiments, since this is extremely error prone. Instead we should opt for extracting all the common training logic into a single script and specify the experiment-specific settings (i.e. dataloader, model, loss function, etc.) in a config file that we can load:
1# train.py
2#
3# Executed with `python train.py <CONFIG_PATH>`
4#
5import argparse
6
7def parse_args() -> argparse.Namespace:
8 parser = argparse.ArgumentParser()
9 parser.add_argument("config_path", type=str, nargs="?")
10 return parser.parse_args()
11
12def main():
13 args = parse_args()
14 config_path = args.config_path
15
16 dataloader = load_dataloader_from_config(config_path)
17 model = load_model_from_config(config_path)
18 loss_function = load_loss_function_from_config(config_path)
19 ...
20
21
22if __name__ == "__main__":
23 main()
We could use a format such as JSON or YAML, but this introduces new problems:
- How can we effectively select between different python objects (i.e. different
models and different args and kwargs) without accidentally writing a
horrifically ugly and error prone
load_from_*
functions?
Who seriously wants to write a bunch of convoluted parsing code?! I’ve written a few of these and seen a couple of implementations out in the wild, but it’s never sat well with me — I would much rather focus on the actually important machine learning code!
Why don’t we skip the config interpreters, and just write our configs as python scripts? A couple of upsides:
- No convoluted parsers to write
- Natural and interpretable parameter and object definition
- Ability to write more bespoke object initiation for certain experiments
But the downside is that we’d have to hardcode the imports at the top of the file:
1# train.py
2#
3# Executed with `python train.py`
4#
5
6# Question: do I have to hardcode this?
7# What if I want to select a different config?
8# Also python imports can be a bit annoying sometimes.
9from ml_experiment.config import (
10 dataloader,
11 model,
12 loss_function,
13 optimizer,
14 n_epochs,
15)
16
17def main():
18 ...
19
20if __name__ == "__main__":
21 main()
But there’s a way around hardcoding the imports by using the built-in methods of
importlib
. I learned this snippet from one of my last jobs (thanks to Tim
Esler and David Norrish):
1# src/utils.py
2#
3from importlib.util import module_from_spec, spec_from_file_location
4from pathlib import Path
5from types import ModuleType
6
7
8def import_script_as_module(script_path: str | Path) -> ModuleType:
9 """Execute a python script and return object where symbols are defined as attributes.
10
11 Args:
12 config_path: Path to a python script.
13 """
14 config_path = Path(script_path)
15 name = Path(config_path).stem
16 spec = spec_from_file_location(name, config_path)
17 module = module_from_spec(spec)
18 spec.loader.exec_module(module)
19 return module
This is effectively a way to dynamically load a .py
file as if it were a
normal module, without needing it in sys.path
.
How it works:
spec_from_file_location(name, location)
- Creates a
ModuleSpec
object that describes how to import a module from the given file path. - Tells Python where the module is and what its name should be.
- Creates a
module_from_spec(spec)
- Creates a new, empty module object based on the
ModuleSpec
. - Prepares a module object that can later be populated by executing the actual code.
- Creates a new, empty module object based on the
spec.loader.exec_module(module)
- Loads and executes the code from the given file into the module object created above.
- Actually runs the script, populating the module’s namespace with the variables, functions, classes, etc. defined in it.
So now I can write my experiment runner as:
1# train.py
2#
3# Executed with `python train.py <CONFIG_PATH>`
4#
5import argparse
6
7from src.utils import import_script_as_module
8
9def parse_args() -> argparse.Namespace:
10 parser = argparse.ArgumentParser()
11 parser.add_argument("config_path", type=str, nargs="?")
12 return parser.parse_args()
13
14def main():
15 args = parse_args()
16 config = import_script_as_module(args.config_path)
17
18 dataloader = config.dataloader
19 model = config.model
20 loss_function = config.loss_function
21 ...
22
23if __name__ == "__main__":
24 main()
And my config file can be written as a simple modularisable Python script:
1# config.py
2#
3
4model = MyModel(arg1, arg2, ...)
5
6transforms = [ ... ]
7dataset = Dataset(transforms, ...)
8dataloader = DataLoader(dataset, ...)
9
10...
Taking this approach, we end up with an experiment setup that is much more interpretable and maintainable.
And I don’t have to write a single line of YAML, which is always a win.