Better config files for python ML experiments
ML experiments essentially boil down to a single script, for example:
# ml_experiment.py
#
# Executed with `python ml_experiment.py`
#
def main():
dataloader = ...
model = ...
loss_function = ...
optmizer = ...
n_epochs = ...
losses = []
for epoch in range(n_epochs):
epoch_error = 0.0
for x, y in dataloader:
y_hat = model(x)
loss = criterion(y_hat, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
epoch_error += float(error)
losses.append(epoch_error)
...
if __name__ == "__main__":
main()
This works for a single experiment, but in development we always use multiple experiments. We should copy and paste all the common iteration and optimisation code across every experiments, since this is extremely error prone. Instead we should opt for extracting all the common training logic into a single script and specify the experiment-specific settings (i.e. dataloader, model, loss function, etc.) in a config file that we can load:
# train.py
#
# Executed with `python train.py <CONFIG_PATH>`
#
import argparse
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("config_path", type=str, nargs="?")
return parser.parse_args()
def main():
args = parse_args()
config_path = args.config_path
dataloader = load_dataloader_from_config(config_path)
model = load_model_from_config(config_path)
loss_function = load_loss_function_from_config(config_path)
...
if __name__ == "__main__":
main()
We could use a format such as JSON or YAML, but this introduces new problems:
- How can we effectively select between different python objects (i.e. different
models and different args and kwargs) without accidentally writing a
horrifically ugly and error prone
load_from_*
functions?
Who seriously wants to write a bunch of convoluted parsing code?! I’ve written a few of these and seen a couple of implementations out in the wild, but it’s never sat well with me — I would much rather focus on the actually important machine learning code!
Why don’t we skip the config interpreters, and just write our configs as python scripts? A couple of upsides:
- No convoluted parsers to write
- Natural and interpretable parameter and object definition
- Ability to write more bespoke object initiation for certain experiments
But the downside is that we’d have to hardcode the imports at the top of the file:
# train.py
#
# Executed with `python train.py`
#
# Question: do I have to hardcode this?
# What if I want to select a different config?
# Also python imports can be a bit annoying sometimes.
from ml_experiment.config import (
dataloader,
model,
loss_function,
optimizer,
n_epochs,
)
def main():
...
if __name__ == "__main__":
main()
But there’s a way around hardcoding the imports by using the built-in methods of
importlib
. I learned this snippet from one of my last jobs (thanks to Tim
Esler and David Norrish):
# src/utils.py
#
from importlib.util import module_from_spec, spec_from_file_location
from pathlib import Path
from types import ModuleType
def import_script_as_module(script_path: str | Path) -> ModuleType:
"""Execute a python script and return object where symbols are defined as attributes.
Args:
config_path: Path to a python script.
"""
config_path = Path(script_path)
name = Path(config_path).stem
spec = spec_from_file_location(name, config_path)
module = module_from_spec(spec)
spec.loader.exec_module(module)
return module
This is effectively a way to dynamically load a .py
file as if it were a
normal module, without needing it in sys.path
.
How it works:
spec_from_file_location(name, location)
- Creates a
ModuleSpec
object that describes how to import a module from the given file path. - Tells Python where the module is and what its name should be.
- Creates a
module_from_spec(spec)
- Creates a new, empty module object based on the
ModuleSpec
. - Prepares a module object that can later be populated by executing the actual code.
- Creates a new, empty module object based on the
spec.loader.exec_module(module)
- Loads and executes the code from the given file into the module object created above.
- Actually runs the script, populating the module’s namespace with the variables, functions, classes, etc. defined in it.
So now I can write my experiment runner as:
# train.py
#
# Executed with `python train.py <CONFIG_PATH>`
#
import argparse
from src.utils import import_script_as_module
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("config_path", type=str, nargs="?")
return parser.parse_args()
def main():
args = parse_args()
config = import_script_as_module(args.config_path)
dataloader = config.dataloader
model = config.model
loss_function = config.loss_function
...
if __name__ == "__main__":
main()
And my config file can be written as a simple modularisable Python script:
# config.py
#
model = MyModel(arg1, arg2, ...)
transforms = [ ... ]
dataset = Dataset(transforms, ...)
dataloader = DataLoader(dataset, ...)
...
Taking this approach, we end up with an experiment setup that is much more interpretable and maintainable.
And I don’t have to write a single line of YAML, which is always a win.