Navigating Metric Trade-offs in Machine Learning

2025-11-08

When building effective machine learning systems, it’s rarely as straightforward as “pick the model that maximises a metric”. There are multiple aspects of model performance, and thus several metrics, trade-offs and compromises that are unavoidable.

For example, say we have trained a classifier model and we’re now trying the decide the threshold at which the model predicts positive samples. If we set the threshold too high, we will have poor precision because almost everything would be predicted as negative. Set the threshold too low, and the recall tanks instead because it labels almost everything as positive.

Like Heisenberg’s uncertainty principle, machine learning models often have good precision, or good recall — but not both.

As Andrew Ng eloquently points out in Machine Learning Yearning, successful ML often involves a strategic balancing act. His approach suggests to defining:

Primary metric:
- Directly aligns with the core business objective
- For example, precision, user engagement, etc.
- Most valuable metrics to maximise
Secondary metrics:
- Other indicators of performance that must be above a given level
- For example, latency, false positive rate, etc.
- We want these metrics to stay above a certain threshold
- But it might be worth accepting a model that maximises the primary metric at the cost of secondary metric performance

This multi-metric approach ensures our models are performant, as well as responsible and practical.

Consider a medical diagnostic classifier designed to detect a rare but critical disease. Our primary metric is maximizing recall — ensuring we identify as many true positive cases as possible to avoid missed diagnoses.

A crucial secondary constraint might be maintaining precision above 0.9, meaning fewer than 10% of our positive predictions are actually false alarms, thereby reducing unnecessary patient anxiety and follow-up procedures. Now, imagine we find that by slightly loosening our precision constraint to 0.85, our recall jumps by an impressive 15%.

This isn’t a failure; it’s a strategic opportunity. The key is to intentionally and thoughtfully make these compromises, understanding the implications of each shift.

The goal isn’t to optimize every individual metric to its absolute peak — this is often an impossible task. Instead, we focus on crafting a system that delivers the most value within defined boundaries.

Reply to this post by email ↪