Extreme Programming for ML Teams: Faster Delivery, Reliable Results
Despite their overlap, the best practices of software engineering in one field often falter when applied directly to data science. Agile rituals that work for feature-driven software projects can feel clumsy when applied to ML. In software, progress can be sliced into neat features shipped every one or two weeks. In data science, progress is murkier: projects take longer, discovery work is essential, and failed experiments aren’t setbacks — they’re the process itself.
Agile frameworks were developed for software because traditional “waterfall” workflows from engineering could not keep pace in the uncertain world of software. Data science teams face the same problem, but with different constraints. The lesson is the same: collaboration and knowledge sharing are the countermeasures to uncertainty — enabling teams to learn quickly, embrace creativity, and move toward faster, more reliable progress.
That’s why a recent post by Jacob Clark from Hyperact (Should we revisit XP in the age of AI?) caught my attention. Jacob makes a compelling argument for bringing “extreme programming” (XP) back to the forefront for agile software development today, and I noticed that the tenets of XP are exactly the same as what I’ve observed makes ML teams successful.
What is XP?
XP is an agile software development methodology focused on improving software quality and responsiveness to changing requirements. Introduced by Kent Beck in 1996, the core idea is to take good development practices to the “extreme”. It’s defined by 5 core values:
-
Communication: constant, clear communication between team members and stakeholders.
-
Simplicity: build the simplest solution that works; avoid over-engineering.
-
Feedback: use frequent testing and customer involvement to get rapid feedback.
-
Courage: be willing to change direction, refactor, or throw away code when needed.
-
Respect: everyone’s contributions are valued; sustainable teamwork.
XP prescribes a set of engineering practices, many of which are now mainstream agile techniques:
-
Collaboration & Team Practices
- Pair Programming – two developers work together at one workstation.
- Collective Code Ownership – anyone can change any part of the code.
- Coding Standards – shared style to keep codebase consistent.
-
Code Quality & Engineering Discipline
- Test-Driven Development (TDD) – write automated tests before writing code.
- Refactoring – improve code structure continuously without changing behavior.
- Simple Design – design for current needs, not speculative future requirements.
-
Process & Delivery Practices
- Continuous Integration (CI) – integrate and test code frequently (multiple times a day).
- Small Releases – deliver working software frequently (days to weeks).
- Sustainable Pace (“40-hour week”) – avoid burnout; maintain long-term productivity.
- On-Site Customer – customer representative works closely with the team to clarify requirements.
In practice, development is done in short 1-2 week iterations, tasks are derived from user stores — which keep the requirements and needs in full focus. Importantly, planning for each task is adaptive: the scope can change, but time boxes are fixed.
This workflow is very responsive to changing requirements, increases collaboration and knowledge sharing, and encourages high code quality through TDD, CI, and refactoring.
However, it’s not a practice suited for large distributed teams. It makes a strong assumption on the ease of getting regular time with a customer representative, and it also appears to have a high cost to productivity at first. These are valid concerns, but I strongly argue against the perceived buy-in cost. The heart of this workflow is not only sufficient for fast and reliable development — it’s absolutely necessary!
Adapting XP for Data Science
Some of these principles work well in data science without modification, but parts of this need re-interpretation to be applied effectively in ML teams.
Where XP Fits Well:
-
Continuous Integration (CI)
- Automatically run evaluation suites, ensure reproducibility
- Automatic re-training of models
- Combine code and data pipelines into CI/CD
-
Test-Driven Development (TDD)
-
Hard to apply directly, but you can do data-driven TDD:
- Write evaluation metrics/tests before building the model.
- Define expected performance thresholds (e.g., “F1 ≥ 0.8 on holdout data”).
-
-
Refactoring & Simple Design
- Keep pipelines and feature engineering code clean and modular
- Use dependency injection patterns when writing code for models and pipelines
- Simpler code is more interpretable code, thus is much less of a liability!
- Use comments judiciously to help with knowledge transfer
- “code tells you how, comments tell you why”
- Avoid premature scaling
- (i.e. don’t over-engineer a data pipeline for 1TB when you have 10GB)
- Keep pipelines and feature engineering code clean and modular
-
Pair Programming and Collective Code Ownership
- Pair on model debugging, feature engineering, or reproducibility reviews
- Helps avoid “hero research code” that no one else understands
- Antidote for “silo-ing” of individual team members on long-running projects
- Significantly de-risks a project by effortlessly distributing knowledge
- Document datasets and model assumptions, not just code!
-
Sustainable Pace
- Especially relevant: ML/AI projects often push people into long “experiment binge” cycles
- XP’s principle of sustainable pace keeps teams fresh and effective
Where XP Needs Adaptation:
-
Small Releases
- Software: a new feature every week
- ML: a new model version every week may not be realistic. Instead:
- Release dashboards, experiments, prototypes in small increments
- Treat notebooks, reports, or trained baselines as “releases”
-
On-Site Customer
- In ML, “the customer” often doesn’t know what’s feasible until we try
- Adaptation: keep close contact with domain experts / stakeholders, but frame it as joint discovery.
Where XP Breaks Down:
-
TDD on research code
- Writing tests before experimentation doesn’t make sense when the science is unknown
- Better: validate after by writing reproducibility tests and data quality checks.
- This can be done easily in notebooks after ingesting and transforming data:
- Verify column names in DataFrames
- check that the summary statistics of columns match what’s expected
- Validate data types and schemas match what’s expected
-
Simplicity principle vs. experimental complexity
- Sometimes you do need to try a more complex model just to test a hypothesis
- Simplicity can’t always be enforced upfront — it comes later, after pruning ideas
An XP Workflow for data science
Adapting XP to data science means treating experiments, pipelines, and models as the deliverables of your workflow, rather than traditional software features. A typical workflow might look like this:
Iteration Cycle (1–2 weeks)
- Define experiment stories: Each iteration begins with a clear objective, e.g., “train model X with features Y and Z” or “clean and validate dataset A.”
- Run experiments and document results: Capture metrics, observations, and insights. Sharing results early ensures the team learns collectively and avoids duplicated effort.
- Add engineering stories: Include tasks like automating data pipelines, writing tests, or deploying a model service.
- Deliver a “release”: In ML, a release isn’t always a production model. It could be a prototype, an evaluation report, a cleaned dataset, or a baseline model ready for further experimentation.
Daily Rhythm
- Quick standups: Share what worked, what failed, and what the next step is.
- Pair programming and reviews: Collaborate on experiments, code, and reproducibility checks.
- Continuous integration (CI): Run automated tests on both code and data pipelines. Validate unit tests, schema checks, and model evaluation scripts frequently.
Testing and Validation
- Code tests: Ensure unit and integration tests catch regressions in preprocessing or pipeline logic.
- Data tests: Check schemas, missing values, statistical ranges, and data drift.
- Model tests: Monitor performance thresholds, fairness metrics, and latency for deployed models.
By combining these practices, the workflow maintains XPs core values: frequent feedback, simplicity, collaboration, and courage to pivot. Experiments are treated as releases, knowledge is shared continuously, and data quality and reproducibility are built into the process. The result is a team that can explore boldly while still delivering reliable, high-quality outcomes.
Key takeaways
XP Brings robust development practices to ML Teams by taking the elements of good collaboration to the extreme, resulting in
- Cleaner pipelines, fewer hidden assumptions.
- Faster feedback, fewer wasted experiments.
- Reproducibility by design.
- A sustainable way to balance research curiosity with engineering rigor.
Something I think is worth highlighting: frequent pairing may feel like a bottleneck, but I’ve experienced the complete opposite effect whenever I engage with it. Simply put, it’s more fun. Some of the most insightful tips I’ve learned didn’t come from textbooks, they come from people I paired with over the years. The shared enjoyment of learning motivates sustainable progress and keeps momentum consistently high over a long stretch of time.
XP wasn’t written with ML in mind, but its core philosophy — rapid feedback, simplicity, discipline — fits the challenges of ML/AI engineering. By treating experiments as releases, data as code, and evaluation as testing, XP gives ML teams a methodology that supports both exploration and reliable delivery.