Hard-Learned Lessons in Shipping Software (AI/ML) Projects

A Guide for Engineers and Product Managers

Product Management
Machine Learning
Author

Imad Dabbura

Published

January 5, 2025

Modified

February 11, 2026

Why ML Projects Fail to Ship

Some ML projects I’ve worked on shipped six months late. Others shipped and quietly died in production. A few never shipped at all — and those are the ones I keep coming back to. I’ve been through this as an individual contributor and as the person leading the team. For a long time I blamed the usual suspects: unclear requirements, technical debt, underestimating complexity. The real cause, I eventually realized, was more structural — and it looked the same from both seats.

A web feature has a clear definition of done: the button appears, the form submits, the data is saved. An ML feature doesn’t. “Improve recommendation accuracy” could mean another week of training runs, another architecture experiment, another round of feature engineering — indefinitely. Unlike traditional software, where the solution space is bounded by the spec, ML projects have an effectively unbounded search space. Every model can be made larger, every feature set more complete, every training run longer.

flowchart TD
    subgraph trad ["Traditional Software"]
        direction LR
        t1["Define Feature"] --> t2["Build"] --> t3["Ship ✓"]
    end
    subgraph ml ["ML Without Constraints"]
        direction LR
        m1["Vague Goal"] --> m2["Experiment #1"]
        m2 -->|"+0.5% accuracy"| m3["Experiment #2"]
        m3 -->|"+0.2% accuracy"| m4["Experiment #3"]
        m4 -.->|"one more try..."| m2
    end

Traditional software has a bounded end state. ML projects without defined constraints loop indefinitely — each experiment looks like progress.

This produces a predictable failure mode: a project that looks like it’s making progress — models training, experiments running, metrics moving — but never ships.

The root cause, I’ve come to believe, is a category error. ML projects sit uncomfortably between research and engineering. Research is unbounded by design — you keep going until you understand something. Engineering is bounded by design — you keep going until it ships. The teams that deliver consistently have made a deliberate choice about which one they’re doing. The ones that don’t, haven’t — and so they run a research process inside an engineering context, indefinitely.

What follows is what I’ve learned — sometimes from my own failures, sometimes from watching teams I was leading repeat patterns I’d already lived through — about how to actually make that choice.

Define the Target Before Writing a Line of Code

The most consistent mistake I’ve made — and watched others make — is starting before the goal is properly defined. It doesn’t feel like a mistake at the time. There’s energy, there’s a general direction, there’s a team ready to move. But you can’t constrain an undefined goal. The first structural requirement for shipping is a precise definition of version one — not the roadmap, not the vision, but version one.

Write it as a falsifiable criterion: “A model that identifies churn risk 14 days in advance with precision ≥ 70% on the holdout set, deployable via the existing prediction service.” That’s a definition. “Improve churn prediction” is not — it’s a direction, and directions don’t ship.

Before writing code, three questions force the definition:

Who is this for, specifically? A customer-facing recommendation system for mobile users has different input distributions, latency constraints, and acceptable error modes than an internal analyst tool. “Users in general” means nobody in particular, and a system designed for nobody in particular gets spec’d indefinitely.

What does version one accomplish — and what does it explicitly not do? The second half matters as much as the first. Scope creep in ML is insidious because experiments feel like progress: an additional feature, a new architecture variant, a cleaned edge case — each looks like forward motion. The out-of-scope list is what makes the in-scope list real.

What are the success criteria, written down and falsifiable? Precision ≥ 0.70. Latency ≤ 100ms at p99. Deployable on the existing serving infrastructure. Criteria that can be verified make it possible to call the project done. Criteria that can’t — “good enough,” “production-ready,” “performs well” — guarantee the project never ends.

Working backwards from these answers also produces the project structure. Once you know what version one must accomplish, you can enumerate prerequisite questions: What training data do we need? What does the evaluation harness look like? How does it plug into production? Each answer either gets scheduled or gets cut. Vague goals don’t allow this decomposition — they keep the surface area perpetually open.

Make Time the Constraint, Not Scope

The natural instinct is to treat scope as fixed and deadline as flexible. This is exactly backwards.

Scope in an ML project is not fixed — it’s infinitely expandable. There’s always a better architecture to try, a cleaner way to handle edge cases, a feature that might help. Teams treat scope as the constraint because it feels owned: the team wrote the requirements, agreed on them, and changing them feels like abandoning a commitment. Deadlines, by contrast, feel externally imposed and therefore more negotiable — something to push when the work “isn’t ready yet.”

I’ve been in this meeting many times — sometimes as the engineer watching the deadline move, more often as the person responsible for it. The deadline slips, then slips again, then becomes a standing item on the weekly status call.

Flip the constraint. Treat the deadline as fixed and scope as the variable. This changes the question from “when will we be done with everything we planned?” to “what’s the most important thing we can ship by this date?” The second question forces real prioritization. The model that trains in four hours ships; the one that takes 24 hours doesn’t. The feature built on existing infrastructure stays; the one that requires a new data pipeline gets cut.

Deadlines as a Design Tool

A deadline doesn’t dictate quality — it dictates scope. The discipline is specifically about protecting the deadline from scope expansion, not accelerating the work. When a new requirement surfaces mid-project, the question isn’t “can we fit it in?” — it’s “what does it displace?”

Version One Is Supposed to Be Small

Version one of most production models is smaller, faster, and more constrained than anything the team initially imagined. Good. The goal of version one isn’t to build the best possible system — it’s to establish the deployment path, validate production integration, and generate real usage data. The best possible system comes later, built on what version one teaches you.

Decompose Until Done Is Unambiguous

Once you have a target and a deadline, break the project into tasks — not work items, not epics, tasks. The distinction matters: a task has an unambiguous definition of done. A project doesn’t.

“Train a production NLP model” is a project. Tasks are: - “Label 500 training examples from the January logs” — done or not done. - “Achieve F1 ≥ 0.82 on the validation split” — done or not done. - “Write the endpoint that accepts raw text and returns a classification” — done or not done.

If you can’t tell whether a piece of work is finished without discussion, break it down further.

With tasks in hand, ruthlessly prioritize against the version one criteria. Not everything is equally important, and pretending otherwise is how projects stall:

Category Description Rule
Must-have System cannot ship without this Do first, never cut
Should-have Meaningfully improves the product Include if time allows
Nice-to-have Incremental gain, no blocker Version two
Gold-plating No clear user benefit Cut immediately

The failure mode is treating “should-haves” as “must-haves.” It happens because, deep down, the team doesn’t believe version two is coming. If this feels like the only shot, every improvement feels essential. But that’s exactly backwards: version two only exists because version one shipped. Holding version one hostage to version two’s requirements is how you guarantee neither does.

Validate the Core Assumption Before Building the System

Every ML project rests on a single load-bearing assumption: “Does a model trained on this data actually produce useful predictions for this problem?” Everything else — the serving infrastructure, the feature pipeline, the retraining loop, the monitoring dashboard — only matters if the answer is yes.

I’ve fallen into this trap early in my career, and led teams into it later. The pattern is always the same: stand up a feature store, design a training pipeline, architect a serving layer — then train the model and discover the data doesn’t support the prediction task, or the signal is too weak, or the problem is better solved without ML at all. Months of infrastructure work, none of it applicable to the revised approach. The infrastructure trap is just as easy to fall into when you’re the one setting the direction as when you’re the one doing the building.

flowchart TD
    subgraph wrong ["❌ Common Mistake"]
        direction LR
        w1["Feature Store"] --> w2["Training Pipeline"] --> w3["Model Registry"] --> w4["Model"] --> w5["Works?"]
    end
    subgraph right ["✓ Correct Order"]
        direction LR
        r1["Validate\nApproach"] --> r2["Establish\nDeploy Path"] --> r3["Build\nInfrastructure"] --> r4["Ship ✓"]
    end

The infrastructure trap: building the full system before validating the approach. The correct order validates cheaply first, then invests in infrastructure.

The minimum viable experiment: train a simple baseline on a slice of the data, evaluate it against a manually-labeled holdout, and show the results to at least one person who’d actually use the output. Logistic regression, a small neural net, a fine-tuned pretrained model — whatever takes days, not months. If the results are promising, the infrastructure investment is justified. If not, you’ve learned the most important thing about the project for the cost of two weeks.

This also determines tool choice during validation. scikit-learn, PyTorch, pre-trained transformers from HuggingFace — these represent thousands of engineering hours and are battle-tested at scale. Custom architectures and bespoke training loops are justified when profiling data shows standard tools can’t meet your requirements. That data doesn’t exist before validation. Building custom infrastructure before validating the approach is the fastest way to spend six months on something nobody uses.

Ship, Then Compound

Once version one meets the criteria, ship it — even if it’s not perfect.

Every model I’ve shipped has surprised me in production. Not because the evaluation was wrong, but because it was measuring the wrong things. Your holdout set measures what you measured. Real users do things you didn’t anticipate — edge cases you didn’t label, inputs from distributions you didn’t sample, and above all, they surface which errors actually matter. A model that’s 92% accurate on the evaluation set might be systematically wrong on the 8% of inputs that are disproportionately important to users. You won’t know that until the model is deployed.

flowchart LR
    A["Ship\nImperfect v1"] --> B["Real\nUsage Data"]
    B --> C["Discover\nActual Failures"]
    C --> D["Targeted\nFixes"]
    D --> E["Ship\nBetter v2"]
    E --> B

The iteration flywheel: each shipped version surfaces real failures that targeted improvements address, compounding over time.

Ship versions that meet the bar, not versions that approach some imagined ceiling. Version one will be wrong in ways you didn’t anticipate — I’ve never shipped one that wasn’t, and I’ve never led a team that did. One of the harder things about leading engineers through this is convincing them that shipping something imperfect isn’t a compromise — it’s the whole point. The failures you discover in production are the ones that matter. Find them early, when fixing them is fast, not late, when the system is load-bearing and everything is entangled.

Each Version Enables the Next

Real usage reveals the failures that matter — not the ones you hypothesized in the design doc, but the ones users actually encounter. Their feedback tells you which improvements are worth making. Infrastructure built for version one scales to version two. The teams that ship consistently aren’t the ones with better planning processes — they’re the ones who’ve completed more cycles of this loop.

Key Takeaways

  1. Decide whether you’re doing research or engineering before you start. ML projects that don’t make this distinction run research processes in engineering contexts — indefinitely.

  2. Define version one as a falsifiable criterion. Precision ≥ X. Latency ≤ Y. Deployable on Z. Criteria that can’t be verified guarantee the project never ends.

  3. Treat deadline as fixed, scope as variable. The question is always: “What’s the most important thing we can ship by this date?”

  4. Decompose until done is unambiguous. If you can’t tell whether a task is finished without discussion, it’s not a task — it’s a project.

  5. Validate the core assumption before building infrastructure. Does the model work on this data? Answer that first, with the simplest possible tools. Everything else comes after.

  6. Ship the imperfect version. Offline evaluation measures what you measured. Real usage reveals what you missed. Each shipped version enables the next.

Back to top