A strong model does not automatically become a strong product

One thing I've learned from working on ML products: a strong model does not automatically become a strong product.

A model can look great in a notebook. It can do well on offline benchmarks. It can even feel impressive in a demo. And still fail in production. Not because the research was bad, but because the model was only one part of the system.

For an ML system to create real value, it has to solve the right problem, for the right users, with the right data, under constraints that are very real: latency, cost, privacy, safety, reliability, scale, and user trust.

Sometimes the hardest part is not building the model. It is figuring out which problem is worth solving in the first place.

A PM is not there to show up at the end and write a PRD after the research is done. In the best ML teams, product work starts much earlier. The PM helps define the use case, prioritize the problem, shape the golden dataset, clarify the evals, align on tradeoffs, define the failure bar, and connect model capability back to user and business value.

The way I think about it is simple. Product owns whether the ML system solves the right problem and is worth shipping — this also helps define the research direction. Research owns which model can work, where existing models are sufficient, and where new innovation is required. ML Engineering owns whether the model can work reliably in production.

Roles and Responsibilities

Area	Product owns	Research owns	ML Engineering owns
Problem and use case	Which user problem matters, why it matters, and what should be prioritized. This may show up in a PRD, prototype, user journey, or launch hypothesis.	Whether the problem can be modeled and which technical paths are promising.	What it would take to productionize the solution.
Model selection	The criteria for what "best" means in the product: quality, latency, cost, safety, privacy, reliability, and user value.	The technical model choice: architecture, training method, ranking approach, embeddings, loss function, or fine-tuning strategy.	The serving complexity, infra fit, scaling risks, monitoring needs, and operational cost.
Tradeoffs	What tradeoffs are acceptable for users and the business. For example: quality vs. latency, accuracy vs. cost, automation vs. control.	How different model choices impact quality, robustness, and failure modes.	How to hit latency, cost, reliability, and scale targets in production.
Golden dataset	Whether the dataset reflects real users, priority scenarios, edge cases, and the launch bar.	What examples, labels, negatives, and edge cases are technically useful for model improvement.	Data pipelines, validation, versioning, freshness, access controls, and repeatability.
Evals	Product evals and experience KPIs: what "good enough" means for users and what needs to be true before launch.	Technical metrics: accuracy, precision, recall, calibration, robustness, benchmarks, diagnostics, and error analysis.	Eval pipelines, dashboards, regression tests, monitoring, and production quality tracking.
Failure modes	Which failures are acceptable, which are launch-blocking, and what the user experience should be when the model is wrong.	Where the model fails, why it fails, and how to improve it.	Detection, fallback, retries, alerts, mitigations, and operational playbooks.
Product integration	How the model fits into the end-to-end experience, policy, operations, support, trust model, and rollout.	Model behavior, limitations, expected outputs, and edge cases.	Serving APIs, logging, fallbacks, alerts, infra, scaling, and integration with existing systems.
Success criteria	User impact, business impact, adoption, engagement, satisfaction, trust, and launch-stage goals.	How success translates into model behavior and measurable quality gains.	How success translates into system reliability, observability, uptime, cost, and production health.
Launch readiness	End-to-end readiness: UX, safety, evals, monitoring, GTM, rollout, rollback, comms, and support.	Model readiness.	Operational readiness.

Research should absolutely own technical model selection. They are closest to the architecture, training approach, loss functions, ranking strategy, embeddings, and fine-tuning methods.

But Product owns the decision criteria. Do we need the highest-quality model, or the fastest model that is good enough? Is a 4% offline improvement meaningful if users do not notice it? Is the quality gain worth the compute cost? Can the experience tolerate 200ms, 500ms, or 2 seconds? What behaviors are unacceptable, even if the average metric improves? What should happen when the model is uncertain or wrong?

Those are product questions.

ML Engineering brings another essential lens: what it will take to make the model real. Serving path. Infra cost. Scaling limits. Observability. Fallbacks. Reliability risks. Operational complexity. Product does not own all of those details — but Product does need to understand them well enough to make the right tradeoffs.

The PM's job is not just to ask, "Can we build this?" It is to ask: Should we build this? For whom? What does good look like? What should the model never do? What tradeoffs are we willing to make? What happens when the model is wrong? How do we evaluate both model behavior and end-to-end product behavior? How do we know this is creating real value?

In strong ML teams — like many I have worked on myself — each function brings their strengths, skillset, and unique lenses to solve a problem.

A strong model does not automatically become a strong product. It takes a team.