I first started working in AI/ML when we were building Amazon Alexa.
Across Alexa, Providence, and Codec Avatars, I have worked on AI systems at different stages of the field's evolution. A few examples stand out.
At Alexa, one part of the work was building narrow AI systems that could understand user intent and reliably execute actions. That meant defining intents, collecting training data, mapping utterances, maintaining content and device catalogs, and building orchestration logic across speech recognition, NLU, entity extraction, APIs, and smart-home/device graphs.
At Providence, the work moved from intent recognition to clinical decision support. One challenge was not just predicting risk, but turning patient data, medical best practices, governance models, clinical context, and human feedback into trusted next-best actions for clinicians at the point of care.
At Codec Avatars, the work is multimodal and generative: using deep learning to understand and generate human presence across face, expression, movement, embodiment, voice, visual appearance, and perception.
Together, these experiences shaped my view of foundation models: they do not replace traditional AI systems wholesale. They change where intelligence lives inside the system.
The better question is not whether foundation models will replace traditional AI. It is: which layers should become general-purpose, and which must remain specialized, deterministic, optimized, governed, or domain-specific?
Trip down memory lane: what could have been different with foundation models?
If foundation models had existed back then…
Alexa
Problem to solve: Take action based upon what user asks Alexa.
Take a simple request: "Alexa, play The Crown."
In the older architecture, that request moved through many separate layers. The system had to recognize the speech, classify the intent as media playback, extract The Crown as the title, resolve it against a content catalog, check available streaming services, understand the target device, and route the request to the right playback API.
If the user said, "I want to watch that queen show upstairs," the system still had to map that variation back to the same underlying goal.
The intelligence was real, but the architecture was tedious to build and maintain. Every variation in user language had to be mapped through narrow systems: speech recognition, intent classification, slot filling, entity resolution, content catalogs, device graphs, rules, APIs, and explicit orchestration logic. Much of the reasoning lived outside the model: if intent equals X, call service Y with parameters Z.
If foundation models had existed then, we would not have had to hand-build as much of that brittle interpretation layer. More of the work around intent understanding, ambiguity resolution, entity extraction, contextual reasoning, clarification, and structured tool-call generation could have moved into the model.
What would still remain are the parts that make the system real: the content catalog, device state, user permissions, APIs, latency budgets, safety checks, execution reliability, and product judgment.
Providence
Provide best clinical course of action based upon discussion with patient, and past clinical history.
Providence taught me the same lesson in a higher-stakes domain.
In clinical settings, prediction is only the starting point. The harder problem is contextual judgment: turning patient history, lab trends, medications, guidelines, governance rules, workflow constraints, and expert feedback into trusted next-best actions.
Take a patient with multiple recent visits, abnormal lab trends, a medication change, and a missed follow-up. A traditional clinical AI model might predict elevated risk from structured features such as diagnoses, labs, medications, utilization history, and prior outcomes.
Useful — but not enough.
The real product challenge is turning that risk signal into a safe, explainable, clinically trusted next step.
Before foundation models, much of that workflow had to be explicitly designed. The system might flag the patient as high risk or place them in a work queue. But the product still had to decide what to show: which lab trend mattered, which medication change was relevant, which guideline applied, which clinician should see it, and whether the next step was outreach, follow-up, lab work, medication review, escalation, or no action.
That synthesis layer was labor-intensive to build and maintain because so much of the context had to be manually assembled, mapped to care pathways, validated with clinical experts, and translated into rule-heavy workflow logic. A risk score did not automatically become clinical judgment.
If foundation models had existed then — especially models trained or adapted on high-quality medical data — more of the workflow we were manually designing into the product could have moved into the model. A strong model arrives pretrained with broad language, reasoning, and pattern-understanding capabilities. With medical grounding, retrieval, and clinical validation, the model could help assemble the middle layer: summarizing the patient's recent history, highlighting the abnormal lab trend, connecting it to the medication change and missed follow-up, retrieving the relevant guideline, explaining why the patient was flagged, and drafting a recommended next step for clinician review.
But again, the foundation model does not replace the clinical system.
It should not replace governance. It should not replace medical validation. It should not replace clinical guidelines, audit trails, human accountability, or systems of record.
In high-stakes domains, foundation models can absorb parts of the cognitive and language interface. But they must be wrapped in retrieval, evidence, policy, escalation paths, and human review.
The Replacement Pattern Across Traditional AI Systems
Foundation models rarely replace an entire AI system. They usually replace or absorb specific layers.
1. Chatbots: replacing the scripted conversation layer
Traditional chatbots were often built from decision trees, FAQ retrievers, intent classifiers, and scripted flows.
Foundation models replace much of that brittle conversation layer because they can understand open-ended language, maintain context, retrieve knowledge, generate responses, and escalate more intelligently.
What remains is the enterprise system around the chatbot: knowledge sources, permissions, CRM records, escalation paths, human review, policy controls, and workflow execution.
2. Document OCR and understanding: replacing the template-extraction layer
Traditional document systems often separated OCR from understanding: first read the characters, then use templates or rules to extract invoice numbers, dates, totals, clauses, signatures, or medical fields.
Multimodal foundation models can replace much of that rule-heavy extraction layer. They can understand layout, tables, screenshots, handwriting, charts, and context.
The OCR layer may still remain as low-cost infrastructure, but the intelligence shifts from "read this text" to "understand this document."
What remains is verification, source grounding, auditability, domain policy, and integration into systems of record.
3. Recommendations: replacing content understanding, augmenting ranking
In a traditional recommendation system, much of the intelligence comes from behavior: what someone clicked, watched, skipped, saved, bought, shared, or abandoned. Those signals drive candidate generation, ranking, exploration, diversity, freshness, and long-term value.
Foundation models can replace parts of the semantic understanding layer. They can understand what a song, video, product, article, or podcast is about before there is much engagement data. They can support requests like: "Recommend something for a rainy Sunday when I want to relax but not feel sleepy."
But they do not replace the full recommendation engine. Real-time behavioral feedback, ranking, calibration, experimentation, diversity constraints, inventory constraints, latency budgets, and business optimization still matter.
4. Search: replacing query understanding, augmenting retrieval and ranking
Foundation models will transform search, but search will not become "just an LLM."
FMs can replace parts of query understanding: intent detection, query expansion, summarization, result synthesis, and conversational refinement. They can turn search from "return documents" into "answer the question and show the evidence."
But search still needs indexing, retrieval, ranking, freshness, authority, permissions, latency control, and citation grounding. The model improves the interface and reasoning layer. It does not eliminate the search infrastructure.
5. Computer vision: replacing generic image understanding, augmenting specialized perception
Traditional image-recognition systems were often trained for narrow tasks: classify this object, detect this defect, segment this region, identify this product, or moderate this image.
Vision-language foundation models replace many generic image-understanding workflows because they can describe images, answer questions, compare images, interpret charts, understand screenshots, and reason over visual context. The shift is from "what label should this image get?" to "what does this visual input mean?"
But high-precision computer vision remains specialized. Autonomous driving perception, medical imaging, manufacturing inspection, AR/VR tracking, robotics, biometric verification, and safety-critical vision still require precision, latency, calibration, sensor fusion, and domain-specific validation.
My Practical Framework
Foundation models replace the brittle understanding layer. They augment the specialized optimization layer. They transform the interface layer. They still depend on the execution layer.
That is why both extremes miss the point.
Foundation models will not replace everything. But they are also not "just another model."
The next generation of AI products will not be built by choosing between foundation models and traditional AI. They will be built by knowing how to compose models, tools, data, policy, infrastructure, and workflow into systems users can trust.
The real product questions are: Which layer should the foundation model own? Where do product constraints — cost, latency, accuracy, quality, reliability, privacy, and safety — require choosing the right model for the job? Which layer needs a smaller task-specific model because it is faster, cheaper, more predictable, or easier to evaluate? Which layer must remain deterministic, auditable, fast, or governed? Where should the model reason, retrieve, call tools, defer, or stay out of the loop entirely?
The durable advantage will not come from using a foundation model everywhere. It will come from knowing how to compose models, tools, data, policy, infrastructure, and human judgment into reliable systems — for the right user, in the right workflow, with the right guardrails, given the product constraints.
These are my personal views, shaped by building AI systems across consumer voice assistants, clinical decision support, and digital presence. They do not represent the views of my current or past employers.