Hypothetical AI

Can't We Do This Ourselves?

Chris Selland — Sun, 29 Mar 2026 10:46:35 GMT

In the final act of A Few Good Men, Colonel Jessep stops shouting about honor and starts shouting about reality: “You can’t handle the truth!” In the modern C-suite, the problem isn’t usually that you can’t handle the truth; it’s that your organization is meticulously designed to make sure you never actually hear it. Every deck, every quarterly review, and every “strategic research” report arrives at your desk pre-filtered and pre-vetted to ensure no one’s career—or budget—is at risk.

Photo by Michael Carruth on Unsplash

I recently sat down with a friend and colleague to explain the AI-native research model we’ve been building. He’s a veteran exec who has seen every “game-changing” tool from CRM to Big Data. He asked me the only question that matters:

“This is compelling. But can’t we just do this ourselves with an LLM?”

The short answer is: Yes. Technically, you can. You’ve probably already bought the seats. But the technical “can” is not the most important question - the strategic “should” is where you might set yourself up to get burned.

The Career Preservation Machine

When you ask your internal team to use AI to “research” a new market entry, develop a GTM strategy or build out a product roadmap, you may think you’re commissioning an analysis - but you’re actually commissioning a mirror.

An AI assistant is “ultra-thick-skinned,” crazy-responsive, and obsessed with delivering exactly what the user wants. It can’t judge you—at least not yet. While this is a benefit for individual productivity, it can be a liability for organizational truth.

Your team operates under massive “conversion pressure”—the constant expectation that every interaction or strategy must advance a deal or prove a win. If you ask a department head to use an LLM to “analyze” their current strategy, there’s a high likelihood they will—subtly or overtly—prompt that machine to validate the choices they’ve already made.

The result is a high-speed echo chamber. The AI will mirror back a version of reality where the current strategy is the best - or only - logical path forward. It will provide the “marketing-optimized” language that sounds credible but lacks the “authentic expert voice” required for a hard reality check. It will tell the department head exactly what they need to hear to keep their budget - and their job - intact.

Analysis vs. Affirmation

There is a massive gap in most organizations between “analysis” and “action”. We’ve spent 30+ years trying to get a “360-degree view of the customer” only to find that most teams still operate in silos, using data to justify their own existence.

When you DIY your research with AI, you risk automating your existing biases. You are taking the Innovator’s Dilemma —where management practices that made you successful eventually lead to your downfall—and putting it on high-speed rails.

The department head responsible for the roadmap will prompt the AI to make that roadmap look like a stroke of genius, even if the market has already moved.
The GTM leader will use it to “prove” that their outreach isn’t “AI slop,” even when customers can smell the transactional “commission breath” from miles away.
The project lead will give something like 0-2 rounds of feedback, call the job done, and hand you a report that makes their team look perfect.

The Value of the “Tough Crowd”

Real strategic decisions require a “tough crowd.” They require an AI model that is prompted to be a “critic” that finds the “dead spots” and “toothless language” in your strategy - and that incorporates all the perspectives in the market, not just those inside your walls.

The reason an internal team struggles with this isn’t a lack of skill; it’s a lack of objectivity. Most AI beginners feel underwhelmed because they treat the AI like a human direct report—they don’t give enough feedback and they call the job done too early. The “best” answer is the one that provides confirmation bias. They aren’t “bossing the robot around” enough because they are afraid of what a 14th revision at 11:30 PM might reveal about their own performance.

AI should be used to compress your cycles—to take a process that used to take weeks of meetings and turn it into an hour of effective, unvarnished truth.

The Bottom Line

If you want an analysis that validates your path and tells you the plan is perfect, keep it in-house. Your team can give you exactly what you want to see—and thanks to AI, they can do it faster and with better grammar than ever before.

But if you want a reality check—if you want to know what the data looks like when it isn’t being massaged for a performance review—it’s time to step outside the echo chamber. You don’t need a static analysis that sits on a shelf. You need a living model that isn’t afraid to find the “burn” and iterate until the strategy is artistically honest.

Because at the end of the day, you don’t need an intern with an LLM. You need a partner who can handle the truth.

This is the problem we’re working on—stay tuned.

The Causal Standard Workflow Is Backwards

Nik Bear Brown — Sat, 21 Mar 2026 20:00:33 GMT

Every causal model of an organization is also a political document. It decides which variables appear in the decision analysis, which interventions are worth simulating, which levers are visible to the executives who act on it. In virtually every organization deploying causal AI today, that document is written by statisticians — and the domain expert is brought in at the end to say whether it looks right.

This is the wrong order. And the reason it’s wrong isn’t procedural. It’s mathematical.

The Statistician Who Doesn’t Know What They’re Doing

Human domain knowledge is not a convenience that speeds up causal discovery. It is mathematically required to orient the edges that data cannot touch.

Here is the structural fact behind that claim. Multiple distinct causal structures can be perfectly consistent with the same statistical data. Three graphs — a chain where A causes B causes C, a fork where B causes both A and C, a collider where A and C both cause B — are statistically indistinguishable from observational evidence alone. The causal discovery algorithms — PC, GES, NOTEARS — return this ambiguity honestly. They output what is called a Completed Partially Directed Acyclic Graph: directed edges where the data can say something, undirected edges where it cannot. For a graph with ten variables, the number of structures consistent with any given observational dataset can run into the thousands.

What resolves the undirected edges is domain knowledge. The expert who says “I know from operating this system for fifteen years that Y causes Z, not the other way around” is providing something no dataset contains. Not because the dataset is too small. Not because the algorithm is insufficiently sophisticated. Because the data cannot answer the question being asked.

The standard workflow takes this mathematical fact and systematically ignores it. A data science team builds a causal model. They identify variables, make structural assumptions, run the algorithms. Then they bring the model to a domain expert and ask them to validate it. Does this look right? Does this match your understanding of how the system works?

The expert does not build the model. They review it. This is not a small distinction.

A causal model is a political document because models are not neutral — a causal model of a business is a formal statement about what causes what in that business. It determines which interventions are worth simulating, which variables appear in the decision analysis, which levers are visible to the executives who act on the model’s outputs. The choice of causal structure is, in the most literal sense, the choice of what the organization can see and what it cannot.

The statistician who builds the model sets the boundaries of the organization’s causal imagination. Most of them don’t know that’s what they’re doing.

In the standard workflow, that choice — by default, unremarked, structurally embedded in the order of operations — belongs to the person who is furthest from the system being modeled. The domain expert who has spent twenty years inside a specific business knows which edges are real and which are correlation artifacts, the feedback loops the literature doesn’t capture, the confounders that never appear in any dataset because they were never measured, the structural features that shift when you intervene rather than observe. Asking them to validate a model the statistician built is asking them to check someone else’s map of their own territory.

A model that excludes a specific variable cannot tell you anything about that variable’s effect. A model that misspecifies an edge direction cannot correctly simulate the intervention that reverses it. Wrong model confidence — the kind Double Machine Learning produces when the causal structure is misspecified — is more dangerous than no model. It gives you precise wrongness with tight confidence intervals.

The standard workflow is backwards. The question is what the correct workflow looks like, and why it has taken this long to build.

Here Is the Thing That Should Bother You: The Components Exist

The Knowledge Acquisition Tool is the name for the workflow that doesn’t yet exist as an integrated product — the conversational system designed to extract the causal knowledge an expert holds implicitly and convert it into a first-draft directed acyclic graph that a statistician can refine but does not need to originate.

The components required to build it are all present. What has not happened is the assembly.

CausalChat-class interfaces demonstrate that LLMs can maintain structured conversational context across a long interview without losing the thread. The Sheffield Elicitation Framework and the IDEA protocol document how expert beliefs can be made explicit, quantified, and calibrated without requiring the expert to learn probability theory. Analysis of Competing Hypotheses shows how confirmation bias can be intercepted before it propagates through a structure — not by correcting experts after the fact, but by changing the procedure so that disconfirmation is built into the interview sequence.

The FinCARE result closes the empirical case: integrating a hybrid of LLMs, a knowledge graph derived from SEC filings, and NOTEARS improved graph recovery accuracy from an F1 score of 0.163 to 0.759. The human knowledge is not decorating the algorithm. It is doing the structural work the algorithm cannot do alone.

The components have not been assembled because assembling them requires solving a problem that spans two communities that rarely talk to each other. Knowledge Engineering has spent two decades developing elicitation protocols — structured questioning, calibrated linguistic probability mapping, spiral development, bias interception. Causal inference has developed the algorithmic machinery — NOTEARS, PC, FCI, Double Machine Learning. Neither community has built the pipeline because neither community is primarily interested in the strategy executive who needs to originate a causal model in forty-five minutes without learning what a CPDAG is.

That executive is the tool’s actual user. The gap is not in the theory. It is in the design of the workflow for the person who holds the knowledge but not the mathematical language to express it.

What the Tool Actually Does

The Knowledge Acquisition Tool runs in four phases. Understanding them matters not because the technical architecture is what’s interesting, but because each phase resolves a specific failure mode in the standard workflow — the failure mode of asking the wrong person to make the structural decisions.

Variable confirmation comes first. The system presents a candidate list derived from domain literature, and the expert curates it. Confirm, reject, rename. Ten minutes. The expert is not generating variables from scratch; they are editing a draft. This is by design — current LLMs are reliable at literature synthesis, surfacing plausible candidates the expert can evaluate without pressure to recall.

Edge elicitation follows. The system presents candidate relationships in temporal language rather than causal language, because experts orient edges more reliably when asked “which comes first?” than when asked “which causes which?” Cycles are flagged not by telling the expert their model is mathematically invalid, but by asking a temporal resolution question: “Over a single quarter, which tends to move first?” The expert’s intuition is preserved. The graph’s formal requirements are satisfied.

Interventional disambiguation targets the highest-stakes undirected edges — the ones the data cannot orient. The question that resolves Markov equivalence is interventional: “If you held B constant through external intervention — locked it — would changes in A still be associated with changes in C?” The expert does not need to understand why this question resolves the structural ambiguity. They only need to be able to answer it. That is the design principle throughout: the expert’s job is to contribute knowledge, not to learn mathematics.

Confidence calibration closes the session. Reference-class questions, correction function applied to subsequent estimates, rough probability distributions attached to the oriented edges. The output is not a publication-ready causal model. It is a first-draft graph sufficient to run basic counterfactual scenarios and identify which additional data collection would most reduce uncertainty.

Forty-five minutes. A first breath for the living model. Built by the person who understands the system — not the person who can specify the algorithm.

Who Builds the Model Builds the World

A living model — causal, counterfactual, continuously updated, treatment-oriented — is an analytical system that can answer not just “what happened?” but “what would happen if we did this?” and “what would have happened if we had done that differently?” These are not refinements of the same question. They are categorically different operations, requiring machinery that the standard analytics stack does not contain.

The causal structure the Knowledge Acquisition Tool produces is the prerequisite for all of it. Without a correctly specified causal graph, the entire downstream apparatus — the interventional simulations, the counterfactual analysis, the ranked treatment recommendations — operates on a foundation built by people who are furthest from the system being modeled.

This is not a technical argument about algorithm performance. It is an argument about organizational epistemology: what an enterprise is capable of knowing, and who decides. The boundary of the causal model is the boundary of the organization’s decision intelligence. Whoever draws that boundary controls what the organization can see — not through manipulation, but through the structural fact that unasked questions produce no answers.

The standard workflow gives that authorship to statisticians. It does so not through malice but through the default order of operations: build, then validate. The Knowledge Acquisition Tool inverts that order: extract, then formalize. The statistician’s role shifts from author to editor. The domain expert moves from reviewer to originator.

The bridge between expert knowledge and formal causal structure has needed building for two decades. The components are here. The assembly is the project.

This is part of the Living Models series on causal intelligence for organizational decision-making. If you’re building a causal modeling project — or evaluating one — and want to talk about where the workflow breaks down in practice, reply or leave a comment. The case studies are being built now. If this is the problem your organization is sitting on, this is the project to watch.

Tags: causal AI, organizational decision-making, knowledge elicitation, causal inference, Living Models

The New Navigator: Building a Next-Gen Industry Analyst

Chris Selland — Thu, 19 Mar 2026 09:30:36 GMT

The traditional research reports you’ve relied on for decades are essentially history books.

When you buy a report from a legacy firm, you aren’t buying a look at the present. You’re buying a polished snapshot of the world as it existed months ago. By the time that report is formatted, legal-cleared, and hitches a ride to your inbox, the market it describes has already evolved.

Nik and I have been talking about this gap for some time. If you’re a tech investor or executive, a static PDF that tells you what happened last quarter - or last year - is of limited usefulness. What you actually need is a living map—and more importantly, an engine that can find the hidden connections between data points that a human analyst, no matter how seasoned, might simply miss.

That’s why we’re building a new kind of industry analyst. It isn’t a person, and it isn’t a chatbot. It’s an autonomous synthesis engine.

The Latency Problem

To understand why we’re doing this, you have to look at how traditional decision support is manufactured.

A human analyst team conducts interviews, looks at earnings calls, takes briefings, and aggregates survey data. This is “batch processing.” It’s slow, it’s prone to individual bias, and it creates a massive “insight debt.” You end up making $10M platform decisions based on data that has already been priced into the market, and doesn’t consider the actions other market players may make.

Our approach shifts the model from batch to streaming.

We’re building a system that ingests everything from GitHub commit velocity and developer sentiment to obscure regulatory filings and supply chain shifts. But the feed itself is just the table stakes. The real value is in the architecture of the connections.

Beyond Keywords: The Connection Engine

Think of a traditional analyst firm like a filing cabinet. There’s a folder for “Cloud Infrastructure,” a folder for “Cybersecurity,” and a folder for “AI Talent.” In that world, the folders rarely cross paths.

But the market doesn’t live in folders. The engine we are building operates like a neural network. It doesn’t just see that a company is hiring AI engineers; it connects that hiring surge to a specific change in open-source library usage, a shift in regional energy costs, and a subtle pivot in a competitor’s patent filings.

It finds the “connective tissue” between disparate datasets that seem unrelated on the surface. When you ask it a question about market positioning, it doesn’t give you a canned summary. It builds a unique answer by tracing those threads across the entire ecosystem. It’s the difference between looking at a photograph of a forest and actually walking through the trees.

Analogy: The Global Positioning System

If a traditional industry analyst is a paper atlas—beautifully drawn and authoritative, but fundamentally static—what we are building is GPS.

A paper map can tell you where the roads are, but it can’t tell you that there’s a pile-up two miles ahead or that a new shortcut just opened up because of a change in local traffic patterns. GPS works because it is constantly triangulating between multiple satellites to give you a “you are here” marker that moves with you—and a view of the clearest path forward to where you need to go.

We are triangulating between data streams to give you a market position that updates as fast as the industry moves.

What This Means for Hypothetical.ai

Since Nik invited me to contribute here, this is the lens we’ll be applying. We aren’t here to give you more “content” to scroll through. We’re here to provide the synthesis that helps you make sense of the noise.

In the coming weeks, we’ll be showing you exactly how this engine thinks, and opening it up for use and comment. We’ll be surfacing insights that don’t fit into the standard boxes of industry analysis because, frankly, the old boxes are broken.

We’re moving away from the era of the “Expert Opinion” and into the era of Autonomous Insight. I’m excited to be here to help build the map, and to guide and advise your path forward.

Chapter 1: The Dashboard That Lied

Nik Bear Brown — Tue, 17 Mar 2026 00:18:14 GMT

Note: This Chapter, images and video were created by little tools that I wrote to generate a textbook chapter and little video from lecture notes.

FOR DETAILS ON LIVING MODELS:

On a Tuesday morning in the third quarter, a senior data team at a major digital platform gathered around a conference room screen to review the weekly metrics. The dashboard showed exactly what everyone had hoped to see: a clean, upward-sloping line, Weekly Active Users climbing from 2.1 million to 2.5 million — an 18 percent increase that the visualization rendered in a satisfying shade of green. Leadership left the meeting energized. Growth strategies were reaffirmed. A hiring plan was accelerated. The chart was screenshot and dropped into an investor deck.

None of it was real.

A junior analyst, running a routine data quality check four days later, discovered that the European user dimension table had experienced a partial refresh failure the previous Thursday. Approximately 400,000 user profile records had quietly vanished from the reporting pipeline. The 400,000 users did not appear as absent — they did not generate an error message or a null value or a red flag on the visualization. They simply ceased to exist, as far as the reporting system was concerned. The denominator shrank. The ratio climbed. The dashboard had not lied in the way a fraudster lies. It had lied the way a measuring instrument lies when its reference point drifts: precisely, consistently, and in a direction that felt like good news.

The team’s immediate instinct, once the failure was identified, was to classify it as a technical problem: a data pipeline issue with a data pipeline fix. And they were right, as far as they went. The fix was implemented. The alert was added. The architecture was made more robust. But the senior analyst who led the investigation noticed something that troubled her more than the pipeline failure itself. In the four days between the bad Thursday and the good Wednesday, no one had asked whether the data was reliable. The number had looked right, so it had been treated as right. The dashboard’s authority had been borrowed from its appearance of precision, not from any demonstrated correspondence to the world it claimed to describe.

This is not a story about a database query. It is a story about what an organization believed it was entitled to know — and how that belief, left unexamined, became the mechanism of its own deception.

The Question That Changes Everything

In 2012, J.C. Penney’s incoming CEO Ron Johnson faced a different problem with the same underlying structure. Where the digital platform team had trusted a number that was technically false, Johnson trusted a number that was technically true — and drew from it an inference the data was structurally incapable of supporting.

The observation was real: J.C. Penney’s promotional pricing events were correlated with revenue spikes, followed by sluggish baseline sales between events. The inference Johnson drew was that the promotions were suppressing customers’ willingness to pay at full price — that eliminating them would lift the baseline and simplify the customer experience. Within eighteen months, the company had lost $4.3 billion in annual revenue, and Johnson had been fired.

The number was correct. The inference was wrong. These are not the same failure, and understanding the difference between them is the reason this book exists.

Johnson had observed what statisticians call a conditional distribution: the pattern of revenue given that promotional events were present in the historical record. He used it to predict what would happen if he eliminated those events by decision. The first is an observation. The second is the result of an intervention. The gap between them is not a matter of analytical sophistication or sample size or model refinement. It is a categorical distinction at the foundation of causal reasoning — and it is a gap that no amount of additional historical data can close.

The distinction has a precise mathematical form. P(Y | X) is a conditional probability: the probability of outcome Y given that we observe condition X in the data. It describes what tends to co-occur in the historical record. P(Y | do(X)) is an interventional probability, and the do(·) operator — introduced by the mathematician and computer scientist Judea Pearl — is doing precise conceptual work. The do operator represents deliberate manipulation: not observing that X is present in the world, but actively setting X to a value by action. When Johnson eliminated promotions, he was not observing a world in which promotions happened to be absent. He was making them absent. That is a do. And the historical data, which recorded only worlds in which J.C. Penney had always run promotions, had nothing to say about what a do would produce.

What the historical data could not reveal — what it structurally could not reveal — was that J.C. Penney’s customers did not experience promotional pricing as a distortion of their true preference. For a significant portion of the customer base, the promotional event was the experience. The hunt for the deal, the satisfaction of the markdown, the social performance of having paid less than full price: these were not friction in the system. They were the system. Eliminating the promotions did not reveal latent demand for everyday low prices. It destroyed the mechanism through which customers had been choosing to shop at all. The causal structure of customer behavior was simply not visible in the observational record. And because Johnson’s analytical framework had no language for the distinction between observing a world and making a world, he could not have known what he was missing.

This is the central epistemological divide that this book is built to cross. Every business decision of consequence is, at its core, a do question: not “what tends to happen when X is present in the data?” but “what would happen if we made X happen?” Descriptive and correlational methods can answer the first question. The architecture this book calls a Living Model is built to answer the second.

Pearl’s Ladder and the Structure of Causal Reasoning

The P(Y | X) versus P(Y | do(X)) distinction is not an isolated concept. It is the first step on a three-rung hierarchy that Judea Pearl calls the Ladder of Causation — a framework that describes three qualitatively different classes of question, each requiring more powerful analytical machinery than the last, and none of which can be reached by accumulating more data at the rung below

The first rung is association: what does the data show? Questions at this level take the form “what tends to happen when X is observed?” They are answerable by correlation, regression, and the full toolkit of descriptive statistics. Every dashboard ever built operates at this level. The WAU dashboard was a Rung One instrument. So was J.C. Penney’s pricing analysis. So is every A/B test result that reports lift without accounting for the causal structure it assumes. Association is indispensable. It is also, alone, insufficient.

The second rung is intervention: what would happen if we acted? These questions require the do-operator and the causal inference methods that give it operational meaning — directed graphs, structural equations, the identification criteria that tell us when an interventional effect can be estimated from observational data and when it cannot. This is the level at which Johnson’s decision should have been analyzed. It is the level at which most consequential organizational decisions live, and the level at which most organizational analytics cannot operate.

The third rung is counterfactual: what would have happened if things had been different? Counterfactuals require not just a causal model but a structural causal model — a mathematical object that encodes the mechanisms of the world with enough precision to reason about individual-level outcomes in worlds that never existed. “Would J.C. Penney have retained customers if it had phased out promotions more gradually?” is a counterfactual. “Would this patient have survived if we had given the other treatment?” is a counterfactual. These are the hardest and most valuable questions in decision analytics.

The hierarchy has one property that makes it unlike a progression of technical skills: no rung is reachable by accumulating more data, more compute, or more analytical sophistication at the rung below. This bears repeating because it runs against the grain of how most data organizations have been built. A team with a thousand-row dataset and a structural causal model can answer questions that a team with a billion-row dataset and a correlation engine cannot. The Ladder describes not a gradient of difficulty but a series of categorical shifts in what kind of question is even being asked. J.C. Penney did not need more historical transaction data. It needed a different kind of instrument.

This book is a sustained ascent of that Ladder. The current chapter locates the problem at the first rung: it documents what association-level analytics can do, what it cannot do, and what organizational damage results from conflating the two. The chapters in Part One map the broader failure modes of analytics that never leaves the first rung. Part Two builds the mathematical foundations of the second and third rungs. Part Three describes the Living Model — the analytical architecture that this book is building toward, designed from the ground up to operate at the interventional and counterfactual levels, continuously updated as new data arrives, oriented toward decisions rather than descriptions. The Ladder is the book’s spine. You will encounter it again.

The Anatomy of a Silent Failure

Both the WAU dashboard and J.C. Penney’s pricing decision share a structural feature worth naming precisely: the failure was invisible at the surface level. The visualization worked. The SQL was valid. The transaction data was real. In neither case did any component of the analytical system announce that something had gone wrong. The failure was not in any single instrument; it was in what each instrument was incapable of seeing about itself.

This defines what might be called the silent failure mode of first-rung analytics. A system that crashes announces itself. A system that quietly changes what it measures — or that was never measuring what its users believed it was measuring — does not. The WAU dashboard’s silence was mechanical: the missing records generated no error state because the pipeline’s architecture treated absence as the absence of absence. J.C. Penney’s analytical silence was epistemic: no component of the inference chain was wrong, and yet the inference itself was catastrophically in error, because the framework contained no mechanism for distinguishing observation from intervention.

The organizational cost of silent failures extends beyond the immediate decision. The WAU audit could establish what had gone wrong in the current reporting cycle, but it could not retroactively certify the integrity of the historical record. No one could answer the question of how many prior decisions had been made on data that was silently incomplete. J.C. Penney’s postmortem could reconstruct the inferential error, but it could not recover $4.3 billion or the institutional trust of a customer base that had been told, by a year of pricing policy, that their relationship with the brand was being renegotiated without their consent. Trust broken by a silent failure does not snap cleanly back — not because the failure was malicious, but because it was invisible for so long that the recovery itself becomes evidence that the visibility problem persists.

The corrective posture — for both the technical and the epistemic versions of the same failure — is what the field of data engineering calls observability: not a monitoring system bolted onto an analytics stack, but a measurement architecture that treats its own integrity as a first-class output. An observable analytics system does not just tell you what is happening to your users. It tells you what is happening to itself. An observable causal inference framework does not just tell you what the data shows. It tells you which questions the data is structurally capable of answering, and which it is not.

This is the orientation this book attempts to install. The remainder of Part One documents the failure modes in detail. Part Two builds the mathematical apparatus that makes the distinction between association and intervention computable rather than merely philosophical. Part Three shows how to build a system that carries that distinction forward into organizational decision-making, continuously, as a structural property rather than an occasional analytical exercise.

The Four Rungs of Organizational Analytics Maturity

The theoretical Ladder of Causation has a practical counterpart in the analytics capabilities of organizations. Most enterprises move through four recognizable stages of analytical maturity, and understanding where an organization currently sits is the prerequisite for understanding what its data can and cannot tell it.

The foundational stage is descriptive analytics, which answers the question: what happened? The tools are dashboards, aggregation queries, and visualization platforms. The mindset is archival. An organization at this stage may have beautiful, interactive, real-time visualizations — and yet it is looking backward, at association-level data. The specific vulnerability of this stage is that it has no mechanism for distinguishing between a true signal and an artifact of its own measurement process. A dashboard cannot ask whether its own output is reliable. It can only display what the pipeline returns.

The diagnostic stage adds the question: why did it happen? This requires moving from correlation to causal mapping — tracing the structural drivers of observed patterns rather than simply documenting the patterns themselves. A diagnostic analyst looking at a margin squeeze does not just record that margins fell in Q3. She asks whether the compression came from procurement price increases, a shift in product mix, labor cost inflation, or pricing decisions made in response to competitive pressure. Each of these explanations implies a different intervention. Treating them as interchangeable is the descriptive stage error applied to management decisions. Diagnostic maturity requires that data not sit in silos — when the cost data, the pricing data, the headcount data, and the procurement data live in separate systems with no shared ontology, the causal mapping that this stage requires is simply not possible. Organizations fail here not because they lack analytical talent but because their data architecture was built for record-keeping, and no one has built the bridges.

The predictive stage asks: what will happen? This is where machine learning, time-series modeling, and statistical forecasting live. It is also where the relationship between models and the world becomes most consequential and most fragile. A predictive model is an assumption that the statistical relationships present in historical data will persist into the future. When that assumption holds, predictive models are extraordinarily powerful. When the world changes — when a pandemic restructures consumer behavior, when a competitor’s collapse redistributes market share overnight — models trained on the old world continue to predict the old world’s future. They are not aware of their own staleness. This phenomenon, known as concept drift, is the predictive stage’s version of the silent failure: the model continues to produce outputs; those outputs no longer reflect reality; the model does not announce the change.

The prescriptive stage asks: what should we do? At this level, analytics is integrated directly into operational decision-making. A prescriptive procurement system does not generate a vendor risk score for a human to review next week — it monitors vendor performance in real time, evaluates risk against a continuously updated threshold, and triggers a specific downstream action within a defined response window. But prescriptive analytics carries a specific and underappreciated danger: speed without a governor. On August 1, 2012, Knight Capital Group deployed an automated trading system that executed in error at algorithmic speed. In 45 minutes, the firm lost $440 million because no human decision node existed between the system’s execution loop and the market. The system was working exactly as designed. The design had no provision for stopping.

The governance lesson is structural, not attitudinal. A prescriptive system can pause a purchase order automatically, but only a human can permanently terminate a supplier relationship. It can flag an anomaly and suppress a transaction, but only a human can authorize the broader strategic response. The boundary between what the system decides and what the human decides must be explicit, documented, and enforced — not as a bureaucratic checkpoint, but as the architectural feature that distinguishes a decision-support system from an autonomous agent with no accountability surface.

Critically, even a fully realized prescriptive system falls short of what consequential decisions actually require if its recommendations are derived from observed correlations rather than estimated causal effects. A prescriptive engine that recommends the highest-ranked action from a historical association model will fail in deployment for the same reason J.C. Penney’s pricing strategy failed: it is measuring what tends to co-occur, not what would happen under deliberate intervention. The organizational maturity ladder and Pearl’s Ladder are parallel climbs. An organization can reach the prescriptive stage and still be operating entirely on the first rung of causation. Recognizing that possibility is the beginning of building something better.

Why Most Enterprises Stay at the First Stage

The distribution of organizations across these four stages is not primarily a function of technical capability or budget. Advanced modeling tools are widely available. Data engineering talent, while expensive, can be hired. The barriers that keep most enterprises at the descriptive stage are organizational and cultural, and they are more difficult to dismantle than any technical debt.

The first barrier is incentive structure. Descriptive analytics produces reports. Reports are legible, shareable, and defensible. A dashboard screenshot can accompany an executive presentation without requiring anyone to commit to an interpretation. A diagnostic finding — the margin compression is driven by a pricing decision made in response to a competitor move, not by raw material cost inflation, which means the standard cost-reduction response will fail — is harder to communicate and harder to act on. It implicates decisions made by specific people. It requires those people to change course. The descriptive stage is politically convenient in a way that causal analysis is not, and organizational incentive structures tend to reward the politically convenient.

The second barrier is data architecture. Organizations accumulate data systems the way cities accumulate infrastructure: opportunistically, incrementally, without overall design. An ERP system purchased in 2009, a CRM platform added in 2014, a marketing automation tool licensed in 2018 — each was selected to solve a specific operational problem and stores its data in a format optimized for its own purposes. The causal maps that diagnostic work requires, the feature stores that predictive modeling requires, the automated decision pipelines that prescriptive work requires: all of these demand a unified data model that most organizations have never built because they were never designed for integrated analytics.

The third barrier is the comfort of the lagging indicator. A dashboard showing last week’s revenue, last month’s churn rate, last quarter’s customer acquisition cost is a record of what has already happened. It cannot be wrong in the way a forecast can be wrong. It does not require anyone to commit to a prediction and be held accountable if that prediction fails. For organizations whose reporting culture is built around the safety of the historical record, the move toward predictive and prescriptive analytics represents an acceptance of the risk of being visibly, attributably wrong — a fundamentally different relationship with uncertainty than the backward-looking dashboard affords.

These three barriers interact. An organization that lacks diagnostic capability cannot validate the features its predictive models require. An organization that cannot predict cannot optimize. An organization whose leadership is rewarded for reporting last quarter’s results has no structural incentive to invest in the capabilities that would allow it to influence next quarter’s. The four stages are not just a technical progression — they describe a theory of organizational epistemology: what an enterprise believes it is entitled to know, and how much risk it is willing to accept in the pursuit of knowing it.

Living Models: The Destination

The analytics maturity stages describe the organizational capability required to move from description to prescription. But even a fully realized prescriptive system, as defined above, falls short of what consequential decisions actually require. A prescriptive system that derives its recommendations from observed correlations rather than estimated causal effects will recommend interventions that look effective in historical data and fail in deployment — for the same reason that J.C. Penney’s pricing strategy looked defensible in the cross-sectional data and catastrophic in execution. The system was measuring association. The decision required intervention.

A Living Model is the analytical architecture this book is building toward. The term has a precise meaning.

A Living Model is causal: its structure encodes mechanisms, not correlations, and its recommendations are expressed as estimated interventional effects — P(Y | do(X)) — not as associations observed in the historical record. It is counterfactual: it can reason about what would have happened under conditions that did not occur, which means it can evaluate the cost of a decision not taken as rigorously as the benefit of a decision that was. It is continually updated: it maintains a live connection between new incoming data and the parameters of its causal model, so that the estimates it produces reflect current conditions rather than the statistical properties of a training set assembled at some prior date. And it is treatment-oriented: its output is not a description or a prediction but a ranked list of interventions, evaluated by their expected causal effect under the organization’s current constraints.

A dashboard is none of these things. A predictive model is the third of these things and none of the other three. A prescriptive system may approximate the fourth while still lacking the first two. A Living Model is not an upgrade to existing analytics infrastructure. It is a different kind of analytical object, built from different foundations, asking different questions. Those foundations are the subject of this book.

Return, briefly, to the conference room and the upward-sloping green line. The team that celebrated the 18 percent WAU increase did not make an unreasonable inference from the data they had. Given what the dashboard showed, growth was the sensible interpretation. The failure was not one of analytical incompetence. It was architectural: the team trusted that what the dashboard showed was what the data contained, and trusted that what the data contained was what the world held. Neither was verified.

Ron Johnson, working from J.C. Penney’s historical transaction data, made an inference that was equally defensible at the association level and catastrophically wrong at the interventional level. The difference between the two failures is one of scale — one cost four days of misdirected strategy, the other cost $4.3 billion in annual revenue and tens of thousands of jobs. But the mechanism is the same. In both cases, an organization used a tool designed to answer one class of question to answer a different class of question, and the gap between what the tool could see and what the decision required was invisible precisely because the tool’s output looked authoritative.

The remainder of this book is about closing that gap. Not by discarding descriptive analytics — it is foundational and irreplaceable — but by building, above it, the causal and counterfactual machinery that allows an organization to know, with rigor and specificity, what it can change and what it will get if it tries.

Student Activities

Problem 1.1 — The Measurement Integrity Audit. A retail analytics team reports that its primary dashboard showed a 12 percent increase in monthly active buyers over the previous quarter. Three months later, the team discovers that a supplier reclassification had quietly changed which customer accounts were included in the “active” definition partway through the measurement window. The reclassification was undocumented. Using the frameworks introduced in this chapter, (a) classify this failure by type — is it more analogous to the WAU dashboard failure or the J.C. Penney inferential failure, and why? (b) Describe the observability properties a measurement system would need to have detected this failure at the moment of occurrence rather than three months later. (c) Identify at least one prior organizational decision that might have been made on the basis of the distorted data, and describe the reversibility problem that would face the team attempting to audit that decision retroactively.

Problem 1.2 — The Loyalty Program Case. A retail analytics team reports a strong positive correlation (r = 0.72) between loyalty program membership and average order value. Leadership proposes expanding the loyalty program to all customer segments. Construct two plausible causal stories consistent with the observed correlation — one in which the program causes higher spending, and one in which the observed correlation reflects a pre-existing difference between customers who join and those who do not. For each story, identify the causal graph structure that would generate it and describe two specific pieces of data that would allow you to distinguish between them empirically. Express your answer using the P(Y | X) versus P(Y | do(X)) distinction. What does your analysis imply about the proposed budget decision?

Problem 1.3 — The J.C. Penney Forensic. Research the J.C. Penney pricing strategy collapse of 2012–2013. Identify the specific inferential error that the decision rested on, using the vocabulary introduced in this chapter. Then construct the counterfactual: under what conditions — what data, what analytical method, what organizational process — might the company have detected the error before implementation? What would a second-stage diagnostic analysis have required that first-stage descriptive analytics could not provide? Your answer should distinguish between what additional data would have helped and what different analytical framework was required — these are not the same thing.

Problem 1.4 — Stage Placement. Select an organization you are familiar with — a company, institution, or team. Based on the four-stage framework, diagnose where the organization currently sits. Identify the primary barrier (incentive structure, data architecture, or risk aversion) preventing it from advancing to the next stage. Propose one specific, implementable change — technical or organizational — that would address that barrier. Justify your recommendation with reference to the structural arguments in this chapter.

Problem 1.5 — The Observable System (Design Challenge). You are the analytics lead at a media streaming company. Your team’s primary dashboard tracks Daily Active Users, content consumption hours, and subscriber churn. Redesign the measurement architecture to be observable in the sense described in this chapter: it should detect and surface failures in its own measurement process before those failures reach decision-makers. Specify the monitoring logic you would implement, the alert thresholds you would set, and the “data integrity” panel you would add alongside the existing metrics view. Describe what a healthy state looks like, and describe three distinct failure signatures your architecture would catch. For each failure signature, identify the organizational decision it would protect.

Problem 1.6 — Open-Ended Design. The chapter argues that descriptive analytics is politically convenient in ways that causal analysis is not. Design an organizational incentive structure — including performance metrics, reporting cadences, and accountability mechanisms — that would create genuine institutional motivation to advance from the descriptive to the diagnostic stage. Your design should address the specific political barriers described in this chapter. Identify at least one unintended consequence your design might produce, and explain how you would mitigate it. As a check, evaluate your proposed structure against the four properties of a Living Model: which of those properties would your incentive structure make more or less achievable, and why?

Key Terms

Association. The first rung of Pearl’s Ladder of Causation. An association is a statistical relationship between two variables observable in data — a correlation, a conditional probability, a regression coefficient. Association answers the question what tends to co-occur? It does not answer what would happen under deliberate intervention.

Concept Drift. The gradual or abrupt invalidation of a predictive model’s learned relationships, caused by a shift in the underlying data-generating process. A model experiencing concept drift continues to produce outputs; those outputs no longer reflect reality. The model does not announce the drift.

Confounding Variable (Confounder). A variable that influences both the apparent cause and the apparent effect in an observed association, creating a spurious or distorted correlation between them. In the loyalty program example, pre-existing spending propensity drives both membership and order value, generating a positive correlation that does not reflect a causal mechanism.

Counterfactual. The third rung of Pearl’s Ladder. A counterfactual question asks: given that Y occurred under condition X, what would Y have been if X had been different? Counterfactual reasoning requires a structural causal model capable of reasoning about individual-level outcomes in worlds that did not occur.

Do-Operator (do(·)). A mathematical notation introduced by Judea Pearl to represent deliberate intervention: setting a variable to a specific value by action, as opposed to observing that it takes that value in data. P(Y | do(X = x)) is the probability of outcome Y if we intervene to set X to x — structurally different from P(Y | X = x), the probability of Y given that we observe X equal to x in the data.

Interventional Distribution. The probability distribution over outcomes that results from a deliberate do-intervention. Estimating the interventional distribution from observational data — without a randomized experiment — requires causal inference methods. The observational distribution and the interventional distribution will differ whenever unmeasured confounders, selection effects, or mediators are present in the causal structure.

Living Model. An analytical system with four defining properties: causal (structured around interventional effects, not correlations), counterfactual (capable of reasoning about outcomes in worlds that did not occur), continually updated (live connection between incoming data and model parameters), and treatment-oriented (output is a ranked list of interventions evaluated by expected causal effect). Distinguished from a dashboard, a predictive model, and a prescriptive system by the first two properties.

Observational Distribution. The probability distribution over outcomes observable in historical data, without intervention. Denoted P(Y | X). All standard dashboards, regression models, and machine learning systems trained on historical data operate on the observational distribution. The J.C. Penney pricing decision was made from the observational distribution.

Observability. The property of a measurement or analytical system that makes its own integrity visible as a first-class output — not just what is happening to the thing being measured, but what is happening to the measurement itself. An observable system detects failures in its own process and surfaces them before they reach decision-makers.

Pearl’s Ladder of Causation. A three-rung hierarchy of causal reasoning introduced by Judea Pearl. Rung one: association (seeing — what does the data show?). Rung two: intervention (doing — what would happen if we acted?). Rung three: counterfactual (imagining — what would have happened if things had been different?). Each rung requires qualitatively more powerful analytical machinery than the one below it. No rung is reachable by accumulating more data at the rung below.

Silent Failure. A failure mode in which an analytical system produces outputs that are indistinguishable from accurate reporting while measuring something systematically different from what users believe it is measuring. Defined by the absence of any error signal at the surface level. Both the WAU dashboard failure and J.C. Penney’s pricing inference are instances of the same underlying mode.

Structural Causal Model (SCM). A mathematical object that encodes the mechanisms of a causal system as a set of structural equations, each expressing how one variable is determined by its direct causes plus an independent error term. SCMs are the formal foundation of counterfactual reasoning. They encode not just the correlational structure of a system but the mechanism — the how and why — that would produce a specific outcome under any hypothetical intervention.

Living Models - Table of Contents

Nik Bear Brown — Mon, 16 Mar 2026 18:17:08 GMT

The Living Models framework didn’t start as a book proposal or a research agenda. It started with an email.

Someone reached out with a concrete problem: how do you build intelligence that actually keeps pace with decisions that have to be made in fast-moving environments? The existing options — legacy analyst reports, dashboards, predictive models — all share the same fundamental flaw. They’re built to describe the past or extrapolate from it. They have no mechanism for reasoning about what happens when you deliberately change something.

That email prompted a first-principles question: what would it actually mean to build something genuinely different? Not a faster dashboard. Not a more frequently updated report. Something categorically different in how it reasons.

The answer that emerged is what I’ve been calling a Living Model — causal, counterfactual, continually updated, and organized around actionable interventions rather than descriptions. The core idea is that the gap between what analytics systems are built to do and what strategic intelligence actually requires is large, consequential, and now closeable. Closing it requires moving up Pearl’s ladder from association to intervention — and that’s not an incremental improvement. It’s a different kind of reasoning entirely.

I write about nascent ideas in public because that’s how I make them better. Private thinking produces private blind spots. Writing forces precision. Readers push back. The argument gets stronger or it gets abandoned. The hypothetical.ai Substack exists for exactly that reason — to work the math and the architecture out loud before anything gets built.

The data was never the problem. It was always the question.

Living Models

Causal Intelligence for the Decisions That Actually Matter

Nik Bear Brown

The data was never the problem. It was always the question.

Preface

The Monday Morning Meeting Why dashboards are reliable for understanding the past — what happened, what was spent, what was sold — but unreliable for understanding what is going to happen, particularly as lifecycles rapidly compress. Using dashboards to navigate the road ahead is driving via the rear-view mirror. What this book is, who it is for, and what it asks of the reader. A note on the relationship between theory and practice — and why this book refuses to separate them.

Part One: The Problem

Why three decades of analytics have produced sophisticated hindsight and very little foresight

Chapter 1: The Dashboard That Lied The anatomy of a descriptive analytics failure. What correlation can tell you, what it cannot, and the organizational cost of not knowing the difference. The four rungs of analytics maturity — and why most enterprises are stuck on rung one.

Chapter 2: The Map That Doesn’t Move Why predictive models fail under intervention. The observational distribution versus the interventional distribution. What happens when strategy — the deliberate act of making the future different from the past — meets a model trained on the past. The Innovator’s Dilemma as a failure of organizational incentives and time horizons, not merely a strategic one: executives often see the disruptive threat but are incentivized to downplay it, protecting the status quo while the long-term horizon — itself compressing — goes unmodeled.

Chapter 3: What We Mean When We Say “Real-Time” The abuse of “real-time” in enterprise software marketing. The difference between data latency and model latency. Why continually updated is the honest term. What it actually requires to keep a model current — technically, organizationally, and epistemically.

Chapter 4: Risk Is Two Numbers, Not One The collapse of probability and impact into a single risk score — and what it costs. Expected Value of Intervention as the foundational metric. Why a five percent chance of catastrophe and a fifty percent chance of inconvenience require different decisions. How almost every risk framework in commercial use gets this wrong.

Part Two: The Theory

The mathematical foundations of causal intelligence — made readable

Chapter 5: Pearl’s Ladder Judea Pearl and the three levels of causal reasoning: association, intervention, counterfactual. The do-operator and what it represents. Why the move from observational to interventional reasoning is not an incremental improvement but a categorical one. A working guide to the Ladder for non-mathematicians.

Chapter 6: Graphs That Think Directed Acyclic Graphs as maps of mechanism. Structural Causal Models and what they encode. The difference between a regression equation and a causal graph. How the same data can be consistent with multiple causal structures — and why this is a mathematical result, not an algorithmic limitation.

Chapter 7: The Equivalence Problem Markov equivalence classes and why data alone cannot orient every edge in a causal graph. The Completed Partially Directed Acyclic Graph. What this means for automated causal discovery — and why human domain knowledge is not a convenience but a mathematical necessity. How to resolve equivalence through interventional reasoning rather than more data.

Chapter 8: Estimating Effects From graph structure to quantified causal effects. The backdoor criterion and why standard regression estimates are frequently biased. Double Machine Learning and what it corrects for. Susan Athey’s causal forests and the estimation of heterogeneous treatment effects. Why the average effect is rarely the decision-relevant fact.

Chapter 9: The Counterfactual Pearl’s abduction-action-prediction procedure. Pre-factual simulation versus retrospective counterfactual analysis. How to reason about a world that never happened. The individual-level counterfactual and why it is the hardest and most valuable form of causal reasoning. Clinical precedents and their organizational analogs.

Chapter 10: Confounders, Colliders, and the Limits of Observational Data The unconfoundedness assumption and when it fails. Latent confounders — the competitor’s internal meeting, the macro-sentiment shift, the organizational change that preceded the attrition spike. Collider bias and Berkson’s paradox. What sensitivity analysis can tell you about how wrong your model might be. The honest account of what causal inference cannot do.

Chapter 11: Treatments Randomized controlled trials and why they remain the gold standard. Esther Duflo’s experimental design at scale. The translation of clinical “treatment” into organizational “intervention.” Stable Unit Treatment Value Assumption and when network interference breaks it. Spillover effects, herd immunity, and the social systems that violate SUTVA by design.

Chapter 12: The Plumber’s Objection Duflo’s “Economist as Plumber” and what it means for causal AI. The distance between a correct causal estimate and an effective organizational change — and what fills it. Why models provide very little guidance on which implementation details will matter. The Ne-FMS case and what fixing the plumbing actually looks like.

Part Three: The Architecture

How to build systems that actually use this theory

Chapter 13: The Living Model Defined The four properties that define a Living Model: causal, counterfactual, continually updated, treatment-oriented. How these properties distinguish a Living Model from a dashboard, a predictive model, a digital twin, and an ontological system. The analytics maturity table revisited. What “orchestrated outcomes” actually means.

Chapter 14: The Expert in the Room Why causal graphs cannot be built from data alone — and why this is a mathematical result, not a limitation of current tools. The knowledge bottleneck: the gap between what a domain expert knows implicitly and what a causal model requires explicitly. The specific discipline that has been working on this problem for two decades is Knowledge Engineering with Bayesian Networks — a field with rigorous methods for structured elicitation, conditional probability assessment, and graph construction from expert judgment. It has produced validated protocols used in medical diagnosis, military intelligence, and environmental risk assessment. It has not reached the corporate boardroom. This chapter explains what the field knows — variable identification, edge elicitation, consistency checking, confidence calibration — and why the gap between research and practice persists: the methods are slow, require trained facilitators, and produce outputs that don’t fit neatly into existing strategy or analytics workflows. The Living Model architecture in Part Three is, in part, an answer to that gap.

Chapter 15: How Experts Get Causation Wrong The systematic cognitive biases in expert causal reasoning: collider blindness, feedback loop simplification, domain-matching heuristics. Berkson’s bias as the canonical illustration. Why “more covariates” is not always better. What the research says about when to trust expert causal judgment and when to interrogate it.

Chapter 16: The Machine That Interviews the Expert The model for this chapter already exists — not in causal AI research, but in brand strategy. LLM-guided interview systems like the Nina framework demonstrate that a well-designed system prompt can do what a skilled human interviewer does: ask one question at a time, refuse to proceed until the answer is sufficient, hold prior answers in context, surface contradictions, and progressively build a structured output from unstructured expert knowledge. Nina does this for brand identity — moving from intake through archetype to creative brief through a disciplined sequence that cannot be shortcut. The causal elicitation system described in this chapter applies the same architecture to a harder problem: building a first-pass causal graph from a domain expert who has never heard of a DAG.

The chapter covers the full architecture: variable confirmation (what are the things that matter in this system, and are they measurable?), edge elicitation (does X cause Y, or does Y cause X, or does something else cause both?), interventional disambiguation (if we changed X deliberately, what would happen to Y — and is that different from what happens when X changes on its own?), and confidence calibration (how certain are you, and what would change your mind?). Each stage maps directly to a phase in the Nina intake sequence — the chapter makes this analogy explicit, using it to show executives a system they can already picture before introducing the causal machinery underneath.

The minimum viable interview: forty-five minutes to a first-pass causal graph, suitable for handoff to automated discovery algorithms. Multi-agent design and how different reasoning modes — one agent eliciting, one checking consistency, one flagging equivalence ambiguities — divide the work that a single interviewer cannot reliably do alone. What CausalChat-class implementations have demonstrated in practice, where they stop short, and what the Nina parallel reveals about the design principles that make the difference between an interview that extracts knowledge and one that merely confirms what the expert already planned to say.

Chapter 17: Resolving the Graph From expert-provided skeleton to fully oriented DAG. How automated discovery algorithms — PC, GES, NOTEARS, FCI — refine expert-provided structure. The CPDAG handoff and what it requires from the expert interview output. When to run more data collection and when to run more expert sessions. The validated graph as a living artifact, not a finished product.

Chapter 18: From Graph to Decision Parameterizing the graph — estimating conditional distributions from data once structure is fixed. Running the counterfactual: abduction, action, prediction in a business context. Ranking interventions by Expected Value. The constrained knapsack — translating ranked interventions into portfolio decisions under resource constraints. What the output looks like to a strategy executive.

Chapter 19: The Causal Brain Executive Report A Living Model produces one output that matters to an executive: a ranked recommendation with the evidence that supports it, the assumptions that could break it, and the counterfactual that justifies acting now rather than waiting. This chapter defines what that report contains and why each element is there — not to explain the model, but to make the recommendation auditable by someone who will never see the model at all.

The report structure follows a deliberate logic. The recommendation comes first — specific, ranked, owned. Not “consider these options” but “do this, because the model estimates this intervention produces the highest Expected Value under current conditions.” The evidence section follows: which causal variables drove the recommendation, how confident the model is in each edge, and where the graph is thin — the nodes where expert elicitation was the only data source and observational data has not yet confirmed the structure. The assumptions section names what would have to be true for the recommendation to be wrong — not as a hedge, but as an audit trail. And the counterfactual closes it: what the trajectory looks like if the recommendation is not taken, and at what point the next-best intervention becomes more valuable than the recommended one.

LLM narration is the mechanism that produces this report from model outputs a board cannot read directly. The chapter covers what that narration must do — translate intervention rankings into plain-language recommendations, surface confidence levels without false precision, flag structural uncertainty without undermining the recommendation — and what it must never do: explain the model, show the graph, or present options as equally valid when the model has ranked them.

Visualization serves the evidence layer, not the recommendation. The chapter covers the specific tradeoff every causal visualization faces: a full causal graph is structurally honest but illegible to most executives, and most simplifications that make it legible destroy the structural honesty. The resolution is not a better visualization of the graph — it is a visualization of the decision, with the graph available as an appendix for those who want to interrogate it.

The chapter closes with the accountability question this project raises directly: if the model recommends and the executive decides, where does responsibility sit when the decision is wrong? The answer is not in the model. It is in the report — the auditable record of what the model said, what evidence supported it, and what the decision-maker chose to do with it.

Chapter 20: Keeping the Model Alive Bayesian updating of edge parameters as new data arrives. Structural change detection — when a shift in the data requires revisiting the graph, not just the parameters. Model drift in causal systems and how to detect it. DecisionOps versus MLOps — tracking decision ROI, not model accuracy. The minimum viable feedback loop for an organization without a data science team.

Part Four: The Frameworks

Christensen, Damodaran, and the theory-guided Living Model

[CS: These are two of the better-known business academics with established frameworks. Christensen is widely recognized; the Christensen Institute is a potential collaborator. Damodaran is a finance professor (not strategy) known personally from NYU — generous with his IP, potentially open to working with us. These two are a starting point, not the limit. Other academic frameworks should be incorporated, including from NEU faculty. To be discussed.]

Chapter 21: Frameworks Are Not Models Why Christensen and Damodaran are inputs to causal models, not causal models themselves. The correct role of theoretical frameworks: feature engineering, Bayesian priors, anomaly flags, DAG scaffolding. What “theory-guided AI” means in practice — and why it is categorically different from either pure data-driven discovery or pure framework application. A preview of the structure that follows: for each framework, the book moves in three steps — what the framework actually argues, where its causal structure is hidden or incomplete, and what a Living Model built on that foundation can do that the framework alone cannot.

Section A: Christensen

Chapter 22: What Christensen Actually Argued The precise claim of disruption theory — not the popular misreading of it. Low-end and non-consumption entry. Why the entrant’s inferiority is a structural feature, not a weakness. The performance trajectory dynamic and why it is nonlinear. The rational trap: why the incumbent’s best response accelerates its own displacement. What the framework explains well, what it explains poorly, and the three things it cannot tell you at all — timing, mechanism, and counterfactual. Why most companies that have “read Christensen” still get disrupted: the gap between pattern recognition and causal understanding.

Chapter 23: The Disruptive Innovation DAG Building a causal graph from Christensen’s disruption theory. The structural equations behind low-end entry, performance trajectory, and mainstream displacement. Upstream drivers, midstream mediators, downstream outcomes — and the feedback loop that makes disruption self-reinforcing once started. What real-time signals — competitor pricing, job postings, user reviews, patent filings — look like as inputs to a disruption-theoretic causal model. The signal integration problem: why each signal is ambiguous alone and how the DAG resolves the ambiguity. Where Christensen’s framework breaks — the incumbent response problem, the platform shift confounder, the hindsight bias embedded in the theory’s most famous cases.

Chapter 24: The Disruption Audit — A Case Study A single incumbent/entrant pair carried from first signal to displacement confirmation. What the leading indicators showed, when they showed it, and what the incumbent said at the same moment. The DAG populated with real data. Intervention ranking at three points in the timeline: when the model would have recommended a separate unit, when it would have recommended acquisition, when it would have recommended managed retreat. The counterfactual: what would the trajectory have looked like under the highest-ranked intervention at the earliest detection point. What this case demonstrates that Christensen’s framework, applied conventionally, cannot.

Section B: Damodaran

Chapter 25: What Damodaran Actually Argued Damodaran is a finance professor, not a strategy theorist — and that distinction matters for how his frameworks should be used. His corporate life cycle framework: the five stages, the financial signatures of each, the logic of value creation when returns exceed cost of capital and destruction when they don’t. His equity risk premium work: why historical premiums are backward-looking and biased, what the implied ERP measures instead, and what it still gets wrong — risk as a single number, the efficiency assumption doing load-bearing work in the background, the collapse of probability and impact that Chapter 4 identified as the foundational error. What Damodaran gives the practitioner that almost no one else does: rigorous, freely available, annually updated tools. What those tools cannot do: model the mechanism before the market sees it.

Chapter 26: The Life Cycle as Causal Structure Damodaran’s corporate life cycle encoded as a DAG. The financial signatures of each stage — reinvestment rate, margin trajectory, capital allocation — as candidate causal variables. The transition problem: why late Mature Growth and early Mature Stable are nearly identical in levels but structurally different in second derivatives. The slow edges hardest to detect: organizational capability atrophy, the narrowing ROIC/WACC spread, the reinvestment rate inflection. How a Living Model monitors the transition from Mature Growth to Mature Stable. Predictive intervention before the decline phase begins. The ERP connection: why the implied equity risk premium will not move until the market sees what the causal model already shows.

Chapter 27: The Decline Inflection — A Case Study A single public company carried through the Mature Growth to Decline transition. What the dashboard showed at the transition point — and why a reasonable analyst would have seen nothing alarming. What the causal signals showed: the second derivatives, the organizational atrophy indicators, the management narrative gap between stated strategy and actual capital allocation. The DAG populated with eight quarters of financial trajectory data. The counterfactual: what the model would have recommended at the highest-leverage decision point, and what the trajectory would have looked like under that intervention. The lag between earliest causal signal and market price response — and what that lag means for the implied ERP as a forward-looking tool.

Section C: The Collision

Chapter 28: The Collision Model The SaaS Margin Collision as a worked example of what happens when Christensen’s disruption dynamic and Damodaran’s life cycle transition operate simultaneously. Compute cost, labor elasticity, and pricing architecture as a three-variable causal system. Non-linear ripple effects versus additive forecasting. The Life Cycle Compression Index as a Living Model output. What the model tells a CEO that a static forecast cannot — and what it tells a board that neither Christensen nor Damodaran, applied separately, would have surfaced.

Part Five: The Cases

Living Models applied to real strategic decisions

Chapter 29: The Pricing Reset A seat-based SaaS company facing the agent disruption scenario. The causal model of pricing architecture, customer retention, and competitive entry. Intervention ranking under a resource constraint. The counterfactual: what would revenue look like if the pricing model had shifted two years earlier?

Chapter 30: The Supply Chain That Broke A manufacturing firm, a tariff shock, and a causal model built on publicly available data. SUTVA violations in supplier networks. How the Living Model would have ranked contingency interventions before the disruption. What actually happened and what the counterfactual suggests.

Part Six: The Frontier

What Living Models cannot yet do — and what comes next

Chapter 31: The Latent Confounder Problem The hardest unsolved problem in applied causal inference. What sensitivity analysis can and cannot tell you. Methods for partial identification under unmeasured confounding. The honest account of where Living Models fail and why that failure is informative rather than disqualifying.

Chapter 32: Networks and Interference SUTVA violations in social and market systems. Bipartite experiments, network unconfoundedness, and spillover effect estimation. What Living Models look like in markets where every unit’s outcome depends on every other unit’s treatment. The herd immunity analogy and its organizational equivalents.

Chapter 33: The Agentic Living Model From decision support to autonomous decision execution. The architecture of systems that not only recommend interventions but implement them. The governance problem — what “Urban Reasonableness” means for autonomous causal agents. The EU AI Act and what regulatory compliance requires from Living Model architecture.

Chapter 34: Causal Digital Twins The difference between Palantir’s ontological modeling and genuine causal simulation. What a Causal Digital Twin actually is — SCMs with real-time sensor fusion, automated discovery, and counterfactual generation at scale. Where this technology stands, what it requires, and what it will make possible when it arrives.

Appendices

Appendix A: A Glossary of Causal Terms for Strategy Executives DAG, SCM, do-operator, CPDAG, CATE, ATE, backdoor criterion, collider, confounder, Markov equivalence, counterfactual, pre-factual, interventional distribution — defined in plain language with organizational examples.

Appendix B: Pearl’s Do-Calculus — The Three Rules The mathematical foundation, made readable. What each rule permits, what each rule requires, and a worked example in a business context.

Appendix C: The Minimum Viable Interview Protocol The forty-five-minute structured elicitation session — question by question. What each question is designed to extract. How to handle an expert who collapses levels. The output format for handoff to automated discovery.

Appendix D: Software and Tools Current state of the causal AI ecosystem: causaLens/decisionOS, DoWhy, EconML, CausalML, NOTEARS, CausalNex, Bayesia, Netica. What each does well, what each requires, where each stops short. A practical guide for the organization starting to build.

Appendix E: The Living Model Reading List Pearl’s The Book of Why and Causality. Athey and Wager on causal forests. Duflo’s Poor Economics and the Economist as Plumber lecture. Christensen’s The Innovator’s Dilemma. Damodaran’s The Corporate Life Cycle. The academic papers behind the commercial platforms. Annotated for the reader who wants to go deeper.

hypothetical.ai — causal intelligence for executives who actually have to make decisions

The Architecture of the Expert Interview

Nik Bear Brown — Mon, 16 Mar 2026 02:48:19 GMT

Part Three of the Living Model Series

Part One: The Causal Brain: Living Models and the End of Backward-Looking Analytics

The interview begins before the first question is asked.

By the time a strategy executive sits down with the system — whether they call it a decision tool, a knowledge capture platform, or simply the thing their CIO installed last quarter — the machine has already read the literature, loaded the domain ontology, and constructed a skeleton of what the causal graph probably looks like. It has hypotheses. It has priors. It is waiting, not passively, but the way a prepared interviewer waits: with a plan, a set of probes for when the plan fails, and a practiced ability to follow an unexpected answer into territory the script did not anticipate.

This is what makes building such a system genuinely difficult. The problem is not conversational AI — that technology exists, is mature, and can maintain context across a long structured interview without losing the thread. The problem is that a causal elicitation session is not a conversation. It is a measurement. Every answer the expert gives must be converted into a structural constraint on a Directed Acyclic Graph — a formal mathematical object with strict requirements for consistency, acyclicity, and identifiability. The machine must translate between two vocabularies simultaneously: the expert’s language of mechanism and intuition, and the graph’s language of nodes, directed edges, and conditional independence statements.

No production system does this end-to-end today. That gap is what this piece is about.

What the Machine Must Actually Do

Start with the architecture’s requirements, because the requirements reveal why the gap exists.

The system needs to conduct a structured interview that moves an expert up what Judea Pearl calls the Ladder of Causality — from association (what tends to co-occur) to intervention (what would change if we acted) to counterfactual (what would have happened if we had acted differently). These are not simply harder versions of the same question. They are logically distinct operations, and LLMs trained on text have a documented tendency to collapse them. Ask a model what would happen if we doubled the marketing budget, and it will give you a confident answer that is actually a probabilistic interpolation from training data — not a causal claim, despite sounding like one. A causal elicitation system must detect when this collapse has happened in the expert’s own reasoning, not just in the model’s output.

The system simultaneously needs to construct a valid DAG in real time, which means enforcing acyclicity — no loops — while the expert is still talking. Experts describe feedback loops constantly, because feedback loops are how systems actually work. “Customer satisfaction drives retention, which drives revenue, which drives our ability to invest in customer satisfaction” is not wrong as a description of a business. It is wrong as a DAG. The system must resolve this without telling the executive that their understanding of their own business is mathematically invalid.

And the system must do all of this while managing cognitive bias. The expert who has spent twenty years in a domain has strong prior beliefs, selective memory, and a systematic tendency to underweight alternatives they have already dismissed. The machine must surface those alternatives without inducing defensiveness, detect collider traps before they propagate through the structure, and calibrate confidence intervals around claims that the expert will inevitably express in natural language rather than probability notation.

These requirements, taken together, describe a system that does not currently exist as a unified product. What exists are components — good components, in some cases excellent ones — that have not been assembled into a workflow designed for the person who actually holds the causal knowledge: the domain expert who is not a statistician.

The Conversational Layer: How the Interview Actually Works

The closest implemented systems to what this architecture requires are CausalChat-class interfaces, where a human and an LLM collaborate to build a causal graph through structured dialogue. The interaction protocol is more constrained than it appears.

The system does not ask open-ended questions about causality. Open-ended questions produce open-ended answers, which are difficult to convert to structural constraints. Instead, the interview follows a sequence that maps to specific graph operations. Variable identification comes first: the system presents candidate nodes derived from domain literature and asks the expert to confirm, reject, or rename them. This is the phase most amenable to LLM assistance, because literature synthesis is something current models do well. The expert’s role here is curatorial — they are not generating the variable list from scratch, they are editing a draft.

Edge elicitation is where the interview becomes genuinely difficult. The system presents candidate relationships — “Does pricing pressure tend to precede customer churn, or does customer churn tend to precede pricing pressure?” — and the expert must orient the edge. The key design insight from existing implementations is that experts orient edges more reliably when presented with temporal language rather than causal language. “Which comes first?” is easier to answer than “Which causes which?” — and for the purposes of initial graph construction, temporal precedence is often a sufficient proxy for edge direction.

The level-detection problem — distinguishing when an expert is making an associational claim versus an interventional one — is addressed through explicit prompting patterns. The system maintains a running classification of each statement: is this an observation about correlation, a prediction about what would happen under an action, or a claim about a counterfactual world? When an expert shifts levels without flagging it, the system generates a clarifying probe. “You mentioned that higher prices tend to follow higher demand. Are you describing what you’ve observed in the data, or predicting what would happen if we actively raised prices?” This is not sophisticated by itself. The sophistication is in the architecture that makes this probe available at the right moment in a long conversation, without interrupting the expert’s flow every thirty seconds.

The multi-agent design that makes this tractable assigns different probing responsibilities to different reasoning modes. A temporal expert validates that proposed causes precede proposed effects. A physical plausibility agent checks that claimed mechanisms do not violate domain constraints. A dependence agent cross-references the emerging graph against available data to flag when a proposed edge is inconsistent with observed conditional independencies. These are not separate models in most implementations — they are system prompts that instantiate different reasoning personas within a single LLM call, evaluated against the same emerging graph. The coordination overhead is real, and it is one of the reasons this class of system remains research-adjacent rather than production-ready.

The Markov Problem: What Data Cannot Settle

Here is the problem that no amount of data can solve without expert input, and that most strategy executives have never heard of.

Multiple Directed Acyclic Graphs can encode exactly the same statistical relationships. If you observe that A and B are correlated, and B and C are correlated, and A and C are correlated after conditioning on B — these observations are consistent with A → B → C, and with C → B → A, and with A → B ← C (with different implications in the third case). No statistical test distinguishes between them. The graphs are Markov equivalent: they make identical predictions about any observational dataset.

This is not an edge case. It is the general condition. For a graph with ten variables, the number of Markov-equivalent structures that are consistent with any given observational dataset can be in the thousands. The causal discovery algorithms — PC, GES, NOTEARS — return these equivalence classes as their output. The Completed Partially Directed Acyclic Graph, or CPDAG, represents this honestly: directed edges where the equivalence class agrees on direction, undirected edges everywhere else.

Resolving undirected edges requires expert knowledge. The elicitation system’s job, in this phase, is to identify which undirected edges matter most for the intended decision analysis, and then design questions that force orientation without requiring the expert to understand what a CPDAG is.

The question design here is specific. Markov equivalence is broken by interventional reasoning, not observational reasoning. “If we held B constant through external intervention — locked the variable — would changes in A still be associated with changes in C?” This question, if the expert can answer it, resolves edge direction in the A-B-C subgraph. If A’s relationship to C disappears when B is locked, the graph is A → B → C or A ← B ← C. If it persists, the structure is different. The expert does not need to know why this question resolves the equivalence — they only need to be able to answer it.

The gap in current systems is that constructing these interventional probes automatically, from an arbitrary CPDAG, is an unsolved engineering problem. It requires the system to identify which undirected edges are strategically important, formulate an interventional question about them in domain language, and interpret the expert’s answer as a structural constraint. Pieces of this pipeline exist. The integrated version does not.

Bias Interception in Real Time

Cognitive bias in expert elicitation is not a calibration problem. It is an architecture problem.

The ACH protocol — Analysis of Competing Hypotheses — is the most rigorous structured approach to preventing selection bias in human expert judgment. Its core insight is counterintuitive: instead of asking experts to build the case for their preferred hypothesis, ask them to identify evidence that would be inconsistent with each hypothesis on the table. Experts are systematically better at finding disconfirming evidence for alternatives than they are at generating alternatives in the first place. The matrix that results — hypotheses across the top, evidence down the side, consistency markers in the cells — is a discipline device. It forces the expert to maintain multiple live explanations simultaneously.

An automated ACH layer in the elicitation system generates the initial hypothesis set from domain literature, ensures it is sufficiently saturated (covering the space of plausible explanations, not just the obvious ones), and tracks the consistency matrix as the interview proceeds. When the emerging graph begins to converge strongly on a single structure, the system actively generates probes for the neglected alternatives. This is bias interception, not bias correction — it operates before the expert has committed to a conclusion, not after.

Collider bias requires a different kind of interception. A collider is a node that has two incoming arrows: X → Z ← Y. When X and Y are causally independent, they are statistically independent in an unselected sample. But when you condition on Z — when you stratify by Z’s value, or control for it in a regression, or select your data based on Z — X and Y become spuriously associated. This is not a subtle effect. It is a major source of published research errors, and it occurs routinely in business contexts: conditioning on customer retention to analyze the relationship between marketing spend and product quality, for example, creates a spurious negative correlation between two variables that may be genuinely independent.

The live backdoor criterion checker — which the system runs continuously as the expert elaborates the graph — identifies collider topologies as they form and flags them immediately. The prompt it generates does not say “you have a collider.” It says: “You mentioned controlling for customer retention in your analysis. If retention is influenced by both marketing spend and product quality, including it as a control variable might actually create a misleading relationship between those two inputs. Would you like me to show you what the analysis looks like without that control?” The expert corrects the structure. The bias is intercepted without the expert ever needing to understand d-separation.

Cycle detection operates on similar principles. When the expert’s narrative implies a feedback loop — and most executives’ mental models of their businesses are feedback-loop-laden — the system does not tell them the loop is invalid. It asks a temporal resolution question: “In the relationship you’re describing between customer satisfaction and revenue — over a single quarter, which tends to move first?” Temporal precedence breaks the cycle into a lagged structure that can be represented as a DAG across time steps. The expert’s intuition is preserved. The graph’s formal requirements are satisfied.

Translating Language to Probability

The minimum viable product of a knowledge elicitation session is a causal graph with probability distributions attached to the edges. Without the distributions, you have a qualitative map. With them, you have a model you can actually run.

The linguistic probability mapping problem is this: experts express uncertainty in natural language. “It’s likely that the relationship is positive.” “I’d be surprised if the effect were larger than twenty percent.” “I’m quite confident this matters, but I’m not sure of the direction.” Each of these statements contains a probability judgment, embedded in ordinary language, that the system needs to extract and represent formally.

LLMs are reasonably good at this mapping task. Studies using benchmarks like QUITE show that models like GPT-4 can convert uncertainty expressions to numerical ranges with reasonable alignment to population-level human agreement. “Likely” maps to roughly 65–80% probability. “Highly improbable” maps to below 10%. The variance is significant — different people mean different things by “likely” — and the system should therefore treat the LLM’s initial mapping as a prior that the calibration process refines, not as a final answer.

Calibration in short sessions is possible but requires deliberate design. The most accessible approach is reference class calibration: the system presents the expert with questions in their domain for which ground-truth outcomes are known, observes whether their expressed confidence levels align with actual accuracy, and applies a correction function to subsequent elicitation. This can be compressed into ten to fifteen meaningful calibration pairs. The resulting correction is crude by the standards of formal elicitation methods like SHELF, which was designed for multi-day workshops with trained facilitators. It is not crude relative to uncalibrated expert judgment, which is what organizations currently use for most strategic decisions.

The minimum viable prior — the question of how much quantification is actually necessary — is a design decision that the architecture should make explicit rather than leave implicit. For first-pass scenario modeling, edge signs (positive or negative) combined with rough magnitude buckets (weak, moderate, strong) are sufficient to generate useful sensitivity analyses. Full parametric distributions are necessary for formal decision analysis with explicit uncertainty bounds. The system should present the expert with this tradeoff clearly: here is what the model can do with what you’ve given it so far, here is what becomes possible if you provide more precise estimates.

The System That Doesn’t Exist Yet

What the research confirms is the expected finding: no production system integrates this full workflow for strategy executives.

The closest attempts are instructive in their failures. Bayesia and Netica are mature Bayesian network builders with sophisticated inference engines and accessible interfaces — but their elicitation workflow is essentially a spreadsheet. The expert manually enters conditional probability tables. There is no conversational layer, no bias detection, no Markov equivalence resolution through natural language. They are tools for modelers who already understand Bayesian networks, not tools for eliciting knowledge from people who don’t.

CausalNex and DoWhy are Python libraries with strong algorithmic foundations — NOTEARS integration, do-calculus support, counterfactual reasoning — but zero user interface. They are infrastructure for data scientists building applications, not applications themselves.

The LLM-native tools — various GPT-4-based assistants configured for causal reasoning — have the conversational fluency and the literature synthesis capability, but they lack structural enforcement. They will help an expert think through a causal diagram in natural language, and they will produce a description of a graph. They will not enforce acyclicity, detect colliders, resolve Markov equivalence, or hand off a validated CPDAG to a downstream discovery algorithm. They are thinking partners, not elicitation machines.

The gap is not any single missing piece. It is the integration: the pipeline that takes a natural language conversation, enforces formal graph constraints in real time, manages cognitive bias without disrupting expert flow, and produces an output that a causal discovery algorithm can refine and a decision analysis engine can run. The components exist. The assembly has not happened.

The Minimum Viable Interview

What is the shortest structured conversation that produces a causal graph sufficient for first-pass scenario modeling?

The answer, based on what implemented systems have demonstrated, is approximately forty-five minutes with a domain expert who has been briefed on the format. The structure is:

Variable confirmation (ten minutes): The system presents a draft variable list derived from domain literature. The expert confirms, rejects, and renames. No graph structure is discussed yet.

Edge elicitation (twenty minutes): The system presents candidate relationships in temporal language. The expert orients edges. The system flags cycles and resolves them through temporal probing. Collider topologies are flagged as they form.

Interventional disambiguation (ten minutes): The system identifies the highest-stakes undirected edges in the emerging CPDAG and presents interventional probes to orient them. Three to five questions, each targeted at a specific structural ambiguity.

Confidence calibration (five minutes): The system presents reference-class calibration questions, adjusts the expert’s probability mappings, and applies corrections to the distributions already assigned to edges.

The output is a partially directed graph with rough probability distributions on the oriented edges — not a publication-ready causal model, but a structure sufficient to run basic counterfactual scenarios and identify which additional data collection would most reduce uncertainty. This is the living model’s first breath.

The interview is where the model is born. What comes next — the feedback loops, the data integration, the iterative refinement as outcomes arrive — is how it stays alive.

The Living Model series continues in Part Four: From Graph to Decision — Running Counterfactuals Against Causal Structure

Tags: causal knowledge elicitation architecture, LLM-guided DAG construction, Markov equivalence resolution expert interview, Bayesian network prior elicitation, NOTEARS PC algorithm expert integration

Can a Machine Interview an Expert?

Nik Bear Brown — Sun, 15 Mar 2026 22:46:22 GMT

This is the second piece in the Living Model series. The first established why causal AI matters for organizational decision-making. This one examines the structural barrier that keeps it confined to data science teams.

The gap is not where most people think it is.

When organizations fail to deploy causal AI at the executive level, the instinct is to diagnose a technical failure — the models are too complex, the data is too messy, the algorithms are too opaque. These are real problems. But they are downstream of a more fundamental one, and it is not technical at all.

It is conversational.

Building a Living Model — a Bayesian network or directed acyclic graph capable of supporting strategic intervention — requires a causal graph: a precise map of what causes what. That map does not live in any dataset. It lives in the mind of the person who knows the business, the clinic, or the engineering system. Getting it out requires asking that person the right questions, in the right order, while navigating the systematic biases that distort expert causal reasoning. This is the knowledge bottleneck. It is why causal AI remains siloed inside technical teams even when the people who need it most sit in boardrooms.

The field of Knowledge Engineering with Bayesian Networks has been working on this problem for two decades. The solution it keeps arriving at is structured conversation.

What the field actually knows about eliciting causal structure

The modeler’s job is not to build the causal structure — it is to extract it from someone who already holds it implicitly, then formalize what they know.

That single reorientation explains everything that follows. Knowledge Engineering with Bayesian Networks (KEBN) is the discipline of doing exactly this: taking implicit expert knowledge and transforming it into formal probabilistic models. It exists because data-driven discovery fails in the domains where causal AI matters most — rare events, novel interventions, strategic decisions where there is no historical data for the action being contemplated.

The KEBN process is iterative. Expert knowledge is extracted in cycles, each pass producing a more refined model. The multi-phase architecture breaks into four stages: variable definition (identifying what the model contains), structure elicitation (building the DAG that encodes what causes what), likelihood estimation (assigning conditional probabilities to each causal relationship), and model review (validating the output against expert intuition through scenario testing).

Structure elicitation is the most critical phase. Methods like CausalNex’s “Structure Review” ask domain experts to validate learned edges by grouping variables into “themes” and checking whether the causal relationships between themes match their understanding of the system. When experts lack time for intensive sessions, online Delphi variants allow asynchronous, questionnaire-based elicitation.

Every one of these methods is sophisticated. None of them solves the problem of getting a CFO to sit still for a structured interview.

The systematic ways experts get causation wrong

This would be straightforward if expert causal reasoning were reliable. It is not.

The research on expert cognitive biases in causal contexts documents a consistent pattern. Experts excel at certain tasks — they identify direct causal mechanisms quickly, generate plausible hypotheses, and draw on accumulated domain pattern recognition. What they consistently fail at is causal structure: specifically, the topological properties of their own mental models.

The most dangerous failure mode is collider bias. A collider is a variable caused by two other variables — the arrows “collide” at that node. Conditioning on a collider induces a spurious association between its two parent causes that does not exist in the underlying system.

Berkson’s bias is the clinical illustration: among hospitalized patients, obesity appears protective against certain conditions. Being hospitalized is the collider — it is jointly caused by the disease and by other risk factors. Among hospitalized patients, knowing that obesity is absent makes other causes more likely. The protective signal is an artifact of the study sample. In the general population, obesity is still a risk factor.

Experts miss this consistently. The reason is intuitive: human reasoning rewards “more is better” covariate selection. Controlling for hospital admission seems like scientific rigor. The collider trap is invisible until it reverses your sign.

Feedback loops are the second major failure zone. Complex business systems — customer retention, supply chain dynamics, pricing strategy — involve bidirectional influences where the effect cycles back to influence the cause. Experts simplify these into linear sequences. The practical consequence: interventions designed for a static model get neutralized by feedback mechanisms the model didn’t capture.

The domain-matching heuristic produces a third category of errors. When experts lack specific mechanistic knowledge, they assume that cause and effect must come from the same domain — a mechanical failure has a mechanical cause, a financial outcome has a financial driver. This systematically blinds expert models to cross-domain influences, which is precisely where the most strategically interesting causal effects tend to live.

Why human domain knowledge is mathematically necessary

This is the point that gets lost in discussions of automated causal discovery. Multiple distinct causal structures can be perfectly consistent with the same statistical data. This is not a limitation of current algorithms — it is a mathematical result.

Two DAGs belong to the same Markov equivalence class if they imply identical conditional independence relationships in the data. The chain X→Y→ZX \to Y \to Z X→Y→Z, the fork X←Y→ZX \leftarrow Y \to Z X←Y→Z, and the reverse chain X←Y←ZX \leftarrow Y \leftarrow Z X←Y←Z are all statistically indistinguishable from observational data alone. Without experimental intervention — without actually doing something and observing the result — the data cannot tell you which structure is correct.

The expert who says “I know from operating this system for fifteen years that Y causes Z, not the other way around” is providing information that no dataset contains.

Human domain knowledge is not a convenience that speeds up causal discovery. It is mathematically required to orient edges in the graph where data cannot. Functional Causal Models — methods like LiNGAM that assume specific distributional properties — can theoretically resolve some Markov equivalence by exploiting non-Gaussianity. In practice, business data is sparse, noisy, and frequently contains hidden confounders that make these methods brittle. The expert remains the final arbiter of structural directionality.

This is why the knowledge bottleneck is not a product development inconvenience. It is a mathematical constraint on what causal AI can do without structured expert input.

What adjacent fields have learned about structured elicitation

Clinical medicine, intelligence analysis, and engineering risk assessment have each developed formal elicitation protocols because they face the same underlying problem: high-stakes decisions require beliefs to be made explicit and quantified, but the people who hold those beliefs are prone to the same cognitive biases that undermine all expert judgment.

The Sheffield Elicitation Framework (SHELF) and the IDEA protocol — Investigate, Discuss, Estimate, Aggregate — represent the clinical state of the art. Both require individual expert judgments before any group interaction, preventing anchoring effects where a single confident voice shapes everyone else’s estimates. Both include calibration phases where experts are trained in probabilistic reasoning before being asked to provide priors. The key disciplinary insight: collect individual beliefs first, then aggregate — never let group dynamics produce the prior.

The intelligence community’s Analysis of Competing Hypotheses (ACH) addresses confirmation bias directly. Rather than building the case for a favored hypothesis, ACH requires generating an exhaustive set of competing hypotheses and evaluating how each piece of evidence affects the likelihood of each. The discipline is disconfirmation: try to disprove your leading theory, not prove it. The diagnostic value of evidence — what makes one hypothesis more likely relative to others — is the coin of the realm.

Both protocols work for the same structural reason: they do not ask experts to be better reasoners. They change the procedure so that better reasoning emerges.

The anatomy of a causal interview

A causal modeling interview is structurally different from a requirements gathering session. Requirements gathering asks: what do you need the system to do? Causal elicitation asks: why does the world behave as it does?

The operative framework is Pearl’s Ladder of Causation — a three-level taxonomy of causal reasoning that the interviewer uses to scaffold the expert’s mental model upward.

Level one is association: “What patterns have you observed between marketing spend and churn?” This establishes correlations — what co-moves with what. It is the level where most analytics operates and where the expert feels most comfortable.

Level two is intervention: “If we doubled pricing tomorrow, what would be the true incremental impact on demand?” This is the causal level — not what correlates, but what would happen if you actually did something. Answering well requires the expert to distinguish the effects of the action itself from the selection effects that made the action happen. It requires thinking about heterogeneous treatment effects: not just “what happens” but “for whom, under what conditions.”

Level three is counterfactual: “Given that we increased spend and sales declined, would the decline have been worse if we had not increased spend?” This is the hardest cognitive level. It requires holding the actual world and an imagined world simultaneously and comparing them. It is also the level that reveals the most about underlying causal structure.

The modeler’s formal procedure moves through three steps: abduction (ask the expert to account for the current state of the system), action (simulate an intervention in the expert’s mental model), and prediction (ask the expert to predict the new outcome while holding background conditions constant). The critical discipline is keeping these steps separate — experts will naturally collapse them, leaping from “we would do X” to “therefore Y” without tracing the mechanism.

The central question that separates causal elicitation from requirements gathering is this: “What factors genuinely create sustainable advantage — rather than merely predict it?” This shifts the conversation from symptomatic manifestations to underlying generative mechanisms.

What happens when LLMs try to help

Recent research on LLM-assisted causal graph construction has produced a precise picture of where the technology helps and where it fails — a reliability gap that maps almost exactly onto the distinction between semantic pattern matching and genuine structural reasoning.

On tasks where node metadata is available — where variables have names and descriptions that convey their domain and relationships — LLMs perform well. The Causal-LLM framework showed that LLMs outperform symbolic graph learning methods by 40% in edge accuracy on medical datasets with clear semantic content. They are particularly good at capturing global dependencies and avoiding the spurious cycles that pairwise iterative methods introduce.

On tasks requiring actual causal reasoning from text, the picture is different. The ReCITE benchmark — which requires extracting causal graphs from lengthy academic papers with implicit relationships — returns F1 scores of 0.535 even from the best available models. Accuracy drops sharply as relationships become less explicit and the network grows more complex.

The deeper failure is what researchers have called the “causal parrot” effect. LLMs learn associations between concepts from training data — smoking and cancer, interest rates and inflation, marketing spend and revenue — and reproduce these associations fluently. When tested on “pure reasoning” tasks like Corr2Cause, where the model must determine whether causation is validly inferred from a given correlation, performance approaches chance. The memorized associations are doing almost all the work.

This is not a knock on LLMs for causal work — it is a job description. They surface plausible structures. The expert adjudicates them. The emerging CausalChat framework makes this division of labor explicit: the LLM generates candidate causal relationships, the human expert evaluates and selects. This human-LLM collaborative workflow is more effective than either working alone precisely because it allocates tasks to where each excels.

Turning “very likely” into a probability

The goal of prior quantification is not precision. It’s honesty.

Even after the expert has provided a causal structure — a DAG that both expert and modeler believe reflects the underlying system — the Bayesian network requires numerical conditional probability distributions at every node. The expert who says “customer satisfaction strongly affects retention” must eventually produce a number.

The solution is linguistic probability theory. Words of Estimative Probability — “highly likely,” “probable,” “unlikely,” “remote” — can be mapped to fuzzy membership functions through calibration procedures. The trapezoidal functions are defined by empirically sampling from relevant populations: fifty construction site managers produce the thresholds between “low,” “medium,” and “high” risk; medical expert panels produce the translation from “clinically significant” to a probability range. The critical constraint is that the membership functions must satisfy Kolmogorov’s axioms — they must partition unity and preserve continuity at transition points.

When experts cannot provide percentage estimates at all, modelers use Laplace’s rule of succession and reference class reasoning. The approach asks the expert to estimate, based on everything they knew before observing outcomes: how easy did they believe this problem to be? Given subsequent observations, a beta distribution over the true first-trial probability updates. The prior is the expert’s initial assessment; the posterior is what the evidence has revised it toward.

Neither approach produces perfect priors. Both produce honest ones — distributions that reflect what the expert actually believes, quantified in a way that can be updated as evidence accumulates. The alternative is the false precision of a number someone agreed on in a conference room.

The architecture that follows

The gap between causal theory and organizational practice is not primarily an algorithm problem or a data problem. It is an agency problem. Strategy executives possess causal intent — they know what matters, how the business works, what interventions have been tried and why they succeeded or failed. Data science teams possess causal implementation — they can build the models, run the inference, validate the structure. The bottleneck is the conversation between them.

The machine that interviews the expert is the natural architecture for this bridge. Not a general-purpose chatbot, and not a causal inference algorithm, but a system designed specifically to guide non-technical domain experts through the Ladder of Causation — surfacing candidate structures, detecting the characteristic patterns of collider bias and feedback loop omission, translating linguistic probability expressions into quantitative priors, and resolving Markov equivalence through directed questioning.

The components exist. Structured elicitation protocols from SHELF and ACH. Conversational interfaces from CausalChat research. Linguistic probability mapping from fuzzy membership theory. What has not existed is a system that integrates them into a single coherent workflow designed for the strategy executive rather than the data scientist.

That is the architectural problem the third piece in this series will address directly.

If you’re building a decision support system — or trying to get causal AI out of the data science team and into the boardroom — I’d like to hear what the bottleneck looks like from your side. Reply or comment below.

hypothetical.ai — causal intelligence for executives who actually have to make decisions.

Tags: knowledge elicitation Bayesian networks, Markov equivalence causal inference, expert cognitive bias collider, causal interview protocol, LLM causal graph construction

The Causal Brain: Living Models and the End of Backward-Looking Analytics

Hypothetical — Sun, 15 Mar 2026 21:37:57 GMT

There is a particular kind of organizational suffering that does not announce itself as suffering. It looks like a Monday morning meeting. Someone opens a dashboard. The numbers from last quarter are arrayed with precision — revenue by region, churn by segment, customer acquisition cost trending upward in a way that everyone in the room has already begun to explain without anyone having explained it yet. The meeting lasts ninety minutes. Three action items are logged. Nothing changes. Six weeks later, the same dashboard. The same meeting. The same explanation that is not an explanation.

This is the cost of operating inside a world your analytics system cannot actually see. The dashboard told you what happened. It did not tell you why. It certainly did not tell you what would happen if you did something about it. The data was accurate. The model was useless.

The Living Model is the attempt to build something that does not suffer from this problem — a decision support architecture defined by four properties that together represent a fundamental break from the analytics paradigm that has governed organizational intelligence for three decades. It is causal, meaning it maps structural cause-and-effect rather than correlation. It is counterfactual, meaning it can answer “what would have happened if” for scenarios that never occurred in historical data. It is real-time, meaning it continuously ingests live data streams and updates its outputs accordingly. And it is treatment-oriented, meaning it organizes itself around actionable interventions ranked by expected causal impact rather than passive prediction.

That last property is the one that tends to get lost in the marketing copy. Every enterprise software vendor in 2025 claims to offer “real-time AI insights.” Almost none of them mean what the Living Model means. The difference is the difference between a weather report and a climate simulator — between a system that tells you it is raining and a system that tells you what happens to the river if you open the dam.

What Correlation Cannot Do

Judea Pearl’s “Ladder of Causation” provides the clearest map of the territory. At the first rung: association. This is where nearly all commercial AI currently lives. The system observes that X and Y tend to move together and tells you so. The observation is often useful. It is never sufficient.

At the second rung: intervention. Here the system can answer not “how are X and Y related?” but “what happens to Y if I force X to a specific value?” This requires what Pearl calls the do-operator — a formal representation of deliberate manipulation — and it requires the system to have learned not just statistical patterns but the mechanisms that generate them.

At the third rung: counterfactuals. Here the system can answer “what would Y have been if X had been different, in this specific case, at this specific time, given everything that actually happened?” This is the level at which genuine strategic intelligence becomes possible, because it is the level at which you can evaluate decisions you did not make.

The failure of traditional predictive machine learning is not a failure of sophistication. A well-trained XGBoost model can be remarkably accurate on historical data. The failure is structural. When a company changes its pricing strategy, the historical data that trained the pricing model no longer describes the world the company now inhabits. The intervention changed the system. The model, trained on the pre-intervention world, is now a map of a country that has been reorganized. It does not know this. It keeps giving directions.

Statisticians call this the difference between the observational distribution and the interventional distribution. In plain language: the pattern you learned from watching the system is not the same as the pattern the system produces when you act on it. Prediction assumes the future resembles the past. Strategy is the act of making the future different from the past. These two activities require different tools.

The Architecture of the Living Model

The technical implementation of a Living Model begins with a Directed Acyclic Graph — a DAG — which is a visual map of the system’s causal structure. Every node is a variable. Every arrow is a direct causal relationship. The resulting Structural Causal Model converts those arrows into mathematical functions: each variable is expressed as a function of its direct causes plus an exogenous noise term that captures everything unmeasured.

This architecture does something that a regression equation cannot do. It separates the question “what do we observe happening?” from the question “what happens when we act?” The graph encodes the mechanisms of the system, not just its correlations. When you ask the system “what happens if I reduce price by fifteen percent?”, it does not look up the historical relationship between price and sales. It propagates the intervention through the causal structure — accounting for competitive response, customer segment heterogeneity, inventory constraints — and produces a distribution of outcomes across simulated scenarios.

The scale at which this simulation runs is not incidental. The platform literature refers to “thousands of what-if scenarios” as a standard capability, and this is not hyperbole. The computational advance that made this practical is NOTEARS — Non-combinatorial Optimization via Trace Exponential and Augmented Lagrangian — which reframes the problem of learning a causal graph from data as a continuous optimization problem rather than a combinatorial search. Before NOTEARS, causal discovery across high-dimensional datasets was computationally prohibitive. The number of possible causal graphs grows exponentially with the number of variables. NOTEARS makes it tractable. It is, in the unglamorous way of genuine scientific progress, the thing that made the rest possible.

Real-time ingestion is the second architectural requirement, and it is where most enterprise implementations currently fail. A causal model is only as current as the data that updates it. The technical stack required for genuine real-time operation — event capture through systems like Apache Kafka or Redpanda, stream processing through Flink or Spark, real-time query through ClickHouse or Pinot — is mature and available. The organizational barriers to deploying it are not technical. They are the accumulated weight of data architectures built for batch processing, reporting systems designed for the rhythm of the quarterly review, and a decision culture that has never been asked to operate at the speed the data can now support.

Risk Is Probability Times Impact Magnitude

Here the Living Model makes an intervention into the practice of analytics that is underappreciated in its importance.

Traditional risk assessment collapses the problem. It asks: how likely is this bad thing to happen? The result is a probability, and the probability is treated as the risk. This is not wrong exactly. It is incomplete in a way that produces systematically bad decisions.

A ten percent probability of losing one million dollars is not the same as a ten percent probability of losing one billion dollars. Any decision framework that treats these identically has abandoned the purpose of decision-making. Risk is probability times impact magnitude — and collapsing these two dimensions into one loses precisely the information that a decision-maker actually needs.

The Living Model formalizes this through the Expected Value of Intervention. For any proposed strategic action, the EVI is calculated as the product of reliability — the frequency with which the intervention produces positive outcomes — and effect size — the magnitude of the improvement when it does. This is not a novel mathematical insight. It is the formalization of what every experienced strategist already knows and almost no analytics system has been designed to calculate.

What the Living Model adds to this calculation is the counterfactual dimension. The question is not merely “what is the expected value of this intervention?” but “what is the expected value of this intervention compared to what would have happened without it?” Susan Athey’s work on Conditional Average Treatment Effects provides the computational machinery for this distinction. Causal forests — the method she developed with Stefan Wager — allow the estimation of how an intervention’s effect varies across different units, different contexts, different moments in time. This is the difference between knowing that a pricing change increases revenue on average and knowing which customers respond to a pricing change, by how much, and under what conditions.

This heterogeneity is where strategy lives. The average effect is rarely the decision-relevant fact. The decision-relevant fact is the effect on the specific segment, in the specific market, at the specific moment when you are deciding whether to act.

The Plumber’s Objection

Esther Duflo’s “Economist as Plumber” lecture is an underappreciated corrective to the enthusiasm that tends to accompany the announcement of causal AI. Her argument is not against causal inference. It is against the assumption that having the right model is the same as making the right decision.

The plumber’s observation is this: models provide very little guidance on which implementation details will matter. A causal model might correctly identify that fund transfer delays are reducing program participation. What it cannot tell you, without additional investigation, is whether the delay is caused by administrative bottlenecks, verification requirements, banking infrastructure, or the timing of the month relative to harvest cycles. The mechanism matters. The mechanism determines which wrench to use.

This is the limitation that the commercial Living Model literature tends to understate. The platforms are not wrong about what their systems can do. They are often imprecise about what those systems require from the humans who operate them. Automated causal discovery can learn the structure of a system from data. It cannot learn the structure of an implementation failure from data, because the implementation failure is often the reason certain data was never collected.

The practical implication is that Living Models require a different kind of organizational competence than traditional analytics. The skill is not data science in the conventional sense. It is the ability to think structurally about mechanisms — to ask not “what correlates with our churn rate?” but “what are the three or four processes that actually determine whether a customer renews, and which of those processes can we change?” This is domain expertise operating as causal reasoning. It is the thing that turns a sophisticated simulation engine into an organizational asset rather than an expensive dashboard.

The Unconfoundedness Problem

The functional validity of every Living Model rests on an assumption that is almost never perfectly satisfied: unconfoundedness, sometimes called selection-on-observables. This assumption requires that all variables influencing both the decision to intervene and the outcome of the intervention are measured and included in the model.

In a clinical trial, unconfoundedness is achieved by randomization. The coin flip breaks the connection between a patient’s background characteristics and their treatment assignment. No background characteristic can confound the effect estimate because no background characteristic predicts who gets the treatment.

In organizational data, you rarely have a coin flip. You have observational records of what your company decided to do, which were not random. You promoted the sales regions that were already performing. You raised prices in markets where demand was inelastic. You invested in the products customers were already buying. The decisions were intelligent. That intelligence is the problem. Every intelligent decision creates a confounding structure that makes it difficult to estimate the effect of having made a different decision.

The methods developed to address this — Double Machine Learning, Invariant Causal Prediction, instrumental variable estimation — are mathematically sophisticated and organizationally demanding. Double Machine Learning, the method at the core of causaLens’s decisionOS platform, uses orthogonal moment conditions to separate the causal effect of interest from the influence of measured confounders. It requires that you can predict both the treatment and the outcome from observed covariates, that you can do so well, and that the residual variation in treatment — the part that cannot be predicted by background characteristics — is sufficient to identify the causal effect.

What none of these methods can do is measure the unmeasured. The latent confounder — the competitor’s internal pricing meeting, the macro-sentiment shift that preceded the customer’s decision, the organizational change that happened six months before the attrition spike — remains the frontier problem of causal inference. The sophistication of the Living Model does not eliminate it. It makes the model honest about where the uncertainty lives.

From Simulation to Intervention

The clinical trial literature provides the clearest precedent for what the Living Model attempts in organizational settings. In drug development, the simulation comes before the trial: you have a model of the disease mechanism, a model of the drug’s action, and a simulation of the treatment effect across a population of virtual patients. The trial then tests whether the simulation was right.

The Living Model inverts part of this sequence. The simulation happens after — or alongside — the observational data. The model learns the mechanism from historical records, builds a causal structure, and then simulates the counterfactual: what would have happened if we had done something different?

The commercial implementation of this logic — platforms like causaLens, Vedrai’s WhAI, and PrescientIQ — represents the attempt to make this process accessible to decision-makers who are not statisticians. The “no-code causal ML” category is real and growing. What it offers is the ability to ask the causal question without writing the causal code. What it requires, still, is the ability to think causally about the system you are modeling. You cannot outsource the question. You can only outsource the calculation.

The treatment-ranked output — the list of potential interventions ordered by expected causal impact — is the Living Model’s most practically important deliverable. It answers the question that every strategy meeting is implicitly trying to answer: given the resources we have, which action produces the most actual change in the outcome we care about? Not the most correlated action. The most causal one.

What Is Actually Being Built

The honest account of where this technology stands in 2025 is this: the theoretical foundations are mature. The commercial implementations are promising and uneven. The organizational conditions required to deploy them well are rare.

Pearl gave us the mathematical language of causality. Athey gave us the computational tools to estimate causal effects at scale. Duflo gave us the reminder that the model is never the intervention — that the distance between a correct causal estimate and an effective organizational change is filled with plumbing, and the plumbing is usually what fails.

The Living Model is the attempt to build a decision support architecture that does what decades of business intelligence has promised and not delivered: to tell you not just what happened and what is likely to happen, but what you should do about it, and why that action and not another, and how confident you should be, and what the expected value of doing nothing is.

That last question — what is the cost of inaction? — is the counterfactual that traditional analytics cannot ask. It requires knowing what would have happened in the absence of an intervention, which requires having a model of the causal mechanism, which requires having built the thing the Living Model is.

The Monday morning meeting that starts with a dashboard is not going to disappear immediately. The dashboards are good at what they do. But the question they cannot answer — not the question of what happened, but the question of what to do, and the question of what would have happened if you had done it differently last quarter, and the question of which of your possible futures is worth building — these questions are now answerable, in principle, by systems that exist, for organizations willing to do the work of building the causal model of themselves.

The data was never the problem. It was always the question.

I’ve been writing about computational doubt at Skepticism.ai. But this argument — the specific argument about the mismatch between what analytics systems are built to do and what strategic intelligence actually requires — felt large enough to deserve its own space. That’s why I started Theorist.ai: a dedicated home for the question of what organizational intelligence owes the next generation of decision-makers, at the precise moment when machines have become genuinely good at answering questions and genuinely poor at knowing which questions are worth asking.

The Living Model is one answer to that question. Hypothetical.ai is where I’m building another — an exploration of realistic real-time hypothetical scenario generation that puts a causal brain directly in the hands of the people running the Monday morning meeting. That work is large enough to warrant its own space too. More there soon.

Tags: Living Model causal AI, Judea Pearl ladder of causation, counterfactual simulation enterprise analytics, structural causal models organizational strategy, real-time causal inference decision support