Can a Machine Interview an Expert?

Two decades of Knowledge Engineering research keep arriving at the same answer — and it has nothing to do with algorithms.

Mar 15, 2026

This is the second piece in the Living Model series. The first established why causal AI matters for organizational decision-making. This one examines the structural barrier that keeps it confined to data science teams.

The gap is not where most people think it is.

When organizations fail to deploy causal AI at the executive level, the instinct is to diagnose a technical failure — the models are too complex, the data is too messy, the algorithms are too opaque. These are real problems. But they are downstream of a more fundamental one, and it is not technical at all.

It is conversational.

Building a Living Model — a Bayesian network or directed acyclic graph capable of supporting strategic intervention — requires a causal graph: a precise map of what causes what. That map does not live in any dataset. It lives in the mind of the person who knows the business, the clinic, or the engineering system. Getting it out requires asking that person the right questions, in the right order, while navigating the systematic biases that distort expert causal reasoning. This is the knowledge bottleneck. It is why causal AI remains siloed inside technical teams even when the people who need it most sit in boardrooms.

The field of Knowledge Engineering with Bayesian Networks has been working on this problem for two decades. The solution it keeps arriving at is structured conversation.

What the field actually knows about eliciting causal structure

The modeler’s job is not to build the causal structure — it is to extract it from someone who already holds it implicitly, then formalize what they know.

That single reorientation explains everything that follows. Knowledge Engineering with Bayesian Networks (KEBN) is the discipline of doing exactly this: taking implicit expert knowledge and transforming it into formal probabilistic models. It exists because data-driven discovery fails in the domains where causal AI matters most — rare events, novel interventions, strategic decisions where there is no historical data for the action being contemplated.

The KEBN process is iterative. Expert knowledge is extracted in cycles, each pass producing a more refined model. The multi-phase architecture breaks into four stages: variable definition (identifying what the model contains), structure elicitation (building the DAG that encodes what causes what), likelihood estimation (assigning conditional probabilities to each causal relationship), and model review (validating the output against expert intuition through scenario testing).

Structure elicitation is the most critical phase. Methods like CausalNex’s “Structure Review” ask domain experts to validate learned edges by grouping variables into “themes” and checking whether the causal relationships between themes match their understanding of the system. When experts lack time for intensive sessions, online Delphi variants allow asynchronous, questionnaire-based elicitation.

Every one of these methods is sophisticated. None of them solves the problem of getting a CFO to sit still for a structured interview.

The systematic ways experts get causation wrong

This would be straightforward if expert causal reasoning were reliable. It is not.

The research on expert cognitive biases in causal contexts documents a consistent pattern. Experts excel at certain tasks — they identify direct causal mechanisms quickly, generate plausible hypotheses, and draw on accumulated domain pattern recognition. What they consistently fail at is causal structure: specifically, the topological properties of their own mental models.

The most dangerous failure mode is collider bias. A collider is a variable caused by two other variables — the arrows “collide” at that node. Conditioning on a collider induces a spurious association between its two parent causes that does not exist in the underlying system.

Berkson’s bias is the clinical illustration: among hospitalized patients, obesity appears protective against certain conditions. Being hospitalized is the collider — it is jointly caused by the disease and by other risk factors. Among hospitalized patients, knowing that obesity is absent makes other causes more likely. The protective signal is an artifact of the study sample. In the general population, obesity is still a risk factor.

Experts miss this consistently. The reason is intuitive: human reasoning rewards “more is better” covariate selection. Controlling for hospital admission seems like scientific rigor. The collider trap is invisible until it reverses your sign.

Feedback loops are the second major failure zone. Complex business systems — customer retention, supply chain dynamics, pricing strategy — involve bidirectional influences where the effect cycles back to influence the cause. Experts simplify these into linear sequences. The practical consequence: interventions designed for a static model get neutralized by feedback mechanisms the model didn’t capture.

The domain-matching heuristic produces a third category of errors. When experts lack specific mechanistic knowledge, they assume that cause and effect must come from the same domain — a mechanical failure has a mechanical cause, a financial outcome has a financial driver. This systematically blinds expert models to cross-domain influences, which is precisely where the most strategically interesting causal effects tend to live.

Why human domain knowledge is mathematically necessary

This is the point that gets lost in discussions of automated causal discovery. Multiple distinct causal structures can be perfectly consistent with the same statistical data. This is not a limitation of current algorithms — it is a mathematical result.

Two DAGs belong to the same Markov equivalence class if they imply identical conditional independence relationships in the data. The chain X→Y→ZX \to Y \to Z X→Y→Z, the fork X←Y→ZX \leftarrow Y \to Z X←Y→Z, and the reverse chain X←Y←ZX \leftarrow Y \leftarrow Z X←Y←Z are all statistically indistinguishable from observational data alone. Without experimental intervention — without actually doing something and observing the result — the data cannot tell you which structure is correct.

The expert who says “I know from operating this system for fifteen years that Y causes Z, not the other way around” is providing information that no dataset contains.

Human domain knowledge is not a convenience that speeds up causal discovery. It is mathematically required to orient edges in the graph where data cannot. Functional Causal Models — methods like LiNGAM that assume specific distributional properties — can theoretically resolve some Markov equivalence by exploiting non-Gaussianity. In practice, business data is sparse, noisy, and frequently contains hidden confounders that make these methods brittle. The expert remains the final arbiter of structural directionality.

This is why the knowledge bottleneck is not a product development inconvenience. It is a mathematical constraint on what causal AI can do without structured expert input.

What adjacent fields have learned about structured elicitation

Clinical medicine, intelligence analysis, and engineering risk assessment have each developed formal elicitation protocols because they face the same underlying problem: high-stakes decisions require beliefs to be made explicit and quantified, but the people who hold those beliefs are prone to the same cognitive biases that undermine all expert judgment.

The Sheffield Elicitation Framework (SHELF) and the IDEA protocol — Investigate, Discuss, Estimate, Aggregate — represent the clinical state of the art. Both require individual expert judgments before any group interaction, preventing anchoring effects where a single confident voice shapes everyone else’s estimates. Both include calibration phases where experts are trained in probabilistic reasoning before being asked to provide priors. The key disciplinary insight: collect individual beliefs first, then aggregate — never let group dynamics produce the prior.

The intelligence community’s Analysis of Competing Hypotheses (ACH) addresses confirmation bias directly. Rather than building the case for a favored hypothesis, ACH requires generating an exhaustive set of competing hypotheses and evaluating how each piece of evidence affects the likelihood of each. The discipline is disconfirmation: try to disprove your leading theory, not prove it. The diagnostic value of evidence — what makes one hypothesis more likely relative to others — is the coin of the realm.

Both protocols work for the same structural reason: they do not ask experts to be better reasoners. They change the procedure so that better reasoning emerges.

The anatomy of a causal interview

A causal modeling interview is structurally different from a requirements gathering session. Requirements gathering asks: what do you need the system to do? Causal elicitation asks: why does the world behave as it does?

The operative framework is Pearl’s Ladder of Causation — a three-level taxonomy of causal reasoning that the interviewer uses to scaffold the expert’s mental model upward.

Level one is association: “What patterns have you observed between marketing spend and churn?” This establishes correlations — what co-moves with what. It is the level where most analytics operates and where the expert feels most comfortable.

Level two is intervention: “If we doubled pricing tomorrow, what would be the true incremental impact on demand?” This is the causal level — not what correlates, but what would happen if you actually did something. Answering well requires the expert to distinguish the effects of the action itself from the selection effects that made the action happen. It requires thinking about heterogeneous treatment effects: not just “what happens” but “for whom, under what conditions.”

Level three is counterfactual: “Given that we increased spend and sales declined, would the decline have been worse if we had not increased spend?” This is the hardest cognitive level. It requires holding the actual world and an imagined world simultaneously and comparing them. It is also the level that reveals the most about underlying causal structure.

The modeler’s formal procedure moves through three steps: abduction (ask the expert to account for the current state of the system), action (simulate an intervention in the expert’s mental model), and prediction (ask the expert to predict the new outcome while holding background conditions constant). The critical discipline is keeping these steps separate — experts will naturally collapse them, leaping from “we would do X” to “therefore Y” without tracing the mechanism.

The central question that separates causal elicitation from requirements gathering is this: “What factors genuinely create sustainable advantage — rather than merely predict it?” This shifts the conversation from symptomatic manifestations to underlying generative mechanisms.

What happens when LLMs try to help

Recent research on LLM-assisted causal graph construction has produced a precise picture of where the technology helps and where it fails — a reliability gap that maps almost exactly onto the distinction between semantic pattern matching and genuine structural reasoning.

On tasks where node metadata is available — where variables have names and descriptions that convey their domain and relationships — LLMs perform well. The Causal-LLM framework showed that LLMs outperform symbolic graph learning methods by 40% in edge accuracy on medical datasets with clear semantic content. They are particularly good at capturing global dependencies and avoiding the spurious cycles that pairwise iterative methods introduce.

On tasks requiring actual causal reasoning from text, the picture is different. The ReCITE benchmark — which requires extracting causal graphs from lengthy academic papers with implicit relationships — returns F1 scores of 0.535 even from the best available models. Accuracy drops sharply as relationships become less explicit and the network grows more complex.

The deeper failure is what researchers have called the “causal parrot” effect. LLMs learn associations between concepts from training data — smoking and cancer, interest rates and inflation, marketing spend and revenue — and reproduce these associations fluently. When tested on “pure reasoning” tasks like Corr2Cause, where the model must determine whether causation is validly inferred from a given correlation, performance approaches chance. The memorized associations are doing almost all the work.

This is not a knock on LLMs for causal work — it is a job description. They surface plausible structures. The expert adjudicates them. The emerging CausalChat framework makes this division of labor explicit: the LLM generates candidate causal relationships, the human expert evaluates and selects. This human-LLM collaborative workflow is more effective than either working alone precisely because it allocates tasks to where each excels.

Turning “very likely” into a probability

The goal of prior quantification is not precision. It’s honesty.

Even after the expert has provided a causal structure — a DAG that both expert and modeler believe reflects the underlying system — the Bayesian network requires numerical conditional probability distributions at every node. The expert who says “customer satisfaction strongly affects retention” must eventually produce a number.

The solution is linguistic probability theory. Words of Estimative Probability — “highly likely,” “probable,” “unlikely,” “remote” — can be mapped to fuzzy membership functions through calibration procedures. The trapezoidal functions are defined by empirically sampling from relevant populations: fifty construction site managers produce the thresholds between “low,” “medium,” and “high” risk; medical expert panels produce the translation from “clinically significant” to a probability range. The critical constraint is that the membership functions must satisfy Kolmogorov’s axioms — they must partition unity and preserve continuity at transition points.

When experts cannot provide percentage estimates at all, modelers use Laplace’s rule of succession and reference class reasoning. The approach asks the expert to estimate, based on everything they knew before observing outcomes: how easy did they believe this problem to be? Given subsequent observations, a beta distribution over the true first-trial probability updates. The prior is the expert’s initial assessment; the posterior is what the evidence has revised it toward.

Neither approach produces perfect priors. Both produce honest ones — distributions that reflect what the expert actually believes, quantified in a way that can be updated as evidence accumulates. The alternative is the false precision of a number someone agreed on in a conference room.

The architecture that follows

The gap between causal theory and organizational practice is not primarily an algorithm problem or a data problem. It is an agency problem. Strategy executives possess causal intent — they know what matters, how the business works, what interventions have been tried and why they succeeded or failed. Data science teams possess causal implementation — they can build the models, run the inference, validate the structure. The bottleneck is the conversation between them.

The machine that interviews the expert is the natural architecture for this bridge. Not a general-purpose chatbot, and not a causal inference algorithm, but a system designed specifically to guide non-technical domain experts through the Ladder of Causation — surfacing candidate structures, detecting the characteristic patterns of collider bias and feedback loop omission, translating linguistic probability expressions into quantitative priors, and resolving Markov equivalence through directed questioning.

The components exist. Structured elicitation protocols from SHELF and ACH. Conversational interfaces from CausalChat research. Linguistic probability mapping from fuzzy membership theory. What has not existed is a system that integrates them into a single coherent workflow designed for the strategy executive rather than the data scientist.

That is the architectural problem the third piece in this series will address directly.

If you’re building a decision support system — or trying to get causal AI out of the data science team and into the boardroom — I’d like to hear what the bottleneck looks like from your side. Reply or comment below.

hypothetical.ai — causal intelligence for executives who actually have to make decisions.

Tags: knowledge elicitation Bayesian networks, Markov equivalence causal inference, expert cognitive bias collider, causal interview protocol, LLM causal graph construction

Hypothetical AI

Discussion about this post

Ready for more?