The Architecture of the Expert Interview
Building the Machine That Interviews the Expert
Part Three of the Living Model Series
Part One: The Causal Brain: Living Models and the End of Backward-Looking Analytics
The interview begins before the first question is asked.
By the time a strategy executive sits down with the system — whether they call it a decision tool, a knowledge capture platform, or simply the thing their CIO installed last quarter — the machine has already read the literature, loaded the domain ontology, and constructed a skeleton of what the causal graph probably looks like. It has hypotheses. It has priors. It is waiting, not passively, but the way a prepared interviewer waits: with a plan, a set of probes for when the plan fails, and a practiced ability to follow an unexpected answer into territory the script did not anticipate.
This is what makes building such a system genuinely difficult. The problem is not conversational AI — that technology exists, is mature, and can maintain context across a long structured interview without losing the thread. The problem is that a causal elicitation session is not a conversation. It is a measurement. Every answer the expert gives must be converted into a structural constraint on a Directed Acyclic Graph — a formal mathematical object with strict requirements for consistency, acyclicity, and identifiability. The machine must translate between two vocabularies simultaneously: the expert’s language of mechanism and intuition, and the graph’s language of nodes, directed edges, and conditional independence statements.
No production system does this end-to-end today. That gap is what this piece is about.
What the Machine Must Actually Do
Start with the architecture’s requirements, because the requirements reveal why the gap exists.
The system needs to conduct a structured interview that moves an expert up what Judea Pearl calls the Ladder of Causality — from association (what tends to co-occur) to intervention (what would change if we acted) to counterfactual (what would have happened if we had acted differently). These are not simply harder versions of the same question. They are logically distinct operations, and LLMs trained on text have a documented tendency to collapse them. Ask a model what would happen if we doubled the marketing budget, and it will give you a confident answer that is actually a probabilistic interpolation from training data — not a causal claim, despite sounding like one. A causal elicitation system must detect when this collapse has happened in the expert’s own reasoning, not just in the model’s output.
The system simultaneously needs to construct a valid DAG in real time, which means enforcing acyclicity — no loops — while the expert is still talking. Experts describe feedback loops constantly, because feedback loops are how systems actually work. “Customer satisfaction drives retention, which drives revenue, which drives our ability to invest in customer satisfaction” is not wrong as a description of a business. It is wrong as a DAG. The system must resolve this without telling the executive that their understanding of their own business is mathematically invalid.
And the system must do all of this while managing cognitive bias. The expert who has spent twenty years in a domain has strong prior beliefs, selective memory, and a systematic tendency to underweight alternatives they have already dismissed. The machine must surface those alternatives without inducing defensiveness, detect collider traps before they propagate through the structure, and calibrate confidence intervals around claims that the expert will inevitably express in natural language rather than probability notation.
These requirements, taken together, describe a system that does not currently exist as a unified product. What exists are components — good components, in some cases excellent ones — that have not been assembled into a workflow designed for the person who actually holds the causal knowledge: the domain expert who is not a statistician.
The Conversational Layer: How the Interview Actually Works
The closest implemented systems to what this architecture requires are CausalChat-class interfaces, where a human and an LLM collaborate to build a causal graph through structured dialogue. The interaction protocol is more constrained than it appears.
The system does not ask open-ended questions about causality. Open-ended questions produce open-ended answers, which are difficult to convert to structural constraints. Instead, the interview follows a sequence that maps to specific graph operations. Variable identification comes first: the system presents candidate nodes derived from domain literature and asks the expert to confirm, reject, or rename them. This is the phase most amenable to LLM assistance, because literature synthesis is something current models do well. The expert’s role here is curatorial — they are not generating the variable list from scratch, they are editing a draft.
Edge elicitation is where the interview becomes genuinely difficult. The system presents candidate relationships — “Does pricing pressure tend to precede customer churn, or does customer churn tend to precede pricing pressure?” — and the expert must orient the edge. The key design insight from existing implementations is that experts orient edges more reliably when presented with temporal language rather than causal language. “Which comes first?” is easier to answer than “Which causes which?” — and for the purposes of initial graph construction, temporal precedence is often a sufficient proxy for edge direction.
The level-detection problem — distinguishing when an expert is making an associational claim versus an interventional one — is addressed through explicit prompting patterns. The system maintains a running classification of each statement: is this an observation about correlation, a prediction about what would happen under an action, or a claim about a counterfactual world? When an expert shifts levels without flagging it, the system generates a clarifying probe. “You mentioned that higher prices tend to follow higher demand. Are you describing what you’ve observed in the data, or predicting what would happen if we actively raised prices?” This is not sophisticated by itself. The sophistication is in the architecture that makes this probe available at the right moment in a long conversation, without interrupting the expert’s flow every thirty seconds.
The multi-agent design that makes this tractable assigns different probing responsibilities to different reasoning modes. A temporal expert validates that proposed causes precede proposed effects. A physical plausibility agent checks that claimed mechanisms do not violate domain constraints. A dependence agent cross-references the emerging graph against available data to flag when a proposed edge is inconsistent with observed conditional independencies. These are not separate models in most implementations — they are system prompts that instantiate different reasoning personas within a single LLM call, evaluated against the same emerging graph. The coordination overhead is real, and it is one of the reasons this class of system remains research-adjacent rather than production-ready.
The Markov Problem: What Data Cannot Settle
Here is the problem that no amount of data can solve without expert input, and that most strategy executives have never heard of.
Multiple Directed Acyclic Graphs can encode exactly the same statistical relationships. If you observe that A and B are correlated, and B and C are correlated, and A and C are correlated after conditioning on B — these observations are consistent with A → B → C, and with C → B → A, and with A → B ← C (with different implications in the third case). No statistical test distinguishes between them. The graphs are Markov equivalent: they make identical predictions about any observational dataset.
This is not an edge case. It is the general condition. For a graph with ten variables, the number of Markov-equivalent structures that are consistent with any given observational dataset can be in the thousands. The causal discovery algorithms — PC, GES, NOTEARS — return these equivalence classes as their output. The Completed Partially Directed Acyclic Graph, or CPDAG, represents this honestly: directed edges where the equivalence class agrees on direction, undirected edges everywhere else.
Resolving undirected edges requires expert knowledge. The elicitation system’s job, in this phase, is to identify which undirected edges matter most for the intended decision analysis, and then design questions that force orientation without requiring the expert to understand what a CPDAG is.
The question design here is specific. Markov equivalence is broken by interventional reasoning, not observational reasoning. “If we held B constant through external intervention — locked the variable — would changes in A still be associated with changes in C?” This question, if the expert can answer it, resolves edge direction in the A-B-C subgraph. If A’s relationship to C disappears when B is locked, the graph is A → B → C or A ← B ← C. If it persists, the structure is different. The expert does not need to know why this question resolves the equivalence — they only need to be able to answer it.
The gap in current systems is that constructing these interventional probes automatically, from an arbitrary CPDAG, is an unsolved engineering problem. It requires the system to identify which undirected edges are strategically important, formulate an interventional question about them in domain language, and interpret the expert’s answer as a structural constraint. Pieces of this pipeline exist. The integrated version does not.
Bias Interception in Real Time
Cognitive bias in expert elicitation is not a calibration problem. It is an architecture problem.
The ACH protocol — Analysis of Competing Hypotheses — is the most rigorous structured approach to preventing selection bias in human expert judgment. Its core insight is counterintuitive: instead of asking experts to build the case for their preferred hypothesis, ask them to identify evidence that would be inconsistent with each hypothesis on the table. Experts are systematically better at finding disconfirming evidence for alternatives than they are at generating alternatives in the first place. The matrix that results — hypotheses across the top, evidence down the side, consistency markers in the cells — is a discipline device. It forces the expert to maintain multiple live explanations simultaneously.
An automated ACH layer in the elicitation system generates the initial hypothesis set from domain literature, ensures it is sufficiently saturated (covering the space of plausible explanations, not just the obvious ones), and tracks the consistency matrix as the interview proceeds. When the emerging graph begins to converge strongly on a single structure, the system actively generates probes for the neglected alternatives. This is bias interception, not bias correction — it operates before the expert has committed to a conclusion, not after.
Collider bias requires a different kind of interception. A collider is a node that has two incoming arrows: X → Z ← Y. When X and Y are causally independent, they are statistically independent in an unselected sample. But when you condition on Z — when you stratify by Z’s value, or control for it in a regression, or select your data based on Z — X and Y become spuriously associated. This is not a subtle effect. It is a major source of published research errors, and it occurs routinely in business contexts: conditioning on customer retention to analyze the relationship between marketing spend and product quality, for example, creates a spurious negative correlation between two variables that may be genuinely independent.
The live backdoor criterion checker — which the system runs continuously as the expert elaborates the graph — identifies collider topologies as they form and flags them immediately. The prompt it generates does not say “you have a collider.” It says: “You mentioned controlling for customer retention in your analysis. If retention is influenced by both marketing spend and product quality, including it as a control variable might actually create a misleading relationship between those two inputs. Would you like me to show you what the analysis looks like without that control?” The expert corrects the structure. The bias is intercepted without the expert ever needing to understand d-separation.
Cycle detection operates on similar principles. When the expert’s narrative implies a feedback loop — and most executives’ mental models of their businesses are feedback-loop-laden — the system does not tell them the loop is invalid. It asks a temporal resolution question: “In the relationship you’re describing between customer satisfaction and revenue — over a single quarter, which tends to move first?” Temporal precedence breaks the cycle into a lagged structure that can be represented as a DAG across time steps. The expert’s intuition is preserved. The graph’s formal requirements are satisfied.
Translating Language to Probability
The minimum viable product of a knowledge elicitation session is a causal graph with probability distributions attached to the edges. Without the distributions, you have a qualitative map. With them, you have a model you can actually run.
The linguistic probability mapping problem is this: experts express uncertainty in natural language. “It’s likely that the relationship is positive.” “I’d be surprised if the effect were larger than twenty percent.” “I’m quite confident this matters, but I’m not sure of the direction.” Each of these statements contains a probability judgment, embedded in ordinary language, that the system needs to extract and represent formally.
LLMs are reasonably good at this mapping task. Studies using benchmarks like QUITE show that models like GPT-4 can convert uncertainty expressions to numerical ranges with reasonable alignment to population-level human agreement. “Likely” maps to roughly 65–80% probability. “Highly improbable” maps to below 10%. The variance is significant — different people mean different things by “likely” — and the system should therefore treat the LLM’s initial mapping as a prior that the calibration process refines, not as a final answer.
Calibration in short sessions is possible but requires deliberate design. The most accessible approach is reference class calibration: the system presents the expert with questions in their domain for which ground-truth outcomes are known, observes whether their expressed confidence levels align with actual accuracy, and applies a correction function to subsequent elicitation. This can be compressed into ten to fifteen meaningful calibration pairs. The resulting correction is crude by the standards of formal elicitation methods like SHELF, which was designed for multi-day workshops with trained facilitators. It is not crude relative to uncalibrated expert judgment, which is what organizations currently use for most strategic decisions.
The minimum viable prior — the question of how much quantification is actually necessary — is a design decision that the architecture should make explicit rather than leave implicit. For first-pass scenario modeling, edge signs (positive or negative) combined with rough magnitude buckets (weak, moderate, strong) are sufficient to generate useful sensitivity analyses. Full parametric distributions are necessary for formal decision analysis with explicit uncertainty bounds. The system should present the expert with this tradeoff clearly: here is what the model can do with what you’ve given it so far, here is what becomes possible if you provide more precise estimates.
The System That Doesn’t Exist Yet
What the research confirms is the expected finding: no production system integrates this full workflow for strategy executives.
The closest attempts are instructive in their failures. Bayesia and Netica are mature Bayesian network builders with sophisticated inference engines and accessible interfaces — but their elicitation workflow is essentially a spreadsheet. The expert manually enters conditional probability tables. There is no conversational layer, no bias detection, no Markov equivalence resolution through natural language. They are tools for modelers who already understand Bayesian networks, not tools for eliciting knowledge from people who don’t.
CausalNex and DoWhy are Python libraries with strong algorithmic foundations — NOTEARS integration, do-calculus support, counterfactual reasoning — but zero user interface. They are infrastructure for data scientists building applications, not applications themselves.
The LLM-native tools — various GPT-4-based assistants configured for causal reasoning — have the conversational fluency and the literature synthesis capability, but they lack structural enforcement. They will help an expert think through a causal diagram in natural language, and they will produce a description of a graph. They will not enforce acyclicity, detect colliders, resolve Markov equivalence, or hand off a validated CPDAG to a downstream discovery algorithm. They are thinking partners, not elicitation machines.
The gap is not any single missing piece. It is the integration: the pipeline that takes a natural language conversation, enforces formal graph constraints in real time, manages cognitive bias without disrupting expert flow, and produces an output that a causal discovery algorithm can refine and a decision analysis engine can run. The components exist. The assembly has not happened.
The Minimum Viable Interview
What is the shortest structured conversation that produces a causal graph sufficient for first-pass scenario modeling?
The answer, based on what implemented systems have demonstrated, is approximately forty-five minutes with a domain expert who has been briefed on the format. The structure is:
Variable confirmation (ten minutes): The system presents a draft variable list derived from domain literature. The expert confirms, rejects, and renames. No graph structure is discussed yet.
Edge elicitation (twenty minutes): The system presents candidate relationships in temporal language. The expert orients edges. The system flags cycles and resolves them through temporal probing. Collider topologies are flagged as they form.
Interventional disambiguation (ten minutes): The system identifies the highest-stakes undirected edges in the emerging CPDAG and presents interventional probes to orient them. Three to five questions, each targeted at a specific structural ambiguity.
Confidence calibration (five minutes): The system presents reference-class calibration questions, adjusts the expert’s probability mappings, and applies corrections to the distributions already assigned to edges.
The output is a partially directed graph with rough probability distributions on the oriented edges — not a publication-ready causal model, but a structure sufficient to run basic counterfactual scenarios and identify which additional data collection would most reduce uncertainty. This is the living model’s first breath.
The interview is where the model is born. What comes next — the feedback loops, the data integration, the iterative refinement as outcomes arrive — is how it stays alive.
The Living Model series continues in Part Four: From Graph to Decision — Running Counterfactuals Against Causal Structure
Tags: causal knowledge elicitation architecture, LLM-guided DAG construction, Markov equivalence resolution expert interview, Bayesian network prior elicitation, NOTEARS PC algorithm expert integration


