In healthcare settings where patients use LLMs as a medical assistant, LLM performance differs between evaluation and deployment. (a) Bean et al. (2025) find a 61 percentage point difference between evaluation and deployment. (b) We argue this gap arises not from poorly designed benchmarks, but from implicit assumptions embedded in evaluation protocols that fail to hold at deployment. (c) We propose a taxonomy that categorizes assumptions into two types, task and outcome, to diagnose where the gap arises and what is required to close it. Closing the gap requires making assumptions explicit, testing which assumptions hold, and updating evaluation protocols accordingly.

Healthcare LLM benchmarks are one of the main paradigms by which LLMs are evaluated prior to clinical settings. Benchmarks provide a stable goalpost that allow researchers to iterate quickly and measure progress consistently. However, in high-stakes domains like healthcare, that same abstraction becomes a liability. For example, a recent study found a 61 percentage point drop in accuracy when going from evaluation to deployment (see Figure). In this setting, patients use LLMs as a medical assistant to better understand their symptoms, identify the underlying condition, and take appropriate actions. 

Moreover, the results showed that patients given access to a highly capable model as a medical assistant did no better at self-diagnosis than those without any model. That is, access to an LLM had no significant impact on patient understanding. The implication isn’t that the model underperformed. Rather, it’s that the way we evaluate is separate from what matters in deployment. For example, during evaluation we ask “does the model get the right answer?” while during deployment we ask “does the patient act correctly on what the model tells them?”

We argue that this gap arises because of implicit assumptions embedded in evaluation that don’t hold in the real world. That is, the scenario that the benchmarks intend to capture and the real-world scenario differ due to implicit assumptions. This difference in turn challenges evaluation validity. In particular, we classify assumptions into two types: task, which concerns assumptions on conversation data, and outcome, which concerns assumptions over human behavior and outcomes. To address this, we propose a framework called BenchmarkCards that makes these assumptions explicit so practitioners can identify when benchmark results transfer to deployment. 

Understanding the Evaluation–Deployment Gap through Assumptions

As an example of what our framework looks like, in Figure 1 we demonstrate our position in a healthcare setting where LLM-as-medical-assistance performance differs between evaluation and deployment, with a 95% to 34% gap (Bean et al., 2025). During evaluation, the model was given doctor-written, single-turn scenarios—one question, one answer, no follow-up—and asked to produce a diagnosis. During deployment, patients interacted with the model in a back-and-forth manner, and success was measured by whether they could correctly identify their diagnosis afterward.

In this setting, three assumptions underlie the gap: 

  1. Query Distribution Evaluation uses doctor-written queries, while real patients produce queries that may be incomplete or imprecise. 
  2. Interaction Type Evaluation features single-turn interactions, while real deployments involve back-and-forth dialogue. 
  3. Decision Mediation – Evaluation measures whether the LLM produces the correct diagnosis, while deployment measures whether the patient acts on it correctly.

We note that these are broad categories of assumptions which are present across evaluation settings, and return to these when introducing BenchmarkCards. 

Stating benchmark assumptions explicitly allows us to estimate how much each assumption contributes to the evaluation-deployment gap — for example, by measuring how the same LLM performs on multi-turn interactions versus single-turn ones. Doing so in our running example reveals that the 61 percentage point gap between evaluation and deployment can be broken down into 12 points due to query distribution, 19 points due to interaction type, and 30 points due to decision mediation. 

That last number reflects something no benchmark can observe: whether patients actually follow what the model tells them. Unlike the first two assumptions, which concern how the task is structured, decision mediation depends entirely on human behavior. A model could correctly diagnose appendicitis, but if the patient dismisses the recommendation, the outcome is the same as a wrong answer. Even a perfectly designed benchmark cannot capture this failure mode, which suggests model evaluators, deployers, and users need a different way of thinking about assumptions altogether. 

When assumptions go unstated, the very purpose of benchmark evaluation —quantifying and comparing model performance to guide deployment decisions —is defeated: practitioners have no way to assess whether benchmark results hold in their setting, or whether any available benchmark provides reliable guidance at all. 

Closing the Gap through Benchmark Cards and Staged Evaluation

Assumptions fall into two categories: task and outcome, which defer based on whether they can be tested with conversation data alone. For example, assumptions on whether conversations are single or multi-turn are task assumptions, while assumptions over proxy vs clinical metrics are outcome assumptions

More generally, we can view assumptions as clustering into two types: task and outcome. Task assumptions concern whether the benchmark faithfully represents the conditions of deployment. For example, if real-world conversations are multi-turn, does the benchmark reflect this? Outcome assumptions concern whether the benchmark’s evaluation criterion matches what actually matters in the real world. For example, a benchmark might measure LLM decision-making, while real-world performance depends on what the user does afterward.

Critically, we note that tackling outcome assumptions requires running real-world behavioral experiments. Task assumptions can be addressed by building benchmarks that more closely resemble real-world conversations, but outcome assumptions depend on human behavior that no benchmark can simulate. Understanding whether users act on LLM recommendations, for instance, requires actually observing them do so. 

Closing the gap requires two pieces of knowledge: what assumptions a benchmark makes, and whether these assumptions hold in a particular deployment context. To address the first point, we propose BenchmarkCards, structured documentation that benchmark designers fill out alongside their benchmark datasets to answer questions about their evaluation protocol without anticipating any particular downstream use (see Table). A practitioner facing a deployment decision then uses the cards to assess which assumptions hold in their setting and identify which benchmarks most closely match their use case. When no existing benchmark matches well, the card makes that gap visible, and signals to the community where new benchmarks are needed.

A BenchmarkCard is filled out once by benchmark designers, explicitly documenting the assumptions built into their evaluation. A practitioner then uses it to assess which assumptions hold in their specific deployment context. The left columns document what the benchmark assumed; the right column shows where those assumptions broke down in this deployment.

Once assumptions are identified, we propose staged evaluation: an iterative process where assumptions are tested one by one and evaluation protocols updated accordingly. The stages are:

  1. Compare BenchmarkCards against Deployment – Use BenchmarkCards to identify which assumptions hold and which don’t. 
  2. Collect Data for Task Assumptions – For example, collect data on real user interactions to capture the difference in query distribution. This augments a pre-existing benchmark so it is more applicable to a real-world setting. 
  3. Test Task Assumptions – Measure performance degradations and, for assumptions with large drops, improve the model or collect more targeted data. Once task assumptions are satisfied, move to outcome assumptions.
  4. Test Outcome Assumptions – Using domain expertise, prioritize which outcome assumptions matter most, then run behavioral studies or randomized controlled trials (RCTs) to test them.

A Call to Action

Better benchmarks are necessary but not sufficient for deploying LLMs safely in healthcare. The fix requires benchmark designers to state plainly what their evaluation does and does not capture, practitioners to check those assumptions against their deployment context, and the community to build the infrastructure that makes this standard procedure rather than exceptional effort. The ask looks different depending on where you sit. For AI teams considering deployment: test assumptions before you ship, not after; don’t wait for real-world failure to tell you where your evaluation fell short. For researchers building the next healthcare benchmark: document your assumptions, so future users can judge for themselves whether your evaluation applies to their setting. For clinicians: treat high benchmark numbers as a starting point for conversation, not a green light.

Acknowledgements: This blog post is based on our paper Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions, co-authored with Santiago Cortes-Gomez, Mateo Dulce Rubio, Fei Fang, and Bryan Wilder. Many thanks to Lawrence Jang, Amanda Coston, Luke Guerdan, Sang Truong, and Tori Qiu for their comments on this work.