AI in HEOR: Toward an AI-Native Research Cycle

A recent NBER session on AI in healthcare, chaired by Kosali Simon, surfaced two moments worth carrying directly into the AI conversation in Health Economics and Outcomes Research (HEOR).

The first came from Scott Cunningham. He described what he called an "AlphaFold moment" for empirical research — pointing at work from the University of Zurich Social Catalyst Lab on agentic systems, sometimes called APE for Autonomous Policy Evaluation, that can take a policy question, download public data, choose an empirical design (difference-in-differences, regression discontinuity, instrumental variables), run robustness checks, produce figures and tables, and write a 20-to-25-page applied econometrics manuscript in a few hours. The traces from those agents are not abstract. Cunningham described JSON-style logs of one agent trying, abandoning, and re-trying specifications. He has, in his own work, accidentally spent $700 running such experiments. His framing was that someone, today, with a modest annual budget, could "swarm health economics with papers."

The second came from David Bradford, who edits Health Economics. The journal sees about 1,300 submissions a year. Many papers get filtered out fast at the first read because they are off-topic, badly framed, or obviously not fit. AI weakens that filter. A paper that is not really health economics can be rewritten to look like health economics. A weak paper can be polished into something that survives the first screen. Bradford described an experiment the journal is preparing, using old papers and AI-generated referee reports, to see whether decisions change when AI is in the loop. He joked, half-seriously, that they may find flaws in papers the journal has already published.

Take those two moments together and the picture is clear. The production cost of plausible empirical research in health economics is collapsing. The cost of verification — checking that the data, code, assumptions, specifications, and interpretation are sound — is not. The AI in HEOR question, in 2026, is not whether a chatbot can draft a manuscript section. That answer is yes, and it has been yes for a while. The harder question is what HEOR does about a world in which polished empirical research is cheap to produce and trust is harder to earn.

The conversation HEOR is having, and the one it needs to have

The Professional Society for Health Economics and Outcomes Research (ISPOR) began this conversation earlier than its academic counterparts did, and has built real institutional machinery around it. But it is still organized around a question that is becoming the smaller one.

ISPOR’s published AI work is substantive. The PALISADE Checklist gave the field a Good Practices Report on machine-learning methods. CHEERS-AI consolidated reporting standards for AI interventions in economic evaluations. ELEVATE-GenAI laid out reporting guidelines across ten domains (transparency, accuracy, reproducibility, and more) for large language model use in HEOR. The Taxonomy of Generative AI in HEOR organized the conceptual landscape. The Good Practices Report on Generative AI for HTA addressed opportunities, challenges, and policy considerations. The GenAI for HEOR Systematic Literature Reviews Task Force is actively building guidance for one of the most labor-heavy parts of evidence generation. AI now rightly sits at the top of ISPOR’s published 2026–2027 trends list. ISPOR 2026 itself frames a significant share of its AI programming as "going beyond the use of chatbots."

None of this is a use-case catalog. Each of these outputs is doing necessary work.

What they do not yet do, taken together, is redesign the research cycle around the fact that AI has changed what is producible and what needs to be checked. They tell you how to describe what AI did. They tell you what to report. They tell you what makes machine learning methodologically sound. They tell you how to use AI for SLRs. They do not connect protocols to statistical analysis plans, to code, to models, to manuscripts, and to review as inspectable artifacts. The next layer is structural, not descriptive, and that is where the most consequential work is going to happen.

What HEOR actually looks like up close

It helps to be concrete about what the work consists of, because the AI conversation often hovers above it.

A burden-of-illness study in claims data is not really a document. It is a chain of decisions: a research question, a target population, a set of ICD-10-CM, ICD-10-PCS, CPT, HCPCS, NDC, and place-of-service codes used to identify it, an index date rule, a continuous enrollment requirement, washout and baseline periods, a risk adjustment strategy, propensity scoring or matching with caliper choices, an outcome definition, an outcome window, a censoring rule, site-of-care definitions, episode construction logic, subgroup definitions, exclusion criteria, missingness handling, a regression specification, and a series of sensitivity analyses. You get the point. Some of these are conventional. Many are judgment calls that materially move the answer.

A cost-effectiveness model is similar. The structure is one decision. The parameter sources are another. The survival extrapolation, utilities, discount rate, time horizon, comparator, perspective, cycle length, and the way uncertainty is propagated through scenarios: each is both standard and contested.

Around the analysis sits the documentation: the protocol, the SAP, the model file, the manuscript, the value dossier, the HTA submission. In a healthy world those objects say the same thing. Often they do not. The protocol describes one cohort and the code constructs another. The SAP names one outcome and the table reports another. The model documents one assumption and the deck oversimplifies it. The manuscript claims more than the analysis supports. Reviewers catch some of this. Most of it lives quietly in folders.

This is the surface AI is now being introduced onto, with agents that can search the decision space far more rapidly than a human would. Cunningham’s specific warning is exactly this: the agent finds many specifications, the final paper looks principled, and the search history is invisible, sometimes editable or deletable after the fact. For health economics, where each of those choices can shift a coverage decision, a price negotiation, or a regulatory submission, that is not a theoretical concern.

Production is cheap. Verification isn’t.

This is the asymmetry the next phase of AI in HEOR has to take seriously. Much of the public AI-in-HEOR conversation right now is some version of "minutes, not months." That framing is not wrong about speed. It is wrong about what matters.

Production gets cheaper. Protocols can be drafted faster. Code can be generated and revised faster. Models can be assembled, reformatted, and rerun faster. Manuscripts, reviewer responses, value dossiers, briefing documents, and regulatory submissions (the whole scaffolding of evidence work) all move faster.

Verification does not get faster on its own. Not the surface kind, the substantive kind: whether the protocol matches the SAP, whether the SAP matches the code, whether the code matches the tables, whether the tables match the manuscript’s claims, whether the cited sources actually support them, whether the model’s assumptions are documented anywhere a reader can find them, whether stated limitations are real or polite, whether another team could reproduce the work at all.

The honest answer is that HEOR has not been perfect at any of these questions in a slow-research era. In a fast-research era, the gap widens unless verification becomes a first-class problem.

Parts of the field are already moving in this direction, and the NBER discussion was useful for naming them. Cody Wing made the distinction that probably matters most in the short term: between AI as autonomous paper generator and AI as workflow accelerator. The autonomous-paper world is the one that worries Cunningham. The workflow-accelerator world is the one that, used well, raises the floor on research infrastructure: cleaner code, better documentation, more transparent workflows, replication packages worth the name. Bradford’s Health Economics experiment with AI-generated referee reports against old manuscripts is in that workflow-accelerator spirit, applied to the review side rather than the writing side. That is what taking the verification problem seriously looks like, in practice.

What this implies for AI in HEOR is that the most valuable AI work is not, primarily, generation. It is also the comparison work. Compare the protocol to the SAP. Compare the SAP to the code. Compare the code to the table. Compare the table to the manuscript. Compare the cited claim to the cited source. Compare the model file to its documentation. Compare what was actually done to what the field would expect.

Modern large language models can do every one of those things — well enough today that the bottleneck is no longer the model. It is the workflow around the model.

A second HEOR-specific problem: restricted data

There is one issue here that is sharper for HEOR than for most of academic economics, and the NBER panel surfaced it directly. Most of the evidence work that matters in health economics happens behind a data wall: Medicare claims, Medicaid TAF, commercial claims, EHR data, registry data, state APCDs, VRDC and CMS environments. None of those datasets can simply be handed to a frontier model.

Kosali Simon laid out the plausible institutional responses: no AI inside the enclave with iteration happening externally on mock or synthetic data; local open-weight models deployed inside; or vendor-based secure cloud arrangements operating under HIPAA business associate agreements. Open-weight models trail the frontier only slightly, and probably not for long, when it comes to most empirical HEOR work. An audience suggestion added a fourth path worth taking seriously: synthetic or differentially private datasets that allow AI-enabled development outside, with final code submitted to the data holder for execution against the real data. The point for HEOR is that the verification problem is hardest exactly where the data are most valuable. The fields most exposed to AI-amplified specification search are also the fields most dependent on restricted real-world evidence access.

An AI-native research cycle, defined more concretely

What I mean by an AI-native research cycle is not a slogan. It is the recognition that the artifacts of HEOR were designed in a world where humans were the only entities reading them. They no longer are.

A protocol can stop being a static Word document and start functioning as a structured specification: one that maps to a cohort definition in code, to a list of diagnosis, procedure, and drug codes, to inclusion and exclusion criteria as they were actually implemented, to a table shell, to a model assumption, to a reporting standard, to a defined check. The evidence behind a model parameter can become a queryable object whose updates flow through to the analyses that depend on it. A manuscript, its underlying code, its tables, its model documentation, and its review history can be inspected as a single connected research object rather than as five files in five places.

That is the structural redesign the field needs. It is the layer that ELEVATE-GenAI, PALISADE, CHEERS-AI, the GenAI Taxonomy, the HTA Good Practices Report, and the GenAI for HEOR SLRs Task Force will sit on, when it exists. Reporting guidelines are necessary. They are not sufficient.

Why HEOR is especially exposed

HEOR is, by structure, a worst-case test of casual AI adoption.

The field combines heavy reliance on real-world data, operational definitions sensitive to small choices, semi-standardized workflows that still demand judgment, payer- and HTA-facing stakes, and evidence that frequently travels far outside the academic literature into commercial and regulatory settings. Models can look precise while hiding assumptions. Manuscripts can read fluently while resting on weak methods. Decks can compress nuance until it disappears.

AI in HEOR is powerful enough to help and plausible enough to mislead. A chatbot will draft a protocol that reads well. That does not mean the protocol is correct. A model will produce code that runs. That does not mean it implements the estimand. An AI summary will describe a literature. That does not mean it distinguishes credible from outdated or biased work. An AI reviewer will flag inconsistencies. That does not mean it understands the therapy area, the payer context, or how the evidence will be used.

The risk is not that AI replaces good HEOR researchers. The risk is that AI lets mediocre research move faster and read more confidently than it has any right to. Fluency is not credibility. The field knows this in theory. It needs to act on it in practice.

What I have been building

Two things are worth introducing here, since the verification thesis is also the thesis I have been building toward.

The first is HEOR Coder, a protocol-to-codebook generator I have been developing. The premise is straightforward. Take a study protocol or statistical analysis plan and produce analyst-reviewable claims variables and exportable code sets across the coding systems claims research lives in, including ICD-10-CM, ICD-10-PCS, CPT, HCPCS, and NDC. The output is a codebook a reviewer can actually challenge: every variable traceable back to the protocol it came from, every coding decision visible rather than buried. The point is not auto-coding. The point is making cohort and variable construction inspectable, which is the side of the research cycle Cunningham’s warning most directly hits.

The second is PeerReviewer, a structured peer review platform built for journals and editors, and useful to authors who want their manuscripts pressure-tested before they submit. It treats a manuscript and its supporting materials, including tables, figures, appendices, and statistical reports, as a connected research object, runs multiple structured review configurations calibrated to journal and discipline norms, and produces feedback that an editor or a serious author can actually act on. It is not an author-side chatbot. If the kind of journal-side experiment Bradford described at NBER is the leading edge of what review systems need, PeerReviewer is the kind of tool that edge eventually needs to land on.

Both are coming out in the next several weeks. I have been quieter about this work than I should have been. Part of that is that the verification problem is hard and the tools that take it seriously have to be built carefully. But standing in front of the AI-in-HEOR conversation as it exists today, I think it is time to be more direct: there is a difference between thinking about AI in HEOR and building for it.

Two communities at the same problem

I have worked for more than twenty-five years across both sides of the health economics map. On the academic and policy side: AEA, iHEA, ASHEcon, AcademyHealth, and the NBER programming health economists follow closely, including the AI session this essay opens with. On the applied side: ISPOR, and to a certain extent AMCP, where health economists collaborate with pharmacists on coverage and managed-care research. I have been a member, presenter, journal contributor, and peer reviewer across these communities. The cross-membership is not a credential. It is a vantage point, and from it the AI conversation looks different on each side.

Academic health economists, including the NBER-adjacent groups, are asking foundational questions about scientific production: what AI does to identification strategies, to peer review, to publication incentives, to the long-run credibility of empirical work. Applied HEOR is closer to industry timelines, payer evidence needs, regulatory pressure, and operational reproducibility. The applied side may move faster, not because its questions are smaller but because its deliverables are more immediate and its financial incentives are sharper.

Both groups end up at the same problem from opposite directions. AI will make it easier to produce research-shaped outputs, and it will not, on its own, make those outputs more reliable. Academic economics is more likely to frame that as a question about knowledge accumulation. Applied HEOR is more likely to ask whether a dossier, a model, or a payer submission still holds up six months later. They are the same question. It would help if more people sat across both rooms.

Where this should leave the field

Few fields are better positioned than HEOR to lead here. Few are also more exposed to getting it wrong.

The field is well positioned because the work is empirical, applied, methodologically serious, and tied to consequential decisions. There is enough repeated structure for AI to be genuinely useful and enough complexity to demand real expertise. It is exposed because the field’s existing artifacts (protocols, SAPs, code, models, manuscripts, dossiers) were designed to be read by humans in slow loops, not connected and inspected at the speed AI now enables.

If HEOR treats AI as a faster way to produce the same disconnected artifacts, the verification gap will widen until something embarrassing forces a correction. If HEOR treats AI as an occasion to redesign the research cycle — making protocols structured, code traceable, model assumptions inspectable, manuscripts connected to their underlying analysis, and review systematic — then the field has a real chance to set the standard rather than chase it.

The next frontier is not more text. It is more traceability.

The field should measure AI not only by what it generates, but also by what it validates. Both contributions matter. The second is the one finally getting attention.