For about thirty years the security community has relied on a well-understood approach for handling dangerous findings. Coordinated vulnerability disclosure is a standard practice for a reason, and it can neatly solve hazard disclosure problems with transparency and technical rigor. A security researcher finds a flaw, reports it privately to the vendor, the vendor ships a fix, deployers update, and only once that window has closed do the details go public. This best practice approach works because of a core critical assumption: the affected system can be repaired, and repairing it ends the hazard.
That assumption does not survive contact with AI systems. We recognized this early, in the course of building our safety and jailbreak benchmarks, and it has since become one of the defining governance problems in evaluating frontier models. The findings an evaluation produces are valuable precisely because they describe how a system behaves under pressure. A fundamental challenge is that the value of that finding does not respect the boundary between defenders and adversaries. At MLCommons, we are addressing this challenge in two ways: designing our own disclosure practice for the benchmarks we run and helping write the standard that supports the whole field of AI evaluation.
Why coordinated disclosure breaks down for AI
Three properties of AI evaluation impact the traditional model for responsible disclosure.
The findings are dual-use by nature. A result that tells defenders, regulators, and users how a system behaves tells adversaries the same thing. It effectively shines a spotlight on which systems, which categories of input, and which failure modes are worth their effort. The risk isn’t usually that a finding exposes an otherwise secret capability; it’s that it lowers the cost of locating one. We are describing uplift – the reduction in effort, time, expertise, or resources an actor needs to accomplish a task. Uplift is most of what makes AI valuable for legitimate users, which is exactly why it is dangerous to hand to the wrong ones. There is also a subtler trap. If results are published by default and one category is quietly left out, the omission itself becomes a signal. The structure of a disclosure carries information independent of its content.
Telling the developer too much corrupts the test. A benchmark meant to be run more than once faces a tension that one-off vulnerability reports have not needed to deal with. The developer of a system under test needs enough feedback to improve the general property being measured, but not so much that they can target the specific items on the test. Hand over the exact prompts, and you get a model that scores better on a test without improving in practice, and, over enough cycles, the benchmark score stops tracking the thing it was built to measure. The discipline is to communicate the general case, never the instances, and never to accept self-attestation alone as proof that something was fixed. This is a general challenge with the legitimacy and reliability of benchmarking evaluations in AI, which MLCommons is addressing in multiple ways, for example with our continuous prompt stewardship work.
You can’t patch a released, open-weight model. This is the property that breaks the assumptions of the prior model of responsible disclosure. Open-source software can be patched in place; a deployer updates and the hazard closes. An open-weight model cannot. A new version is a new artifact, not an update. Every copy of the prior weights stays operational, unmodified, and in the hands of anyone who retained them – indefinitely. A hazard identified in such a system persists in deployment even after a successor ships, and no defender is positioned to remediate it. If a CBRNE hazard is found in a prior model deployment, for example, that hazard now exists indefinitely. Findings, therefore, have to be pinned to specific versions, and in the most sensitive categories, results may need to be aggregated or uniformly withheld across systems — because granular per-model disclosure in those categories functions less like a public-interest report and more like a targeting map for systems nobody can fix.
From principle to standard
We are taking the action to codify a defensible response to this challenge. We’ve taken our approach and practices into ISO/IEC JTC 1/SC 42, the international body responsible for AI standards, for review and discussion. We are contributing these learnings and corresponding responsible-disclosure principles into the work on ISO/IEC TS 42119-8. The aim is a real, citable standard that any evaluator — first-party, second-party, or independent — can build on, rather than a patchwork of one-off policies. Coordination around this issue will be critical to ensure the most hazardous findings are addressed by good actors without being broadcast to bad ones.
Disclosure norms only work if they are shared, and shared norms come from standards bodies, not from any single lab or benchmark operator acting alone.
What this means for our jailbreak benchmark
When our jailbreak benchmark launches, it will ship with a documented responsible-disclosure policy built around these three considerations — protecting the public from harmful uplift, protecting the integrity of the evaluation over repeated runs, and protecting against hazards in systems that cannot be centrally remediated. That policy is deliberately aligned, in advance, with the standard taking shape in SC 42. We would rather launch already pointed in the direction the field is heading than retrofit a practice once the standard lands.
The patch model gave the software security world a common language for three decades. AI evaluation needs its own. We recently launched an agentic-focused security working group to tackle critical challenges in AI security, such as this one, in the coming years. We welcome you to join us on that journey by signing up to be an MLCommons member.