Ask an AI coding agent to build a modal dialog. A good one will produce a <div> with role="dialog", add aria-modal="true", wire up Escape to close, and give it an accessible name. It will look right. It might even pass a linter.
But open that dialog and press Tab. Does focus stay inside it? Can a screen reader user reach the page content behind the overlay? When the dialog closes, does focus return to the element that triggered it? In most cases, the answer to at least one of these is no.
This isn't a guess. Microsoft recently published a11y-llm-eval, a benchmark that tests how well LLMs generate accessible HTML. It renders LLM output in a real browser via Playwright, runs axe-core, and executes hand-written assertions for each test case. For modal dialogs specifically, the assertion "each modal dialog hides content behind it while open" failed between 68% and 92% of the time across all models tested. Focus management on open failed 30-65% of the time. Focus restoration on close, 20-60%.
The model knows about modals. It knows the ARIA roles, the attributes, even some of the keyboard interactions. What it doesn't know is how to implement a modal - the sequence of DOM operations, focus management, and state synchronization that makes the pattern actually work for someone navigating with a keyboard or screen reader.
This gap between knowing about accessibility and implementing it correctly is the central problem. And the solution isn't better models, or at least not only better models. It's better harnesses.
What the benchmark reveals
The a11y-llm-eval report tested eight models across 32 prompt cases covering common components: forms, checkboxes, radio groups, disclosure widgets, modal dialogs, and a shopping page. Five samples per test, rendered and evaluated in Chromium. The overall baseline pass rate was 12%.
That number is worth sitting with. Twelve percent. For components that appear on nearly every website. When models are given no specific accessibility instructions, they produce code that fails automated accessibility checks roughly seven out of eight times.
But the benchmark didn't stop at measuring the problem. It tested interventions.
The simplest intervention was a one-line instruction appended to the system prompt: "All output MUST be accessible." That raised the average pass rate to 37%. A more detailed instruction set covering landmarks, keyboard behavior, form labeling, contrast, and focus management pushed it to 60%. And a multi-turn skill, where the agent generates code and then self-reviews against a checklist, reached 96% on the best-performing model.
The progression from 12% to 37% to 60% to 96% is the most important finding in the benchmark. It says: the knowledge is already in the model. What's missing is the structure to activate it at the right time, in the right way.
The pipeline as the unit of intervention
When people talk about making AI-generated code more accessible, the conversation usually centers on the model. Train it on better data. Fine-tune it on accessible examples. Add accessibility to the system prompt. These are useful, but they treat the model as a monolith - a box that takes prompts and produces code, where the only lever is what goes into the box.
Agent harnesses break that monolith into a pipeline. A typical agent workflow has at least three distinct phases, and accessibility failures cluster differently in each.
Before code is written, the agent needs to make design decisions. Which pattern fits the user's intent? Should this be a native HTML element or a custom ARIA widget? When the prompt is ambiguous, should the agent guess or ask? These decisions have enormous downstream consequences for accessibility, and they happen before a single line of markup is emitted.
Pattern selection is a good example. A prompt like "build a sidebar with three sections that users can switch between" could be tabs, could be navigation, could be an accordion. The accessible implementation differs radically for each. The choice depends on context the prompt often doesn't provide: whether content is preloaded or fetched on demand, whether the URL should change, whether the sections represent persistent navigation targets or in-page views. An agent that guesses wrong at this stage will produce a component that might look right but behaves wrong for assistive technology - and the error is much harder to catch downstream than to prevent upstream.
The same applies to the native-versus-custom decision. Models consistently underestimate what native HTML can do. <dialog> handles modals. <details> handles disclosure. <select> handles most selection patterns. Every time an agent builds a custom widget where a native element would have worked, it takes on the full burden of keyboard behavior, focus management, ARIA state synchronization, and cross-browser consistency - and the benchmark data shows it frequently drops at least one of those.
Organizations that use agents for code generation have an advantage the benchmark didn't test: they already have design systems and component libraries. An agent working within an organization's codebase shouldn't be assembling a modal from raw <div> elements and ARIA attributes. It should be reaching for the organization's existing <Modal> component, which has already been built, tested, and validated for accessibility. The harness's job at the planning phase includes constraining the agent to the organization's component inventory and flagging when a prompt would require something outside it.
Composition matters too. Even when individual components are accessible, combining them can create barriers. A modal that opens another modal. A disclosure widget nested inside a tab panel inside a dialog. A form that dynamically injects new fields without managing focus. These are UI anti-patterns that an organization can enumerate and the harness can enforce. The agent doesn't need to understand why stacking modals is problematic for screen reader users if the harness simply won't let it do so.
During code generation, the agent is choosing markup, attributes, styles, and scripted behavior. This is where instructions and skills have their greatest effect. The benchmark's "Basic" instruction set is essentially a condensed accessibility checklist: use landmarks, associate labels, manage focus, check contrast. The fact that it raises pass rates by nearly 50 percentage points tells us that models already know this information - they just don't reliably apply it without being reminded.
Skills go further. A multi-turn skill can separate generation from verification: first produce the code, then review it against specific criteria. The benchmark's "Building Accessible UI" skill achieved a 96% pass rate on its best model, an 88-percentage-point improvement over the control. That's not incremental. That's a qualitative change in output quality, achieved entirely through harness engineering.
After code is generated, the output can be verified. The benchmark itself demonstrates one form of verification: render in a browser, run axe-core, execute custom assertions. An agent harness can internalize this loop - generate, test, fix, re-test - before presenting the output to the user. The multi-turn skill's results suggest this self-correction loop is where the largest gains come from.
What works, what doesn't, and why
The benchmark data tells a clear story about the relationship between intervention specificity and effectiveness.
Vague instructions barely help. "All output MUST be accessible" is the accessibility equivalent of "write good code." The model already intends to write good code. The instruction doesn't tell it anything it can act on. It raised pass rates by 25 percentage points, which sounds meaningful until you realize the baseline was 12% - the output still failed more often than it passed.
Specific, structured instructions work much better. The "Basic" instruction set is roughly 800 words of concrete guidance: use <label for> on form fields, put one <main> per page, don't skip heading levels, trap focus in dialogs, restore it on close. Each instruction maps to a specific thing the model should do or avoid. This raised pass rates to 60%.
Multi-turn skills work best. Generate, then review. The review step catches errors the generation step missed - not because the model couldn't have gotten it right the first time, but because generation and verification are different cognitive tasks that benefit from separation. This is true for human developers too. Code review catches things the author missed, not because the author is incompetent, but because writing code and evaluating code activate different kinds of attention.
But even the best results come with a caveat the benchmark is careful to state: 100% on their automated checks does not mean WCAG conformance. And full WCAG conformance doesn't necessarily mean a good experience. Automated testing catches a subset of accessibility issues. Color contrast, missing labels, absent ARIA attributes - these are machine-verifiable. But whether alt text is meaningful, whether focus order is logical, whether a live region announcement is useful, whether intended user flows are coherent and friendly - these require human judgment. The harness can catch the modal that doesn't trap focus. It cannot catch the modal whose content is confusing when encountered without visual context.
This is the ceiling of automated verification, and it matters for harness design. The harness can be very good at the things machines can check. But it needs a strategy for the things machines can't.
The harness as a feedback loop
This is where the conversation about harnesses usually stops: build instructions, add checks, measure results. But that framing treats the harness as a static artifact - a test suite you write once and maintain. The more useful framing is the harness as a feedback loop, a system that encodes human knowledge about accessibility and updates that encoding as new knowledge arrives.
Every instruction in a harness is an encoding of something someone learned about how code should work for people with disabilities. "Trap focus in modals" is an encoding of the experience that screen reader and keyboard users get lost when focus escapes an overlay. "Use <fieldset> and <legend> for related radio buttons" is an encoding of the experience that ungrouped radio buttons are announced without context. These encodings didn't come from specifications. They came from people using assistive technology, encountering barriers, and reporting what went wrong.
The question isn't whether to encode this knowledge. The benchmark proves that encoding it works. The question is how the encoding stays alive - how new findings enter the system, how existing encodings get validated or revised, and who does the translation between "a user struggled here" and "the agent should behave differently."
That translation is harder than it looks. User research might reveal that a particular disclosure pattern confuses screen reader users because collapsed content remains in the accessibility tree even when visually hidden. The benchmark actually caught a version of this: the "collapsed content is hidden from everyone" assertion failed 44-84% of the time across disclosure widget variants. But someone had to write that assertion. Someone had to know that this specific failure mode exists, formulate it as a testable condition, and implement it in a way the harness can execute.
That "someone" occupies a role that doesn't have a clean name yet. It requires accessibility domain expertise (knowing what fails and why), testing expertise (knowing how to express a failure mode as an assertion), and enough understanding of the agent pipeline to know where the intervention belongs - is this a planning-phase constraint, a generation-phase instruction, or a verification-phase check? Different failure modes belong at different points, and placing them wrong reduces their effectiveness.
How findings enter the system. A usability study reveals a pattern failure. An assistive technology user reports a barrier. An audit uncovers a systematic issue with generated output. The path from any of these to a harness update currently relies on someone manually recognizing the finding, generalizing it into a rule, and placing it in the pipeline. That's expensive and slow. The question for harness designers is whether parts of this path can be shortened - not automated away (the judgment is essential) but structured so that the translation from finding to encoding is a defined workflow rather than an ad hoc effort.
How encodings get validated. An instruction that improves pass rates on one component might degrade them on another. The benchmark data hints at this: the "Minimal" instruction set actually increased certain ARIA violation rates (like aria-allowed-attr) compared to the control. When you tell a model to be more accessible, it sometimes overcompensates - adding ARIA attributes that aren't valid for the element's role, or wrapping things in landmarks that don't need them. This means harness updates need regression testing, not just new assertions, but verification that new instructions don't introduce new failure modes.
How the loop stays current. Models change. The components people build with agents evolve. Assistive technology behavior shifts between versions. A harness encoding that was accurate six months ago might be stale today. This isn't hypothetical - browser implementations of accessibility APIs diverge from specs regularly, as anyone who's tracked <dialog> support across screen readers can attest. The feedback loop needs a cadence, a rhythm of re-evaluation that keeps encodings aligned with reality. What that cadence looks like in practice - and who drives it - is an open design problem.
The questions worth asking now
The field is moving fast enough that prescriptive answers would be stale before this post is. But some questions are stable, and getting them right will matter more than any specific technical choice.
How should agents handle ambiguity in pattern selection? When a prompt could map to multiple accessible patterns, the agent faces a choice: guess conservatively and risk the wrong pattern, or ask the user and risk friction. The answer probably depends on the stakes - a disclosure widget has fewer accessibility consequences from a wrong guess than a navigation structure does. But encoding that risk assessment into the harness is itself a design challenge. It requires knowing which pattern choices have high accessibility stakes, which is domain knowledge that most agent builders don't have.
What's the right granularity for accessibility instructions? The benchmark showed that more specific instructions work better than vague ones. But there's a tradeoff: longer, more detailed instructions consume context window, potentially displacing other useful information. There's probably a sweet spot between "be accessible" (too vague) and a 2,000-word checklist (too heavy), and it might vary by component type. Simple components might need a nudge; complex interactive widgets might need a detailed protocol. Knowing which is which, again, requires domain knowledge.
How do you close the loop between user research and the harness? This is the core engineering question. User research with assistive technology users is the most reliable source of truth about what actually fails in practice. But research findings are typically expressed in human terms - narratives, observations, severity ratings - not in the formalized assertions a harness can execute. The translation layer between research output and harness input is where the most leverage lives, and it's currently artisanal. Can it be made more systematic? What would a structured format for "accessibility finding → harness encoding" look like?
Who owns the accessibility of agent-generated code? This question sounds abstract but has practical consequences for harness design. If the agent builder is responsible, the harness needs to be comprehensive and opinionated. If the user is responsible, the harness needs to surface decisions rather than make them - flagging tradeoffs, asking clarifying questions, exposing what was checked and what wasn't. If it's shared, the harness needs to clearly delineate what it guarantees and what it doesn't. The benchmark's careful disclaimer - that 100% on their checks doesn't mean WCAG conformance - is an example of that delineation done well.
What does the harness owe its users about its own limitations? An agent that produces code passing all automated checks might create a false sense of security. The user might reasonably assume the output is accessible because the agent didn't flag any issues. But the harness knows what it checked and what it didn't. Should it say so? Should the agent append a note: "This code passes automated accessibility checks for keyboard focus, ARIA roles, and color contrast. It has not been evaluated for meaningful alt text, logical reading order, or cognitive accessibility"? Transparency about the verification boundary is itself an accessibility practice - it tells downstream humans where their attention is still needed.
Where this is headed
The a11y-llm-eval benchmark is a snapshot. It tests a specific set of components, with a specific set of assertions, against models that will be updated by the time you read this. The specific numbers will change. The structural insight won't.
The insight is that making AI-generated code accessible is not primarily a model problem. It is a pipeline problem. The knowledge is in the models. What's missing is the engineering to activate that knowledge at the right phase, validate it against real-world outcomes, and keep it current as both the technology and the understanding of accessibility evolve.
The harness is where that engineering lives. Not as a test suite. As a feedback loop - one that encodes what people with disabilities experience, translates it into machine-actionable constraints, verifies that those constraints actually improve outcomes, and revises itself when they don't.
The models will get better. The harnesses need to get better faster.