87% of AI Agent Pull Requests Introduce Security Flaws

AUSTIN, TX — A new report from DryRun Security has delivered a sharp warning to the engineering teams racing to adopt AI-native development workflows: autonomous AI agents build fast, but they build dangerously.

The Agentic Coding Security Report, released on March 11, 2026, evaluated three leading AI coding agents — Anthropic's Claude (Sonnet 4.5/4.6), OpenAI's Codex (GPT-5.2), and Google's Gemini — on two realistic application-building tasks. The result: 87% of the pull requests submitted by the agents — 26 out of 30 — introduced at least one high-severity security vulnerability.

⚡

87% of AI-generated pull requests in the study contained at least one high-severity security flaw. The agents were not writing toy code — they were building full applications under conditions designed to mirror real-world engineering workflows.

How the Study Was Conducted

DryRun had each agent build two complete, functional applications through sequential feature requests — mimicking the way a developer might prompt an autonomous agent over a multi-day sprint:

A health-tracking web application — user authentication, data storage, API endpoints
A browser-based game — client-side logic, state management, user interaction

Each feature request generated a pull request. DryRun's security tooling then analyzed every PR for high-severity vulnerabilities — SQL injection, insecure authentication, exposed credentials, XSS vectors, and similar OWASP Top 10 class issues. The results were tracked cumulatively across the full application lifecycle, not just per-PR in isolation.

Key Findings

{[ ['Overall failure rate', '87% — 26 of 30 pull requests contained at least one high-severity flaw'], ['Agents tested', 'Anthropic Claude (Sonnet 4.5/4.6), OpenAI Codex (GPT-5.2), Google Gemini'], ['Applications built', 'Health-tracking web app + browser-based game (sequential feature requests)'], ['Cumulative risk pattern', 'Flaws accumulated — agents rarely reviewed prior logic, so early mistakes compounded'], ['Worst for unresolved flaws', 'Claude (Sonnet 4.5/4.6) — highest number of unresolved high-severity issues in final codebases'], ['Best overall', 'OpenAI Codex (GPT-5.2) — fewest total vulnerabilities; best self-remediation when prompted'], ['Primary vulnerability classes', 'SQL injection, insecure auth, exposed credentials, XSS, insecure direct object references'], ].map(([finding, detail], i) => ( ))}

Finding	Detail
{finding}	{detail}

The Cumulative Risk: How Small Mistakes Become Big Vulnerabilities

The most significant structural finding in the report is not any single flaw — it is the pattern of accumulation. Because agentic coding systems generate each new feature in the context of a prompt, not a security audit, they rarely look backwards to evaluate whether a new change interacts dangerously with existing logic.

A minor authentication shortcut introduced in PR #3 may be harmless in isolation. But by PR #12, when the application has a user database, a payment layer, and an external API, that same shortcut has become an attack vector with a significantly larger blast radius. DryRun's methodology was specifically designed to capture this compounding effect — and all three agents exhibited it.

⚡

The cumulative risk pattern is the most practically dangerous finding for enterprise teams. A single-PR review process will not surface it. Security evaluation of AI-generated code requires full-codebase static analysis at regular intervals — not just PR-level diff review.

The Claude Paradox: Most "Thoughtful," Most Vulnerable

The report's most counterintuitive finding involves Anthropic's Claude (tested on Sonnet 4.5 and 4.6). Claude is widely regarded in developer communities as the most "thoughtful" coding model — praised for verbose reasoning, nuanced explanations, and careful handling of edge cases in natural language. In DryRun's security-specific evaluation, that reputation did not carry over.

Claude produced the highest number of unresolved high-severity flaws in the final codebases across both applications. DryRun researchers note that Claude's tendency toward verbose code generation — producing more lines per feature than Codex — may actually increase the vulnerability surface, creating more places for insecure patterns to hide. Claude's recent revenue milestones and enterprise adoption (covered in our Anthropic $5B revenue analysis ) make this finding particularly notable for enterprise security teams currently deploying Claude as a coding agent.

The Codex Edge: Fewer Flaws, Better Self-Remediation

OpenAI's Codex (powered by GPT-5.2) finished the study with the fewest total vulnerabilities across both applications. But the more significant data point is behavioral: when DryRun researchers explicitly prompted Codex to review its own prior code for security issues, it demonstrated a meaningfully better ability to identify and remediate its own mistakes compared to Claude and Gemini.

This "prompted self-remediation" capability suggests a viable mitigation strategy for engineering teams: integrating a dedicated security-review step into the agentic workflow — explicitly asking the agent to audit each completed module before the next feature build begins. It does not eliminate the underlying vulnerability rate, but it reduces the accumulation problem.

💬

The finding is not that AI agents cannot produce secure code — it is that they will not produce it by default. The security review pass has to be deliberately engineered into the workflow. It will not happen automatically.

What Engineering Teams Should Do Right Now

{[ ['Run full-codebase SAST at regular intervals', 'PR-level review misses cumulative risk — vulnerabilities compound across the full commit history'], ['Add an explicit security-review prompt step', 'Codex demonstrated meaningfully better self-remediation when explicitly asked to audit; apply this pattern to all agents'], ['Treat AI PRs as untrusted input', 'Apply the same scrutiny you would to a contractor\'s first commit — agents have no security training signal by default'], ['Audit authentication and data access logic specifically', 'DryRun found insecure auth and data exposure were the most common high-severity classes across all three agents'], ['Track vulnerability density over time', 'Monitor whether your codebase is accumulating flaws sprint-over-sprint — not just whether individual PRs pass review'], ].map(([rec, why], i) => ( ))}

Recommendation	Why It Matters
{rec}	{why}

The Bigger Picture

The DryRun report lands at a moment when the industry is actively debating whether agentic coding represents a genuine productivity leap or a technical debt time bomb. The answer, based on this data, is: both, simultaneously. The speed gains are real. The security blindness is equally real.

For enterprises currently evaluating Claude , Codex , or Gemini for autonomous coding tasks, the practical implication is straightforward: AI agent output requires a dedicated security layer that is designed into the workflow from day one — not bolted on after the codebase has grown.

📊

26 out of 30 AI-generated pull requests introduced a high-severity security flaw. Source: DryRun Security, Agentic Coding Security Report, March 11, 2026.

87% of AI Agent Pull Requests Introduce Security Flaws, New Report Finds