AUSTIN, TX โ A new report from DryRun Security has delivered a sharp warning to the engineering teams racing to adopt AI-native development workflows: autonomous AI agents build fast, but they build dangerously.
The Agentic Coding Security Report, released on March 11, 2026, evaluated three leading AI coding agents โ Anthropic's Claude (Sonnet 4.5/4.6), OpenAI's Codex (GPT-5.2), and Google's Gemini โ on two realistic application-building tasks. The result: 87% of the pull requests submitted by the agents โ 26 out of 30 โ introduced at least one high-severity security vulnerability.
How the Study Was Conducted
DryRun had each agent build two complete, functional applications through sequential feature requests โ mimicking the way a developer might prompt an autonomous agent over a multi-day sprint:
- A health-tracking web application โ user authentication, data storage, API endpoints
- A browser-based game โ client-side logic, state management, user interaction
Each feature request generated a pull request. DryRun's security tooling then analyzed every PR for high-severity vulnerabilities โ SQL injection, insecure authentication, exposed credentials, XSS vectors, and similar OWASP Top 10 class issues. The results were tracked cumulatively across the full application lifecycle, not just per-PR in isolation.
Key Findings
| Finding | Detail |
|---|---|
| Overall failure rate | 87% โ 26 of 30 pull requests contained at least one high-severity flaw |
| Agents tested | Anthropic Claude (Sonnet 4.5/4.6), OpenAI Codex (GPT-5.2), Google Gemini |
| Applications built | Health-tracking web app + browser-based game (sequential feature requests) |
| Cumulative risk pattern | Flaws accumulated โ agents rarely reviewed prior logic, so early mistakes compounded |
| Worst for unresolved flaws | Claude (Sonnet 4.5/4.6) โ highest number of unresolved high-severity issues in final codebases |
| Best overall | OpenAI Codex (GPT-5.2) โ fewest total vulnerabilities; best self-remediation when prompted |
| Primary vulnerability classes | SQL injection, insecure auth, exposed credentials, XSS, insecure direct object references |
The Cumulative Risk: How Small Mistakes Become Big Vulnerabilities
The most significant structural finding in the report is not any single flaw โ it is the pattern of accumulation. Because agentic coding systems generate each new feature in the context of a prompt, not a security audit, they rarely look backwards to evaluate whether a new change interacts dangerously with existing logic.
A minor authentication shortcut introduced in PR #3 may be harmless in isolation. But by PR #12, when the application has a user database, a payment layer, and an external API, that same shortcut has become an attack vector with a significantly larger blast radius. DryRun's methodology was specifically designed to capture this compounding effect โ and all three agents exhibited it.
The Claude Paradox: Most "Thoughtful," Most Vulnerable
The report's most counterintuitive finding involves Anthropic's Claude (tested on Sonnet 4.5 and 4.6). Claude is widely regarded in developer communities as the most "thoughtful" coding model โ praised for verbose reasoning, nuanced explanations, and careful handling of edge cases in natural language. In DryRun's security-specific evaluation, that reputation did not carry over.
Claude produced the highest number of unresolved high-severity flaws in the final codebases across both applications. DryRun researchers note that Claude's tendency toward verbose code generation โ producing more lines per feature than Codex โ may actually increase the vulnerability surface, creating more places for insecure patterns to hide. Claude's recent revenue milestones and enterprise adoption (covered in our Anthropic $5B revenue analysis) make this finding particularly notable for enterprise security teams currently deploying Claude as a coding agent.
The Codex Edge: Fewer Flaws, Better Self-Remediation
OpenAI's Codex (powered by GPT-5.2) finished the study with the fewest total vulnerabilities across both applications. But the more significant data point is behavioral: when DryRun researchers explicitly prompted Codex to review its own prior code for security issues, it demonstrated a meaningfully better ability to identify and remediate its own mistakes compared to Claude and Gemini.
This "prompted self-remediation" capability suggests a viable mitigation strategy for engineering teams: integrating a dedicated security-review step into the agentic workflow โ explicitly asking the agent to audit each completed module before the next feature build begins. It does not eliminate the underlying vulnerability rate, but it reduces the accumulation problem.
What Engineering Teams Should Do Right Now
| Recommendation | Why It Matters |
|---|---|
| Run full-codebase SAST at regular intervals | PR-level review misses cumulative risk โ vulnerabilities compound across the full commit history |
| Add an explicit security-review prompt step | Codex demonstrated meaningfully better self-remediation when explicitly asked to audit; apply this pattern to all agents |
| Treat AI PRs as untrusted input | Apply the same scrutiny you would to a contractor's first commit โ agents have no security training signal by default |
| Audit authentication and data access logic specifically | DryRun found insecure auth and data exposure were the most common high-severity classes across all three agents |
| Track vulnerability density over time | Monitor whether your codebase is accumulating flaws sprint-over-sprint โ not just whether individual PRs pass review |
The Bigger Picture
The DryRun report lands at a moment when the industry is actively debating whether agentic coding represents a genuine productivity leap or a technical debt time bomb. The answer, based on this data, is: both, simultaneously. The speed gains are real. The security blindness is equally real.
For enterprises currently evaluating Claude, Codex, or Gemini for autonomous coding tasks, the practical implication is straightforward: AI agent output requires a dedicated security layer that is designed into the workflow from day one โ not bolted on after the codebase has grown.