AUSTIN, TX โ A new report from DryRun Security has delivered a sharp warning to the engineering teams racing to adopt AI-native development workflows: autonomous AI agents build fast, but they build dangerously.
The Agentic Coding Security Report, released on March 11, 2026, evaluated three leading AI coding agents โ Anthropic's Claude (Sonnet 4.5/4.6), OpenAI's Codex (GPT-5.2), and Google's Gemini โ on two realistic application-building tasks. The result: 87% of the pull requests submitted by the agents โ 26 out of 30 โ introduced at least one high-severity security vulnerability.
How the Study Was Conducted
DryRun had each agent build two complete, functional applications through sequential feature requests โ mimicking the way a developer might prompt an autonomous agent over a multi-day sprint:
- A health-tracking web application โ user authentication, data storage, API endpoints
- A browser-based game โ client-side logic, state management, user interaction
Each feature request generated a pull request. DryRun's security tooling then analyzed every PR for high-severity vulnerabilities โ SQL injection, insecure authentication, exposed credentials, XSS vectors, and similar OWASP Top 10 class issues. The results were tracked cumulatively across the full application lifecycle, not just per-PR in isolation.
Key Findings
| Finding | Detail |
|---|---|
| {finding} | {detail} |
The Cumulative Risk: How Small Mistakes Become Big Vulnerabilities
The most significant structural finding in the report is not any single flaw โ it is the pattern of accumulation. Because agentic coding systems generate each new feature in the context of a prompt, not a security audit, they rarely look backwards to evaluate whether a new change interacts dangerously with existing logic.
A minor authentication shortcut introduced in PR #3 may be harmless in isolation. But by PR #12, when the application has a user database, a payment layer, and an external API, that same shortcut has become an attack vector with a significantly larger blast radius. DryRun's methodology was specifically designed to capture this compounding effect โ and all three agents exhibited it.
The Claude Paradox: Most "Thoughtful," Most Vulnerable
The report's most counterintuitive finding involves Anthropic's Claude (tested on Sonnet 4.5 and 4.6). Claude is widely regarded in developer communities as the most "thoughtful" coding model โ praised for verbose reasoning, nuanced explanations, and careful handling of edge cases in natural language. In DryRun's security-specific evaluation, that reputation did not carry over.
Claude produced the highest number of unresolved high-severity flaws in the final codebases across both applications. DryRun researchers note that Claude's tendency toward verbose code generation โ producing more lines per feature than Codex โ may actually increase the vulnerability surface, creating more places for insecure patterns to hide. Claude's recent revenue milestones and enterprise adoption (covered in our Anthropic $5B revenue analysis ) make this finding particularly notable for enterprise security teams currently deploying Claude as a coding agent.
The Codex Edge: Fewer Flaws, Better Self-Remediation
OpenAI's Codex (powered by GPT-5.2) finished the study with the fewest total vulnerabilities across both applications. But the more significant data point is behavioral: when DryRun researchers explicitly prompted Codex to review its own prior code for security issues, it demonstrated a meaningfully better ability to identify and remediate its own mistakes compared to Claude and Gemini.
This "prompted self-remediation" capability suggests a viable mitigation strategy for engineering teams: integrating a dedicated security-review step into the agentic workflow โ explicitly asking the agent to audit each completed module before the next feature build begins. It does not eliminate the underlying vulnerability rate, but it reduces the accumulation problem.
What Engineering Teams Should Do Right Now
| Recommendation | Why It Matters |
|---|---|
| {rec} | {why} |
The Bigger Picture
The DryRun report lands at a moment when the industry is actively debating whether agentic coding represents a genuine productivity leap or a technical debt time bomb. The answer, based on this data, is: both, simultaneously. The speed gains are real. The security blindness is equally real.
For enterprises currently evaluating Claude , Codex , or Gemini for autonomous coding tasks, the practical implication is straightforward: AI agent output requires a dedicated security layer that is designed into the workflow from day one โ not bolted on after the codebase has grown.
Discussion
Sign in to join the conversation
Your comments appear live in our Discord server โ every post grows the community.
Every comment appears live in our Discord server.
Join to see the full conversation, get notified on new articles, and connect with the community.
Comments sync to our ObjectWire Discord ยท 87% of AI Agent Pull Requests Introduce Security Flaws, New Report Finds.