๐Ÿ”ด BreakingTechnology

87% of AI Agent Pull Requests Introduce Security Flaws, New Report Finds

DryRun Security's Agentic Coding Security Report tested Claude, Codex, and Gemini building real applications from scratch โ€” and found that autonomous AI agents are effectively 'security blind' by default, with 26 out of 30 pull requests introducing at least one high-severity vulnerability.

โ€ขMarch 12, 2026โ€ข๐Ÿ“– 6 min read

AUSTIN, TX โ€” A new report from DryRun Security has delivered a sharp warning to the engineering teams racing to adopt AI-native development workflows: autonomous AI agents build fast, but they build dangerously.

The Agentic Coding Security Report, released on March 11, 2026, evaluated three leading AI coding agents โ€” Anthropic's Claude (Sonnet 4.5/4.6), OpenAI's Codex (GPT-5.2), and Google's Gemini โ€” on two realistic application-building tasks. The result: 87% of the pull requests submitted by the agents โ€” 26 out of 30 โ€” introduced at least one high-severity security vulnerability.

โšก
87% of AI-generated pull requests in the study contained at least one high-severity security flaw. The agents were not writing toy code โ€” they were building full applications under conditions designed to mirror real-world engineering workflows.

How the Study Was Conducted

DryRun had each agent build two complete, functional applications through sequential feature requests โ€” mimicking the way a developer might prompt an autonomous agent over a multi-day sprint:

  • A health-tracking web application โ€” user authentication, data storage, API endpoints
  • A browser-based game โ€” client-side logic, state management, user interaction

Each feature request generated a pull request. DryRun's security tooling then analyzed every PR for high-severity vulnerabilities โ€” SQL injection, insecure authentication, exposed credentials, XSS vectors, and similar OWASP Top 10 class issues. The results were tracked cumulatively across the full application lifecycle, not just per-PR in isolation.

Key Findings

FindingDetail
Overall failure rate87% โ€” 26 of 30 pull requests contained at least one high-severity flaw
Agents testedAnthropic Claude (Sonnet 4.5/4.6), OpenAI Codex (GPT-5.2), Google Gemini
Applications builtHealth-tracking web app + browser-based game (sequential feature requests)
Cumulative risk patternFlaws accumulated โ€” agents rarely reviewed prior logic, so early mistakes compounded
Worst for unresolved flawsClaude (Sonnet 4.5/4.6) โ€” highest number of unresolved high-severity issues in final codebases
Best overallOpenAI Codex (GPT-5.2) โ€” fewest total vulnerabilities; best self-remediation when prompted
Primary vulnerability classesSQL injection, insecure auth, exposed credentials, XSS, insecure direct object references

The Cumulative Risk: How Small Mistakes Become Big Vulnerabilities

The most significant structural finding in the report is not any single flaw โ€” it is the pattern of accumulation. Because agentic coding systems generate each new feature in the context of a prompt, not a security audit, they rarely look backwards to evaluate whether a new change interacts dangerously with existing logic.

A minor authentication shortcut introduced in PR #3 may be harmless in isolation. But by PR #12, when the application has a user database, a payment layer, and an external API, that same shortcut has become an attack vector with a significantly larger blast radius. DryRun's methodology was specifically designed to capture this compounding effect โ€” and all three agents exhibited it.

โšก
The cumulative risk pattern is the most practically dangerous finding for enterprise teams. A single-PR review process will not surface it. Security evaluation of AI-generated code requires full-codebase static analysis at regular intervals โ€” not just PR-level diff review.

The Claude Paradox: Most "Thoughtful," Most Vulnerable

The report's most counterintuitive finding involves Anthropic's Claude (tested on Sonnet 4.5 and 4.6). Claude is widely regarded in developer communities as the most "thoughtful" coding model โ€” praised for verbose reasoning, nuanced explanations, and careful handling of edge cases in natural language. In DryRun's security-specific evaluation, that reputation did not carry over.

Claude produced the highest number of unresolved high-severity flaws in the final codebases across both applications. DryRun researchers note that Claude's tendency toward verbose code generation โ€” producing more lines per feature than Codex โ€” may actually increase the vulnerability surface, creating more places for insecure patterns to hide. Claude's recent revenue milestones and enterprise adoption (covered in our Anthropic $5B revenue analysis) make this finding particularly notable for enterprise security teams currently deploying Claude as a coding agent.

The Codex Edge: Fewer Flaws, Better Self-Remediation

OpenAI's Codex (powered by GPT-5.2) finished the study with the fewest total vulnerabilities across both applications. But the more significant data point is behavioral: when DryRun researchers explicitly prompted Codex to review its own prior code for security issues, it demonstrated a meaningfully better ability to identify and remediate its own mistakes compared to Claude and Gemini.

This "prompted self-remediation" capability suggests a viable mitigation strategy for engineering teams: integrating a dedicated security-review step into the agentic workflow โ€” explicitly asking the agent to audit each completed module before the next feature build begins. It does not eliminate the underlying vulnerability rate, but it reduces the accumulation problem.

๐Ÿ’ฌ
The finding is not that AI agents cannot produce secure code โ€” it is that they will not produce it by default. The security review pass has to be deliberately engineered into the workflow. It will not happen automatically.

What Engineering Teams Should Do Right Now

RecommendationWhy It Matters
Run full-codebase SAST at regular intervalsPR-level review misses cumulative risk โ€” vulnerabilities compound across the full commit history
Add an explicit security-review prompt stepCodex demonstrated meaningfully better self-remediation when explicitly asked to audit; apply this pattern to all agents
Treat AI PRs as untrusted inputApply the same scrutiny you would to a contractor's first commit โ€” agents have no security training signal by default
Audit authentication and data access logic specificallyDryRun found insecure auth and data exposure were the most common high-severity classes across all three agents
Track vulnerability density over timeMonitor whether your codebase is accumulating flaws sprint-over-sprint โ€” not just whether individual PRs pass review

The Bigger Picture

The DryRun report lands at a moment when the industry is actively debating whether agentic coding represents a genuine productivity leap or a technical debt time bomb. The answer, based on this data, is: both, simultaneously. The speed gains are real. The security blindness is equally real.

For enterprises currently evaluating Claude, Codex, or Gemini for autonomous coding tasks, the practical implication is straightforward: AI agent output requires a dedicated security layer that is designed into the workflow from day one โ€” not bolted on after the codebase has grown.

๐Ÿ“Š
26 out of 30 AI-generated pull requests introduced a high-severity security flaw. Source: DryRun Security, Agentic Coding Security Report, March 11, 2026.

Tags

#AI Security#DryRun Security#Agentic AI#Claude#Codex#Gemini#Security Vulnerabilities#Developer Tools#Software Engineering#AI Coding
J

Written by

Jack Wang

Technology Desk

Part ofObjectWirecoverage
๐Ÿ“ฉ Newsletter

Stay ahead of every story

Breaking news, deep-dives, and editor picks โ€” delivered straight to your inbox. No spam, ever.

Free ยท Unsubscribe anytime ยท No ads