Amazon's AI Just Cut Cybersecurity Rule Creation Time by 336%,Here's How It Works
Amazon has built an AI system called RuleForge that generates cybersecurity detection rules more than three times faster than human experts can create them manually, while actually improving accuracy by reducing false alarms. The system uses multiple specialized AI agents working together to transform vulnerability disclosures into production-ready security rules, addressing a critical gap in how quickly organizations can defend against newly discovered threats .
Why Speed Matters in Cybersecurity?
Every day, security teams face a growing avalanche of new vulnerabilities. In 2025 alone, the National Vulnerability Database published more than 48,000 new common vulnerabilities and exposures, or CVEs, reflecting how automated tools have accelerated vulnerability discovery . For security teams at large organizations like Amazon, knowing about a new vulnerability isn't enough; they must translate each disclosure into detection logic fast enough to protect complex systems before attackers exploit the weakness.
Traditionally, this process was slow and manual. A security analyst would download proof-of-concept exploit code, study how the attack works, write detection rules to catch malicious traffic, test those rules against real traffic logs, and iterate until the rule performed well enough for production deployment. Only then would another engineer review it before it went live. This workflow produced high-quality rules, but the time investment meant teams had to carefully prioritize which vulnerabilities to cover first, leaving gaps in protection.
How Does RuleForge's Multi-Agent System Work?
RuleForge reimagines this workflow by decomposing the task into stages that mirror how human experts work, with specialized AI agents handling different parts of the process. Rather than relying on a single model to solve the entire problem, the system breaks the challenge into manageable pieces .
- Automated Ingestion: RuleForge downloads publicly available exploit proof-of-concept code and scores each one using content analysis and threat intelligence sources, ensuring rule generation focuses on threats that matter most.
- Parallel Rule Generation: For each prioritized vulnerability, a generation agent running on AWS Fargate with Amazon Bedrock proposes multiple candidate detection rules in parallel, exploring different detection strategies before selecting the most promising ones.
- AI-Powered Evaluation: A separate "judge" model reviews each candidate rule on two dimensions that human experts use: sensitivity (the probability the rule will miss malicious requests) and specificity (whether the rule targets the actual vulnerability or just a correlated feature).
- Multistage Validation: Rules that pass the judge move through increasingly rigorous tests, including synthetic testing and validation against real traffic logs from Amazon's MadPot honeypot system, with feedback sent back to the generation agent if rules fail.
- Human Review: The best-performing rule enters code review by a security engineer, with any feedback going back to the generation agent for revision before final production deployment.
The key innovation is separating the generation of rules from their evaluation. When Amazon asked the rule generation model to rate its own work, it thought almost everything it produced was good. This aligns with research showing that large language models, or LLMs, tend to have poor calibration on security topics . The solution was using a dedicated judge model instead.
What Makes the "Judge" Model So Effective?
Using a dedicated judge model reduced false positives by 67% while maintaining the same number of true positive detections . Two specific techniques improved the judge's accuracy significantly. First, asking "what is the probability that the rule fails to flag malicious requests?" produced better calibration than asking "what is the probability that the rule correctly flags all malicious requests?" Given that LLMs tend toward affirmation, framing the evaluation as a search for problems yields more honest assessments.
Second, domain-specific prompts outperformed generic ones. Rather than asking the model to rate its overall confidence in a rule, the questions that worked encoded what security engineers actually look for: whether the rule targets the vulnerability mechanism itself versus a correlated surface feature, and whether the rule covers the full range of exploit variations. The system also generates reasoning chains explaining its scores, and when Amazon evaluated those reasoning chains against human assessments, the AI judge's reasoning matched expert human reasoning for six out of nine rules .
"When a human evaluator noted, 'That SQL injection regex is too loose,' the judge had independently determined that 'the regex pattern will catch any query parameter with a single quote, which is broader than just the specific vulnerability,'" Amazon noted in its analysis.
Amazon Science, RuleForge Technical Documentation
How to Implement AI-Driven Security Rule Generation?
- Decompose Complex Tasks: Break down security rule creation into stages that mirror human expert workflows, using specialized AI agents for each stage rather than attempting to solve the entire problem with a single model.
- Separate Generation from Evaluation: Use different AI models for generating candidate rules and evaluating them, preventing the generation model from biasing its own assessment and reducing false positives significantly.
- Use Domain-Specific Prompting: Frame evaluation questions around what domain experts actually look for, such as whether rules target the vulnerability mechanism itself or just correlated features, rather than asking generic confidence questions.
- Maintain Human-in-the-Loop Oversight: Keep security engineers in the final decision-making process, using AI to accelerate rule generation and evaluation while preserving human judgment for production deployment.
- Implement Feedback Loops: Send specific feedback from validation stages back to the generation agent, creating a closed loop of improvement that allows rules to be refined based on test results.
Amazon deployed the confidence scoring system in August 2025, accelerating how quickly its analysts can deploy detection rules for new vulnerabilities . The 336% productivity advantage means that security teams can now turn high-severity CVEs into validated detection rules at a pace and scale that would be impossible with traditional methods, providing more comprehensive protection for customers.
The implications extend beyond Amazon's internal security operations. As the number of high-severity vulnerabilities published to the National Vulnerability Database continues to grow, AI-powered automation is becoming essential for security at scale. Organizations using AWS services can leverage similar approaches through Amazon Bedrock, AWS's managed service for building with foundation models, to accelerate their own security rule generation and close the gap between vulnerability disclosure and defense.