SecurityLinkedIn Article

Set the AWS Security Agent Loose on Our Production Site. Here is What 10 Hours Got Us.

5 min read
Set the AWS Security Agent Loose on Our Production Site. Here is What 10 Hours Got Us.

The conversation that started it

At the AWS Summit Hamburg I ended up in a conversation with Desiree Brunner, Security Specialist Solutions Architect at Amazon Web Services (AWS). We talked about the new wave of AI-driven security tooling and of course the AWS Security Agent. The kind of conversation that does what good summit conversations should do: leaves you with a question you cannot put down.

Mine was simple: what does this thing actually do against a real production workload? Not a CTF, not a demo target. Our own site.

So the day after the summit, I sat down and ran it against cloud.tallence.com, the new Tallence Cloud site, which I built with AWS Kiro. This article is the field report.

Setting it up: harder than the marketing slide suggests

Before any scanner runs, the agent needs to be both legitimised and plumbed in. Four things had to come together:

Space + IAM authentication. Created a dedicated space and wired up IAM auth so I could use regular IAM roles to assume the agent space Domain validation. Validated the attack targets so the agent is allowed (and legally clear) to actually pentest them, by setting up a Route53 Text Record Networking. A simple VPC with a private subnet and a NAT Gateway. I only wanted to test public endpoints, so this is the minimum viable plumbing. Credentials. Login credentials for authenticated paths were dropped into AWS Secrets Manager so the agent could exercise logged-in flows.

It took three failed runs before the test actually started. Each failure was a small config error, the kind you only learn by hitting them. Once it kicked off, I went to bed. The next morning, after about 10 hours, I stopped it manually: it was not clear whether it was still finding new ground or just thrashing, and most categories had clearly closed out.

Stats from the run: 10h 1m wall-clock, 57.04 task-hours of underlying agent work, and 6 findings across the severity spectrum.

What it actually tested

This is where the agent earns its keep. Without prompting, it sequenced 60+ tasks across the OWASP-style attack surface, validator passes included. The high-level coverage:

  • Network & TLS scanning (3 tasks): Endpoint discovery, TLS/SSL vulns, crawler
  • Cross-Site Scripting (XSS) (7 tasks): Reflected, stored, DOM, WAF bypass attempts
  • Local File Inclusion (4 tasks): Including locale parameter WAF bypass
  • Path Traversal (5 tasks): findingKey parameter, multiple validator passes
  • IDOR (3 tasks): Cross-tenant access checks on authenticated APIs
  • Privilege Escalation (5 tasks): Cognito JWT, admin /leads endpoint
  • Server-Side Template Injection (3 tasks): IAM template generator endpoints
  • Command Injection (3 tasks): POST /scans roleArn -> ECS Fargate
  • Code Injection (4 tasks): Contact form Lambda, accountName, search params
  • SQL / NoSQL Injection (3 tasks): DynamoDB FilterExpression injection
  • SSRF (5 tasks): roleArn, HubSpot integration, user export
  • JWT Vulnerabilities (4 tasks): Cognito authorizer, audience claim
  • XXE (2 tasks): Forced XML parsing across endpoints
  • Security Misconfiguration (2 tasks): BREACH, CORS wildcard
  • Server Error 500 (2 tasks): CRLF, HubSpot upstream injection
  • Information Disclosure (1 task): API Gateway error responses

A few things to call out:

It tries to bypass your WAF. Multiple tasks were explicitly framed as "WAF bypass" attempts: mixed-case event handlers, encoded locale parameters, header smuggling. If you rely on a WAF as your last line, this is humbling to watch. It chains context. The agent authenticated as a user, then tried to escalate to admin endpoints. That is not a static scanner pattern, that is an actor. It validates its own hypotheses. Almost every high-confidence finding has a follow-up "Validator task", an attempt to confirm exploitability rather than ship a noisy alert.

The findings

Six findings total. Three Medium, one Low, two Informational. After manual triage:

  • CRLF Injection in Contact Form Lambda via pageUri field - Medium - Accepted -> fixed
  • Contact Form HubSpot Upstream Error (502) via hutk injection - Medium - Accepted -> fixed
  • API Gateway error response reveals internal architecture - Medium - False positive
  • JWT validation oracle + missing audience claim check (CWE-204) - Low - False positive
  • BREACH vulnerability - HTTP compression side-channel - Informational - False positive
  • CORS wildcard Access-Control-Allow-Origin on API responses - Informational - False positive

Two of the three Medium findings were legitimate and got fixed on the spot: the CRLF injection in the Contact Form Lambda’s pageUri field, and the HubSpot upstream 502 triggered by hutk parameter injection. Both are exactly the kind of issue that hides in third-party integration code that nobody looks at twice.

The rest were false positives after triage. High agent confidence, but mitigated by controls the agent could not see (audience claim handled upstream, BREACH not exploitable in our setup, CORS scoped at a layer the agent did not observe).

What I could not give it

The honest gap: I could not enable the new code-review feature. Tallence runs on GitLab and at the time of writing the Security Agent’s code review integration is GitHub-only. A GitLab.com integration would meaningfully change the calculus. (AWS, if you are reading: please.)

What I did hand the agent: five AI-generated Markdown files describing the API surface and application architecture. That alone made the scan visibly smarter. The privilege-escalation and path-traversal probes targeted real endpoints by name, not generic wordlists.

The verdict (so far)

This is the part I want fellow AWS Architects, security folks, and platform owners to take away:

It is the best "water-level check" I have seen. What I got in 10 hours is what a 2- to 4-week external consulting engagement would have produced as a first-pass report, minus the kickoff meetings. It is not a replacement for a real pentest. Not yet. Whether the depth matches a senior human pentester is still an open question, and one I will keep testing. It is dramatically better than nothing. And "nothing" is what most mid-market AWS workloads get between formal audits. Context is the multiplier. The architecture docs I fed in mattered more than I expected. If you run this, do not run it blind.

One more note on the result: the reason there were so few findings is not the agent. I built cloud.tallence.com security-by-design from day one with AWS Kiro: IAM-scoped resources, validated inputs, no implicit trust between services. The Security Agent is a confirmation, not a discovery, of that posture. That is precisely how I want to use it.