Automating Hack The Box 'Editor' with an LLM-driven n8n Agent

by Van Graham
AI

TL;DR

I built an n8n workflow that used an LLM-driven agent to fully enumerate and exploit HTB’s Editor box end-to-end from a single input target. The agent chained dozens of sub-workflows (HTTP discovery, fingerprinting, exploit selection, payload staging, shell stabilization, and flag extraction), adapted to odd command-pipeline failures and ephemeral listening sockets, and ultimately retrieved user.txt.

Automated workflow map (n8n, many subworkflows):

n8n workflow diagram showing automated exploitation chain

Final structured output from the agent (flag hex):

Terminal output showing extracted flag in hex format

The Idea

Use an LLM as a decision-making core inside n8n so the automation can:

  • Discover the attack surface (HTTP endpoints, headers, open ports).
  • Recognize software and likely CVE classes from responses.
  • Choose or adapt exploits from a local/online knowledge base.
  • Stage commands and payloads, handle ephemeral shells/netcats, and validate success.

Put simply: the LLM replaces the single human doing reconnaissance + exploitation; the workflow supplies deterministic tooling (curl, nmap-like probes, exploit modules) and structured memory + parsers to keep state.

How It Worked (Concise Step-by-Step)

Input: Single target URL/IP (the HTB machine).

Stage 1 — Recon: HTTP GETs to default paths, basic header & body fingerprinting, and automated heuristics to match software (e.g., version strings, error pages).

Stage 2 — Triage: LLM produced a ranked list of likely vulnerabilities and matching exploit heuristics with confidence scores.

Stage 3 — Exploit selection: A subworkflow pulled an exploit template (local or remote) and attempted a non-invasive proof-of-concept.

Stage 4 — Shell handling: When a reverse or bind shell was obtained, the workflow used an adaptive node to detect listening ports and then stitched a stable channel (staging netcat, spawner, or SSH).

Stage 5 — Cleanup + extraction: Commands to read user.txt were validated, output normalized (hex), and passed back to the structured output parser.

The entire chain was kicked off with a single input and a pre-built set of ~70 tool nodes and many tiny subworkflows for modularity.

What Made It Work

Modularity: dozens of small, testable subflows (HTTP parsing, fingerprint normalization, exploit templating).

Structured output parsing: having a parser node to normalize results (exit code, stdout, stderr) let the LLM reason about success/failure.

Memory and state: short-lived memory stores kept context (active shells, IPs/ports, credentials) across branches.

Prompt engineering: instructing the LLM to prefer verification (e.g., “only run this command if prior output contains X”) prevented blind destructive attempts.

Key Pitfalls & Failure Modes

Race conditions and ephemeral listeners. Some exploits spawn a listener for a short time; the automated chain must race to connect — fragile without tight orchestration.

Broken command pipelines. Real shells on exploited targets often behave differently (missing /bin/sh features, truncated streams, no tty). The LLM sometimes generated pipelines that worked in a proper bash but failed on the remote shell.

Ambiguous outputs. Services that return HTML or nonstandard errors can confuse the fingerprinting stage and lead to wrong exploit choices.

LLM hallucinations about commands/exploits. The LLM can invent plausible-sounding shell syntax or claim success when the structured parser shows nonzero exit codes — so always validate with programmatic checks.

Operational safety & ethics. Automating exploitation at scale increases risk — accidental scans/exploits against third parties, or repeated destructive actions. Always isolate tests to authorized labs (HTB, CTFs, owned infra).

Detection & noise. Automated, deterministic chains are easy to detect by IDS/EDR (patterned connection attempts, repeated probes). A human-in-the-loop or better opsec layer is necessary for stealthy engagements.

Practical Mitigations and Hardening

Use verification gates: parser nodes that require expected markers (e.g., file present, correct file size) before moving to the next stage.

Provide sandboxed execution (containers, ephemeral VMs) for risky exploit attempts and to capture side effects.

Add retry logic with backoff for race conditions and ephemeral listeners; but cap retries to avoid flooding the target.

Keep an audit trail: every command the agent ran should be logged and reversible where possible.

Constrain LLM outputs by templates and enumerated allowed commands — prevent arbitrary command synthesis.

Why This Matters — Potential Use Cases

Red team augmentation: automate repetitive discovery tasks and surface high-probability paths for human analysts to validate and finalize.

Blue team simulation: create realistic automated adversaries to test detection, logging, and incident response.

Training & labs: scale CTF-style problems and grading by programmatically verifying flags and solutions.

Responsible Cautions

This approach should not be used for unsanctioned scanning or exploitation. Automated agents lower the bar to action; make sure legal/ethical boundaries are explicit, and use a conservative policy (human approval required before any destructive step).

Lessons Learned & Next Steps

The LLM is extremely useful for triage and choosing a likely exploit, but not reliable as an unsupervised operator — always bind it to strong verification.

Improve robustness around shell stabilization (pty/expect emulation, staged stagers, fallback strategies).

Consider hybrid agents: let the LLM propose a short list of actions, but require human confirmation for the final exploit stage.

Expand the toolkit with curated exploit templates (parameterized, tested in containers) to reduce hallucination risk.

Maintenance, Expertise, and Operational Cost

Automated LLM-driven agents are powerful, but they are not “set-and-forget” systems. They require ongoing maintenance, domain expertise, and careful operational controls. Key reasons and practical implications:

Model and Prompt Drift

As the LLM model, the exploited software landscape, and the wording of prompts evolve, the agent’s outputs will change. Regular review of prompts and regression testing of agent behavior is necessary to keep actions predictable.

Exploit Database Upkeep

Exploit templates and proof-of-concept code must be curated, versioned, and tested on representative containers/VMs. CVEs and exploit techniques change quickly — if you rely on stale templates you risk failure or, worse, unsafe behavior.

Testing & CI for Templates

Treat exploit templates like code: put them under version control, run automated tests in isolated sandboxes, and require code review for changes. Have a CI pipeline that validates expected success/failure markers.

Operational Monitoring

Agents need observability: logs of every action, structured metrics (success rate, false positives, time-to-exploit), and alerts for anomalous behavior. This allows rapid detection of regressions and malicious use.

Secret and Credential Management

Many actions (e.g., staging payloads, SSH handoffs) require credentials or ephemeral keys. Integrate secret management (vaults) and rotate keys frequently. Do not bake secrets into prompts or workflow nodes.

Human-in-the-Loop Governance

Enforce approval gates for potentially destructive steps. Even if the agent is high‑confidence, require a human operator for final execution in non-lab environments.

Skill Requirements

Running these agents well requires a multidisciplinary team: prompt engineers, security researchers (to curate exploits and interpret results), SRE/devops (to maintain n8n infrastructure and sandboxing), and legal/ops staff (to set policies and guardrails).

Regular Red-Team/Blue-Team Exercises

Use scheduled simulations to verify agent behavior against detection tools and to teach defenders how to interpret agent-driven adversaries. This also surfaces deficiencies in templates and orchestration.

Automating exploitation expands scale — so ensure authorization, logging for audits, and explicit policy on what targets and tests are allowed. Keep legal counsel involved for non-trivial or real-world engagements.

Cost & Resource Management

Running large numbers of nodes, sandbox VMs, and repeated tests costs CPU, storage, and human time. Budget for continuous testing and maintenance.

Practical Checklist for Teams Adopting LLM-Driven Agents

  • Keep exploit templates in a git repo with CI tests that run in isolated containers.
  • Require human approval gates for any destructive action outside a lab.
  • Integrate a secret manager (Vault) and ensure no secrets appear in logs or prompts.
  • Add structured telemetry and dashboards for success rates and anomalies.
  • Schedule periodic prompt and template reviews (monthly or after major model updates).
  • Provide operator training and documented runbooks for handling failures and rollbacks.

Final Thoughts

I think this experiment shows a faint glimmer of a middle ground: Workflow automation provides the determinism that exploitation needs. The sweet spot is a verified, auditable, modular architecture where LLMs recommend and orchestrate, and humans or strict verifiers sign off on destructive actions. With careful controls, this pattern can speed assessments, scale labs, and improve both offense and defense tooling.


Sources