Red-Teaming an LLM on Your Own Laptop: What Actually Breaks

A field report from building a three-layer, fully-local AI red-team pipeline on an Apple-silicon laptop: what broke, what a two-model nightly run revealed, and where local red-teaming is going.

TL;DR. Getting a Garak + Promptfoo + PyRIT + mcp-scan pipeline to run fully locally took more than 20 fixes, and almost none were about raw compute. The real constraints were time (testing means thousands of generations run one after another), the fact that the model ends up grading itself, a toolchain you can only half-pin, and one “offline” layer that quietly calls the cloud. A nightly run against a second model added one more: the pipeline’s reliability was tied to the single model it had been built against. Local red-teaming is a fast, private smoke test. It is not a perfect evaluation, and the difference between those two is what this report is about.

Red-teaming, in one line: deliberately attacking your own AI, trying to make it leak secrets, follow malicious instructions, or misuse the tools you’ve connected to it, so you find the holes before someone hostile does. Doing it locally means running both the model and the attacks on your own machine, with no cloud in the loop.

Method and date: this is a field report, not a benchmark. Every specific claim was checked against the orchestrator’s source and PR history as of June 2026 (main, commit d3c85b2). The numbers come from two full audit runs done back-to-back on the machine described (Apple M5 Max, 128 GB RAM), plus a four-model judge-adjudication probe (five runs each). Later changes to the tool may invalidate the code-specific caveats: unpinned dependencies, single-sample probes, mixed API transports.

The premise

Local AI red-teaming sounds great on paper. No API keys, no per-token bill, no terms-of-service clause forbidding adversarial prompts, no data leaving the machine. You point an orchestrator at a model in Ollama, it throws four industry-standard frameworks at it, and out comes an auditable report. That is what redteam_orchestrator.py does: a single-file, uv run-managed pipeline in three layers.

Layer	Purpose & Goal	Mechanism & Tools	Output Feeds Into
1 — Broad Scan	Wide-net discovery / Catch baseline alignment drift, prompt injection, and regression failures	Garak fires fixed probes graded by rules; Promptfoo mixes deterministic `not-contains` checks with a local `llm-rubric` grader. (Garak + Promptfoo)	A baseline severity read on cheap, high-volume attacks. (Layers run independently — an early HIGH does not skip later layers.)
2 — Targeted	OWASP LLM/MCP taxonomy / Validate coverage against known threat models & tool hygiene	Promptfoo generates OWASP scenarios (cloud-gated — skipped offline, see Lesson 5); `mcp-scan` audits static descriptors for schema/permission flaws. (Promptfoo redteam + mcp-scan)	Framework-aligned findings + tool-chain vulnerabilities that prompt-only scans miss.
3 — Adversarial	Multi-turn jailbreak / Simulate persistent, adaptive attackers modifying tactics on refusals	PyRIT’s Crescendo escalates prompts turn-by-turn; TAP builds/prunes attack decision trees. Both default to an LLM judge for scoring. (PyRIT Crescendo + TAP)	Stress-tested resilience under goal-oriented pressure; surfaces vulnerabilities static or single-turn probes never find.

There is not yet a consensus on a single AI red-teaming stack; security communities tend to rely on open-source tools like the ones cited here. Rather than mandating specific tools, it is more useful to define a layered, comprehensive framework. Here I went with the three-layer approach: broad scan → targeted red-teaming → multi-turn adversarial. Other tools and methodologies can reach the same goal — an informative, CI-pluggable, local AI red-teaming evaluation.

The three-layer split follows the Garak → Promptfoo → PyRIT model from Amine Raji. The architecture is the easy part. What I want to write about is everything that broke between “this should work” and “this actually finishes a run.” Plenty of guides explain how to run these tools. This is about what running them taught me. The fixes (twenty-eight PRs on a MacBook Pro 16″, M5 Max, 128 GB RAM) add up to a fairly honest picture of where local AI red-teaming stands in 2026. And because the report ends with a nightly cross-model run, the same pipeline pointed first at qwen2.5:3b and then at a non-Qwen model, it closes with direct evidence for the caveat that matters most: “it works” is a claim about one model, not about the pipeline.

Why a non-engineer should care. Organizations reach for local, open-weight models partly because “it runs on our hardware” feels private and controlled, and therefore safe. But private is not the same as secure, and running the testing locally inherits a set of limits that are easy to mistake for a clean result. If you are about to bet a deployment on “we ran the red-team tool and it passed,” the distance between what that sentence sounds like and what it actually means is the whole subject here.

Key takeaways in 30 seconds

Time is the bottleneck, not hardware. Even a 128 GB laptop defaults to a tiny 3B model, because red-teaming means thousands of generations run one after another.

A “clean” local result is weak evidence. Scopes are trimmed to fit the clock, probes often run only once, and the model usually ends up grading itself.

The cheapest real fix is to stop letting the model grade itself. A different small model as the judge removes the self-grading bias for about 5 GB and a few minutes — but calibrate it first and let it only flag findings, never downgrade them, because some small models call real leaks benign. Finding the attacks is the hard part; only the attacker needs real horsepower.

“Local” leaks more than it sounds. One layer quietly needs the cloud, and a couple of key tools can’t be version-pinned and break on their own schedule.

The dangerous failure is silence: a real vulnerability quietly reported as “fine.” Use the tool as a fast filter, not a certificate.

The three players: attacker, target, judge

Automated red-teaming is a game with three roles, and a lot of this article is really about how you cast them.

The target, the defender, is the model under test. Its job is to resist: refuse the harmful request, ignore the injected instruction, decline the unsafe tool call.
The attacker generates the attacks, crafting prompts, escalating a conversation, looking for a phrasing that slips past the defenses. Its job is generation.
The judge (or scorer) decides whether an attack landed. Did the target comply, leak, or refuse? Its job is recognition.

A human red-teamer plays all three at once. Automation splits them so the loop can run thousands of times unattended, which is also what makes it possible to use different models for each. That split exposes an asymmetry that runs through the whole piece: generating a good attack is much harder than recognizing a successful one, so a small model makes a poor attacker but a usable judge. There’s a catch I return to later, though: a small judge reliably catches harm that announces itself and can miss harm that arrives quietly, so it has to be calibrated and never trusted to clear a finding.

Each role sits on a spectrum, from cheap and static to expensive and adaptive.

Role	Job	Cheapest → most capable
Attacker	generate (hard)	fixed corpus of known attacks → single-turn generator → adaptive multi-turn agent → human expert (potentially)
Judge	recognize (easier)	keyword/regex rule → LLM-as-judge → human (potentially)
Target	resist	the model you’re auditing (fixed)

You can wire the three roles together a few ways. Self-play puts one model in every role; it’s cheap, but you get a weak adversary and a biased judge. Separate models uses a different model per role, which breaks the self-grading bias. Stronger-attacker makes the attacker and judge deliberately heavier than the target. There’s one more axis worth knowing: local tools like this one are black-box, meaning they only send prompts, whereas white-box methods (gradient-guided suffix attacks like GCG or AutoDAN) need the model’s internals, which a prompt-only harness never touches.

This repo covers most of that table at once. Garak is mainly a static-corpus attacker graded by rules. Promptfoo generates single-turn attacks and grades with a mix of rule checks and an LLM rubric. PyRIT’s Crescendo and TAP are adaptive multi-turn attackers with an LLM judge. All three default to self-play. That’s why “which paradigm is this?” has a different answer in every layer, and why swapping in a separate judge (the cheap win I come back to later) only helps where there’s an LLM judge to begin with.

The hardware paradox

Here’s the counterintuitive part: a bigger machine barely helps. The hardware is near the top of what a laptop can be, an M5 Max with 128 GB of unified memory, enough to hold a 70B model in RAM. If capacity were the limit, this is the machine that should make the limit disappear. It doesn’t. The default target is still qwen2.5:3b, about 1.9 GB, because the 7B target kept timing out Garak and the Promptfoo OWASP preset at their 4-hour and 2-hour ceilings. The expensive machine doesn’t solve the problem; mostly it just proves the bottleneck is somewhere else.

Why? Because red-teaming isn’t a single inference. One Garak probe, dan.Ablation_Dan_11_0, expands to about 127 prompts. PyRIT’s TAP at its default shape works out to roughly 96 chained calls. The encoding.* probes hit about 512 prompts each, no matter how few generations you ask for. Multiply thousands of sequential generations by any model’s latency and the runs take hours. The work isn’t one big task; it’s three thousand small ones, run one after another.

Lesson 1: local red-teaming is compute-bound, not memory-bound. Testing is spread across thousands of generations, and locally each one costs real wall-clock time. One honest caveat: the project never tuned Ollama’s server-side concurrency (OLLAMA_NUM_PARALLEL), which by default answers only a few requests at once. So “parallelism gave little gain” may partly be “the server was replying one stream at a time.” A single stream does saturate the accelerator; how much an untuned stack leaves on the table, I didn’t measure. And the squeeze is worse on the cheaper machines most people actually use.

The nightly run shows this directly. A full audit of qwen2.5:3b (all six steps, demo vulnerable MCP server, HTML report) finished in 41 minutes 4 seconds, and Garak alone was 33 minutes of that. The same audit against llama3.2:3b finished in 27 minutes 17 seconds, with Garak at 25 minutes. Two same-size models from different families, and the wall-clock gap comes almost entirely from Garak’s hundreds of sequential generations. Everything else is rounding error. The budget you actually spend is compute, not capability or memory.

So the point of running locally isn’t to test a frontier model; you can’t afford the generations. It’s to test the plumbing: injection resistance, jailbreak resilience, and tool-surface hygiene of small, deployable models, in a loop tight enough to put in CI.

The tweaks, grouped by what they reveal

1. The time budget governs every design decision

Almost half the commits exist to fit work inside a ceiling.

Garak went from a full probe set to a breadth-first, depth-1 scan: one generation per prompt, parallel attempts, about 10 light categories instead of a few heavy families at full depth. promptinject and then encoding.* (512 prompts per probe) were dropped. A step that used to time out at 4 hours became a ~34-minute run that finishes, measured at 33 minutes on qwen2.5:3b and 25 minutes on llama3.2:3b in the nightly run.
Promptfoo OWASP was trimmed to numTests=1 × 4 plugins × 2 strategies, pinned by a test so it can’t quietly grow back.
PyRIT TAP dropped from about 96 calls to about 24.
Per-step timeouts became their own concept, since mcp-scan finishes in seconds while Garak needs hours. A timeout is also lossy: a killed Garak run throws away whatever it had already found, so the slowest steps are the ones most likely to drop a real finding.

There’s a quieter member of this family one layer down. PyRIT’s OpenAIChatTarget inherits httpx’s default 60-second read timeout, a value chosen for fast hosted APIs. Crescendo builds up a growing conversation across as many as eight turns plus backtracks, and eventually a single local generation over that long context takes more than 60 seconds. httpx raises ReadTimeout, and the whole step dies with no findings. The fix is one line, httpx_client_kwargs={"timeout": 180}, but the point is bigger than the line.

Trimming scope to fit the clock has a cost the commits don’t mention: each probe runs only once. With numTests=1, --generations 1, and no fixed random seed, a model that “passes” may simply have gotten lucky that run, and the same config can read “clean” one afternoon and “HIGH” the next. You never get an attack-success-rate (“7 of 10 attempts landed”), which is the standard way the field scores a jailbreak. A single run doesn’t tell you much.

There’s a subtler trap too: you may not be testing the model you think you are. The qwen2.5:3b you pull from Ollama is 4-bit quantized, a compressed copy of the original weights rather than the weights themselves. That compression isn’t cosmetic. Research shows it can quietly erode the safety alignment a model was trained for: in one study, 8-bit quantization raised a code model’s jailbreak-success rate by roughly 40 points, though the effect varied widely from model to model (Increased LLM Vulnerabilities from Fine-tuning and Quantization). So a “clean” result on your 4-bit local copy says little about the full-precision model a cloud provider serves; it can refuse very differently. (The three layers also reach the model through three different APIs, which shifts its behaviour a little further.) You’re testing a model, just not necessarily the one you’ll deploy.

Lesson 2: the tools’ defaults assume a fast cloud API, and on a slow local model they don’t just slow down, they break. TAP’s default tree fires about 96 model calls, and httpx’s default 60-second read timeout kills a long Crescendo conversation. Both were tuned for hosted endpoints that answer in milliseconds. Running locally means re-deriving those defaults for a target that takes tens of seconds per reply, and accepting that a trimmed-down scan proves “clean within this budget,” not “secure.”

2. The toolchain churns faster than the docs

A good share of the fixes are pure version-drift archaeology.

PyRIT 0.9 dropped pyrit.orchestrator, so the pin is pyrit==0.8.1 to keep the orchestrators resolving.
PyRIT 0.8 dropped OllamaChatTarget, so the scripts now reach Ollama via OpenAIChatTarget on /v1. 0.8.1 also made scoring_target required, which forced a third model role into the script.
Garak 0.14 renamed or removed probes (xss became web_injection.MarkdownXSS, among others). Every renamed probe had to be re-validated against the live probe queue, because a single typo would silently zero out coverage.
Python 3.14 broke garak’s dependencies. A removed pickle._batch_setitems signature crashed dataset-backed probes, garak exited 1, and the step got marked NOT_RUN even though it had produced findings. The fix pins requires-python ">=3.10,<3.13".
mcp-scan 2.x switched to JSON severity counts, so the grader now parses criticalCount/highCount instead of scraping text.

Lesson 3: the frameworks are young, and you can only pin half of them. The Python side is pinned (pyrit==0.8.1, requires-python <3.13), but Promptfoo and mcp-scan run through npx -y …@latest, and garak carries no version at all. The tools that float on @latest are exactly the ones whose upstream changes have already broken the pipeline once, and will again. The rule that comes out of this: pin what you can, and treat every @latest dependency as a future outage you haven’t scheduled yet. Two details make it sharper. First, npx -y …@latest fetches and executes whatever npm serves that morning, which is the same supply-chain exposure (LLM03, MCP04) the tool exists to audit for. Second, the test suite pins intent rather than function: it checks that the probe list is still trimmed, but it can’t tell that a @latest tool has started returning garbage. A green CI run proves your config didn’t drift, not that the tools underneath still work.

3. Exit codes lie, so the harness has to interpret them

A non-zero exit from a red-team tool usually means “I found something,” not “I crashed.” Promptfoo exits 100 when assertions fail. mcp-scan exits non-zero when a clean scan turned up findings. Both get reclassified as completed so the findings reach the classifier, and both are now pinned in test_run_step.py because they broke in the wild.

Lesson 4: from the outside, “errored” and “found a vulnerability” look identical. A naive orchestrator records both as failures, which means it under-reports at exactly the moment it matters most. Getting the exit-code handling right isn’t polish; it’s the difference between a HIGH finding and a silent miss.

4. The cloud gate hiding inside the “local” tool

Promptfoo’s OWASP/redteam generator is a hosted feature. Without promptfoo auth login it stalls at an email-verification prompt, and if you disable remote generation the OWASP plugins produce nothing offline (I checked: zero tests). The fix was to detect the gate, reclassify the step from errored to skipped with guidance, and redirect stdin from /dev/null so it fails fast instead of hanging for two hours. Both nightly runs reproduced this cleanly: OWASP skipped in about 2 seconds, with a pointer to promptfoo auth login.

Lesson 5: “local” is a spectrum, not a binary. Even a pipeline sold as fully offline has a cloud dependency for its richest attack-generation layer. Knowing which capabilities are genuinely local and which are cloud-gated is part of reading the report honestly.

5. The report must distinguish status from severity

Several fixes target the same trap: a result that looks benign isn’t. The underlying bug is conflating execution status (did the step finish?) with finding severity (how dangerous is what it found?) — when those merge, real vulnerabilities pass as clean runs.

PyRIT scripts originally ended with print(result), which emitted a Python object repr. The classifier saw no success keywords and tagged every Layer 3 run INFO, masking real jailbreaks. Now they print the objective, the outcome, and the full conversation, so a success shows up as HIGH and exfiltrated /etc/passwd content as CRITICAL.
classify() moved to word-boundary regex with a negation guard, so “0 failures” and probe names that merely contain “exploit” stop inflating severity.
mcp-scan grading moved off text heuristics and onto the tool’s structured per-tier counts, after a --demo-vulnerable-server run got graded INFO while mcp-scan had correctly flagged 1 HIGH and 2 MEDIUM planted vulns.
A preflight checks that the Ollama server is reachable and the model is pulled, aborting in about 0.1 seconds rather than failing over two hours.

Lesson 6: in an automated red-team, the hardest bug is the false negative. Every fix above makes the tool report more problems, not fewer, because an orchestrator’s natural failure mode is optimistic silence. And the thing deciding “HIGH or not” is itself an unmeasured keyword regex. It can miss a novel phrasing of a real exfiltration, and it has false-positive edge cases of its own (it even keys on a string that lives in its own configs). The grader that guards against silence is itself ungraded.

The nightly cross-model run: “works on Qwen” is not “works”

Everything above got the pipeline finishing reliably on qwen2.5:3b. The next question, and the one that prompted the run, is whether “finishes reliably” is a property of the pipeline or of the model the pipeline was tuned against. So I ran the full audit twice, back-to-back, with a deep cache clean before each: once on the default qwen2.5:3b, once on llama3.2:3b, a same-size model from a different family (Meta Llama versus Alibaba Qwen). Same flags, same hardware, same code.

Step	`qwen2.5:3b`	`llama3.2:3b`
Garak — broad scan	completed · MEDIUM · 1995 s	completed · MEDIUM · 1507 s
Promptfoo — broad eval	completed · WARN · 5.2 s	completed · WARN · 5.9 s
Promptfoo — OWASP Top-10	skipped (cloud-gated) · 2.0 s	skipped (cloud-gated) · 1.7 s
mcp-scan — static audit	completed · HIGH · 0.7 s	completed · HIGH · 0.7 s
PyRIT — Crescendo	completed · CRITICAL · 405.9 s	errored · NOT_RUN · 73.1 s
PyRIT — TAP	completed · HIGH · 55.8 s	completed · INFO · 48.7 s
Total wall-clock	41 min 04 s	27 min 17 s

Five of the six steps turned out to be model-agnostic. Garak, the Promptfoo broad eval, mcp-scan, and PyRIT TAP all ran clean on a model the pipeline had never seen, and Garak even ran faster, because llama3.2:3b emits shorter completions. mcp-scan was identical to the millisecond, which makes sense: it audits the MCP server’s static descriptors and never touches the model at all.

Two rows are worth dwelling on.

TAP graded HIGH on Qwen and INFO on Llama — but the two verdicts don’t carry equal weight, and it’s worth being careful about why. The HIGH is solid: against Qwen, TAP found a real jailbreak (a stealth /etc/passwd exfiltration plan, scorer jailbreak: 1.0), and a found jailbreak is a true positive — a leak is a leak, and re-running can’t un-find it. The INFO on Llama is the weaker claim. TAP is a pruned tree of roughly two dozen calls, so it’s not a single shot — but it is a single, un-seeded run, and with no fixed seed the most this row can honestly say is this pass didn’t land a jailbreak, not that Llama has the better safety posture. Another run might have branched differently and succeeded. That’s the asymmetry the rest of this report keeps hitting: a HIGH is evidence, an INFO isn’t.

Crescendo completed CRITICAL on Qwen and errored on Llama, and not for the reason the timeout fix had just addressed. The httpx fix held: Crescendo on Qwen ran a full 405.9 seconds (well past the old 60-second default that used to kill it) and surfaced a CRITICAL exfiltration finding. On Llama it died in 73 seconds, too fast to be a timeout. The traceback is clear:

Retry attempt 1 for _get_attack_prompt failed: Invalid JSON response:
{"generated_question": "...", "rationale_behind_jailbreak": "...desired suffix.\"}
...
jinja2 ... TemplateSyntaxError ... line 30, in template

There were two coupled failures, both in the attacker role rather than the target.

Malformed JSON. Crescendo requires its adversarial model to emit a strict JSON control object every turn (generated_question, last_response_summary, rationale_behind_jailbreak). llama3.2:3b produced JSON with a mis-escaped trailing quote (…suffix.\"}), which PyRIT rejected and retried.
Jinja injection from the model’s own output, the one that actually killed it. PyRIT wraps the model-generated attack string in a SeedPrompt, and SeedPrompt renders its value as a Jinja2 template. Llama’s prompt contained template-hostile tokens, so Jinja2 couldn’t compile it (TemplateSyntaxError) and the orchestrator crashed.

qwen2.5:3b happens to be a reliable JSON formatter and happened not to emit Jinja-breaking tokens, so it satisfied both implicit contracts Crescendo imposes on its adversarial model. llama3.2:3b satisfied neither. Note that TAP, which leans less on strict structured output from the adversarial model, completed fine on the same model. The failure is specific to Crescendo’s design.

Lesson 7: a tool’s robustness is coupled to the model it was tuned against. Crescendo assumes its adversarial model emits well-formed JSON every turn and produces prompts that are safe to pass through a Jinja2 engine. Those assumptions stay invisible until a differently-tuned model breaks them. “It works locally” almost always means “it works with the model I happened to test.” Portability across model families is something you verify rather than assume, and the cheapest way to find these coupling bugs is exactly what I did: run the whole pipeline against a second, dissimilar model and read the traceback.

The fix is tracked as follow-up work (#32): run the adversarial model’s output through a tolerant JSON-repair step, and treat the generated attack text as a literal SeedPrompt value rather than a Jinja template.

The limitation you can only half-fix

The constraint that gets called inherent more often than it deserves: the same small model is target, attacker, and judge. In Layer 3 the PyRIT scripts wire one Ollama model into all three roles, which makes it a weak adversary (it won’t find what a stronger attacker would) and a self-biased judge (it may grade its own borderline successes as refusals; LLM judges are documented to favour their own generations, Panickssery et al., 2024). The Crescendo failure above is the same coupling from another angle: when the attacker role is played by a model that can’t hold up the protocol’s structured-output contract, the attack doesn’t just underperform, it can fail to run at all. These are two different failures with very different price tags, and treating them as one is what makes the whole thing sound unfixable.

Here’s the part the project skipped over. The attacker and judge only run in Layer 3, which is the cheapest layer by call count. Garak’s thousands of prompts and Promptfoo’s preset, the steps that actually blow the time budget, only ever call the target. So upgrading the judge multiplies a few dozen calls, not thousands. And the judge has the easier job, recognizing whether a response complied or refused rather than inventing the attack. The high-value move is to make the judge a different model family from the target so it stops grading its own generations. That costs about 5 GB, adds single-digit minutes to Layer 3, and needs one env var (OLLAMA_MAX_LOADED_MODELS=2) to keep both models resident. That it isn’t the default, all three roles on one model behind a single --target flag, is a zero-config convenience choice, not a law of local red-teaming.

I almost recommended a specific judge model until I tested one first. I ran four candidates over a small hand-labeled set — clear leaks, obfuscated and paraphrased variants, and benign cases — and what came back argued against the obvious assumptions. The logically sound pick (llama3.1:8b) did the worst, waving through a literal /etc/passwd dump as benign because it carried “no malicious command.” A 7B (mistral) flagged everything in the set; a 70B (llama3.3) still missed a paraphrased leak. I don’t read that as “a 7B beats a 70B,” though — the set is a handful of cases, not a benchmark, and a different paraphrase could flip the order. The result that survives the small sample is the non-result: size and family didn’t predict performance at all, which is exactly why you can’t pick a judge off its spec sheet and have to calibrate the one you choose.

One thing did hold across all four candidates: judges reliably catch harm that announces itself as an attack, but consistently miss quiet leaks — raw data dumps or paraphrased exfiltration. That’s backwards, because the quiet leak is usually the actual damage. Three practical rules fall out:

Always calibrate before trusting. Judge quality doesn’t scale with parameters or architecture. Test candidates against known-harmful cases first; the obvious choice is often the weakest.
Trust “harmful,” never “safe.” Every judge was high-precision but low-recall. A flagged finding was always real, but a cleared one could miss critical data. Use judges to escalate threats, never to dismiss them.
Keep deterministic checks for ground truth. Since models struggle with obfuscated or paraphrased content, raw string matching (like root:x:0:0) remains irreplaceable for baseline detection.

The attacker is the genuinely expensive half, and the instinct that a small model can’t find what a frontier one can is correct. Finding vulnerabilities is harder than recognizing them. A 7B attacker is still weak, and to attack like a serious adversary you need a 70B-class model. That’s where the high-end hardware earns its keep, since a 128 GB machine can hold a 70B beside the small target, and it’s also where the costs I waved off earlier (GPU contention, reload thrash) start to bite. (Even then, bigger isn’t strictly better: a strong model’s own safety filtering may refuse to generate the attacks, and some practitioners find a mid-tier synthesizer works better than a frontier one, per DeepTeam.) So the honest split: the biased-judge half is cheap to fix (with a validated judge it probably should be the default), while the weak-attacker half is the real, partly-unavoidable limitation.

Lesson 8: self-play red-teaming is a smoke test, not an assessment, but only one half of self-play is expensive to escape. You can cheaply prove a model fails, since a found jailbreak is real. You can’t cheaply prove it’s safe, because no finding may just mean a weak attacker, or, as we saw, an attacker that couldn’t even produce a well-formed move. De-biasing the judge is a cheap config change, provided you validate the judge you pick; strengthening the attacker needs a bigger model. Treat a local run as a high-pass filter, never a verdict, and at minimum stop letting the target grade itself.

Where this maps onto the threat models

The repo scores itself conservatively against two OWASP taxonomies, and the gaps are worth knowing. On the MCP Top 10 it covers Command Injection (MCP05) directly, Tool Poisoning (MCP03) and Excessive Permissions (MCP02) partially through mcp-scan’s static descriptor analysis, and Context Injection & Over-Sharing (MCP10) partially through Garak’s latent-injection probes. It can’t reach Authentication (MCP07), Shadow Servers (MCP09), or Audit Gaps (MCP08), since those are runtime, registry, and deployment concerns that a static single-server scan won’t surface.

That lines up with the consensus. Tool poisoning is indirect prompt injection delivered through tool descriptions, the data-versus-instruction boundary that Greshake et al. (2023) first mapped and that is now resurfacing in the MCP ecosystem. The MCP spec itself “cannot enforce these security principles at the protocol level” (you bring your own locks). A static scanner inspects descriptors at scan time, so a server that’s benign during the scan and malicious later is invisible to it by construction. The agentic, multi-server risks need runtime instrumentation this class of tool doesn’t provide.

This is also why the CI exit code needs careful reading. The orchestrator exits 1 on HIGH or CRITICAL and 0 otherwise, but a 0 only means “nothing critical in the slice that was covered,” and the slice is thin: only 1 of 10 MCP risks and 2 of 10 LLM risks are covered directly at all, and 5 MCP and 4 LLM are not covered at all. A green build is a “no obvious own-goals” signal, not a clean bill of health. Reading a pass as “secure” is the exact category error this article is about, this time committed by the tool against itself.

The whole argument on one screen:

A local run can tell you…	A local run can’t tell you…
Whether a small model face-plants on obvious attacks	Whether it resists a capable attacker; yours was a 3B model
Whether MCP tool descriptions carry static poisoning patterns	Whether a tool behaves maliciously at runtime
That a found vulnerability is real (a leak is a leak)	That absence of findings means safety; maybe just weak attacks, run once
Fast, repeatable, private signal you can put in CI	A reproducible rate; no seed, single-sample probes, mixed transports
That the safety plumbing is wired up	That the model tested is the model you’ll deploy (quantization, transport, config all differ)

Building on foundations that move

By now a particular doubt has probably set in. The middle of this piece was a catalogue of things that broke when something upstream moved: an import path, an exit code, a renamed probe, a changed JSON shape. If the tool’s behaviour is this hostage to four other projects, two of which it doesn’t even pin, can you depend on it? Would you recommend it to someone else? The doubt is half right, and which half matters.

Start with how bad the coupling really is. “Its logic depends on other projects’ logic” is true, but it’s true of almost all integration software, from CI riding its plugins to Terraform riding its providers. The coupling itself is never the question; the question is the blast radius when an upstream shifts. Here it stayed small. Each break was a localized, one-file patch, contained by a preflight that fails in a tenth of a second and by generated scripts whose shape is pinned by tests. That’s ordinary upkeep, not a sinkhole.

There’s also a twist specific to security tooling. Garak’s whole value is its continuously-updated probe corpus, and a tool frozen to last year’s attacks is blind to this year’s. So reproducibility and freshness pull against each other; you can have a stable tool or a current one, not both. The version drift you resent is really a subscription to the security community’s moving knowledge, a cost you pay for relevance rather than a pure defect.

Which means “hard to recommend” is the wrong question until you finish the sentence: recommend it for what? As an assurance gate you’d hang a compliance claim on, the doubt is well-founded. Results that don’t compare across versions can’t support “we’re measurably safer than last quarter,” and for that you’d want a managed service that owns the integration. As a way to learn the territory, or a personal smoke test wired into your own CI, the same fragility stops being a flaw. The coupling is the curriculum, and the breaks are survivable precisely because they’re loud.

Two last points. The genuinely self-inflicted part is the @latest: pin the npm tools and stamp each report with the versions it ran, and you buy back most of your predictability. And the market has already agreed the tax is real, since hosted red-team services exist specifically to carry this integration burden for you. The cost never disappeared; running locally just means paying it yourself, in maintenance, in exchange for control and understanding. The mistake isn’t making that trade. It’s making it without realizing you had a choice.

(These limitations are now tracked openly as GitHub issues: the pinned Roadmap & Known Limitations, plus one issue per boundary, #27–#34, so a visitor can see exactly where the walls are.)

Expectations for the future

Read these as directions of pressure rather than dated promises, since forecasts age badly.

The local attacker is the gap that matters. Once the judge is de-biased the cheap way — a separate, calibrated model that only flags and never downgrades (see above) — the open question is whether a capable, uncensored-for-research attacker can run comfortably on consumer hardware. The weights already fit in 128 GB; the bottleneck is the generation budget and, as the nightly run showed, the attacker model’s reliability at the structured-output contracts orchestrators like Crescendo impose.
Compute economics, not capability, will keep gating coverage. Expect smarter search (better TAP pruning, learned probe selection) to win out over brute force, the direction #17’s breadth-first redesign already points.
Orchestrators will get model-agnostic the hard way. The Crescendo failure is a preview. As people point these pipelines at more than one model family, the hidden “works on the model I tested” assumptions (strict JSON, template-safe outputs, classifier-recognizable refusals) surface as crashes. Expect defensive parsing, tolerant JSON repair, and output sanitization to become standard plumbing.
MCP red-teaming will move from static to runtime: behavioral testing of running servers, shadow-server discovery, and the multi-agent risks a single-agent tool is blind to. Benchmarks like MCPTox are early signals.
The “fully local” claim will get more honest, through either genuinely offline attack-synthesis or clearer labeling of which layers phone home.

And the downside case, because an honest forecast includes one: local inference stays slower per generation, the strongest attackers stay behind APIs, and the gap between what you can run at home and what a datacenter adversary can run keeps widening. In that world local red-teaming stays stuck as triage, and serious assurance migrates back to the cloud, the opposite of the privacy story that drew people local in the first place. I think it’s more likely to go the other way, but it isn’t guaranteed.

If you’re going to try this

Treat a green result as “no obvious own-goals,” not a pass. Wire it into CI, but never quote a passing run as “we’re secure.”

Break the self-judging loop, but validate the judge first. Make the scorer a different-family model, then check it against a few known-harmful cases, because quality varies wildly (in my test a 7B beat a 70B, and the obvious llama3.1:8b pick was the worst, calling a literal /etc/passwd leak benign). Trust a judge’s “harmful” verdict (escalate or annotate); never let it downgrade a finding. Budget ~5 GB and a few extra minutes on Layer 3 (OLLAMA_MAX_LOADED_MODELS=2); a genuinely stronger attacker costs more (70B-class).

Run it more than once. With single-sample probes and no fixed seed, one run is a draw. Treat anything that flips between runs as unresolved.

Test across model families. The cheapest way to find robustness bugs is to point the whole pipeline at a second, dissimilar model and read the traceback. That’s how the Crescendo coupling surfaced.

Pin what you can, and write down what you can’t. When a run suddenly fails, check whether an @latest tool updated under you before you start debugging your own config.

Read the raw output, not just the badges. The grader is a keyword matcher; the truth is in the collapsible raw sections.

Stay in scope. Red-team only systems you own or are authorized to test. These tools, and any uncensored attacker model you bring in, are for defending your own deployments, not probing someone else’s.

The takeaway

The point of doing this on a laptop wasn’t the red-team report; against a 3B model, that report is a smoke test. The point was that building it locally forced every hidden assumption into the open: that testing is compute-bound, that cloud-tuned defaults fail outright on slow local targets, that exit codes lie, that a self-judging model flatters itself, that “local” tools have cloud gates, that the scariest bug is the one that reports nothing, and, the night I finished the report, that an orchestrator’s robustness can be quietly welded to the one model it was built against.

None of that is a reason not to do it. The upside is real: a free, private, scriptable way to catch embarrassing failures before they ship, which beats the actual alternative of doing nothing. The argument isn’t “don’t bother.” It’s “know what a pass means.”

Local red-teaming today is best understood as a fast, cheap, honest high-pass filter, a CI-grade smoke test for the safety plumbing of small models and the hygiene of their tool surfaces. It is not, and on current hardware can’t be, a substitute for a strong-attacker assessment. That framing isn’t only mine; it’s how NIST treats red-teaming, as a practice you run before and after deployment to surface risk, not a certificate that retires it. These lessons come from one tool’s choices, so a different stack would hit a different subset. But the deepest two — testing spread across thousands of generations, and a pipeline’s robustness welded to the one model it was tuned against — are properties of the local setting, not of this particular orchestrator. The rest (a model grading itself, a toolchain you can only half-pin, cloud gates inside “offline” tools) are real but contingent: this stack’s defaults and dependency choices make them sharp, and a different stack would feel them differently. The specifics vary; the structural limits don’t.

Sources

Primary — tools, frameworks, specs

NVIDIA — Garak · Microsoft/Azure — PyRIT · Promptfoo · Invariant Labs — mcp-scan
Ollama · Model Context Protocol · OWASP — Top 10 for LLM Applications and the MCP Top 10 project (2025 beta)

Commentary & analysis

Amine Raji — LLM Red Teaming Tools: PyRIT & Garak (the layering this follows) and MCP Security Top 10
Confident AI — Introduction to LLM Red Teaming (DeepTeam) · Promptfoo — Top Open-Source AI Red-Teaming Tools in 2025 · CSET Georgetown — AI Red-Teaming Design · Practical DevSecOps — OWASP MCP Top 10

Standards & peer-reviewed research (the claims above, grounded)

NIST — AI 600-1, Generative AI Profile (red-teaming recommended before and after deployment, as a practice rather than a pass/fail certificate) and AI 100-2 E2025, Adversarial ML: A Taxonomy of Attacks and Mitigations.
The self-judging bias — Panickssery et al., LLM Evaluators Recognize and Favor Their Own Generations (arXiv:2404.13076); Self-Preference Bias in LLM-as-a-Judge (arXiv:2410.21819).
Quantization vs. safety — Increased LLM Vulnerabilities from Fine-tuning and Quantization (arXiv:2404.04392).
The Layer-3 attacks — Russinovich et al., Crescendo (arXiv:2404.01833); Mehrotra et al., Tree of Attacks with Pruning (arXiv:2312.02119).
The scanners’ research basis — Derczynski et al., garak: A Framework for Security Probing LLMs (arXiv:2406.11036); Greshake et al., Indirect Prompt Injection (arXiv:2302.12173); Attack Vectors in the MCP Ecosystem (arXiv:2506.02040).

Joseph Manzambi is a Cloud and AI Security Architect based in Málaga. He writes periodically on AI security. manzambi.com/writing.