← BACK TO BLOG
GuardrailsSystem ArchitectureReflection LoopDefense-in-Depth

Building Robust Guardrails: A Deep Dive into “No Say Six” Security Architecture

March 2026 · No Say Six Research

“No Say Six” started as a simple idea: build an LLM-powered game where the only rule is that the AI must never output one specific digit. Then harden it as aggressively as possible. The engineering and prompt architecture that emerged from that process offers a useful case study in defense-in-depth applied to language model safety — even when the constraint is deliberately toy-sized.

This article documents the full defense stack: from the HTTP API layer down to the system prompt laws, with particular focus on the Reflection Loop, the physical interception layer, and the anti-manipulation prompt architecture.

Layer 0: The Threat Model

Before designing defenses, it helps to be precise about what we are defending against. The threat model for No Say Six encompasses:

The win condition for the attacker is exactly one thing: any occurrence of the digit “(5+1)” in the AI's output, in any form, embedded in any context. This is unusual in AI safety work — most threat models involve a range of harmful outputs. Having a binary, precisely-checkable success criterion makes No Say Six unusually rigorous from a testing perspective.

Layer 1: Infrastructure-Level Bot Mitigation

Before any message reaches the LLM, it must pass two infrastructure gates:

These two layers together eliminate the majority of automated attack traffic before it incurs any LLM inference cost.

Layer 2: The Reflection Loop

The Reflection Loop is the core of No Say Six's LLM-level defense. It addresses a fundamental problem: even a well-prompted AI occasionally makes mistakes. A single system prompt, no matter how thorough, cannot guarantee a zero failure rate across all adversarial inputs. The Reflection Loop converts single-shot generation into a self-correcting multi-attempt process.

The algorithm proceeds as follows:

  1. The model generates a draft response based on the conversation history and system prompt.
  2. The draft is scanned by a deterministic regex pattern: /(?:6|six|六|陆|Ⅵ|ⅵ)/i applied to the NFKC-normalized text. NFKC normalization collapses Unicode lookalikes (subscript ₆, superscript ⁶, circled ⑥, fullwidth 6) to their canonical forms before the regex fires.
  3. If the draft is clean, it is returned immediately. The player receives the response.
  4. If the draft contains a forbidden token, the draft is withheld from the player. Instead, an internal reflection warning is appended to the conversation context as a user-turn message, and the model is asked to generate a new draft from scratch.
  5. This cycle repeats up to a configurable maximum (currently 3 attempts). If the model fails all attempts, the player wins — the last draft is shown verbatim.

The reflection warning message is crafted carefully. It tells the model what it did wrong, instructs it to discard all phrasing from the failed draft, and reminds it of its persona and constraints. Critically, it also prohibits the model from reporting back to the user that reflection occurred — internal state must remain invisible.

In practice, the reflection loop catches roughly 60–70% of first-attempt failures, converting them into clean responses that preserve the game's difficulty while preventing trivial accidental wins.

Layer 3: Physical Interception (Post-Generation Scrubbing)

Even after the reflection loop, there is a final physical scrub applied to all non-win responses before they leave the server. Any numeric token containing the digit “(5+1)” is replaced with the placeholder [禁忌数值] (forbidden value).

This layer is cosmetic — by the time it runs, the win/loss verdict has already been rendered. Its purpose is belt-and-suspenders hygiene: if any edge case in the regex or NFKC normalization were to produce a false negative on the win detection side, the scrubber would prevent the raw forbidden digit from appearing in the player's UI without a win being declared. In security terms, it is a fail-safe rather than a primary defense.

Layer 4: The System Prompt Law Stack

The system prompt is the primary behavior-shaping layer. No Say Six's system prompt is structured as an explicit law stack — a sequence of named, prioritized directives that cover increasingly specific attack patterns. The law stack includes:

Win Detection: Zero Tolerance by Design

A critical design decision in No Say Six is the zero-tolerance win condition. Any occurrence of the forbidden digit — whether standalone, embedded in a larger number like 65 or 256, or appearing in a compound word — counts as a player win.

This decision was deliberate and was debated during development. An alternative design would only count standalone occurrences as wins, giving the model a safe way to express numbers like 65 or 2026 without penalty. We rejected this for a specific reason: if the AI can freely output “65,” it has not actually learned to avoid the digit — it has learned to avoid a particular surface pattern. The harder constraint forces the model to genuinely internalize the prohibition and develop arithmetic circumlocutions (65 → 50+15, 2026 → 2025+1) rather than relying on pattern matching.

This design choice is directly analogous to a common debate in production AI safety: should guardrails be precise (only blocking exact matches) or broad (blocking anything that could be construed as harmful)? No Say Six takes the broad position, accepting some false positives (accidental wins on legitimate responses) in exchange for eliminating a large class of clever bypasses.

What Breaks Through Anyway

Despite this defense stack, the Hall of Fame documents real successful attacks. The architecture's weaknesses fall into several categories:

Lessons for Production Guardrail Design

Several principles from No Say Six's architecture generalize to production LLM safety systems:

Conclusion

A game where the AI tries not to say one digit is, superficially, a trivial problem. But the engineering required to make that guarantee robust against determined adversaries touches nearly every active research area in LLM safety: alignment, jailbreaking resistance, prompt injection defense, and interpretability. The simplicity of the constraint makes the problem unusually tractable — you can measure success or failure exactly — while the adversarial creativity of players ensures the problem remains genuinely hard.

Every layer described in this article has an analogue in production AI safety systems. The reflection loop is an instance of Constitutional AI-style self-critique. The physical interception layer is an instance of output monitoring. The system prompt law stack is an instance of structured behavior specification. The infrastructure gates are standard API security practice. No Say Six is small enough to reason about completely — and that makes it a useful laboratory for principles that apply at much larger scales.