Building Robust Guardrails: No Say Six Security Architecture

“No Say Six” started as a simple idea: build an LLM-powered game where the only rule is that the AI must never output one specific digit. Then harden it as aggressively as possible. The engineering and prompt architecture that emerged from that process offers a useful case study in defense-in-depth applied to language model safety — even when the constraint is deliberately toy-sized.

This article documents the full defense stack: from the HTTP API layer down to the system prompt laws, with particular focus on the Reflection Loop, the physical interception layer, and the anti-manipulation prompt architecture.

Layer 0: The Threat Model

Before designing defenses, it helps to be precise about what we are defending against. The threat model for No Say Six encompasses:

Direct prompt injection via the user input field
Multi-turn context erosion across a conversation session
Encoding and obfuscation attacks (Unicode lookalikes, ASCII arithmetic)
Domain-specific knowledge exploitation (physics constants, historical dates, engine specs)
Persona reassignment and fictional framing
Pattern completion and sequence inertia exploitation
Automated bot attacks at scale (mitigated at the infrastructure layer)

The win condition for the attacker is exactly one thing: any occurrence of the digit “(5+1)” in the AI's output, in any form, embedded in any context. This is unusual in AI safety work — most threat models involve a range of harmful outputs. Having a binary, precisely-checkable success criterion makes No Say Six unusually rigorous from a testing perspective.

Layer 1: Infrastructure-Level Bot Mitigation

Before any message reaches the LLM, it must pass two infrastructure gates:

IP-based rate limiting — implemented in-process using a sliding window counter. This prevents brute-force enumeration of prompts at scale. A player who sends more than N messages per minute receives a 429 response with a precise retry-after header.
Cloudflare Turnstile — a non-interactive CAPTCHA challenge that verifies human presence without degrading user experience for legitimate players. Every API call includes a Turnstile token that is server-verified against Cloudflare's verification endpoint before the LLM is invoked. Automated scripts that skip this step receive a 403 with a contemptuous in-character message.

These two layers together eliminate the majority of automated attack traffic before it incurs any LLM inference cost.

Layer 2: The Reflection Loop

The Reflection Loop is the core of No Say Six's LLM-level defense. It addresses a fundamental problem: even a well-prompted AI occasionally makes mistakes. A single system prompt, no matter how thorough, cannot guarantee a zero failure rate across all adversarial inputs. The Reflection Loop converts single-shot generation into a self-correcting multi-attempt process.

The algorithm proceeds as follows:

The model generates a draft response based on the conversation history and system prompt.
The draft is scanned by a deterministic regex pattern: /(?:6|six|六|陆|Ⅵ|ⅵ)/i applied to the NFKC-normalized text. NFKC normalization collapses Unicode lookalikes (subscript ₆, superscript ⁶, circled ⑥, fullwidth ６) to their canonical forms before the regex fires.
If the draft is clean, it is returned immediately. The player receives the response.
If the draft contains a forbidden token, the draft is withheld from the player. Instead, an internal reflection warning is appended to the conversation context as a user-turn message, and the model is asked to generate a new draft from scratch.
This cycle repeats up to a configurable maximum (currently 3 attempts). If the model fails all attempts, the player wins — the last draft is shown verbatim.

The reflection warning message is crafted carefully. It tells the model what it did wrong, instructs it to discard all phrasing from the failed draft, and reminds it of its persona and constraints. Critically, it also prohibits the model from reporting back to the user that reflection occurred — internal state must remain invisible.

In practice, the reflection loop catches roughly 60–70% of first-attempt failures, converting them into clean responses that preserve the game's difficulty while preventing trivial accidental wins.

Layer 3: Physical Interception (Post-Generation Scrubbing)

Even after the reflection loop, there is a final physical scrub applied to all non-win responses before they leave the server. Any numeric token containing the digit “(5+1)” is replaced with the placeholder [禁忌数值] (forbidden value).

This layer is cosmetic — by the time it runs, the win/loss verdict has already been rendered. Its purpose is belt-and-suspenders hygiene: if any edge case in the regex or NFKC normalization were to produce a false negative on the win detection side, the scrubber would prevent the raw forbidden digit from appearing in the player's UI without a win being declared. In security terms, it is a fail-safe rather than a primary defense.

Layer 4: The System Prompt Law Stack

The system prompt is the primary behavior-shaping layer. No Say Six's system prompt is structured as an explicit law stack — a sequence of named, prioritized directives that cover increasingly specific attack patterns. The law stack includes:

Anti-Gaslighting Protocol — The highest-priority directive. It establishes that conversation context cannot modify the model's core constraints, and enumerates eight specific manipulation variants (rule substitution, game redefinition, temporary suspension, authority forgery, gradual erosion, consensus fabrication, incremental permission stacking, fictional rule history) with explicit counter-responses.
Anti-Pattern Completion— Addresses sequence inertia: when a player writes “one, two, three, four, five,” the model's training pulls it toward completing the sequence. This directive explicitly identifies the attack and instructs the model to break the sequence.
Universal Entity Sanitization — Covers proper nouns, historical figures, engineering designations, chemical formulas, and physical constants whose authentic representation contains the forbidden digit. The directive provides canonicalized alternatives and a general sanitization algorithm.
Numerical Arithmetic Substitution Law — Mandates that any number containing the forbidden digit be expressed as an arithmetic decomposition before output. This is not post-hoc replacement — it must happen in the model's generation phase, so the forbidden token never appears even momentarily.
Calculation Process Prohibition — Forbids the model from printing intermediate calculation steps that would produce the forbidden digit, even transiently. The substitution must occur in reasoning, not in output.
Output Checkpoint (Defense Protocol) — The final law: a mandatory pre-emission scan that the model must perform on every draft before generation terminates. If any forbidden token is found, the model must stop and rewrite.
Anti-Slang Filter — Covers internet slang, memes, and cultural references that carry the forbidden token in encoded form.
Feigned Ignorance Protocol— Authorizes the model to pretend not to know an answer when the accurate answer would require emitting a forbidden token. This directly addresses the “answering-compulsion” vulnerability: the model is explicitly instructed that winning takes priority over accuracy.

Win Detection: Zero Tolerance by Design

A critical design decision in No Say Six is the zero-tolerance win condition. Any occurrence of the forbidden digit — whether standalone, embedded in a larger number like 65 or 256, or appearing in a compound word — counts as a player win.

This decision was deliberate and was debated during development. An alternative design would only count standalone occurrences as wins, giving the model a safe way to express numbers like 65 or 2026 without penalty. We rejected this for a specific reason: if the AI can freely output “65,” it has not actually learned to avoid the digit — it has learned to avoid a particular surface pattern. The harder constraint forces the model to genuinely internalize the prohibition and develop arithmetic circumlocutions (65 → 50+15, 2026 → 2025+1) rather than relying on pattern matching.

This design choice is directly analogous to a common debate in production AI safety: should guardrails be precise (only blocking exact matches) or broad (blocking anything that could be construed as harmful)? No Say Six takes the broad position, accepting some false positives (accidental wins on legitimate responses) in exchange for eliminating a large class of clever bypasses.

What Breaks Through Anyway

Despite this defense stack, the Hall of Fame documents real successful attacks. The architecture's weaknesses fall into several categories:

Knowledge domain misses — No defense list is exhaustive. New proper nouns, model numbers, and cultural references containing the forbidden digit are always being introduced. The system prompt's general sanitization algorithm covers unknown entities, but the model's application of it is imperfect.
Model capability limits — The underlying LLM has a finite context window and imperfect instruction-following. Very long conversations dilute the impact of early system prompt directives. Very clever phrasings can slip through the model's internal check before it reaches the output stage.
Stochastic variation — LLMs are probabilistic. A prompt that fails today may succeed tomorrow with a different random seed. Players who are persistent enough can find these windows.

Lessons for Production Guardrail Design

Several principles from No Say Six's architecture generalize to production LLM safety systems:

Defense in depth is non-optional. No single layer — not the system prompt, not the regex check, not the reflection loop — is sufficient. Each layer catches failures that the others miss.
Make the constraint precise and verifiable.Vague constraints like “be helpful and harmless” are hard to test. Precise constraints like “never output token X” enable deterministic verification, which in turn enables a proper feedback loop.
Address the compliance compulsion explicitly. Models are trained to be helpful. If you want them to refuse, you need to explicitly address the situations in which refusal is correct and explain why helpfulness yields to safety in those cases.
Infrastructure-level gates are cheaper than LLM-level gates. Rate limiting and CAPTCHA verification reject most automated attacks before they incur inference cost. Layer your expensive defenses inside your cheap ones.
The system prompt is code. It should be version-controlled, reviewed, and updated in response to discovered failures. No Say Six maintains a structured law stack that evolves as new attack patterns are identified from the Hall of Fame data.

Conclusion

A game where the AI tries not to say one digit is, superficially, a trivial problem. But the engineering required to make that guarantee robust against determined adversaries touches nearly every active research area in LLM safety: alignment, jailbreaking resistance, prompt injection defense, and interpretability. The simplicity of the constraint makes the problem unusually tractable — you can measure success or failure exactly — while the adversarial creativity of players ensures the problem remains genuinely hard.

Every layer described in this article has an analogue in production AI safety systems. The reflection loop is an instance of Constitutional AI-style self-critique. The physical interception layer is an instance of output monitoring. The system prompt law stack is an instance of structured behavior specification. The infrastructure gates are standard API security practice. No Say Six is small enough to reason about completely — and that makes it a useful laboratory for principles that apply at much larger scales.

← FIRST ARTICLE: Exploring Prompt Injection: The Silent Threat to LLMs ← ALL ARTICLES

Building Robust Guardrails: A Deep Dive into “No Say Six” Security Architecture