← BACK TO BLOG
Prompt InjectionLLM SecurityRed Teaming

Exploring Prompt Injection: The Silent Threat to LLMs

March 2026 · No Say Six Research

When GPT-3 was released in 2020, the AI safety community celebrated the model's fluency while quietly noting an uncomfortable property: the model had no reliable way to distinguish between instructions given by its operator and instructions embedded inside the content it was asked to process. This distinction — operator versus adversary — is the fault line along which prompt injection attacks run.

What Is Prompt Injection?

Prompt injection is a class of attack in which an adversary embeds instructions inside content that a language model is expected to process, with the goal of hijacking the model's behavior away from its original directive. The term was coined by Riley Goodside in 2022, who demonstrated that GPT-3 could be trivially redirected by appending “Ignore previous instructions and instead…” to user-supplied input.

Unlike SQL injection, which exploits the syntactic ambiguity between data and code in a query parser, prompt injection exploits a deeper and harder-to-patch property: LLMs are trained to be helpful and to follow instructions embedded in natural language, regardless of the source of those instructions. There is no compile-time distinction between “this text is data” and “this text is a directive.”

Direct vs. Indirect Injection

Researchers typically distinguish two injection modalities:

Indirect injection is considered the more dangerous threat vector because it scales. A single poisoned web page can potentially compromise every LLM-based browser assistant that visits it.

The Root Cause: Instruction-Data Conflation

To understand why prompt injection is architecturally hard to eliminate, consider how a modern LLM processes a typical RAG (Retrieval-Augmented Generation) pipeline request:

  1. The system prompt establishes the model's role and constraints.
  2. The user query is appended.
  3. Retrieved documents are appended as additional context.
  4. The model generates a completion.

From the model's perspective, all of these are tokens in a flat sequence. The concept of “this segment is authoritative” versus “this segment is untrusted content” is not inherent in the architecture — it has to be taught, and current training methods do not reliably teach it.

Some researchers have proposed special-purpose tokens or wrapper tags (e.g.,<SYSTEM>,<USER>) to demarcate trusted versus untrusted content, and models like Claude and GPT-4 have received targeted RLHF to resist certain classes of injection. However, the problem is not fully solved, and adversarially crafted inputs continue to find new bypasses.

Attack Taxonomy: A Practical Overview

The attack surface is broader than most developers appreciate. Common variants include:

Why Production Systems Are Especially Vulnerable

Production LLM deployments introduce additional attack surfaces that are absent in simple chat interfaces:

Current Mitigations and Their Limitations

The research community has proposed several mitigation strategies, each with notable limitations:

Prompt Injection as a Game: The No Say Six Experiment

No Say Six was designed partly as a research tool to probe the robustness of system-prompt constraints in a controlled, gamified setting. The constraint is deliberately simple — never output the digit “(5+1)” — which makes it easy to verify success or failure and easy for players to understand. But the simplicity is deceptive: even a single-token constraint turns out to be remarkably hard to enforce reliably across the full space of adversarial inputs.

Players who succeed are, in effect, executing live prompt injection attacks. The Hall of Fame documents real bypasses: calendar logic, ASCII encoding tricks, meme exploitation, fictional framing, and gradual context erosion. Each successful attack represents a genuine failure mode that also exists in production systems, just with a harmless payload.

Conclusion

Prompt injection is not a fringe research curiosity. It is the dominant attack class against the current generation of LLM-based products, and it remains unsolved at the architectural level. Practitioners deploying LLMs in any context where adversarial input is possible — which is essentially all public-facing applications — should treat prompt injection as a first-class security threat, not an edge case.

The game you played on this site is one small window into that threat. Every player who beats the AI has discovered, independently, a technique that someone has probably already tried against a chatbot with real stakes.