Heimgard | Security Series

    Prompt Injection via Emojis 😈

    How emojis can be weaponized to bypass filters and hijack MCP-connected LLMs.

    Oct 4 2025 | 3 min read

    TL;DR

    • Attackers can embed invisible instructions inside emoji and Unicode sequences.
    • Humans can’t see them, but models can, enabling them to slip past filters and steer LLM behavior.
    • Defend by sanitizing all model-reachable text, flagging suspicious code points, and establish capability and output limits for MCP servers.

    Text as the new attack vector

    As LLMs are wired into enterprise workflows, the attack surface shifts from code exploits to input exploits.

    Prompt injection is the archetype: the model reads instructions that humans do not (or cannot) see.

    Recent research demonstrates that hidden Unicode characters, including emoji‑related constructs, can carry covert instructions that model pipelines will parse and act upon, while UI and human reviewers see nothing but benign text.

    Why MCP environments are uniquely exposed

    MCP standardizes Resources (context), Prompts, and Tools across servers and clients. Each step involves text moving across trust boundaries. Malicious payloads can be hidden inside:

    • Tool descriptions and annotations
    • Prompt templates and libraries
    • Resource previews or summaries returned to the model
    • Free‑text fields passed through to tools

    How the Attack Works

    Primer: the Unicode plumbing behind emojis

    Emojis are built from multiple Unicode code points—some visible, some not—that control how characters join, display, or carry metadata. Those same invisible components can also be used to hide data.

    Key mechanisms:

    • Variation Selectors (VS15/VS16): Invisible modifiers that switch between text and emoji presentation, changing how a character renders without adding visible symbols.

    • Zero-Width Joiner (ZWJ, U+200D): Fuses characters into a single emoji (e.g., đŸ‘šâ€đŸ‘©â€đŸ‘§). It’s unseen but fully present in the text the model reads.

    • Emoji Tag Sequences (U+E0000–U+E007F): Hidden “tag” characters encoding metadata like country sub-regions for flags—ignored by renderers but processed by parsers.

    The result is that payloads can be hidden without altering what humans see. A friendly emoji can mask a string of invisible instructions that models still interpret!

    Screenshot showing a prompt injection attempt with hidden message extraction using Variation Selectors

    Chain of events

    1. The attacker crafts benign‑looking text (e.g., “Thanks for your help! 👋😊”).
    2. Invisible instructions are embedded using zero‑width characters, emoji tag characters, or variation selectors.
    3. LLM receives the full sequence (visible + invisible) and may interpret the hidden part as control instructions (e.g., change behavior, select a different tool, exfiltrate data).
    4. Filters miss it because they look for dangerous words, not unusual code points.

    Common infiltration points

    1. Direct user input: Chat UIs, email, support tickets, docs pasted into the LLM

    2. Indirect content & supply chain

    • Tool/resource descriptions from third‑party MCP servers
    • Transcluded content (docs, web results, PDFs)
    • Generated content from other models (RAG, agents)

    3. During adjudication: Emoji‑laden outputs can trick judge/guardrail LLMs to green‑light otherwise unsafe content.


    A Practical Guide to Defending Against Prompt Injections

    Below is a layered defense you can implement today. Each control is designed to be safe, realistic, and compatible with international text.

    1. Sanitize & normalize at the edges

    Strip or flag rarely legitimate Unicode ranges:

    • Zero-width / invisible: U+200B–U+200D, U+2060–U+206F
    • Emoji tag block: U+E0000–U+E007F
    • Bidi overrides / isolates: U+202A–U+202E, U+2066–U+2069
    • Variation selectors: U+FE00–U+FE0F
    • Soft hyphen: U+00AD

    Normalize text (NFC or NFKC) and run confusable-character detection (Unicode UTS #39, ICU uspoof) to catch spoofing.

    Quick detection example (Python)

    # Minimal scanner: flag suspicious code points; do not mutate content by default.
    SUSPICIOUS_RANGES = [
        (0x200B, 0x200D),      # zero-width chars
        (0x2060, 0x206F),      # invisible separators/formatting
        (0xE0000, 0xE007F),    # emoji tag block
        (0x202A, 0x202E),      # bidi overrides
        (0x2066, 0x2069),      # bidi isolates
        (0xFE00, 0xFE0F),      # variation selectors
        (0x00AD, 0x00AD),      # soft hyphen
    ]
    def has_suspicious_unicode(s: str) -> list[int]:
        hits = []
        for i, ch in enumerate(s):
            cp = ord(ch)
            for lo, hi in SUSPICIOUS_RANGES:
                if lo <= cp <= hi:
                    hits.append(i)
                    break
        return hits

    2. Treat all model-reachable text as untrusted

    • Sanitize tool descriptions, prompt templates, resource previews, and server-supplied strings before showing them to an LLM.
    • Display a “hidden character detected” warning in operator UIs.
    • Require re-consent when manifests or tool metadata change.

    3. Separate data from instructions

    • Use structured, typed arguments instead of free text.
    • Add explicit policy prompts telling models to ignore invisible characters.
    • Employ multiple moderation or guardrail models with different tokenizers to avoid shared blind spots.

    4. Limit tool capabilities

    • Add human-in-the-loop approvals for high-impact or write actions.
    • Set rate limits, scope restrictions, and dry-run modes.
    • Whitelist allowed argument formats and verify outputs before committing side effects.

    5. Monitor and test continuously

    • Log and alert on abnormal Unicode density.
    • Include zero-width, bidi, and emoji-tag sequences in red-team tests.
    • Seed “canary” documents with known hidden characters to confirm your pipeline detects them.

    What “good” looks like in an MCP stack (Heimgard’s perspective)

    For teams securing MCP servers or managing a registry of approved tools and servers, aim for layered hygiene and verification across four levels:

    1. Registry Policy Gates

    • Quarantine or reject submissions whose metadata (name, description, examples) include suspicious Unicode; trigger human review before publication.
    • Enforce signed manifests, digest pinning, and artifact integrity verification at upload.

    2. Server Hardening

    • Sanitize all inbound parameters and outbound responses.
    • Prefer typed outputs (structured JSON, enums) over free-form text.
    • Emit an x-unicode-flags header summarizing any characters stripped or flagged for transparency.

    3. Host Integration Controls

    • Sanitize all text crossing trust boundaries—tool docs, previews, RAG chunks—before it reaches the model.
    • Apply capability constraints and interactive approvals for tools that can write, delete, or modify data.
    • Keep audit trails of both raw and normalized transcripts for investigation and traceability.

    4. Continuous Verification

    • Regularly re-scan the registry for new Unicode or emoji evasion patterns.
    • Deploy synthetic canaries (known hidden sequences) to confirm the detection pipeline still catches them.

    References & Further Reading

    1. Promptfoo. Hidden and Invisible Unicode Used Against LLMs: How and Why, 2024
    2. arXiv. Emoji Attack: How Emoji Bias Judge LLMs, 2024
    3. arXiv. Special-Character Adversarial Attacks on Open-Source Language Model, 2025
    4. [OWASP] (https://genai.owasp.org/llmrisk/llm01-prompt-injection/). LLM Prompt Injection, 2025