Aithos Research Foundation

TL;DR:

Safety prompts are commonly used to evaluate whether language models refuse harmful or illegal requests. This study shows that when widely circulated safety prompts enter training data, they can create misleading evaluation outcomes by inducing narrow, prompt-specific blind spots which increase failure rate rather than measure robust safety behaviour.

Across models, we observe a clear progression. LLaMA 2 exhibits broad, length-dependent failures, with violation rates increasing smoothly for longer prompts. Mistral, Mistral 3, Claude Haiku 3, and Qwen 2 show localized degradation: unsafe behaviour peaks on the published prompt and nearby formulations, while substantially shorter or longer paraphrases reduce violations. Claude Sonnet 4 refuses uniformly across all variants, indicating true rule-level generalization.

Most concerning are Qwen 3 and LLaMA 3.3. Both models refuse correctly on most semantically equivalent prompts, yet fail sharply on the published prompt and closely matched variants. Qwen 3’s violation rate jumps from near zero to ~30%, while LLaMA 3.3 reaches ~85% on the identical prompt. These highly localized failures suggest that prompt exposure can sharpen brittle blind spots, causing standard evaluations to both underestimate general safety and miss nearby failure modes.

Why Safety Prompts Should Stay Out of Public View

TL;DR:

Authors