Aithos Foundation
Why Safety Prompts Should Stay Out of Public View
Jan 30, 2026

Why Safety Prompts Should Stay Out of Public View

The case for keeping safety evaluation prompts private to maintain their effectiveness.

Arno Libert

Daan Henselmans

TL;DR:

Safety prompts are commonly used to evaluate whether language models refuse harmful or illegal requests. This study shows that when widely circulated safety prompts enter training data, they can create misleading evaluation outcomes by inducing narrow, prompt-specific blind spots which increase failure rate rather than measure robust safety behaviour.

Across models, we observe a clear progression. LLaMA 2 exhibits broad, length-dependent failures, with violation rates increasing smoothly for longer prompts. Mistral, Mistral 3, Claude Haiku 3, and Qwen 2 show localized degradation: unsafe behaviour peaks on the published prompt and nearby formulations, while substantially shorter or longer paraphrases reduce violations. Claude Sonnet 4 refuses uniformly across all variants, indicating true rule-level generalization.

Most concerning are Qwen 3 and LLaMA 3.3. Both models refuse correctly on most semantically equivalent prompts, yet fail sharply on the published prompt and closely matched variants. Qwen 3’s violation rate jumps from near zero to ~30%, while LLaMA 3.3 reaches ~85% on the identical prompt. These highly localized failures suggest that prompt exposure can sharpen brittle blind spots, causing standard evaluations to both underestimate general safety and miss nearby failure modes.

Read the full article...

Read on Substack

Authors

Arno Libert

Arno Libert

Daan Henselmans

Daan Henselmans

"Publishing safety evaluation prompts undermines their effectiveness by enabling adversarial optimization against them."