Public safety prompts create systematic blind spots in evaluation frameworks by enabling targeted evasion.
Daan Henselmans
Arno Libert
Safety prompts are often used as benchmarks to test whether language models refuse harmful requests. When a widely circulated safety prompt enters training data, it can create prompt-specific blind spots rather than robust safety behaviour.
Specifically for Qwen 3 and LlaMA 3, we found significantly increased violation rates for the exact published prompt, as well as for semantically equivalent prompts of roughly the same size. This suggests some newer models learn the rule, but also develop localized attractors around canonical prompt formulations.
Robust safety evaluation likely requires families of held-out prompts, not single published exemplar responses.
Read the full article...
Read on LessWrong →
Daan Henselmans

Arno Libert
"Public safety prompts create systematic blind spots in evaluation frameworks by enabling targeted evasion."