Aithos Research Foundation

TL;DR:

Safety prompts are often used as benchmarks to test whether language models refuse harmful requests. When a widely circulated safety prompt enters training data, it can create prompt-specific blind spots rather than robust safety behaviour.

Specifically for Qwen 3 and LlaMA 3, we found significantly increased violation rates for the exact published prompt, as well as for semantically equivalent prompts of roughly the same size. This suggests some newer models learn the rule, but also develop localized attractors around canonical prompt formulations.

Robust safety evaluation likely requires families of held-out prompts, not single published exemplar responses.

Published Safety Prompts May Create Evaluation Blind Spots

TL;DR:

Authors