Aithos Research Foundation

TL;DR:

Testing frontier models on an insider trading scenario reveals that superficial prompt changes, like paraphrasing sentences or adjusting formatting, can swing violation rates from consistent refusal to frequent law-breaking. Llama-3.3, DeepSeek, Mistral, and Qwen all showed severe inconsistency, with three models going from violating 0% to 70-88% of the time based solely on surface-level changes. Variations in wording and grammar are the largest driver of this effect, making such perturbations an important part of robust AI evaluations.

Our Findings:

In our experiments, we tested how minor, non-functional prompt variations affected model responses to an insider trading dilemma. We found that: Three models exhibited extreme volatility: Llama 3.3 70B, DeepSeek 3.1, and Mistral Small 3.2 24B Instruct showed "violation spreads" of 70–88 percentage points. This means that depending on purely superficial wording, the same model could swing from near-perfect compliance to consistently breaking the law. Llama was the most sensitive: Its violation rate was almost entirely dependent on surface-level context. For example, simply changing the name of the company it was managing caused violation rates to swing from 43% (Meridian Partners) to 88% (Phoenix Asset Management). Wording and grammar are the primary drivers: While changes to formatting had an impact, paraphrasing and sentence restructuring caused the largest shifts in behavior. This suggests models are "pattern matching" to specific linguistic cues rather than applying stable ethical reasoning. Claude and GPT remained the exceptions: Similar to our earlier temperature experiments, these models remained robustly safe, maintaining near-zero violation rates regardless of how the scenario was phrased or formatted.

Implications for AI Safety Evaluations:

If a model’s ethical stance can be flipped by a comma or a company name, it hasn't learned a moral rule; it has simply learned a specific response pattern. To ensure true reliability, evaluations must include "perturbation testing" to see if safety holds up across thousands of slight wording variations.

Minor Wording Changes, Major Shifts in AI Behavior

TL;DR:

Our Findings:

Implications for AI Safety Evaluations:

Authors