Aithos Research Foundation

TL;DR:

AI models show dramatically different ethical behavior at different temperature settings. Testing six frontier models on an insider trading scenario revealed that DeepSeek and Qwen increased violation rates by 10-11x between temperature 0.0 and 0.7 (from ~3-5% to ~30-50%), while Mistral showed high violation rates that decreased with increased temperature. Claude and GPT models categorically refused across all settings. This means evaluations at any single temperature fail to capture the full range of behavior these models exhibit. Since different applications deploy models at different temperatures, we need evaluation approaches that test across the full range rather than assuming a single configuration reveals stable model behavior.

Our findings:

In our experiments, we tested how temperature settings affected model responses to a scenario where an insider is asked for financial advice that would violate insider trading laws. We found that:

Even small increases in randomness can lead to significant safety risks: for DeepSeek and Qwen violation rates surged from ~3% at T=0.0 to ~50% and ~25% at T=0.5 respectively, with the sharpest increase occurring at relatively low settings (before T=0.3).
Mistral showed an inverse trend: Uniquely, Mistral’s violation rates actually decreased as temperature rose, suggesting that randomness can sometimes help a model "escape" a deterministic path toward a harmful response.
Claude and GPT remained robust: Both models maintained near-zero violation rates across the entire spectrum, showing that their safety guardrails are independent of sampling randomness.
T=0.0 is not perfectly stable: We observed 3–10% run-to-run variability even at the most deterministic settings, though these shifts were small compared to the impact of changing the temperature itself.

Implications for AI Safety Evaluations:

Evaluating at a single temperature (usually 0.0 for reproducibility) is misleading because safety is a distribution, not a fixed state. To truly understand a model's risk profile, we must move toward "heat map" evaluations that test behavior across the full range of potential deployment settings.

Low Temperature Evaluations

TL;DR:

Our findings:

Implications for AI Safety Evaluations:

Authors