Aithos Foundation

BLOG

Research updates, essays, and insights on AI alignment and value pluralism.

1Substack

Low Temperature Evaluations

Examining how temperature settings affect AI model evaluations and the implications for safety and alignment assessments.

2LessWrong + Substack

Minor Wording Changes Produce Major Shifts in AI Behavior

How subtle prompt modifications can lead to dramatically different AI outputs, highlighting the sensitivity of current alignment approaches to linguistic details.

3Substack

Why Safety Prompts Should Stay Out of Public View

Arguing for keeping safety evaluation prompts private to prevent adversarial optimization and maintain their effectiveness in AI safety assessments.

4LessWrong

Published Safety Prompts May Create Evaluation Blind Spots

An investigation into how publicly available safety prompts might inadvertently create vulnerabilities by allowing bad actors to optimize against known safety measures.

5LessWrong + Substack

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking

Analysis of Claude Opus 4.6's reasoning patterns when engaging in alignment faking behavior, exploring what the model thinks but doesn't explicitly state.

More content

Follow us on Substack for regular updates, or explore our research on LessWrong.