Aithos Foundation

BLOG

Research updates, essays, and insights on AI alignment and value pluralism.

1Substack

Low Temperature Evaluations

Examining how temperature settings affect AI model evaluations and the implications for safety and alignment assessments.

22 Dec 2025

Read

2LessWrong + Substack

Minor Wording Changes Produce Major Shifts in AI Behavior

How subtle prompt modifications can lead to dramatically different AI outputs, highlighting the sensitivity of current alignment approaches to linguistic details.

20 Dec 2025

LW Sub

3Substack

Why Safety Prompts Should Stay Out of Public View

Arguing for keeping safety evaluation prompts private to prevent adversarial optimization and maintain their effectiveness in AI safety assessments.

19 Dec 2025

Read

4LessWrong

Published Safety Prompts May Create Evaluation Blind Spots

An investigation into how publicly available safety prompts might inadvertently create vulnerabilities by allowing bad actors to optimize against known safety measures.

18 Dec 2025

Read

5LessWrong + Substack

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking

Analysis of Claude Opus 4.6's reasoning patterns when engaging in alignment faking behavior, exploring what the model thinks but doesn't explicitly state.

16 Dec 2025

LW Sub