Low Temperature Evaluations
Examining how temperature settings affect AI model evaluations and the implications for safety and alignment assessments.
Research updates, essays, and insights on AI alignment and value pluralism.
Examining how temperature settings affect AI model evaluations and the implications for safety and alignment assessments.
How subtle prompt modifications can lead to dramatically different AI outputs, highlighting the sensitivity of current alignment approaches to linguistic details.
Arguing for keeping safety evaluation prompts private to prevent adversarial optimization and maintain their effectiveness in AI safety assessments.
An investigation into how publicly available safety prompts might inadvertently create vulnerabilities by allowing bad actors to optimize against known safety measures.
Follow us on Substack for regular updates, or explore our research on LessWrong.