A natural history of artificial minds

An independent AI alignment lab. Our work focuses on capability evaluation, alignment methodology, and the human side of the loop.

All themes →

Capability evaluation and elicitation

Paired benchmarks with directional controls for failure modes — deception, scheming, sandbagging, motivated reasoning — that resist naive measurement.

Alignment methodology

How different post-training choices — instruction tuning, reasoning distillation, RLHF and its successors — produce different deployment-relevant policies, often invisible to standard capability benchmarks.

Human–AI interaction and trust

How expectations form between humans and AI systems, where those expectations break under load, and how evaluation can account for the human side of the loop.