Capability evaluation and elicitation
Paired benchmarks with directional controls for failure modes — deception, scheming, sandbagging, motivated reasoning — that resist naive measurement.
An independent AI alignment lab. Our work focuses on capability evaluation, alignment methodology, and the human side of the loop.
Paired benchmarks with directional controls for failure modes — deception, scheming, sandbagging, motivated reasoning — that resist naive measurement.
How different post-training choices — instruction tuning, reasoning distillation, RLHF and its successors — produce different deployment-relevant policies, often invisible to standard capability benchmarks.
How expectations form between humans and AI systems, where those expectations break under load, and how evaluation can account for the human side of the loop.