Our work currently clusters around the following themes.
Interpretability and model internals
We investigate what large language models represent internally, and how those representations give rise to behavior in deployment-realistic settings. We are particularly interested in the link between probing readouts and behavioral policies — and in the methodological standards needed to distinguish a feature that correlates with a behavior from a feature that causes it.
Capability evaluation and elicitation
Some failure modes — deception, scheming, sandbagging, motivated reasoning — resist naive measurement, because their absence under one elicitation condition does not imply absence under another. We design paired benchmarks with directional controls that isolate the construct of interest from broader instruction-following, and we treat measurement that survives adversarial scrutiny as the standard worth reaching.
Human–AI interaction and trust
AI safety is also a question about systems that include humans — reviewers, auditors, decision-makers, end users — who form expectations about model behavior and adjust their actions accordingly. We study how these expectations form, where they break under load, and how evaluation procedures can account for the human side of the loop without diluting the technical content of the alignment problem.
Alignment methodology
Post-training pathways shape model behavior in ways that are often opaque from the outside and are not captured by standard capability benchmarks. We study how different post-training choices — instruction tuning, reasoning distillation, constitutional approaches, RLHF and its successors — produce different deployment-relevant policies, and what this implies for how alignment claims should be reported and audited.
Cross-cultural perspectives on AI safety
Alignment targets — what counts as a good outcome, what counts as a deceptive output, what counts as appropriate deference — are not culturally neutral, and the field's center of gravity has been narrow so far. We are interested in how different research traditions frame these questions, as both an empirical broadening of the work and a robustness check on safety claims validated in only one context.