Ornis Research

About our research

Ornis Research is an independent AI alignment laboratory. We study the behavior of large language models under realistic deployment pressures, the internal mechanisms that produce that behavior, and the relationships between these systems and the people who rely on them. Our work spans interpretability, capability evaluation, and alignment methodology. We share the field's view that AI safety is a genuinely open problem, and we believe it benefits when these questions are pursued from a wider set of research traditions than the current literature reflects.

Research themes

Our work currently clusters around the following themes.

Interpretability and model internals

We investigate what large language models represent internally, and how those representations give rise to behavior in deployment-realistic settings. We are particularly interested in the link between probing readouts and behavioral policies — and in the methodological standards needed to distinguish a feature that correlates with a behavior from a feature that causes it.

Capability evaluation and elicitation

Some failure modes — deception, scheming, sandbagging, motivated reasoning — resist naive measurement, because their absence under one elicitation condition does not imply absence under another. We design paired benchmarks with directional controls that isolate the construct of interest from broader instruction-following, and we treat measurement that survives adversarial scrutiny as the standard worth reaching.

Human–AI interaction and trust

AI safety is also a question about systems that include humans — reviewers, auditors, decision-makers, end users — who form expectations about model behavior and adjust their actions accordingly. We study how these expectations form, where they break under load, and how evaluation procedures can account for the human side of the loop without diluting the technical content of the alignment problem.

Alignment methodology

Post-training pathways shape model behavior in ways that are often opaque from the outside and are not captured by standard capability benchmarks. We study how different post-training choices — instruction tuning, reasoning distillation, constitutional approaches, RLHF and its successors — produce different deployment-relevant policies, and what this implies for how alignment claims should be reported and audited.

Cross-cultural perspectives on AI safety

Alignment targets — what counts as a good outcome, what counts as a deceptive output, what counts as appropriate deference — are not culturally neutral, and the field's center of gravity has been narrow so far. We are interested in how different research traditions frame these questions, as both an empirical broadening of the work and a robustness check on safety claims validated in only one context.

Current work

Several manuscripts are in preparation for submission. Where possible, we release pre-registrations, code, and full result tables in advance of publication, so that the work is auditable and reusable by the wider research community.

How we work

We operate as an independent research group, unaffiliated with any large company or specific university. Most of our work is published openly — papers, blog posts, pre-registrations, code, and result tables — and we do not undertake proprietary research. We place a high weight on methodological rigor and reproducibility: paired controls, pre-registered verdict criteria, public datasets and analysis scripts, and walking back claims when replication fails. The lab writes for the international academic literature, and we welcome correspondence from researchers working on similar questions in alignment, evaluation, and interpretability.