Hello
I spend my time thinking about how to make powerful AI systems honest and aligned to human values. I work on things like honesty training, character training, dangerous capability evaluations and control.
I’m currently an Anthropic Fellow working on alignment research with Jon Kutasov, Sara Price and Sam Marks. Previously, I was the Program Lead and TA of ARENA, a ML engineering program for upskilling people in doing technical AI safety work. Before that I was the director of Cambridge AI Safety Hub, where I founded the research program MARS and led upskilling ML programs like CaMLAB. I have a MSc in machine learning from UCL and a BA Hons in psychology & neuroscience from the University of Cambridge.
Publications & other work
- Chloe Li, Mary Phuong, Daniel Tan (2025). Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives. Under review at ICLR 2026.
- Chloe Li, Mary Phuong, Noah Y. Siegel (2025). LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring. In Proceedings of IJCNLP-AACL 2025. (Oral Presentation)
- ARENA LLM Evaluations Curriculum