Leon Chen

Leon Liangyu Chen

Hi! 👋 I'm Leon, a Computer Science Ph.D. student at Stanford University. I'm grateful to be advised by Serena Yeung and Ludwig Schmidt at Stanford AI Lab. I currently work as a research scientist intern at Meta Superintelligence Labs.

Before Ph.D., I did my undergrad and worked at Nanyang Technological University, advised by Ziwei Liu. I've also worked with Alan Yuille and Zongwei Zhou at Johns Hopkins University.

My research focuses on developing scalable frameworks for training and evaluating AI agents. I work on multimodal reasoning agents, data-efficient training methodologies, data collection pipelines, and practical architectures that enable agents to perform complex reasoning tasks. My interests span from foundational model development to building robust pipelines for agent training and evaluation.

I'm passionate about bridging the gap between research and real-world applications of AI systems. Whether you're working on synthetic data, exploring collaborative research opportunities, or developing practical reasoning applications, I'd love to connect and discuss potential synergies.

Email / Google Scholar / Semantic Scholar / Github / LinkedIn

Follow @cliangyu_

Publications (Selected Publications)

	Terminal-Bench: A Benchmark for AI Agents in Terminal Environments The Terminal-Bench Team Benchmark / Code / Bibtex Summary: terminal-bench is a collection of tasks and an evaluation harness to help agent makers quantify their agents' terminal mastery.
	The Impact of Image Resolution on Biomedical Multimodal Large Language Models Liangyu Chen, James Burgess, Jeffrey J Nirschl, Orr Zohar, Serena Yeung-Levy Paper / Code / Bibtex Summary: Train and test models on native-resolution images. Use mixed-resolution datasets to balance performance with computational resources, and maintain consistent resolution between training and inference to avoid misalignment issues.
	Open Thoughts Open Thoughts Team Paper / Code / Datasets / Blog / Bibtex Summary: The first open-source model trained on public reasoning data to match DeepSeek-R1-Distill's performance through 1000+ systematic data curation/synthesis experiments.
	BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Austin Wolfgang Katzer, Collin Chiu, Anita Rau, Xiaohan Wang, Yuhui Zhang, Alfred Seunghoon Song, Robert Tibshirani, Serena Yeung-Levy Computer Vision and Pattern Recognition (CVPR), 2025 Paper / Code / Bibtex Summary:* A large categorized dataset with 24+ million biomedical image-text pairs across multiple domains with expert-guided annotations.
	Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee ICML Workshop on DataWorld: Unifying Data Curation Frameworks Across Domains , 2025 Paper / Code / Bibtex Summary: Prioritizing global data diversity over local instance quality for instruction data selection using k-means clustering iteratively.
	MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024 (Spotlight) Paper / Project Page / Code / Bibtex Summary: A challenging long-context multi-modal document understanding benchmark with 1,062 expert-annotated questions requiring cross-page reasoning.
	MMInA: Benchmarking Multihop Multimodal Internet Agents Shulin Tian, Ziniu Zhang, Liangyu Chen, Ziwei Liu Association for Computational Linguistics (ACL) Findings, 2025 Paper / Leaderboard / Code & Dataset / Demo / Bibtex Summary:* The first benchmark evaluating AI agents on evolving real-world websites with 1,050 human-written multihop tasks requiring long-range reasoning.
	Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order Aurora-M Team International Conference on Computational Linguistics (COLING) Industry Track, 2025 Paper / Model / Bibtex Summary: An open-source multilingual model fine-tuned on human-reviewed safety instructions aligned with the Biden-Harris AI safety executive order.
	Benchmarking and Analyzing Generative Data for Visual Recognition Bo Li, Haotian Liu, Liangyu Chen, Yong Jae Lee, Chunyuan Li, Ziwei Liu IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025 Paper / Code / Bibtex Summary: An extensive benchmark systematically analyzing generative data across visual recognition tasks with a novel CLER Score metric.
	Otter: A Multi-Modal Model with In-Context Instruction Tuning Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Ziwei Liu IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025 Paper / Code / Bibtex Summary:* The first large-scale multi-modal instruction tuning dataset with 2.8 million instruction-response pairs derived from images and videos. A model is trained on this dataset to achieve state-of-the-art performance on vision-language tasks.
	Large Language Models are Visual Reasoning Coordinators Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, Ziwei Liu Neural Information Processing Systems (NeurIPS), 2023 ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023 Paper / Code / Talk (English) / Talk (中文) / Bibtex Summary: The first multimodal agent study to use natural language as communication medium for LLMs to coordinate multiple vision-language models for complex reasoning.
	Panoptic Video Scene Graph Generation Jingkang Yang, Wenxuan Peng, Xiangtai Li, Zujin Guo, Liangyu Chen, Bo Li, Zheng Ma, Wayne Zhang, Kaiyang Zhou, Chen Change Loy, Ziwei Liu Computer Vision and Pattern Recognition (CVPR), 2023 Paper / Bibtex Summary: Extended scene graph generation from static images to dynamic videos with unified panoptic understanding of objects and stuff.
	Making Your First Choice: To Address Cold Start Problem in Vision Active Learning Liangyu Chen, Yutong Bai, Siyu Huang, Yongyi Lu, Bihan Wen, Alan Yuille, Zongwei Zhou Medical Imaging with Deep Learning (MIDL), 2023 Radiological Society of North America (RSNA), Abstracts, 2024 NeurIPS Workshop on Human in the Loop Learning, 2022 Paper / Code / Poster / Slides / Talk (English) / Talk (中文) / Bibtex Summary: A systematic approach to solving the cold start problem in vision active learning using self-supervised contrastive features without labels.
	Automatic Calcification Morphology and Distribution Classification for Breast Mammograms with Multi-task Graph Convolutional Neural Network Hao Du, Melissa Min-Szu Yao, Siqi Liu, Liangyu Chen, Wing P. Chan, Mengling Feng Journal of Biomedical and Health Informatics (JBHI), 2023 Paper / Bibtex Summary: Modeling spatial and visual relationships among calcifications using graph convolutional networks for breast cancer diagnosis.
	Baconian: A Unified Open-source Framework for Model-Based Reinforcement Learning Linsen Dong, Guanyu Gao, Xinyi Zhang, Liangyu Chen, Yonggang Wen arXiv preprint, 2020 Paper / Code / Docs / Demo / Bibtex Summary: A unified open-source framework specifically designed for model-based reinforcement learning research with modular components.

Service

I love teaching AI and was honored to coach Singapore's teams for the first-ever International Olympiad in Artificial Intelligence in 2024. Our two teams excelled on the global stage, securing two of the four gold medals awarded in the Scientific Round.

Reviewer for TIP, The Visual Computer, IET Computer Vision, IJCV, TMLR, NeurIPS 2025/23, ICLR 2025, COLM 2025, AAAI 2025, ECCV 2024, CVPR 2024, MLHC 2025, ICCV CVAMD 2023, CVPR CVinW 2023, ICML IMLH 2023/22, NeurIPS GenAI4Health 2024.

Template