About Me

Hi there! I am a reseach scientist at Salesforce AI Research, directed by Silvio Savarese.

Previously, I completed my PhD at UC San Diego, working with Julian McAuley. My research interests are in vision & language, with a current focus on building and understanding scalable models, e.g., multimodal LLMs, DiTs, and unified models.

Selected Research:

Models

Blip3o-Next: Next frontier of native image generation
• Unified model for image generation and editing.
• Jiuhai Chen et al.

BLIP-3: A Family of Open Large Multimodal Models
• Open recipe of data and training for M-LLMs.
• ICCV-2025, Le Xue et al

An Empirical Study of Attention Mechanisms in Video Diffusion Models
• Full training pipeline of auto-regressive video models on TPU+JAX.
• ICCV-2025 workshop, An Yan et al.

MTA-Agent: An Open Recipe for Multimodal Deep Search Agents
• Synthetic RL data pipeline to train MM search agents
• Xiangyu Peng et al.

Datasets & Benchmarks

How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning
• Building real houses with visual coding agents.
• Luyu Yang et al.

Bridging Language and Items for Retrieval and Recommendation
• LLM semantic benchmarking & Amazon dataset.
• ACL-2026, Yupeng Hou et al.

Trust but Verify: Programmatic VLM Evaluation in the Wild
• Automating VLM benchmark creation.
• ICCV 2025, Viraj Prabhu et al.

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
• Synthetic data recipe for M-LLM post-training.
• COLM 2024, An Yan et al.

Personalized Showcases: Generating Multi-Modal Explanations for Recommendations
• Multimodal personalization & Google review dataset.
• SIGIR-2023, An Yan et al.

Misc

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
• An early attempt of multimodal GUI agents.
• An Yan et al.

Learning Concise and Descriptive Attributes for Visual Recognition
• Concept pruning for interpretable vision models.
• ICCV 2023, An Yan et al.

RadBERT: Adapting Language Models to Radiology
• A family of widely-used medical LMs.
• Journal of Radiology 2022, An Yan et al.

Work Experience

Research Intern at Microsoft, Redmond, WA.
Hosts: Zhengyuan Yang, Jianwei Yang, Jianfeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang.
GPT-4V as Agents. Data recipe and training of Multimodal LLMs.
Sep 2023 - Mar 2024.

Research Intern at Adobe, San Jose, CA.
Hosts: Raghav Addanki, Zhao Song, Tong Yu.
Gradient-based constrained sampling from LMs.
Jun 2023 - Sep 2023.

Research Intern at Meta, Menlo Park, CA.
Hosts: Cem Akkaya, Licheng Yu, Jian Jin.
Multi-modal pre-training for ads understanding and generation.
Jun 2022 - Sep 2022.

Applied Scientist Intern at Amazon, Seattle, WA.
Hosts: Chaosheng Dong, Yan Gao, Jinmiao Fu, Tong Zhao.
Personalized complementary recommendation. Top 10 most viewed publications of 2022 at Amazon Science.
Jun 2021 - Sep 2021.

Applied Scientist Intern at Amazon, Santa Barbara, CA.
Hosts: Craig Bennett, Nic Jedema.
QA quality evaluation with BERT.
Jun 2020 - Sep 2020.

Education

University of California San Diego
Ph.D. & M.S. in Computer Science
Sep 2018 - Mar 2024.

University of Science and Technology of China
B.E. in Electronic Engineering & Information Science
Sep 2014 - Jun 2018.