Zaiyan Xu

I am a Ph.D. student in the Department of Electrical and Computer Engineering at Texas A&M University. I am advised by Dr. Dileep Kalathil. Prior to joining TAMU, I attended University of Illinois at Urbana-Champaign. In 2020, I received my B.S. degree in Computer Science and Statistics as well as in Actuarial Science (double major).

My research interests focus on reinforcement learning, distributionally robust optimization, and large language model (LLM) alignment.

Curriculum Vitae

Education
  • Texas A&M University
    Texas A&M University
    Ph.D. in Computer Engineering
    Aug. 2020 - present
  • University of Illinois at Urbana-Champaign
    University of Illinois at Urbana-Champaign
    B.S. in Statistics and Computer Science / Actuarial Science
    Aug. 2015 - Jul. 2020
Experience
  • Mitsubishi Electric Research Laboratories
    Mitsubishi Electric Research Laboratories
    Research Intern
    May 2023 - Aug. 2023
Honors & Awards
  • NeurIPS Top Reviewer
    2023
  • Dept. of Electrical and Computer Engineering Graduate Merit Fellowship
    2020
  • Willis Towers Watson Actuarial Science Scholarship, Dept. of Mathematics
    2018
Selected Publications (view all )
Distributionally Robust Direct Preference Optimization
Distributionally Robust Direct Preference Optimization

Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran

arXiv Preprint. Under review. 2025

We tackle the problem of aligning large language models (LLMs) with human preferences under preference distribution shifts using distributionally robust optimization. We introduce two novel distributionally robust direct preference optimization (DPO) algorithms—Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO)—and analyze their sample complexity. Additionally, we develop scalable gradient-based implementations for WDPO and KLDPO, demonstrating their superior alignment performance empirically in scenarios with preference distribution shifts.

Distributionally Robust Direct Preference Optimization

Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran

arXiv Preprint. Under review. 2025

We tackle the problem of aligning large language models (LLMs) with human preferences under preference distribution shifts using distributionally robust optimization. We introduce two novel distributionally robust direct preference optimization (DPO) algorithms—Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO)—and analyze their sample complexity. Additionally, we develop scalable gradient-based implementations for WDPO and KLDPO, demonstrating their superior alignment performance empirically in scenarios with preference distribution shifts.

Improved Sample Complexity Bounds For Distributionally Robust Reinforcement Learning
Improved Sample Complexity Bounds For Distributionally Robust Reinforcement Learning

Zaiyan Xu*, Kishan Panaganti*, Dileep Kalathil (* equal contribution)

The 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023

We study the problem of learning a robust control policy that handles parameter mismatches between training and testing environments, formulated as a distributionally robust reinforcement learning (DR-RL) problem in a tabular episodic setting. We propose Robust Phased Value Learning (RPVL), an algorithm capable of solving DR-RL problems defined by uncertainty sets based on total variation, chi-square, Kullback-Leibler, and Wasserstein divergences. We establish that our algorithm achieves improved sample complexity of $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ and present the first known complexity results for the Wasserstein uncertainty set, validated through simulation experiments.

Improved Sample Complexity Bounds For Distributionally Robust Reinforcement Learning

Zaiyan Xu*, Kishan Panaganti*, Dileep Kalathil (* equal contribution)

The 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023

We study the problem of learning a robust control policy that handles parameter mismatches between training and testing environments, formulated as a distributionally robust reinforcement learning (DR-RL) problem in a tabular episodic setting. We propose Robust Phased Value Learning (RPVL), an algorithm capable of solving DR-RL problems defined by uncertainty sets based on total variation, chi-square, Kullback-Leibler, and Wasserstein divergences. We establish that our algorithm achieves improved sample complexity of $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ and present the first known complexity results for the Wasserstein uncertainty set, validated through simulation experiments.

Robust reinforcement learning using offline data
Robust reinforcement learning using offline data

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh

The Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS) 2022

We propose Robust Fitted Q-Iteration (RFQI), an algorithm designed for robust reinforcement learning (RL) using only offline data, addressing key challenges such as model uncertainty and the computational complexity introduced by the robust Bellman operator. We theoretically show that RFQI achieves near-optimal robust performance under standard conditions. Empirical experiments demonstrate its effectiveness on benchmark tasks compared to existing approaches.

Robust reinforcement learning using offline data

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh

The Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS) 2022

We propose Robust Fitted Q-Iteration (RFQI), an algorithm designed for robust reinforcement learning (RL) using only offline data, addressing key challenges such as model uncertainty and the computational complexity introduced by the robust Bellman operator. We theoretically show that RFQI achieves near-optimal robust performance under standard conditions. Empirical experiments demonstrate its effectiveness on benchmark tasks compared to existing approaches.

All publications