I am a Ph.D. student in the Department of Electrical and Computer Engineering at Texas A&M University. I am advised by Dr. Dileep Kalathil. Prior to joining TAMU, I attended University of Illinois at Urbana-Champaign. In 2020, I received my B.S. degree in Computer Science and Statistics as well as in Actuarial Science (double major).
My research interests focus on reinforcement learning, distributionally robust optimization, and large language model (LLM) alignment.
") does not match the recommended repository name for your site ("
").
", so that your site can be accessed directly at "http://
".
However, if the current repository name is intended, you can ignore this message by removing "{% include widgets/debug_repo_name.html %}
" in index.html
.
",
which does not match the baseurl
("
") configured in _config.yml
.
baseurl
in _config.yml
to "
".
Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran
arXiv Preprint. Under review. 2025
We tackle the problem of aligning large language models (LLMs) with human preferences under preference distribution shifts using distributionally robust optimization. We introduce two novel distributionally robust direct preference optimization (DPO) algorithms—Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO)—and analyze their sample complexity. Additionally, we develop scalable gradient-based implementations for WDPO and KLDPO, demonstrating their superior alignment performance empirically in scenarios with preference distribution shifts.
Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran
arXiv Preprint. Under review. 2025
We tackle the problem of aligning large language models (LLMs) with human preferences under preference distribution shifts using distributionally robust optimization. We introduce two novel distributionally robust direct preference optimization (DPO) algorithms—Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO)—and analyze their sample complexity. Additionally, we develop scalable gradient-based implementations for WDPO and KLDPO, demonstrating their superior alignment performance empirically in scenarios with preference distribution shifts.
Zaiyan Xu*, Kishan Panaganti*, Dileep Kalathil (* equal contribution)
The 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023
We study the problem of learning a robust control policy that handles parameter mismatches between training and testing environments, formulated as a distributionally robust reinforcement learning (DR-RL) problem in a tabular episodic setting. We propose Robust Phased Value Learning (RPVL), an algorithm capable of solving DR-RL problems defined by uncertainty sets based on total variation, chi-square, Kullback-Leibler, and Wasserstein divergences. We establish that our algorithm achieves improved sample complexity of $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ and present the first known complexity results for the Wasserstein uncertainty set, validated through simulation experiments.
Zaiyan Xu*, Kishan Panaganti*, Dileep Kalathil (* equal contribution)
The 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023
We study the problem of learning a robust control policy that handles parameter mismatches between training and testing environments, formulated as a distributionally robust reinforcement learning (DR-RL) problem in a tabular episodic setting. We propose Robust Phased Value Learning (RPVL), an algorithm capable of solving DR-RL problems defined by uncertainty sets based on total variation, chi-square, Kullback-Leibler, and Wasserstein divergences. We establish that our algorithm achieves improved sample complexity of $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ and present the first known complexity results for the Wasserstein uncertainty set, validated through simulation experiments.
Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh
The Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS) 2022
We propose Robust Fitted Q-Iteration (RFQI), an algorithm designed for robust reinforcement learning (RL) using only offline data, addressing key challenges such as model uncertainty and the computational complexity introduced by the robust Bellman operator. We theoretically show that RFQI achieves near-optimal robust performance under standard conditions. Empirical experiments demonstrate its effectiveness on benchmark tasks compared to existing approaches.
Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh
The Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS) 2022
We propose Robust Fitted Q-Iteration (RFQI), an algorithm designed for robust reinforcement learning (RL) using only offline data, addressing key challenges such as model uncertainty and the computational complexity introduced by the robust Bellman operator. We theoretically show that RFQI achieves near-optimal robust performance under standard conditions. Empirical experiments demonstrate its effectiveness on benchmark tasks compared to existing approaches.