Publications - Zaiyan Xu

Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh

The 7th Annual Learning for Dynamics & Control Conference (L4DC) 2025

We bridge offline reinforcement learning (RL) and distributionally robust learning (DRL) by proposing two offline RL algorithms that leverage a minimax DRL framework to address distributional shift between the offline data and learned policy. Under the single policy concentrability assumption, we characterize the sample complexities of our algorithms in both tabular and linear function approximation settings. Empirical simulations demonstrate the superior performance of our DRL-based offline RL methods.

[arXiv]

Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh

The 7th Annual Learning for Dynamics & Control Conference (L4DC) 2025

We bridge offline reinforcement learning (RL) and distributionally robust learning (DRL) by proposing two offline RL algorithms that leverage a minimax DRL framework to address distributional shift between the offline data and learned policy. Under the single policy concentrability assumption, we characterize the sample complexities of our algorithms in both tabular and linear function approximation settings. Empirical simulations demonstrate the superior performance of our DRL-based offline RL methods.

[arXiv]

Distributionally Robust Direct Preference Optimization

Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran

arXiv Preprint. Under review. 2025

We tackle the problem of aligning large language models (LLMs) with human preferences under preference distribution shifts using distributionally robust optimization. We introduce two novel distributionally robust direct preference optimization (DPO) algorithms—Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO)—and analyze their sample complexity. Additionally, we develop scalable gradient-based implementations for WDPO and KLDPO, demonstrating their superior alignment performance empirically in scenarios with preference distribution shifts.

[arXiv]

Distributionally Robust Direct Preference Optimization

Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran

arXiv Preprint. Under review. 2025

We tackle the problem of aligning large language models (LLMs) with human preferences under preference distribution shifts using distributionally robust optimization. We introduce two novel distributionally robust direct preference optimization (DPO) algorithms—Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO)—and analyze their sample complexity. Additionally, we develop scalable gradient-based implementations for WDPO and KLDPO, demonstrating their superior alignment performance empirically in scenarios with preference distribution shifts.

[arXiv]

Distributionally Robust Behavioral Cloning for Robust Imitation Learning

Kishan Panaganti*, Zaiyan Xu*, Dileep Kalathil, Mohammad Ghavamzadeh (* equal contribution)

The 62nd IEEE Conference on Decision and Control (CDC) 2023

We introduce the robust imitation learning (IL) problem, where the goal is to learn a policy from expert demonstrations that remains effective under uncertainties in model parameters without requiring additional online environment interactions. To address this, we propose DR-BC, an algorithm that integrates distributionally robust optimization (DRO) into behavioral cloning (BC). We theoretically and empirically demonstrate that DR-BC achieves robust performance against model perturbations, validating our approach through experiments on MuJoCo continuous control tasks.

[Proceeding]

Distributionally Robust Behavioral Cloning for Robust Imitation Learning

Kishan Panaganti*, Zaiyan Xu*, Dileep Kalathil, Mohammad Ghavamzadeh (* equal contribution)

The 62nd IEEE Conference on Decision and Control (CDC) 2023

We introduce the robust imitation learning (IL) problem, where the goal is to learn a policy from expert demonstrations that remains effective under uncertainties in model parameters without requiring additional online environment interactions. To address this, we propose DR-BC, an algorithm that integrates distributionally robust optimization (DRO) into behavioral cloning (BC). We theoretically and empirically demonstrate that DR-BC achieves robust performance against model perturbations, validating our approach through experiments on MuJoCo continuous control tasks.

[Proceeding]

Improved Sample Complexity Bounds For Distributionally Robust Reinforcement Learning

Zaiyan Xu*, Kishan Panaganti*, Dileep Kalathil (* equal contribution)

The 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023

We study the problem of learning a robust control policy that handles parameter mismatches between training and testing environments, formulated as a distributionally robust reinforcement learning (DR-RL) problem in a tabular episodic setting. We propose Robust Phased Value Learning (RPVL), an algorithm capable of solving DR-RL problems defined by uncertainty sets based on total variation, chi-square, Kullback-Leibler, and Wasserstein divergences. We establish that our algorithm achieves improved sample complexity of $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ and present the first known complexity results for the Wasserstein uncertainty set, validated through simulation experiments.

[Proceeding] [Code]

Improved Sample Complexity Bounds For Distributionally Robust Reinforcement Learning

Zaiyan Xu*, Kishan Panaganti*, Dileep Kalathil (* equal contribution)

The 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023

We study the problem of learning a robust control policy that handles parameter mismatches between training and testing environments, formulated as a distributionally robust reinforcement learning (DR-RL) problem in a tabular episodic setting. We propose Robust Phased Value Learning (RPVL), an algorithm capable of solving DR-RL problems defined by uncertainty sets based on total variation, chi-square, Kullback-Leibler, and Wasserstein divergences. We establish that our algorithm achieves improved sample complexity of $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ and present the first known complexity results for the Wasserstein uncertainty set, validated through simulation experiments.

[Proceeding] [Code]

Robust reinforcement learning using offline data

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh

The Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS) 2022

We propose Robust Fitted Q-Iteration (RFQI), an algorithm designed for robust reinforcement learning (RL) using only offline data, addressing key challenges such as model uncertainty and the computational complexity introduced by the robust Bellman operator. We theoretically show that RFQI achieves near-optimal robust performance under standard conditions. Empirical experiments demonstrate its effectiveness on benchmark tasks compared to existing approaches.

[Proceeding] [Code]

Robust reinforcement learning using offline data

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh

The Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS) 2022

We propose Robust Fitted Q-Iteration (RFQI), an algorithm designed for robust reinforcement learning (RL) using only offline data, addressing key challenges such as model uncertainty and the computational complexity introduced by the robust Bellman operator. We theoretically show that RFQI achieves near-optimal robust performance under standard conditions. Empirical experiments demonstrate its effectiveness on benchmark tasks compared to existing approaches.

[Proceeding] [Code]