Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh
The 7th Annual Learning for Dynamics & Control Conference (L4DC) 2025
We bridge offline reinforcement learning (RL) and distributionally robust learning (DRL) by proposing two offline RL algorithms that leverage a minimax DRL framework to address distributional shift between the offline data and learned policy. Under the single policy concentrability assumption, we characterize the sample complexities of our algorithms in both tabular and linear function approximation settings. Empirical simulations demonstrate the superior performance of our DRL-based offline RL methods.
Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh
The 7th Annual Learning for Dynamics & Control Conference (L4DC) 2025
We bridge offline reinforcement learning (RL) and distributionally robust learning (DRL) by proposing two offline RL algorithms that leverage a minimax DRL framework to address distributional shift between the offline data and learned policy. Under the single policy concentrability assumption, we characterize the sample complexities of our algorithms in both tabular and linear function approximation settings. Empirical simulations demonstrate the superior performance of our DRL-based offline RL methods.
Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran
arXiv Preprint. Under review. 2025
We tackle the problem of aligning large language models (LLMs) with human preferences under preference distribution shifts using distributionally robust optimization. We introduce two novel distributionally robust direct preference optimization (DPO) algorithms—Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO)—and analyze their sample complexity. Additionally, we develop scalable gradient-based implementations for WDPO and KLDPO, demonstrating their superior alignment performance empirically in scenarios with preference distribution shifts.
Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran
arXiv Preprint. Under review. 2025
We tackle the problem of aligning large language models (LLMs) with human preferences under preference distribution shifts using distributionally robust optimization. We introduce two novel distributionally robust direct preference optimization (DPO) algorithms—Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO)—and analyze their sample complexity. Additionally, we develop scalable gradient-based implementations for WDPO and KLDPO, demonstrating their superior alignment performance empirically in scenarios with preference distribution shifts.
Kishan Panaganti*, Zaiyan Xu*, Dileep Kalathil, Mohammad Ghavamzadeh (* equal contribution)
The 62nd IEEE Conference on Decision and Control (CDC) 2023
We introduce the robust imitation learning (IL) problem, where the goal is to learn a policy from expert demonstrations that remains effective under uncertainties in model parameters without requiring additional online environment interactions. To address this, we propose DR-BC, an algorithm that integrates distributionally robust optimization (DRO) into behavioral cloning (BC). We theoretically and empirically demonstrate that DR-BC achieves robust performance against model perturbations, validating our approach through experiments on MuJoCo continuous control tasks.
Kishan Panaganti*, Zaiyan Xu*, Dileep Kalathil, Mohammad Ghavamzadeh (* equal contribution)
The 62nd IEEE Conference on Decision and Control (CDC) 2023
We introduce the robust imitation learning (IL) problem, where the goal is to learn a policy from expert demonstrations that remains effective under uncertainties in model parameters without requiring additional online environment interactions. To address this, we propose DR-BC, an algorithm that integrates distributionally robust optimization (DRO) into behavioral cloning (BC). We theoretically and empirically demonstrate that DR-BC achieves robust performance against model perturbations, validating our approach through experiments on MuJoCo continuous control tasks.
Zaiyan Xu*, Kishan Panaganti*, Dileep Kalathil (* equal contribution)
The 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023
We study the problem of learning a robust control policy that handles parameter mismatches between training and testing environments, formulated as a distributionally robust reinforcement learning (DR-RL) problem in a tabular episodic setting. We propose Robust Phased Value Learning (RPVL), an algorithm capable of solving DR-RL problems defined by uncertainty sets based on total variation, chi-square, Kullback-Leibler, and Wasserstein divergences. We establish that our algorithm achieves improved sample complexity of $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ and present the first known complexity results for the Wasserstein uncertainty set, validated through simulation experiments.
Zaiyan Xu*, Kishan Panaganti*, Dileep Kalathil (* equal contribution)
The 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023
We study the problem of learning a robust control policy that handles parameter mismatches between training and testing environments, formulated as a distributionally robust reinforcement learning (DR-RL) problem in a tabular episodic setting. We propose Robust Phased Value Learning (RPVL), an algorithm capable of solving DR-RL problems defined by uncertainty sets based on total variation, chi-square, Kullback-Leibler, and Wasserstein divergences. We establish that our algorithm achieves improved sample complexity of $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ and present the first known complexity results for the Wasserstein uncertainty set, validated through simulation experiments.
Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh
The Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS) 2022
We propose Robust Fitted Q-Iteration (RFQI), an algorithm designed for robust reinforcement learning (RL) using only offline data, addressing key challenges such as model uncertainty and the computational complexity introduced by the robust Bellman operator. We theoretically show that RFQI achieves near-optimal robust performance under standard conditions. Empirical experiments demonstrate its effectiveness on benchmark tasks compared to existing approaches.
Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh
The Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS) 2022
We propose Robust Fitted Q-Iteration (RFQI), an algorithm designed for robust reinforcement learning (RL) using only offline data, addressing key challenges such as model uncertainty and the computational complexity introduced by the robust Bellman operator. We theoretically show that RFQI achieves near-optimal robust performance under standard conditions. Empirical experiments demonstrate its effectiveness on benchmark tasks compared to existing approaches.