COLT 2021 Tutorial: Statistical Foundations of Reinforcement Learning


The past decade has seen tremendous interest in sequential decision making under uncertainty, a broad class of problems involving an agent interacting with an unknown environment to accomplish some goal. Reinforcement learning approaches to addressing these problems have led to recent AI breakthroughs in game playing, robotics, and elsewhere. Inspired by these empirical demonstrations, many researchers from the learning theory community have turned their attention to reinforcement learning, in an attempt to better understand these problems and develop new algorithmic principles. Their efforts have led to a more-modern statistical foundation for reinforcement learning that places an emphasis on non-asymptotic characterizations via global convergence, sample complexity, and regret analyses.

This tutorial will provide an overview of this emerging theory with a focus on the most challenging online exploration setting. The tutorial is organized into three sections:
  1. Part 1 will cover the necessary background and definitions. We focus here on the most basic setting of tabular Markov Decision Processes and consider problems of increasing difficulty: from planning, to optimization with an exploratory distribution, to online exploration. We will present two algorithms: Natural Policy Gradient (NPG) for the optimization problem and UCB-Value Iteration (UCB-VI) for exploration, along with their guarantees.
  2. Part 2 will be a recitation/practice session. We have prepared a problem set that covers the analyses of NPG and UCB-VI in detail highlighting key lemmas that are broadly useful in reinforcement learning, as well as technical connections to related fields. This session will take place in gather.town, and a number of experts in the field will be available to assist on the problem set or to answer other questions.
  3. Part 3 will focus on online exploration beyond the tabular setting, where function approximation is required for generalization. Here we will provide a tour of the zoo of RL models and complexity measures that enable tractable learning, as well as some statistical barriers and algorithms. We will close with some open problems and future directions.

The tutorial will be accessible to all COLT attendees. No background knowledge in RL is required, but we do expect tutorial attendees to be comfortable with the standard mathematical tools used in learning theory research, such as concentration inequalities and some linear algebra.

Staff


Main Presenters: Akshay Krishnamurthy (MSR NYC), Wen Sun (Cornell)

Discussion Leads: Chris Dann, Fei Feng, Christina Yu, Andrea Zanette

Tutorial time: August 5th 7:15 - 10:15 am (Mountain Time)

How to attend

The tutorial will be entirely virtual, via Zoom and Gather.town. Attendees must register for COLT 2021, which is free for students. You can register for the conference here. Conference attendees can join the tutorial from the COLT Gather.town space

Schedule

Time Session Materials
7:15am-8am Fundamentals: Markov Decision Processes, Natural policy gradient, and Upper confidence bound exploration Slides
Video
8:15am-9am Recitation / Practice: Analysis of NPG and UCB-VI (in Gather) Exercises
Solutions
9:15am-10:15am Generalization in RL: Linear methods, Bellman rank, Bilinear classes, Representation learning Slides
Video

Relevant Resources

RL Theory Seminar series (virtual): Keep up with advances in RL theory by joining the mailing list and tuning in.

Primary references

  • Alekh Agarwal, Nan Jiang, Sham M. Kakade, Wen Sun. Reinforcement Learning Theory and Algorithms. Monograph available here.
  • Alekh Agarwal, Sham M. Kakade, Gaurav Mahajan, Jason D. Lee. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. COLT 2020.
  • Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. ICML 2020
  • Mohammad Gheshlaghi Azar, Ian Osband, Remi Munos. Minimax Regret Bounds for Reinforcement Learning. ICML 2017.
  • Yuanhao Wang, Ruosong Wang, Sham M. Kakade. An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap. arXiv 2021.
  • Chi Jin, Zhuoran Yang, Zhaoran Wang, Michael I. Jordan. Provably Efficient Reinforcement Learning with Linear Function Approximation. COLT 2020.
  • Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, Robert E. Schapire. Contextual Decision Processes with Low Bellman Rank are PAC-Learnable. ICML 2017.
  • Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, Ruosong Wang. Bilinear Classes: A Structural Framework for Provable Generalization in RL. ICML 2021.
  • Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, John Langford. Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning. ICML 2020.
  • Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, Wen Sun. FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs. NeurIPS 2020.

Other references

  • Martin L. Puterman. Markov Decision Processes: Discrete and Stochastic Dynamic Programming. John Wiley & Sons, 2014.
  • Michael Kearns, Yishay Mansour, Andrew Ng. A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Machine Learning 2002.
  • MDPs with linear structures:
    • Zheng Wen, Benjamin Van Roy. Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization. Mathematics of Operations Research, 2017.
    • Simon S. Du, Yuping Luo, Ruosong Wang, Hanrui Zhang. Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle. NeurIPS 2019.
    • Tor Lattimore, Csaba Szepesvari, Gellert Weisz. Learning with good feature representations in bandits and in rl with a generative model. ICML 2020.
    • Ruosong Wang, Dean P. Foster, Sham M. Kakade. What are the Statistical Limits of Offline RL with Linear Function Approximation? ICLR 2021.
    • Gellert Weisz, Philip Amortila, Csaba Szepesvári. Exponential lower bounds for planning in mdps with linearly-realizable optimal action-value functions. ALT 2021.
    • Gellert Weisz, Philip Amortila, Barnabas Janzer, Yasin Abbasi-Yadkori, Nan Jiang, Csaba Szepesvari. On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function. arXiv 2021.
    • Lin Yang, Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. ICML 2020.
    • Aditya Modi, Nan Jiang, Ambuj Tewari, Satinder Singh. Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles. AISTATS 2020.
    • Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, Lin Yang. Model-based reinforcement learning with value-targeted regression. ICML 2020
    • Dongruo Zhou, Quanquan Gu, Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. COLT 2021
    • Gergely Neu, Ciara Pike-Burke. A unifying view of optimism in episodic reinforcement learning. NeurIPS 2020
    • Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill. Learning Near Optimal Policies with Low Inherent Bellman Error. ICML 2020.
  • Function approximation with Eluder dimension:
    • Daniel Russo, Benjamin Van Roy. Eluder Dimension and the Sample Complexity of Optimistic Exploration. NeurIPS 2013.
    • Gene Li, Pritish Kamath, Dylan J. Foster, Nathan Srebro. Eluder Dimension and Generalized Rank. arXiv 2021.
  • Nonlinear function approximation:
    • Antos, Andras, Csaba Szepesvari, Remi Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning 2008
    • Kefan Dong, Jiaqi Yang, Tengyu Ma. Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature. arXiv 2021.
    • Chi Jin, Qinghua Liu, Sobhan Miryoosefi. Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms.
    • Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. COLT 2019
    • Simon S. Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudík, John Langford. Provably efficient RL with Rich Observations via Latent State Decoding. ICML 2019.
    • Fei Feng, Ruosong Wang, Wotao Yin, Simon S. Du, Lin F. Yang. Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning. NeurIPS 2020.
    • Dylan J. Foster, Alexander Rakhlin, David Simchi-Levi, Yunzong Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. COLT 2021
    • Tianhao Wu, Yunchang Yang*, Simon S. Du, Liwei Wang. On Reinforcement Learning with Adversarial Corruption and Its Application to Block MDP. ICML 2021.
    • Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal. Model-free Representation Learning and Exploration in Low-rank MDPs. arXiv 2021.
    • Sarah Dean, Benjamin Recht. Certainty Equivalent Perception-Based Control. L4DC 2021.
    • Zakaria Mhammedi, Dylan J. Foster, Max Simchowitz, Dipendra Misra, Wen Sun, Akshay Krishnamurthy, Alexander Rakhlin, John Langford. Learning the Linear Quadratic Regulator from Nonlinear Observations. NeurIPS 2020.
    • Dipendra Misra, Qinghua Liu, Chi Jin, John Langford. Provable Rich Observation Reinforcement Learning with Combinatorial Latent States. ICLR 2021.
  • Policy gradient methods:
    • Jalaj Bhandari, Daniel Russo. Global optimality guarantees for policy gradient methods
    • Matthieu Geist, Bruno Scherrer, Olivier Pietquin. A theory of regularized markov decision processes. ICML 2019
    • Alekh Agarwal, Mikael Henaff, Sham Kakade, Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning. NeurIPS 2020
    • Andrea Zanette, Ching-An Cheng, Alekh Agarwal. Cautiously optimistic policy optimization and exploration with linear function approximation. COLT 2021