RL-REP
Representation-based Reinforcement Learning

When faced with the intricate trade-off among modeling, exploration, and planning, one might wonder the following question:

What is a practical algorithm with rigorous guarantee to achieve such balance?

To tackle this problem, we aim to exploit model-based information to provide a reasonable linear function space for model-free planning and exploration.

Table of Contents

I. LV-REP: Latent Variable Representation for Reinforcement Learning

Abstract

Deep latent variable models have achieved significant empirical successes in model-based reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.

Introduction

For both model-free and model-based algorithms, there has been insufficient work considering both statistical and computation tractability and efficiency in terms of learning, planning, and exploration in a unified and coherent perspective for algorithm design. This raises the question:
Is there a way to design a provable and practical algorithm that can effectively address both the statistical and computational challenges of Reinforcement Learning (RL)?
Here, by “provable” we mean the statistical complexity of the algorithm can be rigorously characterized without explicit dependence on the number of states but instead the fundamental complexity of the parameterized representation space; while by “practical” we mean the learning, planning, and exploration components in the algorithm are computationally tractable and can be implemented in real-world scenarios.

This work provides an affirmative answer to the question above by establishing the representation view of latent variable dynamics models through a connection to linear MDPs. Such a connection immediately provides a computationally tractable approach to planning and exploration in the linear space constructed by the flexible deep latent variable model. Such a latent variable model view also provides a variational learning method that remedies the intractability of MLE for general linear MDPs. Our main contributions consist of the following:

  • We establish the representation view of latent variable dynamics models in RL, which naturally induces Latent Variable Representation (LV-Rep) for linearly representing the state-action value function, and paves the way for a practical variational method for representation learning;
  • We provide computation efficient algorithms to implement the principle of optimism and pessimism in the face of uncertainty with the learned LV-Rep for online and offline RL;
  • We theoretically analyze the sample complexity of LV-Rep in both online and offline settings, which reveals the essential complexity beyond the cardinality of the latent variable;
  • We empirically demonstrate LV-Rep outperforms the state-of-the-art model-based and model-free RL algorithms on several RL benchmarks.

Experiments


We extensively test our algorithm on the Mujoco and DeepMind Control Suite .

Dense-Reward Mujoco Benchmarks

MY ALT TEXT
This table presents all experiment results, where all results are averaged over 4 random seeds. In practice, we found LV-Rep-C provides comparable or better performance, so that we report its result for LV-Rep in the table. We present the best model-based RL performance for comparison.

The results clearly show that LV-Rep provides significantly better or comparable performance compared to all model-based algorithms. In particular, in the most challenging domains such as Walker and Ant, most model-based methods completely fail the task, while LV-Rep achieves the state-of-the-art performance in contrast. Furthermore, LV-Rep shows dominant performance in all domains comparing to two representative representation learning based RL methods, Deep Successor Feature (DeepSF) and SPEDE. LV-Rep also achieves better performance than the strongest model-free algorithm SAC in most challenging domains except Humanoid. MY ALT TEXT Finally, this figure provides learning curves of LV-Rep-C and LV-Rep-D in comparison to SAC, which clearly shows that comparing to the SOTA (state-of-the-art) model-free baseline SAC, LV-Rep enjoys great sample efficiency in these tasks.

Sparse-Reward DeepMind Control Suite

MY ALT TEXT
We compare all algorithms after running 200K environment steps across 4 random seeds. Results are presented in the table above. We report the result of LV-Rep-C for LV-Rep as it gives better empirical performance. We can clearly observe that LV-Rep dominates the performance across all domains. In relatively dense-reward problems, cheetah-run and walker-run, LV-Rep outperforms all baselines by a large margin. Remarkably, for sparser reward problems, hopper-hop and humanoid-run, LV-Rep provides reasonable results while other methods do not even start learning.
MY ALT TEXT
We also plot the learning curves of LV-Rep with all competitors in the figure above. This shows that LV-Rep outperforms other baselines in terms of both sample efficiency and final performance.

II. CTRL: Making Linear MDPs Practical via Contrastive Representation Learning

Abstract

It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations. This motivates much of the recent theoretical study on linear MDPs. However, most approaches require a given representation under unrealistic assumptions about the normalization of the decomposition or introduce unresolved computational challenges in practice. Instead, we consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning via contrastive estimation. The framework also admits confidence-adjusted index algorithms, enabling an efficient and principled approach to incorporating optimism or pessimism in the face of uncertainty. To the best of our knowledge, this provides the first practical representation learning method for linear MDPs that achieves both strong theoretical guarantees and empirical performance. Theoretically, we prove that the proposed algorithm is sample efficient in both the online and offline settings. Empirically, we demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.

Introduction

We address the lingering computational efficiency issues in representation learning for linear MDPs. We provide solutions to the aforementioned difficulties, bridge the gap between theory and practice, and eventually establish a sound yet practical foundation for learning in linear MDPs. More specifically:

  • We clarify the importance of the normalization condition and propose a variant of linear MDPs wherein it is easy to enforce normalization;
  • We develop the ConTrastive Representation Learning algorithm, CTRL, that can implicitly satisfy the density requirement;
  • We incorporate the representation learning techniques in optimistic and pessimistic confidence-adjusted RL algorithms, CTRL-UCB and CTRL-LCB, for both online and offline RL problems, and establish their sample complexities respectively;
  • We conduct a comprehensive comparison to existing state-of-the-art model-based and model-free RL algorithms on several benchmarks for the online and offline settings, demonstrating superior empirical performance of the proposed CTRL.

Experiments


We show that CTRL-UCB achieves state-of-the-art performance in MuJoCo and DeepMind Control Suite benchmarks, compared to both model-based (e.g., PETS, ME-TRPO) and model-free RL algorithms (e.g., SAC).

Dense-Reward Mujoco Benchmarks

MY ALT TEXT

In the table above, we can see that CTRL-UCBL achieves the state-of-the-art (SoTA) results among all model-based RL algorithms and significantly improves the prior baselines. Despite the improvement in terms of final return, prior model-based RL methods can hardly learn the behaviors such as walking and moving the ant in these two tasks. By contrast, CTRL-UCBL learns the two behaviors successfully.

We also compare our algorithm with the SoTA model-free RL method SAC. Our method achieves comparable, sometimes better performance in most of the tasks. Lastly, comparing with two representation learning baselines: DeepSF and SPEDE, CTRL-UCBL also shows dominant performance. However, the performance of CTRL-UCBL is not satisfactory in Humanoid and Hopper environments. It is potentially due to their hard-exploration nature and the linear parametrized Q failed to capture the bonus for exploration.

Sparse-Reward DeepMind Control Suite

MY ALT TEXT

To better understand how CTRL performs in a sparse-reward setting, we conduct experiments on the DeepMind Control Suite. We compare CTRL-UCBM with model-free RL methods like SAC (including 2-layer MLP and 3-layer MLP as critic network) and PPO (3-layer of MLP as value network). In addition, we also compare with the representation learning method DeepSF.

CTRL-UCBM outperforms SAC (2-layer) by a large margin even in relatively dense reward settings like cheetah-run and walker-run. This margin becomes even larger in sparse-reward settings like walker-run-sparse and hopper-hop. Even comparing with SAC using 3-layer MLP, CTRL-UCBM also achieves better performance. This again shows the effectiveness of our method.

Offline Reinforcement Learning

MY ALT TEXT

We compare CTRL-LCB with state-of-the-art offline RL algorithms and include CPC as a representation learning baseline. Besides the state-action representation learned in CTRL-LCB, one major difference between CPC and CTRL-LCB is the additional reward penalty introduced by CTRL-LCB. The average normalized scores are presented in the table above. CTRL-LCB performs well on the medium-expert set of tasks. The medium set of tasks also benefit from our method. However, CTRL-LCB is unable to improve the performance of medium-replay, potentially due to low-quality data induced representations being incapable of recovering the good policy.

III. μLV-Rep: Efficient Reinforcement Learning from Partial Observability

Abstract

In most real-world reinforcement learning applications, state information is only partially observable, which breaks the Markov decision process assumption and leads to inferior performance for algorithms that conflate observations with state. Partially Observable Markov Decision Processes (POMDPs), on the other hand, provide a general framework that allows for partial observability to be accounted for in learning, exploration and planning, but presents significant computational and statistical challenges. To address these difficulties, we develop a representation-based perspective that leads to a coherent framework and tractable algorithmic approach for practical reinforcement learning from partial observations. We provide a theoretical analysis for justifying the statistical efficiency of the proposed algorithm, and also empirically demonstrate the proposed algorithm can surpass state-of-the-art performance with partial observations across various benchmarks, advancing reliable reinforcement learning towards more practical applications.

Introduction

Empirical successes have motivated investigation into structured POMDPs that allow some of the core computational and statistical complexities to be overcome, and provides an improved understanding of exploitable structure and practical new algorithms with rigorous justification. However, most works rely on the existence of an ideal computational oracle for planning, which, unsurprisingly, is infeasible in most cases, hence difficult to apply in practice. Although there have been a few attempts to overcome the computational complexity of POMDPS, these algorithms are either only applicable to the tabular setting or rely on an integration oracle that quickly become intractable for large observation spaces. This gap immediately motivates the question:
Can efficient and practical RL algorithms be designed for partial observations by exploiting natural structures?
By “efficient” we mean the statistical complexity avoids an exponential dependence on history length, while by “practical” we mean that every component of learning, planning and exploration can be easily implemented and applied in practical settings.

In this paper, we provide an affirmative answer to this question. More specifically,

  • We reveal for the first time that an  -decodable POMDP admits a sufficient representation, the Multi-step Latent Variable Representation (μLV-Rep), that supports exact and tractable linear representation of the value functions, breaking the fundamental computational barriers;
  • We design a computationally efficient planning algorithm that can implement both the principles of optimism and pessimism in the face of uncertainty for online and offline POMDPs respectively, by leveraging the learned sufficient representation μLV-Rep;
  • We provide a theoretical analysis of the sample complexity of the proposed algorithm, justifying its efficiency in balancing exploitation versus exploration;
  • We conduct a comprehensive empirical comparison to current existing RL algorithms for POMDPs on several benchmarks, demonstrating the superior empirical performance of μLV-Rep.

Experiments


We evaluate the proposed method on Meta-world, which is an open-source simulated benchmark consisting of 50 distinct robotic manipulation tasks with visual observations. We also provide experiment results on partial observable control problems constructed based on OpenAI gym MuJoCo.

Mujoco Benchmarks

MY ALT TEXT

Performance on various continuous control problems with partial observation. All results are averaged across 4 random seeds and a window size of 10K. μLV-Rep achieves the best performance compared to the baselines. Here, Best-FO denotes the performance of LV-Rep using full observations as inputs, providing a reference on how well an algorithm can achieve most in our tests.

The results clearly demonstrate that the proposed method consistently delivers either competitive or superior outcomes across all domains compared to both the model-based and model-free baselines. We note that in most domains, μLV-Rep nearly matches the performance of Best-FO, further confirming that the proposed method is able to extract useful representations for decision-making in partially observable environments.

Meta-world Benchmarks

MY ALT TEXT

Success Rate on 50 Meta-world tasks after 1 million interactions. The blue bars show our results, the gray bars show the best performance achieved by DreamerV2 and MWM. Our method succeeds on most of tasks (> 90% on 33 tasks). The results are better than or equivalent to (within 10% difference) the best of DreamerV2 and MWM on 41 tasks.

It is noteworthy that while MWM achieves comparable sample efficiency with μLV-Rep in certain tasks, it incurs higher computational costs and longer running times due to the incorporation of ViT network in its model. In our experimental configurations, μLV-Rep demonstrates a training speed of 21.3 steps per second, outperforming MWM, which achieves a lower training speed of 8.1 steps per second. This highlights the computational efficiency of the proposed method.

Citations

@misc{ren2023latent,
        title={Latent Variable Representation for Reinforcement Learning},
        author={Tongzheng Ren and Chenjun Xiao and Tianjun Zhang and Na Li and Zhaoran Wang and Sujay Sanghavi and Dale Schuurmans and Bo Dai},
        year={2023},
        eprint={2212.08765},
        archivePrefix={arXiv},
        primaryClass={cs.LG}
  }
@misc{zhang2022making,
    title={Making Linear MDPs Practical via Contrastive Representation Learning},
    author={Tianjun Zhang and Tongzheng Ren and Mengjiao Yang and Joseph E. Gonzalez and Dale Schuurmans and Bo Dai},
    year={2022},
    eprint={2207.07150},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
@misc{zhang2024efficient,
      title={Efficient Reinforcement Learning from Partial Observability},
      author={Hongming Zhang and Tongzheng Ren and Chenjun Xiao and Dale Schuurmans and Bo Dai},
      year={2024},
      eprint={2311.12244},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}