When faced with the intricate trade-off among modeling, exploration, and planning, one might wonder the following question:

What is a practical algorithm with rigorous guarantee to achieve such balance?

To tackle this problem, we aim to exploit model-based information to provide a reasonable linear function space for model-free planning and exploration.

Table of Contents

I. LV-REP: Latent Variable Representation for Reinforcement Learning

Abstract

Deep latent variable models have achieved significant empirical successes in model-based reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.

Introduction

For both model-free and model-based algorithms, there has been insufficient work considering both statistical and computation tractability and efficiency in terms of learning, planning, and exploration in a unified and coherent perspective for algorithm design. This raises the question:
Is there a way to design a provable and practical algorithm that can effectively address both the statistical and computational challenges of Reinforcement Learning (RL)?
Here, by “provable” we mean the statistical complexity of the algorithm can be rigorously characterized without explicit dependence on the number of states but instead the fundamental complexity of the parameterized representation space; while by “practical” we mean the learning, planning, and exploration components in the algorithm are computationally tractable and can be implemented in real-world scenarios.

This work provides an affirmative answer to the question above by establishing the representation view of latent variable dynamics models through a connection to linear MDPs. Such a connection immediately provides a computationally tractable approach to planning and exploration in the linear space constructed by the flexible deep latent variable model. Such a latent variable model view also provides a variational learning method that remedies the intractability of MLE for general linear MDPs. Our main contributions consist of the following:

  • We establish the representation view of latent variable dynamics models in RL, which naturally induces Latent Variable Representation (LV-Rep) for linearly representing the state-action value function, and paves the way for a practical variational method for representation learning;
  • We provide computation efficient algorithms to implement the principle of optimism and pessimism in the face of uncertainty with the learned LV-Rep for online and offline RL;
  • We theoretically analyze the sample complexity of LV-Rep in both online and offline settings, which reveals the essential complexity beyond the cardinality of the latent variable;
  • We empirically demonstrate LV-Rep outperforms the state-of-the-art model-based and model-free RL algorithms on several RL benchmarks.

Experiments


We extensively test our algorithm on the Mujoco and DeepMind Control Suite .

Dense-Reward Mujoco Benchmarks

MY ALT TEXT
This table presents all experiment results, where all results are averaged over 4 random seeds. In practice, we found LV-Rep-C provides comparable or better performance, so that we report its result for LV-Rep in the table. We present the best model-based RL performance for comparison.

The results clearly show that LV-Rep provides significantly better or comparable performance compared to all model-based algorithms. In particular, in the most challenging domains such as Walker and Ant, most model-based methods completely fail the task, while LV-Rep achieves the state-of-the-art performance in contrast. Furthermore, LV-Rep shows dominant performance in all domains comparing to two representative representation learning based RL methods, Deep Successor Feature (DeepSF) and SPEDE. LV-Rep also achieves better performance than the strongest model-free algorithm SAC in most challenging domains except Humanoid. MY ALT TEXT Finally, this figure provides learning curves of LV-Rep-C and LV-Rep-D in comparison to SAC, which clearly shows that comparing to the SOTA (state-of-the-art) model-free baseline SAC, LV-Rep enjoys great sample efficiency in these tasks.

Sparse-Reward DeepMind Control Suite

MY ALT TEXT
We compare all algorithms after running 200K environment steps across 4 random seeds. Results are presented in the table above. We report the result of LV-Rep-C for LV-Rep as it gives better empirical performance. We can clearly observe that LV-Rep dominates the performance across all domains. In relatively dense-reward problems, cheetah-run and walker-run, LV-Rep outperforms all baselines by a large margin. Remarkably, for sparser reward problems, hopper-hop and humanoid-run, LV-Rep provides reasonable results while other methods do not even start learning.
MY ALT TEXT
We also plot the learning curves of LV-Rep with all competitors in the figure above. This shows that LV-Rep outperforms other baselines in terms of both sample efficiency and final performance.

II. CTRL: Making Linear MDPs Practical via Contrastive Representation Learning

Abstract

It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations. This motivates much of the recent theoretical study on linear MDPs. However, most approaches require a given representation under unrealistic assumptions about the normalization of the decomposition or introduce unresolved computational challenges in practice. Instead, we consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning via contrastive estimation. The framework also admits confidence-adjusted index algorithms, enabling an efficient and principled approach to incorporating optimism or pessimism in the face of uncertainty. To the best of our knowledge, this provides the first practical representation learning method for linear MDPs that achieves both strong theoretical guarantees and empirical performance. Theoretically, we prove that the proposed algorithm is sample efficient in both the online and offline settings. Empirically, we demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.

Introduction

We address the lingering computational efficiency issues in representation learning for linear MDPs. We provide solutions to the aforementioned difficulties, bridge the gap between theory and practice, and eventually establish a sound yet practical foundation for learning in linear MDPs. More specifically:

  • We clarify the importance of the normalization condition and propose a variant of linear MDPs wherein it is easy to enforce normalization;
  • We develop the ConTrastive Representation Learning algorithm, CTRL, that can implicitly satisfy the density requirement;
  • We incorporate the representation learning techniques in optimistic and pessimistic confidence-adjusted RL algorithms, CTRL-UCB and CTRL-LCB, for both online and offline RL problems, and establish their sample complexities respectively;
  • We conduct a comprehensive comparison to existing state-of-the-art model-based and model-free RL algorithms on several benchmarks for the online and offline settings, demonstrating superior empirical performance of the proposed CTRL.

Experiments


We show that CTRL-UCB achieves state-of-the-art performance in MuJoCo and DeepMind Control Suite benchmarks, compared to both model-based (e.g., PETS, ME-TRPO) and model-free RL algorithms (e.g., SAC).

Dense-Reward Mujoco Benchmarks

MY ALT TEXT

In the table above, we can see that CTRL-UCBL achieves the state-of-the-art (SoTA) results among all model-based RL algorithms and significantly improves the prior baselines. Despite the improvement in terms of final return, prior model-based RL methods can hardly learn the behaviors such as walking and moving the ant in these two tasks. By contrast, CTRL-UCBL learns the two behaviors successfully.

We also compare our algorithm with the SoTA model-free RL method SAC. Our method achieves comparable, sometimes better performance in most of the tasks. Lastly, comparing with two representation learning baselines: DeepSF and SPEDE, CTRL-UCBL also shows dominant performance. However, the performance of CTRL-UCBL is not satisfactory in Humanoid and Hopper environments. It is potentially due to their hard-exploration nature and the linear parametrized Q failed to capture the bonus for exploration.

Sparse-Reward DeepMind Control Suite

MY ALT TEXT

To better understand how CTRL performs in a sparse-reward setting, we conduct experiments on the DeepMind Control Suite. We compare CTRL-UCBM with model-free RL methods like SAC (including 2-layer MLP and 3-layer MLP as critic network) and PPO (3-layer of MLP as value network). In addition, we also compare with the representation learning method DeepSF.

CTRL-UCBM outperforms SAC (2-layer) by a large margin even in relatively dense reward settings like cheetah-run and walker-run. This margin becomes even larger in sparse-reward settings like walker-run-sparse and hopper-hop. Even comparing with SAC using 3-layer MLP, CTRL-UCBM also achieves better performance. This again shows the effectiveness of our method.

Offline Reinforcement Learning

MY ALT TEXT

We compare CTRL-LCB with state-of-the-art offline RL algorithms and include CPC as a representation learning baseline. Besides the state-action representation learned in CTRL-LCB, one major difference between CPC and CTRL-LCB is the additional reward penalty introduced by CTRL-LCB. The average normalized scores are presented in the table above. CTRL-LCB performs well on the medium-expert set of tasks. The medium set of tasks also benefit from our method. However, CTRL-LCB is unable to improve the performance of medium-replay, potentially due to low-quality data induced representations being incapable of recovering the good policy.

III. μLV-Rep: Provable Representation with Efficient Planning for Partially Observable Reinforcement Learning

Abstract

In most real-world reinforcement learning applications, state information is only partially observable, which breaks the Markov decision process assumption and leads to inferior performance for algorithms that conflate observations with state. Partially Observable Markov Decision Processes (POMDPs), on the other hand, provide a general framework that allows for partial observability to be accounted for in learning, exploration and planning, but presents significant computational and statistical challenges. To address these difficulties, we develop a representation-based perspective that leads to a coherent framework and tractable algorithmic approach for practical reinforcement learning from partial observations. We provide a theoretical analysis for justifying the statistical efficiency of the proposed algorithm, and also empirically demonstrate the proposed algorithm can surpass state-of-the-art performance with partial observations across various benchmarks, advancing reliable reinforcement learning towards more practical applications.

Introduction

Empirical successes have motivated investigation into structured POMDPs that allow some of the core computational and statistical complexities to be overcome, and provides an improved understanding of exploitable structure and practical new algorithms with rigorous justification. However, most works rely on the existence of an ideal computational oracle for planning, which, unsurprisingly, is infeasible in most cases, hence difficult to apply in practice. Although there have been a few attempts to overcome the computational complexity of POMDPS, these algorithms are either only applicable to the tabular setting or rely on an integration oracle that quickly become intractable for large observation spaces. This gap immediately motivates the question:
Can efficient and practical RL algorithms be designed for partial observations by exploiting natural structures?
By “efficient” we mean the statistical complexity avoids an exponential dependence on history length, while by “practical” we mean that every component of learning, planning and exploration can be easily implemented and applied in practical settings.

In this paper, we provide an affirmative answer to this question. More specifically,

  • We reveal for the first time that an L-decodable POMDP admits a sufficient representation, the Multi-step Latent Variable Representation (μLV-Rep), that supports exact and tractable linear representation of the value functions, breaking the fundamental computational barriers;
  • We design a computationally efficient planning algorithm that can implement both the principles of optimism and pessimism in the face of uncertainty for online and offline POMDPs respectively, by leveraging the learned sufficient representation μLV-Rep;
  • We provide a theoretical analysis of the sample complexity of the proposed algorithm, justifying its efficiency in balancing exploitation versus exploration;
  • We conduct a comprehensive empirical comparison to current existing RL algorithms for POMDPs on several benchmarks, demonstrating the superior empirical performance of μLV-Rep.

Experiments


We evaluate the proposed method on Meta-world, which is an open-source simulated benchmark consisting of 50 distinct robotic manipulation tasks with visual observations. We also provide experiment results on partial observable control problems constructed based on OpenAI gym MuJoCo.

Mujoco Benchmarks

MY ALT TEXT

Performance on various continuous control problems with partial observation. All results are averaged across 4 random seeds and a window size of 10K. μLV-Rep achieves the best performance compared to the baselines. Here, Best-FO denotes the performance of LV-Rep using full observations as inputs, providing a reference on how well an algorithm can achieve most in our tests.

The results clearly demonstrate that the proposed method consistently delivers either competitive or superior outcomes across all domains compared to both the model-based and model-free baselines. We note that in most domains, μLV-Rep nearly matches the performance of Best-FO, further confirming that the proposed method is able to extract useful representations for decision-making in partially observable environments.

Meta-world Benchmarks

MY ALT TEXT

Success Rate on 50 Meta-world tasks after 1 million interactions. The blue bars show our results, the gray bars show the best performance achieved by DreamerV2 and MWM. Our method succeeds on most of tasks (> 90% on 33 tasks). The results are better than or equivalent to (within 10% difference) the best of DreamerV2 and MWM on 41 tasks.

It is noteworthy that while MWM achieves comparable sample efficiency with μLV-Rep in certain tasks, it incurs higher computational costs and longer running times due to the incorporation of ViT network in its model. In our experimental configurations, μLV-Rep demonstrates a training speed of 21.3 steps per second, outperforming MWM, which achieves a lower training speed of 8.1 steps per second. This highlights the computational efficiency of the proposed method.

IV. Diff-SR: Diffusion Spectral Representation for Reinforcement Learning

Abstract

Diffusion-based models have achieved notable empirical successes in reinforcement learning (RL) due to their expressiveness in modeling complex distributions. Despite existing methods being promising, the key challenge of extending existing methods for broader real-world applications lies in the computational cost at inference time, i.e., sampling from a diffusion model is considerably slow as it often requires tens to hundreds of iterations to generate even one sample. To circumvent this issue, we propose to leverage the flexibility of diffusion models for RL from a representation learning perspective. In particular, by exploiting the connection between diffusion model and energy-based model, we develop Diffusion Spectral Representation (Diff-SR), a coherent algorithm framework that enables extracting sufficient representations for value functions in Markov decision processes (MDP) and partially observable Markov decision processes (POMDP). We further demonstrate how Diff-SR facilitates efficient policy optimization and practical algorithms while explicitly bypassing the difficulty and inference cost of sampling from the diffusion model. Finally, we provide comprehensive empirical studies to verify the benefits of Diff-SR in delivering robust and advantageous performance across various benchmarks with both fully and partially observable settings.

Introduction

Although diffusion models are powerful tools to represent complex, multimodal data distributions, its flexibility comes with a substantial inference cost: generating even a single sample from a diffusion model is notably slow, typically requiring tens to thousands of iterations. The computational demands are particularly problematic for RL, since the learning agent must frequently query the model for interactions with the environment.

In this paper, we exploit the connection between diffusion models and energy-based models (EBMs), and develop Diffusion Spectral Representation (Diff-SR), which leverages the flexibility of diffusion models by extracting representations that capture the dynamics structure. Specifically,

  • We show that such diffusion-based representations are sufficiently expressive to represent the value function of any policy, which paves the way for efficient planning and exploration;
  • We circumvent the need for sample generation from the diffusion model, and thus avoiding the inference costs associated with prior diffusion-based methods;
  • Integrated with existing model-free policy optimization algorithms, Diff-SR demonstrates robust, superior performance and efficiency across various benchmarks.

Experiments

We evaluate Diff-SR with state-based MDP and image-based POMDP tasks to investigate the capability of Diff-SR comprehensively.

Gym-MuJoCo Benchmarks

The following table presents the performances of Diff-SR and baseline RL algorithms after 200K environment steps. Results are averaged across 4 random seeds and a window size of 10K steps. Diff-SR achieves significantly better or comparable performance to all baseline methods, including a diffusion-based method PolyGRAD in nearly all of the tasks. MY ALT TEXT Besides its superior performance, Diff-SR does not require sampling from the diffusion model while still leverages the flexibility of diffusion models. This leads to 4x faster training speed compared to other diffusion-based RL methods, as we observed in our experiments (see the next figure). MY ALT TEXT

Meta-World Benchmarks

As the most difficult setting, we evaluate Diff-SR with 8 visual-input tasks selected from the Meta-World Benchmark. We see that Diff-SR achieves a greater than 90% success rate for seven of the tasks, 4 more tasks than the second best baseline μLV-Rep. Overall, Diff-SR exhibits superior performance, faster convergence speed, and stable optimization in most of the tasks compared to the baseline methods.

V. SPEDER: Spectral Decomposition Representation for Reinforcement Learning

Abstract

Representation learning often plays a critical role in reinforcement learning by managing the curse of dimensionality. A representative class of algorithms exploits a spectral decomposition of the stochastic transition dynamics to construct representations that enjoy strong theoretical properties in an idealized setting. However, current spectral methods suffer from limited applicability because they are constructed for state-only aggregation and derived from a policy-dependent transition kernel, without considering the issue of exploration. To address these issues, we propose an alternative spectral method, Spectral Decomposition Representation (SPEDER), that extracts a state-action abstraction from the dynamics without inducing spurious dependence on the data collection policy, while also balancing the exploration-versus-exploitation trade-off during learning. A theoretical analysis establishes the sample efficiency of the proposed algorithm in both the online and offline settings. In addition, an experimental investigation demonstrates superior performance over current state-of-the-art algorithms across several benchmarks.

Introduction

Despite their elegance and desirable properties, current spectral representation algorithms exhibit several drawbacks. One drawback is that current methods generate state-only features, which are heavily influenced by the behavior policy and can fail to generalize well to alternative polices. Moreover, most existing spectral representation learning algorithms omit the intimate coupling between representation learning and exploration, and instead learn the representation from a pre-collected static dataset. This is problematic as effective exploration depends on having a good representation, while learning the representation requires comprehensively-covered experiences — failing to properly manage this interaction can lead to fundamentally sample-inefficient data collection. These limitations lead to suboptimal features and limited empirical performance. In this paper, we address these important but largely ignored issues, and provide a novel spectral representation learning method that generates policy-independent features that provably manage the delicate balance between exploration and exploitation. More specifically:

  • We provide a spectral decomposition view of several current representation learning methods, and identify the cause of spurious dependencies in state-only spectral features;
  • We develop a novel model-free objective, Spectral Decomposition Representation (SPEDER), that factorizes the policy-independent transition kernel to eliminate policy-induced dependencies, while revealing the connection between model-free and model-based representation learning;
  • We provide algorithms that implement the principles of optimism and pessimism in the face of uncertainty using the SPEDER features for online and offline RL, and equip behavior cloning with SPEDER for imitation learning;
  • We analyze the sample complexity of SPEDER in both the online and offline settings, to justify the achieved balance between exploration versus exploitation.

Experiments


We show that SPEDER achieves state-of-the-art performance in MuJoCo and DeepMind Control Suite benchmarks, compared to both model-based (e.g., PETS, ME-TRPO) and model-free RL algorithms (e.g., SAC).

Dense-Reward Mujoco Benchmarks

MY ALT TEXT

In the above table, we show that SPEDER achieves SoTA results among all model-based RL algorithms and significantly improves the prior baselines. We also compare the algorithm with the SoTA model-free RL method SAC. The proposed method achieves comparable or better performance in most of the tasks. Lastly, compared to two representation learning baselines and SPEDE, SPEDER also shows superior performance, which demonstrates the proposed SPEDER is able to overcome the aforementioned drawback of vanilla spectral representations.

Sparse-Reward DeepMind Control Suite

MY ALT TEXT

To evaluate the exploration performance of SPEDER, we additionally run experiments on the DeepMind Control Suite. Results are in the above table. We compare the proposed method with SAC, (including a 2-layer, 3-layer and 5-layer MLP for critic network), PPO, Dreamer-v2, Deep SF and Proto-RL. We see that SPEDER achieves superior performance compared to SAC using the 2-layer critic network. Compared to SAC and PPO with deeper critic networks, SPEDER has significant gain in tasks with sparse reward (e.g., walker-run-sparse and hopper-hop).

Citations

@misc{ren2023latent,
    title={Latent Variable Representation for Reinforcement Learning},
    author={Tongzheng Ren and Chenjun Xiao and Tianjun Zhang and Na Li and Zhaoran Wang and Sujay Sanghavi and Dale Schuurmans and Bo Dai},
    year={2023},
    eprint={2212.08765},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
  }
@misc{zhang2022making,
    title={Making Linear MDPs Practical via Contrastive Representation Learning},
    author={Tianjun Zhang and Tongzheng Ren and Mengjiao Yang and Joseph E. Gonzalez and Dale Schuurmans and Bo Dai},
    year={2022},
    eprint={2207.07150},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
@misc{zhang2024provable,
    title={Provable Representation with Efficient Planning for Partially Observable Reinforcement Learning},
    author={Hongming Zhang and Tongzheng Ren and Chenjun Xiao and Dale Schuurmans and Bo Dai},
    year={2024},
    eprint={2311.12244},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
@misc{shribak2024diffusion,
    title={Diffusion Spectral Representation for Reinforcement Learning}, 
    author={Dmitry Shribak and Chen-Xiao Gao and Yitong Li and Chenjun Xiao and Bo Dai},
    year={2024},
    eprint={2406.16121},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
}
@misc{ren2023spectraldecompositionrepresentationreinforcement,
      title={Spectral Decomposition Representation for Reinforcement Learning},
      author={Tongzheng Ren and Tianjun Zhang and Lisa Lee and Joseph E. Gonzalez and Dale Schuurmans and Bo Dai},
      year={2023},
      eprint={2208.09515},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
}