Introduction
Empirical successes have motivated investigation into structured POMDPs that allow some of the core computational and statistical complexities to be overcome, and provides an improved understanding of exploitable structure and practical new algorithms with rigorous justification. However, most works rely on the existence of an ideal computational oracle for planning, which, unsurprisingly, is infeasible in most cases, hence difficult to apply in practice. Although there have been a few attempts to overcome the computational complexity of POMDPS, these algorithms are either only applicable to the tabular setting or rely on an integration oracle that quickly become intractable for large observation spaces. This gap immediately motivates the question:
Can efficient and practical RL algorithms be designed for partial observations by exploiting natural structures?
By “efficient” we mean the statistical complexity avoids an exponential dependence on history length,
while by “practical” we mean that every component of learning, planning and exploration can be easily implemented and applied in practical settings.
In this paper, we provide an affirmative answer to this question. More specifically,
- We reveal for the first time that an -decodable POMDP admits a sufficient representation, the Multi-step Latent Variable Representation (μLV-Rep), that supports exact and tractable linear representation of the value functions, breaking the fundamental computational barriers;
- We design a computationally efficient planning algorithm that can implement both the principles of optimism and pessimism in the face of uncertainty for online and offline POMDPs respectively, by leveraging the learned sufficient representation μLV-Rep;
- We provide a theoretical analysis of the sample complexity of the proposed algorithm, justifying its efficiency in balancing exploitation versus exploration;
- We conduct a comprehensive empirical comparison to current existing RL algorithms for POMDPs on several benchmarks, demonstrating the superior empirical performance of μLV-Rep.
Experiments
We evaluate the proposed method on Meta-world, which is an open-source simulated benchmark consisting of 50 distinct robotic manipulation tasks with visual observations. We also provide experiment results on partial observable control problems constructed based on OpenAI gym MuJoCo.
Mujoco Benchmarks
Performance on various continuous control problems with partial observation. All results are averaged across 4 random seeds and a window size of 10K. μLV-Rep achieves the best performance compared to the baselines. Here, Best-FO denotes the performance of LV-Rep using full observations as inputs, providing a reference on how well an algorithm can achieve most in our tests.
The results clearly demonstrate that the proposed method consistently delivers either competitive or superior outcomes across all domains compared to both the model-based and model-free baselines. We note that in most domains, μLV-Rep nearly matches the performance of Best-FO, further confirming that the proposed method is able to extract useful representations for decision-making in partially observable environments.
Meta-world Benchmarks
Success Rate on 50 Meta-world tasks after 1 million interactions. The blue bars show our results, the gray bars show the best performance achieved by DreamerV2 and MWM. Our method succeeds on most of tasks (> 90% on 33 tasks). The results are better than or equivalent to (within 10% difference) the best of DreamerV2 and MWM on 41 tasks.
It is noteworthy that while MWM achieves comparable sample efficiency with μLV-Rep in certain tasks, it incurs higher computational costs and longer running times due to the incorporation of ViT network in its model. In our experimental configurations, μLV-Rep demonstrates a training speed of 21.3 steps per second, outperforming MWM, which achieves a lower training speed of 8.1 steps per second. This highlights the computational efficiency of the proposed method.
Citations
@misc{ren2023latent,
title={Latent Variable Representation for Reinforcement Learning},
author={Tongzheng Ren and Chenjun Xiao and Tianjun Zhang and Na Li and Zhaoran Wang and Sujay Sanghavi and Dale Schuurmans and Bo Dai},
year={2023},
eprint={2212.08765},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{zhang2022making,
title={Making Linear MDPs Practical via Contrastive Representation Learning},
author={Tianjun Zhang and Tongzheng Ren and Mengjiao Yang and Joseph E. Gonzalez and Dale Schuurmans and Bo Dai},
year={2022},
eprint={2207.07150},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{zhang2024efficient,
title={Efficient Reinforcement Learning from Partial Observability},
author={Hongming Zhang and Tongzheng Ren and Chenjun Xiao and Dale Schuurmans and Bo Dai},
year={2024},
eprint={2311.12244},
archivePrefix={arXiv},
primaryClass={cs.LG}
}