Luis Goncalves, Carlos Enrique (2025)
Uncertainty Representations in Reinforcement Learning.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00028956
Dissertation, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
Reinforcement learning (RL) has achieved tremendous success over the last decade, primarily through massive compute in simulated environments. However, applications of RL in physical systems have lagged behind as a result of open challenges such as sample-efficient learning, partial observability, generalization and adaptation to unseen tasks. Consequently, there exists a large gap to be filled before RL becomes the standard for enabling autonomous systems in the real world.
In this thesis, we argue that proper handling of uncertainty is key to address these challenges. We introduce uncertainty estimation techniques that consider the sequential nature of decision-making, which enable a seamless integration of the resulting uncertainty estimates into RL algorithms.
First, we adopt the model-based RL paradigm and investigate methods that propagate uncertainty from the learned dynamics up to long-term predictions of the value of a control policy. Key to these approaches are probabilistic models that separate aleatoric and epistemic uncertainty: the former is an inherent part of the problem and therefore irreducible, while the latter exists due to a lack of knowledge about the dynamics and can be reduced by strategically collecting more data. We first tackle the problem of estimating the epistemic variance around the predicted performance (value) of a policy. We derive a theoretically-grounded estimation algorithm that effectively propagates model uncertainty and recovers the desired variance. We then demonstrate how to use such epistemic variance estimates for improved exploration in tabular problems. For more challenging continuous control tasks, we identify challenges to apply our theory and propose a suitable approximation, which leads to a practical deep RL architecture that accomodates risk-seeking or risk-averse policy optimization.
As a natural next step, we show how to efficiently learn an entire distribution of policy values, rather than just its mean and variance. The distributional representation of epistemic uncertainty around values is more expressive and allows for a wider range of policy optimization objectives while having low computational overhead. Furthermore, empirical evaluation in diverse control tasks indicate a substantial improvement in final performance and sample-efficiency over state-of-the-art methods.
Next, we consider the problem of partial observability in model-free RL. That is, the environment observations provide limited information for decision-making, therefore the hidden state of the environment must be infered from trajectory data. In this setting, we propose sequence models composed of Kalman filter (KF) layers that perform closed-form Gaussian inference in linear state-space models and train them end-to-end to maximize returns. By design, the KF layers are a drop-in replacement of previous recurrent layers in model-free architectures, but they are equipped with an explicit mechanism for probabilistic filtering of the latent state representation. We empirically demonstrate that KF layers excel in tasks where reasoning over uncertainty is crucial for decision-making.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2025 | ||||
Autor(en): | Luis Goncalves, Carlos Enrique | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Uncertainty Representations in Reinforcement Learning | ||||
Sprache: | Englisch | ||||
Referenten: | Peters, Prof. Jan ; Bellemare, Prof. Marc G. | ||||
Publikationsjahr: | 8 Januar 2025 | ||||
Ort: | Darmstadt | ||||
Kollation: | xii, 125 Seiten | ||||
Datum der mündlichen Prüfung: | 6 Dezember 2024 | ||||
DOI: | 10.26083/tuprints-00028956 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/28956 | ||||
Kurzbeschreibung (Abstract): | Reinforcement learning (RL) has achieved tremendous success over the last decade, primarily through massive compute in simulated environments. However, applications of RL in physical systems have lagged behind as a result of open challenges such as sample-efficient learning, partial observability, generalization and adaptation to unseen tasks. Consequently, there exists a large gap to be filled before RL becomes the standard for enabling autonomous systems in the real world. In this thesis, we argue that proper handling of uncertainty is key to address these challenges. We introduce uncertainty estimation techniques that consider the sequential nature of decision-making, which enable a seamless integration of the resulting uncertainty estimates into RL algorithms. First, we adopt the model-based RL paradigm and investigate methods that propagate uncertainty from the learned dynamics up to long-term predictions of the value of a control policy. Key to these approaches are probabilistic models that separate aleatoric and epistemic uncertainty: the former is an inherent part of the problem and therefore irreducible, while the latter exists due to a lack of knowledge about the dynamics and can be reduced by strategically collecting more data. We first tackle the problem of estimating the epistemic variance around the predicted performance (value) of a policy. We derive a theoretically-grounded estimation algorithm that effectively propagates model uncertainty and recovers the desired variance. We then demonstrate how to use such epistemic variance estimates for improved exploration in tabular problems. For more challenging continuous control tasks, we identify challenges to apply our theory and propose a suitable approximation, which leads to a practical deep RL architecture that accomodates risk-seeking or risk-averse policy optimization. As a natural next step, we show how to efficiently learn an entire distribution of policy values, rather than just its mean and variance. The distributional representation of epistemic uncertainty around values is more expressive and allows for a wider range of policy optimization objectives while having low computational overhead. Furthermore, empirical evaluation in diverse control tasks indicate a substantial improvement in final performance and sample-efficiency over state-of-the-art methods. Next, we consider the problem of partial observability in model-free RL. That is, the environment observations provide limited information for decision-making, therefore the hidden state of the environment must be infered from trajectory data. In this setting, we propose sequence models composed of Kalman filter (KF) layers that perform closed-form Gaussian inference in linear state-space models and train them end-to-end to maximize returns. By design, the KF layers are a drop-in replacement of previous recurrent layers in model-free architectures, but they are equipped with an explicit mechanism for probabilistic filtering of the latent state representation. We empirically demonstrate that KF layers excel in tasks where reasoning over uncertainty is crucial for decision-making. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Status: | Verlagsversion | ||||
URN: | urn:nbn:de:tuda-tuprints-289562 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Intelligente Autonome Systeme |
||||
Hinterlegungsdatum: | 08 Jan 2025 13:05 | ||||
Letzte Änderung: | 15 Jan 2025 12:58 | ||||
PPN: | |||||
Referenten: | Peters, Prof. Jan ; Bellemare, Prof. Marc G. | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 6 Dezember 2024 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |