TU Darmstadt / ULB / TUbiblio

Reinforcement Learning with Non-Exponential Discounting

Schultheis, Matthias ; Rothkopf, Constantin A. ; Koeppl, Heinz (2022)
Reinforcement Learning with Non-Exponential Discounting.
36th Conference on Neural Information Processing Systems (NeurIPS 2022). New Orleans, USA (28.11.-09.12.2022)
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

Commonly in reinforcement learning (RL), rewards are discounted over time using an exponential function to model time preference, thereby bounding the expected long-term reward. In contrast, in economics and psychology, it has been shown that humans often adopt a hyperbolic discounting scheme, which is optimal when a specific task termination time distribution is assumed. In this work, we propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions. This formulation covers the case in which there is a non-exponential random termination time. We derive a Hamilton–Jacobi–Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method, which uses deep learning for function approximation. Further, we show how the inverse RL problem can be approached, in which one tries to recover properties of the discount function given decision data. We validate the applicability of our proposed approach on two simulated problems. Our approach opens the way for the analysis of human discounting in sequential decision-making tasks.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2022
Autor(en): Schultheis, Matthias ; Rothkopf, Constantin A. ; Koeppl, Heinz
Art des Eintrags: Bibliographie
Titel: Reinforcement Learning with Non-Exponential Discounting
Sprache: Englisch
Publikationsjahr: 31 Oktober 2022
Veranstaltungstitel: 36th Conference on Neural Information Processing Systems (NeurIPS 2022)
Veranstaltungsort: New Orleans, USA
Veranstaltungsdatum: 28.11.-09.12.2022
URL / URN: https://openreview.net/forum?id=yjWir-w3gki
Kurzbeschreibung (Abstract):

Commonly in reinforcement learning (RL), rewards are discounted over time using an exponential function to model time preference, thereby bounding the expected long-term reward. In contrast, in economics and psychology, it has been shown that humans often adopt a hyperbolic discounting scheme, which is optimal when a specific task termination time distribution is assumed. In this work, we propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions. This formulation covers the case in which there is a non-exponential random termination time. We derive a Hamilton–Jacobi–Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method, which uses deep learning for function approximation. Further, we show how the inverse RL problem can be approached, in which one tries to recover properties of the discount function given decision data. We validate the applicability of our proposed approach on two simulated problems. Our approach opens the way for the analysis of human discounting in sequential decision-making tasks.

Freie Schlagworte: Reinfocement Learning,
Fachbereich(e)/-gebiet(e): 18 Fachbereich Elektrotechnik und Informationstechnik
18 Fachbereich Elektrotechnik und Informationstechnik > Institut für Nachrichtentechnik > Bioinspirierte Kommunikationssysteme
18 Fachbereich Elektrotechnik und Informationstechnik > Institut für Nachrichtentechnik
18 Fachbereich Elektrotechnik und Informationstechnik > Self-Organizing Systems Lab
Hinterlegungsdatum: 04 Apr 2024 11:19
Letzte Änderung: 04 Apr 2024 11:19
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen