TU Darmstadt / ULB / TUbiblio

Compatible natural gradient policy search

Pajarinen, Joni ; Thai, Hong Linh ; Akrour, Riad ; Peters, Jan ; Neumann, Gerhard (2022)
Compatible natural gradient policy search.
In: Machine Learning, 108 (8-9)
doi: 10.1007/s10994-019-05807-0
Artikel, Bibliographie

Dies ist die neueste Version dieses Eintrags.

Kurzbeschreibung (Abstract)

Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.

Typ des Eintrags: Artikel
Erschienen: 2022
Autor(en): Pajarinen, Joni ; Thai, Hong Linh ; Akrour, Riad ; Peters, Jan ; Neumann, Gerhard
Art des Eintrags: Bibliographie
Titel: Compatible natural gradient policy search
Sprache: Englisch
Publikationsjahr: 2022
Verlag: Springer
Titel der Zeitschrift, Zeitung oder Schriftenreihe: Machine Learning
Jahrgang/Volume einer Zeitschrift: 108
(Heft-)Nummer: 8-9
DOI: 10.1007/s10994-019-05807-0
Zugehörige Links:
Kurzbeschreibung (Abstract):

Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.

Sachgruppe der Dewey Dezimalklassifikatin (DDC): 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik
600 Technik, Medizin, angewandte Wissenschaften > 600 Technik
Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Intelligente Autonome Systeme
TU-Projekte: EC/H2020|640554|SKILLS4ROBOTS
Hinterlegungsdatum: 02 Aug 2024 12:37
Letzte Änderung: 02 Aug 2024 12:37
PPN:
Export:
Suche nach Titel in: TUfind oder in Google

Verfügbare Versionen dieses Eintrags

Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen