Compatible natural gradient policy search

Pajarinen, Joni ; Thai, Hong Linh ; Akrour, Riad ; Peters, Jan ; Neumann, Gerhard (2022)
Compatible natural gradient policy search.
In: Machine Learning, 2022, 108 (8-9)
doi: 10.26083/tuprints-00020531
Artikel, Zweitveröffentlichung, Verlagsversion

URL / URN: https://tuprints.ulb.tu-darmstadt.de/20531

Kurzbeschreibung (Abstract)

Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.

Typ des Eintrags:	Artikel
Erschienen:	2022
Autor(en):	Pajarinen, Joni ; Thai, Hong Linh ; Akrour, Riad ; Peters, Jan ; Neumann, Gerhard
Art des Eintrags:	Zweitveröffentlichung
Titel:	Compatible natural gradient policy search
Sprache:	Englisch
Publikationsjahr:	2022
Publikationsdatum der Erstveröffentlichung:	2022
Verlag:	Springer
Titel der Zeitschrift, Zeitung oder Schriftenreihe:	Machine Learning
Jahrgang/Volume einer Zeitschrift:	108
(Heft-)Nummer:	8-9
DOI:	10.26083/tuprints-00020531
URL / URN:	https://tuprints.ulb.tu-darmstadt.de/20531
Zugehörige Links:	Verlags-DOI
Herkunft:	Zweitveröffentlichungsservice
Kurzbeschreibung (Abstract):	Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.
Status:	Verlagsversion
URN:	urn:nbn:de:tuda-tuprints-205319
Sachgruppe der Dewey Dezimalklassifikatin (DDC):	000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik 600 Technik, Medizin, angewandte Wissenschaften > 600 Technik
Fachbereich(e)/-gebiet(e):	20 Fachbereich Informatik 20 Fachbereich Informatik > Intelligente Autonome Systeme
TU-Projekte:	EC/H2020\|640554\|SKILLS4ROBOTS
Hinterlegungsdatum:	10 Feb 2022 13:10
Letzte Änderung:	21 Feb 2022 11:34
PPN:
Export:

Suche nach Titel in:	TUfind oder in Google

Frage zum Eintrag

Optionen (nur für Redakteure)

Redaktionelle Details anzeigen

OAI 2.0-Basis-URL: https://tubiblio.ulb.tu-darmstadt.de/cgi/oai2 TUbiblio verwendet EPrints 3.

Drucken |

Impressum |

Datenschutzerklärung

Compatible natural gradient policy search

Pajarinen, Joni ; Thai, Hong Linh ; Akrour, Riad ; Peters, Jan ; Neumann, Gerhard (2022)Compatible natural gradient policy search. In: Machine Learning, 2022, 108 (8-9) doi: 10.26083/tuprints-00020531 Artikel, Zweitveröffentlichung, Verlagsversion

Kurzbeschreibung (Abstract)

Pajarinen, Joni ; Thai, Hong Linh ; Akrour, Riad ; Peters, Jan ; Neumann, Gerhard (2022)
Compatible natural gradient policy search.
In: Machine Learning, 2022, 108 (8-9)
doi: 10.26083/tuprints-00020531
Artikel, Zweitveröffentlichung, Verlagsversion