TU Darmstadt / ULB / TUbiblio

Compatible natural gradient policy search

Pajarinen, Joni ; Thai, Hong Linh ; Akrour, Riad ; Peters, Jan ; Neumann, Gerhard (2022):
Compatible natural gradient policy search. (Publisher's Version)
In: Machine Learning, 108 (8-9), pp. 1443-1466. Springer, ISSN 0885-6125, e-ISSN 1573-0565,
DOI: 10.26083/tuprints-00020531,
[Article]

Abstract

Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.

Item Type: Article
Erschienen: 2022
Creators: Pajarinen, Joni ; Thai, Hong Linh ; Akrour, Riad ; Peters, Jan ; Neumann, Gerhard
Origin: Secondary publication service
Status: Publisher's Version
Title: Compatible natural gradient policy search
Language: English
Abstract:

Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.

Journal or Publication Title: Machine Learning
Volume of the journal: 108
Issue Number: 8-9
Publisher: Springer
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Intelligent Autonomous Systems
TU-Projects: EC/H2020|640554|SKILLS4ROBOTS
Date Deposited: 10 Feb 2022 13:10
DOI: 10.26083/tuprints-00020531
URL / URN: https://tuprints.ulb.tu-darmstadt.de/20531
URN: urn:nbn:de:tuda-tuprints-205319
PPN:
Corresponding Links:
Export:
Suche nach Titel in: TUfind oder in Google
Send an inquiry Send an inquiry

Options (only for editors)
Show editorial Details Show editorial Details