Pajarinen, Joni ; Thai, Hong Linh ; Akrour, Riad ; Peters, Jan ; Neumann, Gerhard (2022):
Compatible natural gradient policy search. (Publisher's Version)
In: Machine Learning, 108 (8-9), pp. 1443-1466. Springer, ISSN 0885-6125, e-ISSN 1573-0565,
DOI: 10.26083/tuprints-00020531,
[Article]
Abstract
Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.
Item Type: | Article |
---|---|
Erschienen: | 2022 |
Creators: | Pajarinen, Joni ; Thai, Hong Linh ; Akrour, Riad ; Peters, Jan ; Neumann, Gerhard |
Origin: | Secondary publication service |
Status: | Publisher's Version |
Title: | Compatible natural gradient policy search |
Language: | English |
Abstract: | Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks. |
Journal or Publication Title: | Machine Learning |
Volume of the journal: | 108 |
Issue Number: | 8-9 |
Publisher: | Springer |
Divisions: | 20 Department of Computer Science 20 Department of Computer Science > Intelligent Autonomous Systems |
TU-Projects: | EC/H2020|640554|SKILLS4ROBOTS |
Date Deposited: | 10 Feb 2022 13:10 |
DOI: | 10.26083/tuprints-00020531 |
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/20531 |
URN: | urn:nbn:de:tuda-tuprints-205319 |
PPN: | |
Corresponding Links: | |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
![]() |
Send an inquiry |
Options (only for editors)
![]() |
Show editorial Details |