Song, Yunlong (2023)
Minimax and entropic proximal policy optimization.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00024754
Masterarbeit, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
First-order gradient descent is to date the most commonly used optimization method for training deep neural networks, especially for networks with shared parameters, or recurrent neural networks (RNNs). Policy gradient methods provide several advantages over other reinforcement learning algorithms; for example, they can naturally handle continuous state and action spaces. In this thesis, we contribute two different policy gradient algorithms that are straightforward to implement and effective for solving challenging environments, both methods being compatible with large nonlinear function approximations and optimized using stochastic gradient descent. First, we propose a new family of policy gradient algorithms, which we call minimax entropic policy optimization (MMPO). The new method combines the trust region policy optimization and the idea of minimax training, in which stable policy improvement is achieved by formulating the KL-divergence constraint in the trust region policy optimization (TRPO) as a loss function with a ramp function transformation, and then, carrying out a minimax optimization between two stochastic gradient optimizers, one optimizing the "surrogate" objective and another maximizing the ramp-transformed KL-divergence loss function. Our experiments on several challenging continuous control tasks demonstrate that MMPO method achieves comparable performance as TRPO and proximal policy optimization (PPO), however, is much easier to implement compared to TRPO and guarantees that the KL-divergence bound to be satisfied. Second, we investigate the use of the f-divergence as a regularization to the policy improvement, where the f-divergence is a general class of functional measuring the divergence between two probability distributions with the KL-divergence being a special case. The f-divergence can be either treated as a hard constraint or added as a soft constraint to the objective. We propose to treat the f-divergence as a soft constraint by penalizing the policy update step via a penalty term on the f-divergence between successive policy distributions. We term such an unconstrained policy optimization method as f-divergence penalized policy optimization (f-PPO). We focus on a one-parameter family of α-divergences, a special case of f-divergences, and study influences of the choice of divergence functions on policy optimization. The empirical results on a series of MuJoCo environments show that f-PPO with a proper choice of α-divergence is effective for solving challenging continuous control tasks, where α-divergences act differently on the policy entropy, and hence, on the policy improvement.
Typ des Eintrags: | Masterarbeit | ||||
---|---|---|---|---|---|
Erschienen: | 2023 | ||||
Autor(en): | Song, Yunlong | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Minimax and entropic proximal policy optimization | ||||
Sprache: | Englisch | ||||
Publikationsjahr: | 26 Oktober 2023 | ||||
Ort: | Darmstadt | ||||
Kollation: | vi, 42 Seiten | ||||
DOI: | 10.26083/tuprints-00024754 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/24754 | ||||
Herkunft: | Zweitveröffentlichungsservice | ||||
Kurzbeschreibung (Abstract): | First-order gradient descent is to date the most commonly used optimization method for training deep neural networks, especially for networks with shared parameters, or recurrent neural networks (RNNs). Policy gradient methods provide several advantages over other reinforcement learning algorithms; for example, they can naturally handle continuous state and action spaces. In this thesis, we contribute two different policy gradient algorithms that are straightforward to implement and effective for solving challenging environments, both methods being compatible with large nonlinear function approximations and optimized using stochastic gradient descent. First, we propose a new family of policy gradient algorithms, which we call minimax entropic policy optimization (MMPO). The new method combines the trust region policy optimization and the idea of minimax training, in which stable policy improvement is achieved by formulating the KL-divergence constraint in the trust region policy optimization (TRPO) as a loss function with a ramp function transformation, and then, carrying out a minimax optimization between two stochastic gradient optimizers, one optimizing the "surrogate" objective and another maximizing the ramp-transformed KL-divergence loss function. Our experiments on several challenging continuous control tasks demonstrate that MMPO method achieves comparable performance as TRPO and proximal policy optimization (PPO), however, is much easier to implement compared to TRPO and guarantees that the KL-divergence bound to be satisfied. Second, we investigate the use of the f-divergence as a regularization to the policy improvement, where the f-divergence is a general class of functional measuring the divergence between two probability distributions with the KL-divergence being a special case. The f-divergence can be either treated as a hard constraint or added as a soft constraint to the objective. We propose to treat the f-divergence as a soft constraint by penalizing the policy update step via a penalty term on the f-divergence between successive policy distributions. We term such an unconstrained policy optimization method as f-divergence penalized policy optimization (f-PPO). We focus on a one-parameter family of α-divergences, a special case of f-divergences, and study influences of the choice of divergence functions on policy optimization. The empirical results on a series of MuJoCo environments show that f-PPO with a proper choice of α-divergence is effective for solving challenging continuous control tasks, where α-divergences act differently on the policy entropy, and hence, on the policy improvement. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Status: | Verlagsversion | ||||
URN: | urn:nbn:de:tuda-tuprints-247547 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Intelligente Autonome Systeme |
||||
TU-Projekte: | EC/H2020|640554|SKILLS4ROBOTS | ||||
Hinterlegungsdatum: | 26 Okt 2023 13:43 | ||||
Letzte Änderung: | 27 Okt 2023 09:20 | ||||
PPN: | |||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |