Vinogradska, Julia (2018)
Gaussian Processes in Reinforcement Learning: Stability Analysis and Efficient Value Propagation.
Technische Universität Darmstadt
Dissertation, Erstveröffentlichung
Kurzbeschreibung (Abstract)
Control of nonlinear systems on continuous domains is a challenging task for various reasons. For robust and accurate control of complex systems a precise model of the system dynamics is essential. Building such highly precise dynamics models from physical knowledge often requires substantial manual effort and poses a great challenge in industrial applications. Acquiring a model automatically from system measurements employing regression techniques allows to decrease manual effort and, thus, poses an interesting alternative to knowledge-based modeling. Based on such a learned dynamics model, an approximately optimal controller can be inferred automatically. Such approaches are the subject of model-based reinforcement learning (RL) and learn optimal control from interactions with the system. Especially when probabilistic dynamics models such as Gaussian processes are employed, model-based RL has been tremendously successful and has attracted much attention from both the control and machine learning communities. However, several problems need to be solved to facilitate widespread deployment of model-based RL for learning control in real world scenarios. In this thesis, we address two current limitations of model-based RL that are indispensable prerequisites for widespread deployment of model-based RL in real world tasks. In many real world applications a poor controller can cause severe damage to the system or even put the safety of humans at risk. Thus, it is essential to ensure that the controlled system behaves as desired. While this question has been studied extensively in classical control, stability of closed-loop control systems with dynamics given as a Gaussian process has not been considered yet. We propose an automatic tool to compute regions of the state space where the desired behavior of the system can be guaranteed. We consider dynamics given as the mean of a GP as well as the full GP posterior distribution. In the first case, the proposed tool constructs regions of the state space, such that the trajectories starting in this region converge to the target state. From this asymptotic result, we follow statements for finite time horizons and stability under the presence of disturbances. In the second case the system dynamics is given as a GP posterior distribution. Thus, computation of multi-step-ahead predictions requires averaging over all plausible dynamics models given the observations. A a consequence, multi-step-ahead predictions become analytically intractable. We propose an approximation based on numerical quadrature that can handle complex state distributions, e.g., with multiple modes and provides upper bounds for the approximation error. Exploiting these error bounds, we present an automatic tool to compute stability regions. In these regions of the state space, our tool guarantees that for a finite time horizon the system behaves as desired with a given probability. Furthermore, we analyze asymptotic behavior of closed-loop control systems with dynamics given as a GP posterior distribution. In this case we show that for some common choices of the prior, the system has a unique stationary distribution to which the system state converges irrespective of the starting state. Another major challenge of RL for real world control applications is to minimize interactions with the system required for learning. While RL approaches based on GP dynamics models have demonstrated great data efficiency, the average amount of required system interactions can further be reduced. To achieve this goal, we propose to employ the numerical quadrature based approximation to propagate the value of a state. To show how this approximation can further increase data efficiency, we employ it in the two main classes of model-based RL: policy search and value iteration. In policy search, the state distribution must be computed to evaluate the expected long-term reward for a policy. The proposed numerical quadrature based approximation substantially improves estimates of the expected long-term reward and its gradients. As a result, data efficiency is significantly increased. For the value function based approaches for policy learning, the value propagation step is completely characterized by the Bellman equation. However, this equation is intractable for nonlinear dynamics. In this case, we propose a projection-based value iteration approach. We employ numerical quadrature to facilitate projection of the value function onto a linear feature space. Suitable features for value function representation are learned online without manual effort. This feature learning is constructed such that upper bounds for the projection error can be obtained. The proposed value iteration approach learns globally optimal policies and significantly benefits from the introduced highly accurate approximations.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2018 | ||||
Autor(en): | Vinogradska, Julia | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Gaussian Processes in Reinforcement Learning: Stability Analysis and Efficient Value Propagation | ||||
Sprache: | Englisch | ||||
Referenten: | Peters, Prof. Dr. Jan ; Rasmussen, Prof. Dr. Carl | ||||
Publikationsjahr: | 2018 | ||||
Ort: | Darmstadt | ||||
Datum der mündlichen Prüfung: | 29 November 2017 | ||||
URL / URN: | http://tuprints.ulb.tu-darmstadt.de/7286 | ||||
Kurzbeschreibung (Abstract): | Control of nonlinear systems on continuous domains is a challenging task for various reasons. For robust and accurate control of complex systems a precise model of the system dynamics is essential. Building such highly precise dynamics models from physical knowledge often requires substantial manual effort and poses a great challenge in industrial applications. Acquiring a model automatically from system measurements employing regression techniques allows to decrease manual effort and, thus, poses an interesting alternative to knowledge-based modeling. Based on such a learned dynamics model, an approximately optimal controller can be inferred automatically. Such approaches are the subject of model-based reinforcement learning (RL) and learn optimal control from interactions with the system. Especially when probabilistic dynamics models such as Gaussian processes are employed, model-based RL has been tremendously successful and has attracted much attention from both the control and machine learning communities. However, several problems need to be solved to facilitate widespread deployment of model-based RL for learning control in real world scenarios. In this thesis, we address two current limitations of model-based RL that are indispensable prerequisites for widespread deployment of model-based RL in real world tasks. In many real world applications a poor controller can cause severe damage to the system or even put the safety of humans at risk. Thus, it is essential to ensure that the controlled system behaves as desired. While this question has been studied extensively in classical control, stability of closed-loop control systems with dynamics given as a Gaussian process has not been considered yet. We propose an automatic tool to compute regions of the state space where the desired behavior of the system can be guaranteed. We consider dynamics given as the mean of a GP as well as the full GP posterior distribution. In the first case, the proposed tool constructs regions of the state space, such that the trajectories starting in this region converge to the target state. From this asymptotic result, we follow statements for finite time horizons and stability under the presence of disturbances. In the second case the system dynamics is given as a GP posterior distribution. Thus, computation of multi-step-ahead predictions requires averaging over all plausible dynamics models given the observations. A a consequence, multi-step-ahead predictions become analytically intractable. We propose an approximation based on numerical quadrature that can handle complex state distributions, e.g., with multiple modes and provides upper bounds for the approximation error. Exploiting these error bounds, we present an automatic tool to compute stability regions. In these regions of the state space, our tool guarantees that for a finite time horizon the system behaves as desired with a given probability. Furthermore, we analyze asymptotic behavior of closed-loop control systems with dynamics given as a GP posterior distribution. In this case we show that for some common choices of the prior, the system has a unique stationary distribution to which the system state converges irrespective of the starting state. Another major challenge of RL for real world control applications is to minimize interactions with the system required for learning. While RL approaches based on GP dynamics models have demonstrated great data efficiency, the average amount of required system interactions can further be reduced. To achieve this goal, we propose to employ the numerical quadrature based approximation to propagate the value of a state. To show how this approximation can further increase data efficiency, we employ it in the two main classes of model-based RL: policy search and value iteration. In policy search, the state distribution must be computed to evaluate the expected long-term reward for a policy. The proposed numerical quadrature based approximation substantially improves estimates of the expected long-term reward and its gradients. As a result, data efficiency is significantly increased. For the value function based approaches for policy learning, the value propagation step is completely characterized by the Bellman equation. However, this equation is intractable for nonlinear dynamics. In this case, we propose a projection-based value iteration approach. We employ numerical quadrature to facilitate projection of the value function onto a linear feature space. Suitable features for value function representation are learned online without manual effort. This feature learning is constructed such that upper bounds for the projection error can be obtained. The proposed value iteration approach learns globally optimal policies and significantly benefits from the introduced highly accurate approximations. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
URN: | urn:nbn:de:tuda-tuprints-72865 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Intelligente Autonome Systeme |
||||
Hinterlegungsdatum: | 29 Apr 2018 19:55 | ||||
Letzte Änderung: | 29 Apr 2018 19:55 | ||||
PPN: | |||||
Referenten: | Peters, Prof. Dr. Jan ; Rasmussen, Prof. Dr. Carl | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 29 November 2017 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |