Parisi, Simone (2020)
Reinforcement Learning with Sparse and Multiple Rewards.
Technische Universität Darmstadt
doi: 10.25534/tuprints-00011372
Dissertation, Erstveröffentlichung
Kurzbeschreibung (Abstract)
Over the course of the last decade, the framework of reinforcement learning has developed into a promising tool for learning a large variety of task. The idea of reinforcement learning is, at its core, very simple yet effective. The learning agent is left to explore the world by performing actions based on its observations of the state of the world, and in turn receives a feedback, called reward, assessing the quality of its behavior. However, learning soon becomes challenging and even impractical as the complexity of the environment and of the task increase. In particular, learning without any pre-defined behavior (1) in the presence of rarely emitted or sparse rewards, (2) maintaining stability even with limited data, and (3) with possibly multiple conflicting objectives are some of the most prominent issues that the agent has to face. Consider the simple problem of a robot that needs to learn a parameterized controller in order to reach a certain point based solely on the raw sensory observation, e.g., internal reading of joints position and camera images of the surrounding environment, and on the binary reward "success'' / "failure''. Without any prior knowledge of the world's dynamics, or any hint on how to behave, the robot will start acting randomly. Such exploration strategy will be (1) very unlikely to bring the robot closer to the goal, and thus to experience the "success'' feedback, and (2) likely generate useless trajectories and, subsequently, learning will be unstable. Furthermore, (3) there are many different ways the robot can reach the goal. For instance, the robot can quickly accelerate and then suddenly stop at the desired point, or it can slowly and smoothly navigate to the goal. These behaviors are clearly opposite, but the binary feedback does not provide any hint on which is more desirable. It should be clear that even simple problems such as a reaching task can turn non-trivial for reinforcement learning. One possible solution is to pre-engineer the task, e.g., hand-crafting the initial exploration behavior with imitation learning, shaping the reward based on the distance from the goal, or adding auxiliary rewards based on speed and safety. Following this solution, in recent years a lot of effort has been directed towards scaling reinforcement learning to solve complex real-world problems, such as robotic tasks with many degrees of freedom, videogames, and board games like Chess, Go, and Shogi. These advances, however, were possible largely thanks to experts prior knowledge and engineering, such as pre-initialized parameterized agent behaviors and reward shaping, and often required a prohibitive amount of data. This large amount of required prior knowledge and pre-structuring is arguably in stark contrast to the goal of developing autonomous learning.
In this thesis we will present methods to increase the autonomy of reinforcement learning algorithms, i.e., learning without expert pre-engineering, by addressing the issues discussed above. The key points of our research address (1) techniques to deal with multiple conflicting reward functions, (2) methods to enhance exploration in the presence of sparse rewards, and (3) techniques to enable more stable and safer learning. Progress in each of these aspects will lift reinforcement learning to a higher level of autonomy.
First, we will address the presence of conflicting objective from a multi-objective optimization perspective. In this scenario, the standard concept of optimality is replaced by Pareto optimality, a concept for representing compromises among the objectives. Subsequently, the goal is to find the Pareto frontier, a set of solutions representing different compromises among the objectives. Despite recent advances in multi-objective optimization, achieving an accurate representation of the Pareto frontier is still an important challenge. Common practical approaches rely on experts to manually set priority or thresholds on the objectives. These methods require prior knowledge and are not able to learn the whole Pareto frontier but just a portion of it, possibly missing interesting solutions. On the contrary, we propose a manifold-based method which learn a continuous approximation of the frontier without the need of any prior knowledge.
We will then consider learning in the presence of sparse rewards and present novel exploration strategies. Classical exploration techniques in reinforcement learning mostly revolve around the immediate reward, that is, how to choose an action to balance between exploitation and exploration for the current state. These methods, however, perform extremely poorly if only sparse rewards are provided. State-of-the-art exploration strategies, thus, rely either on local exploration along the current solution together with sensible initialization, or on handcrafted strategies based on heuristics. These approaches, however, either require prior knowledge or have poor guarantees of convergence, and often falls in local optima. On the contrary, we propose an approach that plans exploration actions far into the future based on what we call long-term visitation value. Intuitively, this value assesses the number of unvisited states that the agent can visit in the future by performing that action.
Finally, we address the problem of stabilizing learning when little data is available. Even assuming efficient exploration strategies, dense rewards, and the presence of only one objective, reinforcement learning can exhibit unstable behavior. Interestingly, the most successful algorithms, namely actor-critic methods, are also the most sensible to this issue. These methods typically separate the problem of learning the value of a given state from the problem of learning the optimal action to execute in such a state. The former is fullfilled by the so-called critic, while the latter by the so-called actor. In this scenario, the instability is due the interplay between these two components, especially when nonlinear approximators, such as neural networks, are employed. To avoid such issues, we propose to regularize the learning objective of the actor by penalizing the error of the critic. This improves stability by avoiding large steps in the actor update whenever the critic is highly inaccurate.
Altogether, the individual contributions of this thesis allow reinforcement learning to rely less on expert pre-engineering. The proposed methods can be applied to a large variety of common algorithms, and are evaluated on a wide array of tasks. Results on both standard and novel benchmarks confirm their effectiveness.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2020 | ||||
Autor(en): | Parisi, Simone | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Reinforcement Learning with Sparse and Multiple Rewards | ||||
Sprache: | Englisch | ||||
Referenten: | Peters, Prof. Dr. Jan ; Boedeker, Prof. Dr. Joschka | ||||
Publikationsjahr: | Januar 2020 | ||||
Ort: | Darmstadt | ||||
Datum der mündlichen Prüfung: | 19 Dezember 2019 | ||||
DOI: | 10.25534/tuprints-00011372 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/11372 | ||||
Kurzbeschreibung (Abstract): | Over the course of the last decade, the framework of reinforcement learning has developed into a promising tool for learning a large variety of task. The idea of reinforcement learning is, at its core, very simple yet effective. The learning agent is left to explore the world by performing actions based on its observations of the state of the world, and in turn receives a feedback, called reward, assessing the quality of its behavior. However, learning soon becomes challenging and even impractical as the complexity of the environment and of the task increase. In particular, learning without any pre-defined behavior (1) in the presence of rarely emitted or sparse rewards, (2) maintaining stability even with limited data, and (3) with possibly multiple conflicting objectives are some of the most prominent issues that the agent has to face. Consider the simple problem of a robot that needs to learn a parameterized controller in order to reach a certain point based solely on the raw sensory observation, e.g., internal reading of joints position and camera images of the surrounding environment, and on the binary reward "success'' / "failure''. Without any prior knowledge of the world's dynamics, or any hint on how to behave, the robot will start acting randomly. Such exploration strategy will be (1) very unlikely to bring the robot closer to the goal, and thus to experience the "success'' feedback, and (2) likely generate useless trajectories and, subsequently, learning will be unstable. Furthermore, (3) there are many different ways the robot can reach the goal. For instance, the robot can quickly accelerate and then suddenly stop at the desired point, or it can slowly and smoothly navigate to the goal. These behaviors are clearly opposite, but the binary feedback does not provide any hint on which is more desirable. It should be clear that even simple problems such as a reaching task can turn non-trivial for reinforcement learning. One possible solution is to pre-engineer the task, e.g., hand-crafting the initial exploration behavior with imitation learning, shaping the reward based on the distance from the goal, or adding auxiliary rewards based on speed and safety. Following this solution, in recent years a lot of effort has been directed towards scaling reinforcement learning to solve complex real-world problems, such as robotic tasks with many degrees of freedom, videogames, and board games like Chess, Go, and Shogi. These advances, however, were possible largely thanks to experts prior knowledge and engineering, such as pre-initialized parameterized agent behaviors and reward shaping, and often required a prohibitive amount of data. This large amount of required prior knowledge and pre-structuring is arguably in stark contrast to the goal of developing autonomous learning. In this thesis we will present methods to increase the autonomy of reinforcement learning algorithms, i.e., learning without expert pre-engineering, by addressing the issues discussed above. The key points of our research address (1) techniques to deal with multiple conflicting reward functions, (2) methods to enhance exploration in the presence of sparse rewards, and (3) techniques to enable more stable and safer learning. Progress in each of these aspects will lift reinforcement learning to a higher level of autonomy. First, we will address the presence of conflicting objective from a multi-objective optimization perspective. In this scenario, the standard concept of optimality is replaced by Pareto optimality, a concept for representing compromises among the objectives. Subsequently, the goal is to find the Pareto frontier, a set of solutions representing different compromises among the objectives. Despite recent advances in multi-objective optimization, achieving an accurate representation of the Pareto frontier is still an important challenge. Common practical approaches rely on experts to manually set priority or thresholds on the objectives. These methods require prior knowledge and are not able to learn the whole Pareto frontier but just a portion of it, possibly missing interesting solutions. On the contrary, we propose a manifold-based method which learn a continuous approximation of the frontier without the need of any prior knowledge. We will then consider learning in the presence of sparse rewards and present novel exploration strategies. Classical exploration techniques in reinforcement learning mostly revolve around the immediate reward, that is, how to choose an action to balance between exploitation and exploration for the current state. These methods, however, perform extremely poorly if only sparse rewards are provided. State-of-the-art exploration strategies, thus, rely either on local exploration along the current solution together with sensible initialization, or on handcrafted strategies based on heuristics. These approaches, however, either require prior knowledge or have poor guarantees of convergence, and often falls in local optima. On the contrary, we propose an approach that plans exploration actions far into the future based on what we call long-term visitation value. Intuitively, this value assesses the number of unvisited states that the agent can visit in the future by performing that action. Finally, we address the problem of stabilizing learning when little data is available. Even assuming efficient exploration strategies, dense rewards, and the presence of only one objective, reinforcement learning can exhibit unstable behavior. Interestingly, the most successful algorithms, namely actor-critic methods, are also the most sensible to this issue. These methods typically separate the problem of learning the value of a given state from the problem of learning the optimal action to execute in such a state. The former is fullfilled by the so-called critic, while the latter by the so-called actor. In this scenario, the instability is due the interplay between these two components, especially when nonlinear approximators, such as neural networks, are employed. To avoid such issues, we propose to regularize the learning objective of the actor by penalizing the error of the critic. This improves stability by avoiding large steps in the actor update whenever the critic is highly inaccurate. Altogether, the individual contributions of this thesis allow reinforcement learning to rely less on expert pre-engineering. The proposed methods can be applied to a large variety of common algorithms, and are evaluated on a wide array of tasks. Results on both standard and novel benchmarks confirm their effectiveness. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
URN: | urn:nbn:de:tuda-tuprints-113721 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik 500 Naturwissenschaften und Mathematik > 510 Mathematik |
||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Intelligente Autonome Systeme |
||||
Hinterlegungsdatum: | 26 Jan 2020 20:57 | ||||
Letzte Änderung: | 26 Jan 2020 20:57 | ||||
PPN: | |||||
Referenten: | Peters, Prof. Dr. Jan ; Boedeker, Prof. Dr. Joschka | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 19 Dezember 2019 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |