Muratore, Fabio (2021)
Randomizing Physics Simulations for Robot Learning.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00019940
Dissertation, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
The ability to mentally evaluate variations of the future may well be the key to intelligence. Combined with the ability to reason, it makes humans excellent at handling new and complex situations. If we want robots to solve varying tasks autonomously, we need to endow them with such kind of ‘mental rehearsal’. Physics simulations allow predicting how the environment will change depending on a sequence of actions. For example, robots can simulate multiple control policies in different simulations instances, collect the results, and subsequently reason about which policy to execute in the real world. As such physics simulations are highly customizable, they enable generating vast amounts of diverse data at a relatively low cost. Therefore, they make it possible to apply deep learning methods for physical systems despite the exorbitant demand for data. Since state-of-the-art deep learning methods come with few guarantees, it is essential to test them in many simulated scenarios before deployment on the real system. Over the last decade, the speed and modeling power of general-purpose physics engines increased significantly. State-of-the-art simulators feature rigid body, soft body, and fluid dynamics, as well as massive GPU-based parallelization. Despite the impressive progress, simulations will always remain an idealized model of the real world, thus are inevitably flawed. Typical error sources are unmodeled physical phenomena, or suboptimal parameter values of the underlying generative model. These discrepancies between the real and the simulated world are summarized by the term ‘reality gap’. This gap can manifest in various ways when learning from simulations. In the best case, it is only a performance drop, e.g., a lower success rate, or a reduced tracking accuracy. More likely, the learned policy is not transferable to the robot because of unknown friction effects, which lead to underestimating the friction in simulation. Thus, the commanded actions are in this case not strong enough to get the robot moving. Another reason for failure are small parameter estimation errors, which can quickly lead to unstable system dynamics. This case is particularly dangerous for humans and robots. For these reasons, bridging the reality gap is the essential step to endow robots with the ability to learn from simulated experience. In this thesis, we will tackle the challenge of learning robot control policies from simulations such that the results can be (directly) transferred to the real world. We focus on scenarios where the source domain is a randomized simulator and the target domain is either a different simulation instance (sim-to-sim) or the physical robot (sim-to-real). We strive to answer the following research questions: 1. How can we quantitatively estimate the transferability of a control policy from one domain to another? 2. Does randomizing the simulator during learning make the resulting policy more robust against modeling imperfections? 3. How do we adapt the randomized simulator based on real-world evaluations? 4. Can we infer the source domain parameter distribution from data and subsequently use it for learning? 5. What are the necessary assumptions and technical requirements to learn robot control policies from randomized simulations? Despite the recent popularity of sim-to-real methods, the first question has been unanswered up to this point in time. As a consequence, state-of-the-art algorithms can not make a quantitative statement about the transferability of the resulting control policies. Moreover, they stop training according to some heuristic like a fixed number of iterations, which can lead to a waste of computation time. In Chapter 3, we derive the simulation optimization bias as a measure of the reality gap, and show that policies learned from a source domain are optimistically biased in terms of their performance in the target domain, even if they originate from the same distribution. To mitigate this problem, we propose a policy search algorithm that estimates simulation optimization bias and continues training until an estimated upper confidence bound on this bias is below a given threshold. Thus, the resulting policy satisfies a probabilistic guarantee on the performance loss when transferring the policy to a different environment from the same source domain distribution. Moreover, our sim-to-real evaluations answer the second question with a clear “yes”. Straightforwardly learning from randomized source domains shows the tendency to be slower and have lower performance at the nominal model than methods that close the sim-to-real loop by adapting the domain parameter distribution. Therefore, we tackle the third question in Chapter 4 by introducing a policy search algorithm which incorporates Bayesian optimization to adapt the domain parameter distribution based on real-world data. The sample-efficiency of Bayesian optimization allows updating the distribution’s parameters, including the uncertainty, while only requiring few evaluations on the physical device. Most notably, the data yielded from these evaluations can be very scarce, e.g., a scalar return value per trial. This way, the connection between distribution over simulator parameters and the target domain performance is captured by a probabilistic model. At the same time, we can eliminate the common assumption of knowing the distribution’s mean and variance a priori. So far, existing domain randomization approaches assume that each domain parameter is independent and obeys a known probability distribution type, typically chosen to be a normal or uniform distribution. These and other assumptions impose unnecessary restrictions on the posterior distribution over simulators, and prevent us from utilizing the full power of domain randomization. In order to overcome this limitation, we propose to combine reinforcement learning with state-of-the-art likelihood-free inference methods, powered by flexible neural density estimators, to learn the posterior over domain parameters. The proposed method only requires a parametric generative model, e.g., a physics simulator, coarse prior ranges, and a small set of real-world trajectories. Together with a policy optimization algorithm, this approach iteratively updates the posterior over simulators and learns how to solve a given task. Most importantly, the generative model does not need to be differentiable, and the neural posterior can capture dependencies between domain parameters. By drastically reducing the quality and quantity of assumptions while still successfully learning transferable control policies, this procedure answers the fourth and the fifth question in Chapter 5. The methods presented in this thesis will greatly benefit from the continuous increase in computational power, allowing the randomization schemes to perform more exhaustive searches through the domain parameter space. In consequence, the required computation time as well as the variance will be reduced, alleviating the two biggest drawbacks of the domain randomization approaches. Meanwhile, financially strong actors like the video gaming industry are heavily pushing the development of physics simulators. Thus, current niche applications like simulations of muscles or interactions between fluid and solid particles are going to be consumer standard in the near future. The facilitated access to high-fidelity simulators will open the door to a whole new range of tasks which can be solved with methods presented in this thesis. One example could be to train control policies for active robotic prostheses in simulation such that to support human motion. In a subsequent step, these controllers could be customized based on user-specific data. The foreseeable establishment of (differentiable) probabilistic simulation engines will provide access to the simulator’s likelihood function, hence boost the applicability of Bayesian inference. As a consequence, the popularity of research on highly data-efficient simulation-based inference methods will increase, leading to new algorithms that can perform complex inference in real time. These approaches have the potential to become the next mega trend in robotics research after the era deep learning.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2021 | ||||
Autor(en): | Muratore, Fabio | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Randomizing Physics Simulations for Robot Learning | ||||
Sprache: | Englisch | ||||
Referenten: | Peters, Prof. Dr. Jan ; Ramos, Prof. Dr. Fabio | ||||
Publikationsjahr: | 2021 | ||||
Ort: | Darmstadt | ||||
Kollation: | xx, 138 Seiten | ||||
Datum der mündlichen Prüfung: | 28 September 2021 | ||||
DOI: | 10.26083/tuprints-00019940 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/19940 | ||||
Kurzbeschreibung (Abstract): | The ability to mentally evaluate variations of the future may well be the key to intelligence. Combined with the ability to reason, it makes humans excellent at handling new and complex situations. If we want robots to solve varying tasks autonomously, we need to endow them with such kind of ‘mental rehearsal’. Physics simulations allow predicting how the environment will change depending on a sequence of actions. For example, robots can simulate multiple control policies in different simulations instances, collect the results, and subsequently reason about which policy to execute in the real world. As such physics simulations are highly customizable, they enable generating vast amounts of diverse data at a relatively low cost. Therefore, they make it possible to apply deep learning methods for physical systems despite the exorbitant demand for data. Since state-of-the-art deep learning methods come with few guarantees, it is essential to test them in many simulated scenarios before deployment on the real system. Over the last decade, the speed and modeling power of general-purpose physics engines increased significantly. State-of-the-art simulators feature rigid body, soft body, and fluid dynamics, as well as massive GPU-based parallelization. Despite the impressive progress, simulations will always remain an idealized model of the real world, thus are inevitably flawed. Typical error sources are unmodeled physical phenomena, or suboptimal parameter values of the underlying generative model. These discrepancies between the real and the simulated world are summarized by the term ‘reality gap’. This gap can manifest in various ways when learning from simulations. In the best case, it is only a performance drop, e.g., a lower success rate, or a reduced tracking accuracy. More likely, the learned policy is not transferable to the robot because of unknown friction effects, which lead to underestimating the friction in simulation. Thus, the commanded actions are in this case not strong enough to get the robot moving. Another reason for failure are small parameter estimation errors, which can quickly lead to unstable system dynamics. This case is particularly dangerous for humans and robots. For these reasons, bridging the reality gap is the essential step to endow robots with the ability to learn from simulated experience. In this thesis, we will tackle the challenge of learning robot control policies from simulations such that the results can be (directly) transferred to the real world. We focus on scenarios where the source domain is a randomized simulator and the target domain is either a different simulation instance (sim-to-sim) or the physical robot (sim-to-real). We strive to answer the following research questions: 1. How can we quantitatively estimate the transferability of a control policy from one domain to another? 2. Does randomizing the simulator during learning make the resulting policy more robust against modeling imperfections? 3. How do we adapt the randomized simulator based on real-world evaluations? 4. Can we infer the source domain parameter distribution from data and subsequently use it for learning? 5. What are the necessary assumptions and technical requirements to learn robot control policies from randomized simulations? Despite the recent popularity of sim-to-real methods, the first question has been unanswered up to this point in time. As a consequence, state-of-the-art algorithms can not make a quantitative statement about the transferability of the resulting control policies. Moreover, they stop training according to some heuristic like a fixed number of iterations, which can lead to a waste of computation time. In Chapter 3, we derive the simulation optimization bias as a measure of the reality gap, and show that policies learned from a source domain are optimistically biased in terms of their performance in the target domain, even if they originate from the same distribution. To mitigate this problem, we propose a policy search algorithm that estimates simulation optimization bias and continues training until an estimated upper confidence bound on this bias is below a given threshold. Thus, the resulting policy satisfies a probabilistic guarantee on the performance loss when transferring the policy to a different environment from the same source domain distribution. Moreover, our sim-to-real evaluations answer the second question with a clear “yes”. Straightforwardly learning from randomized source domains shows the tendency to be slower and have lower performance at the nominal model than methods that close the sim-to-real loop by adapting the domain parameter distribution. Therefore, we tackle the third question in Chapter 4 by introducing a policy search algorithm which incorporates Bayesian optimization to adapt the domain parameter distribution based on real-world data. The sample-efficiency of Bayesian optimization allows updating the distribution’s parameters, including the uncertainty, while only requiring few evaluations on the physical device. Most notably, the data yielded from these evaluations can be very scarce, e.g., a scalar return value per trial. This way, the connection between distribution over simulator parameters and the target domain performance is captured by a probabilistic model. At the same time, we can eliminate the common assumption of knowing the distribution’s mean and variance a priori. So far, existing domain randomization approaches assume that each domain parameter is independent and obeys a known probability distribution type, typically chosen to be a normal or uniform distribution. These and other assumptions impose unnecessary restrictions on the posterior distribution over simulators, and prevent us from utilizing the full power of domain randomization. In order to overcome this limitation, we propose to combine reinforcement learning with state-of-the-art likelihood-free inference methods, powered by flexible neural density estimators, to learn the posterior over domain parameters. The proposed method only requires a parametric generative model, e.g., a physics simulator, coarse prior ranges, and a small set of real-world trajectories. Together with a policy optimization algorithm, this approach iteratively updates the posterior over simulators and learns how to solve a given task. Most importantly, the generative model does not need to be differentiable, and the neural posterior can capture dependencies between domain parameters. By drastically reducing the quality and quantity of assumptions while still successfully learning transferable control policies, this procedure answers the fourth and the fifth question in Chapter 5. The methods presented in this thesis will greatly benefit from the continuous increase in computational power, allowing the randomization schemes to perform more exhaustive searches through the domain parameter space. In consequence, the required computation time as well as the variance will be reduced, alleviating the two biggest drawbacks of the domain randomization approaches. Meanwhile, financially strong actors like the video gaming industry are heavily pushing the development of physics simulators. Thus, current niche applications like simulations of muscles or interactions between fluid and solid particles are going to be consumer standard in the near future. The facilitated access to high-fidelity simulators will open the door to a whole new range of tasks which can be solved with methods presented in this thesis. One example could be to train control policies for active robotic prostheses in simulation such that to support human motion. In a subsequent step, these controllers could be customized based on user-specific data. The foreseeable establishment of (differentiable) probabilistic simulation engines will provide access to the simulator’s likelihood function, hence boost the applicability of Bayesian inference. As a consequence, the popularity of research on highly data-efficient simulation-based inference methods will increase, leading to new algorithms that can perform complex inference in real time. These approaches have the potential to become the next mega trend in robotics research after the era deep learning. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Status: | Verlagsversion | ||||
URN: | urn:nbn:de:tuda-tuprints-199400 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik 600 Technik, Medizin, angewandte Wissenschaften > 600 Technik 600 Technik, Medizin, angewandte Wissenschaften > 620 Ingenieurwissenschaften und Maschinenbau |
||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Intelligente Autonome Systeme |
||||
Hinterlegungsdatum: | 01 Dez 2021 13:30 | ||||
Letzte Änderung: | 08 Dez 2021 07:54 | ||||
PPN: | |||||
Referenten: | Peters, Prof. Dr. Jan ; Ramos, Prof. Dr. Fabio | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 28 September 2021 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |