Hur, Junhwa (2022)
Joint Motion, Semantic Segmentation, Occlusion, and Depth Estimation.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00021624
Dissertation, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
Visual scene understanding is one of the most important components of autonomous navigation. It includes multiple computer vision tasks such as recognizing objects, perceiving their 3D structure, and analyzing their motion, all of which have gone through remarkable progress over the recent years. However, most of the earlier studies have explored these components individually, and thus potential benefits from exploiting the relationship between them have been overlooked. In this dissertation, we explore what kind of relationship the tasks can present, along with the potential benefits that could be discovered from jointly formulating multiple tasks. The joint formulation allows each task to exploit the other task as an additional input cue and eventually improves the accuracy of the joint tasks. We first present the joint estimation of semantic segmentation and optical flow. Though not directly related, the tasks provide an important cue to each other in the temporal domain. Semantic information can provide information on plausible physical motion of its associated pixels, and accurate pixel-level temporal correspondences enhance the temporal consistency of semantic segmentation. We demonstrate that the joint formulation improves the accuracy of both tasks. Second, we investigate the mutual relationship between optical flow and occlusion estimation. Unlike most previous methods considering occlusions as outliers, we highlight the importance of jointly reasoning the two tasks in the optimization. Specifically through utilizing forward-backward consistency and occlusion-disocclusion symmetry in the energy, we demonstrate that the joint formulation brings substantial performance benefits for both tasks on standard benchmarks. We further demonstrate that optical flow and occlusion can exploit their mutual relationship in Convolutional Neural Network as well. We propose to iteratively and residually refine the estimates using a single weight-shared network, which substantially improves the accuracy without adding network parameters or even reducing them depending on the backbone networks. Next, we propose a joint depth and 3D scene flow estimation from only two temporally consecutive monocular images. We solve this ill-posed problem by taking an inverse problem view. We design a single Convolutional Neural Network that simultaneously estimates depth and 3D motion from a classical optical flow cost volume. With self-supervised learning, we leverage unlabeled data for training, without concerns about the shortage of 3D annotation for direct supervision. Finally, we conclude by summarizing the contributions and discussing future perspectives that can resolve current challenges our approaches have.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2022 | ||||
Autor(en): | Hur, Junhwa | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Joint Motion, Semantic Segmentation, Occlusion, and Depth Estimation | ||||
Sprache: | Englisch | ||||
Referenten: | Roth, Prof. Ph.D Stefan ; Ramanan, Prof. Ph.D Deva | ||||
Publikationsjahr: | 2022 | ||||
Ort: | Darmstadt | ||||
Kollation: | xviii, 154 Seiten | ||||
Datum der mündlichen Prüfung: | 18 Mai 2022 | ||||
DOI: | 10.26083/tuprints-00021624 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/21624 | ||||
Kurzbeschreibung (Abstract): | Visual scene understanding is one of the most important components of autonomous navigation. It includes multiple computer vision tasks such as recognizing objects, perceiving their 3D structure, and analyzing their motion, all of which have gone through remarkable progress over the recent years. However, most of the earlier studies have explored these components individually, and thus potential benefits from exploiting the relationship between them have been overlooked. In this dissertation, we explore what kind of relationship the tasks can present, along with the potential benefits that could be discovered from jointly formulating multiple tasks. The joint formulation allows each task to exploit the other task as an additional input cue and eventually improves the accuracy of the joint tasks. We first present the joint estimation of semantic segmentation and optical flow. Though not directly related, the tasks provide an important cue to each other in the temporal domain. Semantic information can provide information on plausible physical motion of its associated pixels, and accurate pixel-level temporal correspondences enhance the temporal consistency of semantic segmentation. We demonstrate that the joint formulation improves the accuracy of both tasks. Second, we investigate the mutual relationship between optical flow and occlusion estimation. Unlike most previous methods considering occlusions as outliers, we highlight the importance of jointly reasoning the two tasks in the optimization. Specifically through utilizing forward-backward consistency and occlusion-disocclusion symmetry in the energy, we demonstrate that the joint formulation brings substantial performance benefits for both tasks on standard benchmarks. We further demonstrate that optical flow and occlusion can exploit their mutual relationship in Convolutional Neural Network as well. We propose to iteratively and residually refine the estimates using a single weight-shared network, which substantially improves the accuracy without adding network parameters or even reducing them depending on the backbone networks. Next, we propose a joint depth and 3D scene flow estimation from only two temporally consecutive monocular images. We solve this ill-posed problem by taking an inverse problem view. We design a single Convolutional Neural Network that simultaneously estimates depth and 3D motion from a classical optical flow cost volume. With self-supervised learning, we leverage unlabeled data for training, without concerns about the shortage of 3D annotation for direct supervision. Finally, we conclude by summarizing the contributions and discussing future perspectives that can resolve current challenges our approaches have. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Status: | Verlagsversion | ||||
URN: | urn:nbn:de:tuda-tuprints-216242 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Visuelle Inferenz |
||||
Hinterlegungsdatum: | 21 Jul 2022 12:15 | ||||
Letzte Änderung: | 16 Dez 2022 07:35 | ||||
PPN: | 497916274 | ||||
Referenten: | Roth, Prof. Ph.D Stefan ; Ramanan, Prof. Ph.D Deva | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 18 Mai 2022 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |