TU Darmstadt / ULB / TUbiblio

Combining Appearance, Depth and Motion for Efficient Semantic Scene Understanding

Rehfeld, Timo (2018):
Combining Appearance, Depth and Motion for Efficient Semantic Scene Understanding.
Darmstadt, Technische Universität, [Online-Edition: http://tuprints.ulb.tu-darmstadt.de/7315],
[Ph.D. Thesis]

Abstract

Computer vision plays a central role in autonomous vehicle technology, because cameras are comparably cheap and capture rich information about the environment. In particular, object classes, i.e. whether a certain object is a pedestrian, cyclist or vehicle can be extracted very well based on image data. Environment perception in urban city centers is a highly challenging computer vision problem, as the environment is very complex and cluttered: road boundaries and markings, traffic signs and lights and many different kinds of objects that can mutually occlude each other need to be detected in real-time. Existing automotive vision systems do not easily scale to these requirements, because every problem or object class is treated independently. Scene labeling on the other hand, which assigns object class information to every pixel in the image, is the most promising approach to avoid this overhead by sharing extracted features across multiple classes. Compared to bounding box detectors, scene labeling additionally provides richer and denser information about the environment. However, most existing scene labeling methods require a large amount of computational resources, which makes them infeasible for real-time in-vehicle applications. In addition, in terms of bandwidth, a dense pixel-level representation is not ideal to transmit the perceived environment to other modules of an autonomous vehicle, such as localization or path planning.

This dissertation addresses the scene labeling problem in an automotive context by constructing a scene labeling concept around the "Stixel World" model of Pfeiffer (2011), which compresses dense information about the environment into a set of small "sticks" that stand upright, perpendicular to the ground plane. This work provides the first extension of the existing Stixel formulation that takes into account learned dense pixel-level appearance features. In a second step, Stixels are used as primitive scene elements to build a highly efficient region-level labeling scheme. The last part of this dissertation finally proposes a model that combines both pixel-level and region-level scene labeling into a single model that yields state-of-the-art or better labeling accuracy and can be executed in real-time with typical camera refresh rates. This work further investigates how existing depth information, i.e. from a stereo camera, can help to improve labeling accuracy and reduce runtime.

Item Type: Ph.D. Thesis
Erschienen: 2018
Creators: Rehfeld, Timo
Title: Combining Appearance, Depth and Motion for Efficient Semantic Scene Understanding
Language: English
Abstract:

Computer vision plays a central role in autonomous vehicle technology, because cameras are comparably cheap and capture rich information about the environment. In particular, object classes, i.e. whether a certain object is a pedestrian, cyclist or vehicle can be extracted very well based on image data. Environment perception in urban city centers is a highly challenging computer vision problem, as the environment is very complex and cluttered: road boundaries and markings, traffic signs and lights and many different kinds of objects that can mutually occlude each other need to be detected in real-time. Existing automotive vision systems do not easily scale to these requirements, because every problem or object class is treated independently. Scene labeling on the other hand, which assigns object class information to every pixel in the image, is the most promising approach to avoid this overhead by sharing extracted features across multiple classes. Compared to bounding box detectors, scene labeling additionally provides richer and denser information about the environment. However, most existing scene labeling methods require a large amount of computational resources, which makes them infeasible for real-time in-vehicle applications. In addition, in terms of bandwidth, a dense pixel-level representation is not ideal to transmit the perceived environment to other modules of an autonomous vehicle, such as localization or path planning.

This dissertation addresses the scene labeling problem in an automotive context by constructing a scene labeling concept around the "Stixel World" model of Pfeiffer (2011), which compresses dense information about the environment into a set of small "sticks" that stand upright, perpendicular to the ground plane. This work provides the first extension of the existing Stixel formulation that takes into account learned dense pixel-level appearance features. In a second step, Stixels are used as primitive scene elements to build a highly efficient region-level labeling scheme. The last part of this dissertation finally proposes a model that combines both pixel-level and region-level scene labeling into a single model that yields state-of-the-art or better labeling accuracy and can be executed in real-time with typical camera refresh rates. This work further investigates how existing depth information, i.e. from a stereo camera, can help to improve labeling accuracy and reduce runtime.

Place of Publication: Darmstadt
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Visual Inference
Date Deposited: 29 Apr 2018 19:55
Official URL: http://tuprints.ulb.tu-darmstadt.de/7315
URN: urn:nbn:de:tuda-tuprints-73155
Referees: Roth, Prof. Dr. Stefan and Rother, Prof. Dr. Carsten
Refereed / Verteidigung / mdl. Prüfung: 26 September 2017
Alternative Abstract:
Alternative abstract Language
Maschinelle Bildverarbeitung spielt eine zentrale Rolle für autonome Fahrzeuge, da Kameras vergleichsweise günstig sind und eine Vielzahl an Informationen über die Umgebung erfassen. Insbesondere die Objektklasse, also ob ein bestimmtes Objekt ein Fußgänger, Radfahrer oder Auto ist, kann sehr gut anhand von Bildmaterial erkannt werden. Umgebungserfassung im städtischen Umfeld ist ein große Herausforderung für Bildverarbeitungsalgorithmen, da die Umgebung sehr komplex und unstrukturiert ist: Fahrbahnberandung und Spurmarkierungen, Schilder und Ampeln, und viele weitere Objekte die sich gegenseitig verdecken können, müssen in Echtzeit erkannt werden. Die derzeit existierenden Algorithmen in intelligenten Fahrzeugen skalieren nicht ohne weiteres zu diesen Anforderungen, da jedes Problem bzw jede Objektklasse getrennt behandelt wird. Sogenanntes "Scene Labeling", welches jedem Pixel im Bild eine Klasse zuweist, ist eine vielversprechende Methode um diesen Mehraufwand zu vermeiden indem extrahierte Bildmerkmale zwischen verschiedenen Klassen geteilt werden. Verglichen mit Bounding-Box Detektoren liefert Scene Labeling außerdem eine reichhaltigere und dichtere Darstellung der Umgebung. Die meisten bestehenden Scene Labeling Verfahren haben allerdings einen sehr hohen Rechenaufwand, was eine Anwendung in Echtzeit nicht ermöglicht. Zusätzlich ist im Hinblick auf Bandbreite eine dichte Darstellung auf Pixel-Ebene nicht ideal um die erfasste Umgebung an andere Module in einem autononem Fahrzeug (wie z.B. Lokalisierung und Pfad-Planung) zu übertragen. Diese Dissertation geht Scene Labeling aus einem Automobil-Kontext an, indem ein Scene Labeling Konzept um das "Stixel Welt" Modell von Pfeiffer (2011) aufgebaut wird, welches die dichte Umgebungsinformation zu einer Menge von kleinen senkrecht auf dem Boden stehenden Stäben komprimiert. In dieser Arbeit wird erstmals die bestehende Stixel-Formulierung dahingehend erweitert, dass dichte gelernte Bildmerkmale auf Pixel-Ebene berücksichtigt werden. In einem zweiten Schritt werden Stixel als Basisbausteine der Szene benutzt um ein hocheffizientes Labeling Schema auf Regions-Ebene zu realisieren. Der letzte Teil dieser Arbeit stellt ein Konzept vor, dass Labeling auf Pixel-Ebene und Regions-Ebene in einem einzigen Modell kombiniert, welches eine Genaugkeit vergleichbar oder besser als der aktuelle Stand der Technik liefert und in Echtzeit mit typischen Bildwiederholraten ausgeführt werden kann. Diese Arbeit untersucht des Weiteren inwiefern vorhandene Tiefeninformation, z.B. von einer Stereo-Kamera, helfen kann um die Labeling-Präzision zu erhöhen und Laufzeit zu reduzieren.German
Export:
Suche nach Titel in: TUfind oder in Google

Optionen (nur für Redakteure)

View Item View Item