Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks

Koppanyi, Zoltan ; Iwaszczuk, Dorota ; Zha, Bing ; Saul, Can Jozef ; Toth, Charles K. ; Yilmaz, Alper
Hrsg.: Yang, Michael Ying ; Rosenhahn, Bodo ; Murino, Vittorio (2019)
Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks.
In: Multimodal Scene Understanding
doi: 10.1016/B978-0-12-817358-9.00009-3
Buchkapitel, Bibliographie

URL / URN: http://www.sciencedirect.com/science/article/pii/B9780128173...

Kurzbeschreibung (Abstract)

Semantic segmentation has been an active field in computer vision and photogrammetry communities for over a decade. Pixel-level semantic labeling of images is generally achieved by assigning labels to pixels using machine learning techniques. Among others, the encoder–decoder convolutional neural networks have become the baseline approach for this problem recently. The majority of papers on this topic use only RGB images as input, despite the availability of other data sources, such as depth, which can improve segmentation and labeling. In this chapter, we investigate a number of encoder–decoder CNN architectures for semantic labeling, where the depth data is fused with the RGB data using three different approaches: (1) fusion with RGB image through color space transformation, (2) stacking depth images and RGB images, and (3) using Siamese network structures, such as FuseNet or VNet. The chapter also presents our approach to using surface normals in place of depth data. The advantage of using surface normal representation is to introduce viewpoint independency, meaning that the direction of a surface normal vector remains the same when the camera pose changes. This is a clear advantage over raw depth data where the depth value for a single scene point changes when the camera moves. The chapter provides a comprehensive analysis of three different fusion approaches using SegNet, FuseNet and VNet deep learning architectures. The analysis is conducted on both the Stanford 2D-3D-Semantics indoor dataset and aerial images from the ISPRS's Vaihingen dataset. Depth images of the Stanford dataset are acquired directly by flash LiDAR, and the ISPRS dataset depth images are generated by dense 3D reconstruction. We show that the surface normal representation better generalizes to different scenes. In our experiments, using surface normals with FuseNet achieved 5% improvement compared to using depth, and this resulted in 81.5% global accuracy on the Stanford dataset.

Typ des Eintrags:	Buchkapitel
Erschienen:	2019
Herausgeber:	Yang, Michael Ying ; Rosenhahn, Bodo ; Murino, Vittorio
Autor(en):	Koppanyi, Zoltan ; Iwaszczuk, Dorota ; Zha, Bing ; Saul, Can Jozef ; Toth, Charles K. ; Yilmaz, Alper
Art des Eintrags:	Bibliographie
Titel:	Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks
Sprache:	Englisch
Publikationsjahr:	2019
Verlag:	Academic Press
Buchtitel:	Multimodal Scene Understanding
DOI:	10.1016/B978-0-12-817358-9.00009-3
URL / URN:	http://www.sciencedirect.com/science/article/pii/B9780128173...
Kurzbeschreibung (Abstract):	Semantic segmentation has been an active field in computer vision and photogrammetry communities for over a decade. Pixel-level semantic labeling of images is generally achieved by assigning labels to pixels using machine learning techniques. Among others, the encoder–decoder convolutional neural networks have become the baseline approach for this problem recently. The majority of papers on this topic use only RGB images as input, despite the availability of other data sources, such as depth, which can improve segmentation and labeling. In this chapter, we investigate a number of encoder–decoder CNN architectures for semantic labeling, where the depth data is fused with the RGB data using three different approaches: (1) fusion with RGB image through color space transformation, (2) stacking depth images and RGB images, and (3) using Siamese network structures, such as FuseNet or VNet. The chapter also presents our approach to using surface normals in place of depth data. The advantage of using surface normal representation is to introduce viewpoint independency, meaning that the direction of a surface normal vector remains the same when the camera pose changes. This is a clear advantage over raw depth data where the depth value for a single scene point changes when the camera moves. The chapter provides a comprehensive analysis of three different fusion approaches using SegNet, FuseNet and VNet deep learning architectures. The analysis is conducted on both the Stanford 2D-3D-Semantics indoor dataset and aerial images from the ISPRS's Vaihingen dataset. Depth images of the Stanford dataset are acquired directly by flash LiDAR, and the ISPRS dataset depth images are generated by dense 3D reconstruction. We show that the surface normal representation better generalizes to different scenes. In our experiments, using surface normals with FuseNet achieved 5% improvement compared to using depth, and this resulted in 81.5% global accuracy on the Stanford dataset.
Freie Schlagworte:	Deep learning, CNN, Sensor fusion, Semantic labeling
Fachbereich(e)/-gebiet(e):	13 Fachbereich Bau- und Umweltingenieurwissenschaften 13 Fachbereich Bau- und Umweltingenieurwissenschaften > Institut für Geodäsie 13 Fachbereich Bau- und Umweltingenieurwissenschaften > Institut für Geodäsie > Fernerkundung und Bildanalyse
Hinterlegungsdatum:	04 Okt 2019 06:46
Letzte Änderung:	04 Okt 2019 06:46
PPN:
Export:

Suche nach Titel in:	TUfind oder in Google

Frage zum Eintrag

Optionen (nur für Redakteure)

Redaktionelle Details anzeigen

OAI 2.0-Basis-URL: https://tubiblio.ulb.tu-darmstadt.de/cgi/oai2 TUbiblio verwendet EPrints 3.

Drucken |

Impressum |

Datenschutzerklärung