TU Darmstadt / ULB / TUbiblio

Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks

Koppanyi, Zoltan and Iwaszczuk, Dorota and Zha, Bing and Saul, Can Jozef and Toth, Charles K. and Yilmaz, Alper Yang, Michael Ying and Rosenhahn, Bodo and Murino, Vittorio (eds.) (2019):
Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks.
In: Multimodal Scene Understanding, Academic Press, pp. 41 - 64, DOI: 10.1016/B978-0-12-817358-9.00009-3,
[Online-Edition: http://www.sciencedirect.com/science/article/pii/B9780128173...],
[Book Section]

Abstract

Semantic segmentation has been an active field in computer vision and photogrammetry communities for over a decade. Pixel-level semantic labeling of images is generally achieved by assigning labels to pixels using machine learning techniques. Among others, the encoder–decoder convolutional neural networks have become the baseline approach for this problem recently. The majority of papers on this topic use only RGB images as input, despite the availability of other data sources, such as depth, which can improve segmentation and labeling. In this chapter, we investigate a number of encoder–decoder CNN architectures for semantic labeling, where the depth data is fused with the RGB data using three different approaches: (1) fusion with RGB image through color space transformation, (2) stacking depth images and RGB images, and (3) using Siamese network structures, such as FuseNet or VNet. The chapter also presents our approach to using surface normals in place of depth data. The advantage of using surface normal representation is to introduce viewpoint independency, meaning that the direction of a surface normal vector remains the same when the camera pose changes. This is a clear advantage over raw depth data where the depth value for a single scene point changes when the camera moves. The chapter provides a comprehensive analysis of three different fusion approaches using SegNet, FuseNet and VNet deep learning architectures. The analysis is conducted on both the Stanford 2D-3D-Semantics indoor dataset and aerial images from the ISPRS's Vaihingen dataset. Depth images of the Stanford dataset are acquired directly by flash LiDAR, and the ISPRS dataset depth images are generated by dense 3D reconstruction. We show that the surface normal representation better generalizes to different scenes. In our experiments, using surface normals with FuseNet achieved 5% improvement compared to using depth, and this resulted in 81.5% global accuracy on the Stanford dataset.

Item Type: Book Section
Erschienen: 2019
Editors: Yang, Michael Ying and Rosenhahn, Bodo and Murino, Vittorio
Creators: Koppanyi, Zoltan and Iwaszczuk, Dorota and Zha, Bing and Saul, Can Jozef and Toth, Charles K. and Yilmaz, Alper
Title: Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks
Language: English
Abstract:

Semantic segmentation has been an active field in computer vision and photogrammetry communities for over a decade. Pixel-level semantic labeling of images is generally achieved by assigning labels to pixels using machine learning techniques. Among others, the encoder–decoder convolutional neural networks have become the baseline approach for this problem recently. The majority of papers on this topic use only RGB images as input, despite the availability of other data sources, such as depth, which can improve segmentation and labeling. In this chapter, we investigate a number of encoder–decoder CNN architectures for semantic labeling, where the depth data is fused with the RGB data using three different approaches: (1) fusion with RGB image through color space transformation, (2) stacking depth images and RGB images, and (3) using Siamese network structures, such as FuseNet or VNet. The chapter also presents our approach to using surface normals in place of depth data. The advantage of using surface normal representation is to introduce viewpoint independency, meaning that the direction of a surface normal vector remains the same when the camera pose changes. This is a clear advantage over raw depth data where the depth value for a single scene point changes when the camera moves. The chapter provides a comprehensive analysis of three different fusion approaches using SegNet, FuseNet and VNet deep learning architectures. The analysis is conducted on both the Stanford 2D-3D-Semantics indoor dataset and aerial images from the ISPRS's Vaihingen dataset. Depth images of the Stanford dataset are acquired directly by flash LiDAR, and the ISPRS dataset depth images are generated by dense 3D reconstruction. We show that the surface normal representation better generalizes to different scenes. In our experiments, using surface normals with FuseNet achieved 5% improvement compared to using depth, and this resulted in 81.5% global accuracy on the Stanford dataset.

Title of Book: Multimodal Scene Understanding
Publisher: Academic Press
ISBN: 978-0-12-817358-9
Uncontrolled Keywords: Deep learning, CNN, Sensor fusion, Semantic labeling
Divisions: 13 Department of Civil and Environmental Engineering Sciences
13 Department of Civil and Environmental Engineering Sciences > Institute of Geodesy
13 Department of Civil and Environmental Engineering Sciences > Institute of Geodesy > Remote Sensing and Image Analysis
Date Deposited: 04 Oct 2019 06:46
DOI: 10.1016/B978-0-12-817358-9.00009-3
Official URL: http://www.sciencedirect.com/science/article/pii/B9780128173...
Export:
Suche nach Titel in: TUfind oder in Google
Send an inquiry Send an inquiry

Options (only for editors)

View Item View Item