红外与可见光图像融合基于区域的深度学习方法.pdf

发布时间：2022-05-30 发布人：admin 分类：说明书资料大小：1.98M 资料格式：pdf 举报版权申诉

weixin_38746818-15643530-16359647569383938874.pdf-第1页.png

第1页 / 共12页

weixin_38746818-15643530-16359647569383938874.pdf-第2页.png

第2页 / 共12页

weixin_38746818-15643530-16359647569383938874.pdf-第3页.png

第3页 / 共12页

weixin_38746818-15643530-16359647569383938874.pdf-第4页.png

第4页 / 共12页

weixin_38746818-15643530-16359647569383938874.pdf-第5页.png

第5页 / 共12页

weixin_38746818-15643530-16359647569383938874.pdf-第6页.png

第6页 / 共12页

weixin_38746818-15643530-16359647569383938874.pdf-第7页.png

第7页 / 共12页

weixin_38746818-15643530-16359647569383938874.pdf-第8页.png

第8页 / 共12页

Infrared and Visible Image Fusion: A Region-Based Deep Learning Method

1 Introduction

2 Background

3 Method

3.1 Segmentation of Foreground and Background

3.2 Fusion of Foreground and Background

3.3 Reconstruction

3.4 Implementation Details

4 Results and Comparison

4.1 Results

4.2 Comparison

5 Conclusion

References

Infrared and Visible Image Fusion: A Region-Based Deep Learning Method Chunyu Xie1 and Xinde Li1,2(B) 1 Key Laboratory of Measurement and Control of CSE, School of Automation, Southeast University, Nanjing, China {cyxie,xindeli}@seu.edu.cn 2 School of Cyber Science and Engineering, Southeast University, Nanjing, China Abstract. Infrared and visible image fusion is playing an important role in robot perception. The key of fusion is to extract useful information from source image by appropriate methods. In this paper, we propose a deep learning method for infrared and visible image fusion based on region segmentation. Firstly, the source infrared image is segmented into foreground part and background part, then we build an infrared and vis- ible image fusion network on the basis of neural style transfer algorithm. We propose foreground loss and background loss to control the fusion of the two parts respectively. And ﬁnally the fused image is reconstructed by combining the two parts together. The experimental results show that compared with other state-of-art methods, our method retains both saliency information of target and detail texture information of back- ground. Keywords: Infrared image · Visible image · Image fusion · Region segmentation · Deep learning 1 Introduction The purpose of infrared and visible image fusion is combining the images obtained by infrared and visible sensors to generate robust and informative images for further processing. Infrared images can distinguish targets from their backgrounds based on the radiation diﬀerence while visible images can provide texture details with high spatial resolution and deﬁnition in a manner consistent with the human visual system [1]. The target of infrared and visible image fusion is to combine thermal radiation information in infrared image with detailed tex- ture information in visible image. In recent years, research on fusion algorithms has been developing rapidly. However, an appropriate image information extrac- tion method is key to ensuring good fusion performance of infrared and visible images. The existing fusion algorithms are divided into seven categories including multi-scale transform [2], sparse representation [3], neural network [4], subspace [5], and saliency based [6] methods, hybrid models [7], and other methods [8]. c Springer Nature Switzerland AG 2019 H. Yu et al. (Eds.): ICIRA 2019, LNAI 11744, pp. 604–615, 2019. https://doi.org/10.1007/978-3-030-27541-9_49

Infrared and Visible Image Fusion 605 The main steps of these methods are decompose source images into several levels, fuse corresponding layers with particular rules, and reconstruct the tar- get images. Many fusion methods are based on pixel-level image fusion. These methods can not eﬀectively extract the target area we interested in and heav- ily depend on predeﬁned transforms and corresponding levels for decomposition and reconstruction. However, in several practical applications, our attention is focused on the objects of images at the region level [1]. Hence, region-level infor- mation should be considered during image fusion [9]. Consequently, region-based fusion rules have been widely used in infrared and visible image fusion [10]. Many region-based fusion methods have been proposed for infrared and visible image fusion, such as feature region extraction [11], regional uniformity [12], regional energy [10], and multi-judgment fusion rule [13]. Some representative methods are based on the salient region [13]. These method aims to identify regions that are more salient than other areas. This model has been used to extract visually salient regions of images, which can be used to obtain saliency maps of multi- scale sub-images [1]. Zhang et al. adopted the super-pixel-based saliency method to extract salient target regions; then, the fused coeﬃcient could be obtained by the extracted target region using a morphological method [14]. There are two disadvantages in these methods. (1) These methods adopt the same fusion rules for target and background areas. (2) These methods can not extract the region of target accurately. Therefore, in order to solve these existing problems, it is necessary to propose better methods. In this paper, we propose a region-based fusion method to solve these prob- lems. The source infrared image is segmented into foreground part and back- ground part by semantic segmentation. We propose a deep learning fusion method and propose foreground loss and background loss to control the fusion of diﬀerent regions respectively. The fused image is reconstructed by combining the foreground part and the background part. The rest of this paper is struc- tured as follows. In Sect. 2, the background of this research will be introduced. In Sect. 3, the methods we proposed is introduced in detail. The performance of our method and experimental results on public data sets are shown in Sect. 4. Finally, we draw a conclusion of our proposed method in Sect. 5. 2 Background Semantic segmentation, also called scene labeling, refers to the process of assign- ing a semantic label to each pixel of an image [15]. With the development of deep learning, research in semantic segmentation has been signiﬁcantly improved. Semantic segmentation based on deep learning can accurately classify each pixel and have already achieved well performance on very complex RGB image data sets. Compared with RGB images, infrared images are usually gray-scale images, the diﬀerence between target area and background area is more obvious. There- fore, we believe that semantic segmentation will also achieve good results on infrared images. Gatys et al. [16] proposed a deep learning method in creat- ing artistic imagery by separating and recombining image content and style.

606 C. Xie and X. Li This process of using Convolutional Neural Networks (CNNs) to render a con- tent image in diﬀerent styles is referred to as Neural Style Transfer (NST) [17]. They extract deep features at diﬀerent layers from images by using CNNs. Con- tent loss and style loss are deﬁned to control the fusion of content and texture. Diﬀerent from traditional methods, they use deep features to reconstruct images. Inspired by their work, we segment the source infrared image into sub-regions and fuse them with visible image separately. We propose foreground loss and background loss to control the fusion of the two diﬀerent regions based on the works of Gatys et al [16]. The details of our method will be presented in the next section. 3 Method In this section, our proposed method is presented in four parts. To solve the problems raised in section II, we propose an infrared and visible image fusion method based on deep learning. The infrared image is segmented into foreground and background parts by semantic segmentation. The two parts are fused sep- arately by using deep learning network based on NST. The framework of our proposed method is shown in Fig. 1. Fig. 1. The framework of our method 3.1 Segmentation of Foreground and Background In this paper, we deﬁne the target area as foreground which is usually the region of interest in an infrared image, and the other areas as background. The purpose of foreground fusion is preserving the saliency information of target in infrared image while reserving the texture information of visible image as much as pos- sible. The purpose of background fusion is preserving the features of infrared image while retaining the texture details of visible image. In order to achieve better fusion performance, we divide the source images into foreground and background, and fused with diﬀerent strategies and parameters for each part. For the input image I, it can be represented as a combination of foreground and background parts as follow: I = If + Ib (1)

Infrared and Visible Image Fusion 607 If is the foreground part, and Ib is the background part. To divide the image, we use a semantic segmentation network to segment an infrared image into foreground part and background part, and train it on TNO and INO datasets. 3.2 Fusion of Foreground and Background Fusion of Foreground. In an infrared image, foreground is usually the salient area. Hence, we take foreground part of the source infrared image as the basis of the fused image, so that the salient information of the target will be preserved. We extract texture and detail features from foreground part of the source visible image. In order to extract the optimal detail features, we use CNNs to extract the deep features of the image. We deﬁne the foreground loss to control the fusion of foreground. In Gatyss work, the content loss of the I-th layer is deﬁned as: (2) (3) Ll c = 1 2NlDl ij (Fl[O] − Fl[I])2 ij The style loss of the l-th layer is deﬁned as: Ll s = 1 2N 2 l ij (Gl[O] − Gl[S])2 ij I is the input image, O is the output image and S is the reference style image. Nl is the number of ﬁlters in the l-th layer. Dl is the size of vectorized feature map of each ﬁlter in the l-th layer. Ff,l[·] is the feature matrix with (i, j) indi- cating its index. Gf,l[·] = Ff,l[·]Ff,l[·]T is the Gram matrix which is deﬁned as the inner product between the vectorized feature maps. Inspired by Gats’work, we use content loss to constrain the basic content information and style loss which is deﬁned as texture loss in this paper to constrain the details and tex- ture information of the fused image. For the input infrared image I, the input visible image V and the output fused image O, the foreground loss function of the fusion network is deﬁned as: L Lf = αl f Ll f,c + L βl f Ll f,s (4) l=1 l=1 The network contains L layers, Ll f,c and Ll f,s indicate the content loss and texture loss of the foreground fusion in the l-th layer. The content loss and texture loss of foreground fusion are controlled by αl f . The content loss and texture loss are deﬁned as follow. The content loss of l-th layer is: (Ff,l[O] − Ff,l[I])2 f and βl Ll f,c = (5) 1 ij 2NlDl ij and the texture loss of the l-th layer is: Ll f,s = 1 2N 2 l ij (Gf,l[O] − Gf,l[V ])2 ij (6)

608 C. Xie and X. Li and: Ff,l[O] = Fl[O]Mf,l[I] Ff,l[I] = Fl[I]Mf,l[I] Ff,l[V ] = Fl[V ]Mf,l[I] (7) (8) (9) Mf,l[I] denotes the foreground segmentation mask. To adapt each layer, the mask is down sampled to Mf,l[I]. In our test data, infrared and visible images have been strictly registered and there is Mf,l[I] = Mf,l[V ]. All the Mf,l[V ] items have been replaced by Mf,l[I] in formulas above. Fusion of Background. Deferent from foreground, we pay more attention to detail textures in background. Hence, we take background part of the source visible image as the basis of the fused image, so that the detail information of visible image will be preserved. We extract textures from the source infrared image. For background part, we deﬁne the background loss to control the fusion. The loss function is deﬁned as: Lb = L l=1 αl bLl b,c + L l=1 βl bLl b,s (10) b,c and Ll Ll b,s indicate the content loss and texture loss of the background fusion in the l-th layer. The weights of content loss and texture loss of background fusion are αl b. The content loss and texture loss are deﬁned as follow. The content loss of the l-th layer is: b and βl Ll b,c = 1 2NlDl ij (Fb,l[O] − Fb,l[V ])2 ij and the texture loss of the l-th layer is: Ll b,s = 1 2N 2 l ij (Gb,l[O] − Gb,l[I])2 ij (11) (12) and: (13) (14) (15) Similar to the foreground fusion, Mb,l[I] = Mb,l[V ], and all the Mb,l[V ] items have been replaced by Mb,l[I] in formulas above. Fb,l[O] = Fl[O]Mb,l[I] Fb,l[I] = Fl[I]Mb,l[I] Fb,l[V ] = Fl[V ]Mb,l[I]

Infrared and Visible Image Fusion 609 Fig. 2. The procedure of fusion 3.3 Reconstruction We reconstruct the fused image by combining the fused foreground part and background part. The total loss function of the fusion network is formulated by combining the foreground loss and the background loss together. We add Ltv term to suppress the noise generated in the fusion process. The total loss of the fusion network is: Ltotal = Lf + Lb + Ltv = L l=1 (αl f Ll f,c + αl bLl b,c) + (βl f Ll f,s + βl bLl b,s) + Ltv (16) L l=1 3.4 Implementation Details In this part, the implementation details of our method will be described. We adopt a state-of-art semantic segmentation network SegNet [18] on the segmen- tation of infrared images, and generate masks for further processing. We use it to segment the image into two categories, and the network is trained on 1000 images from the TNO and INO data sets. To generate the mask, we only segment the infrared image since the infrared and visible image pairs have been registered, and it is not diﬃcult mapping the segmentation mask to the visible image. In our fusion network, as shown in Fig. 2, a pre-trained VGG-19 network is employed as the feature extractor. For foreground fusion, we choose layer conv2 2 to extract the content feature, and layer conv1 1, conv2 1 to extract the texture feature. For background fusion, we choose layer conv4 2 to extract the content feature, and layer conv3 1, conv4 1, conv5 1 to extract the texture feature. The mask is down sampled to correspond feature maps of diﬀerent layers. 4 Results and Comparison In this section, the performance of our method will be evaluated by experiments on common data sets and compared with other methods.

610 C. Xie and X. Li 4.1 Results We select 1000 pairs of infrared and visible images from TNO and INO data sets, format them into 360 by 480 small images and input into SegNet [18]. We trained a well performed semantic segmentation network and use it to segment the input infrared images. To test the performance of our method, we select 20 pairs of infrared and visible images from TNO data set for experiment. Several segmentation results are shown in Fig. 3. After segmentation, the mask, infrared and visible images are put into the fusion network. The fusion result is shown in Fig. 4. Fig. 3. The results of semantic segmentation on TNO dataset. The input infrared images are shown in ﬁrst row, the output masks are shown in second row. (a) The infrared image. (b) The visible image. (c) The mask. (d) The fused image. Fig. 4. The fusion result of our method. 4.2 Comparison In the experiment, we select several state-of-art methods of infrared image and visible image fusion for comparison. These methods including curvelet trans- form (CVT) [19], dual-tree complex wavelet transform (DTCWT) [20], weighted least square optimization-based method (WLS) [7], gradient transfer fusion (GTF) [21], and a generative adversarial network for infrared and visible image

Infrared and Visible Image Fusion 611 fusion (FusionGAN) [22]. The experiment is carried out on a 3.4 GHz Intel(R) Core(TM) CPU with 8 GB RAM. Subjective Evaluation. Five pairs of infrared and visible images are selected for subjective evaluation. As shown in the Fig. 5, the ﬁrst two lines show the orig- inal infrared images and the visible images and the last row shows the results of our method while the other rows correspond to the ﬁve methods for comparison. All the methods have fused the features of infrared image and visible image suc- cessfully. The fusion results of CVT and DTCWT contain rich detail features, but the targets are not obvious. Compared with CVT and DTCWT, WLS has stronger target saliency, but some infrared features are lost in the background. Fig. 5. Results of ﬁve infrared and visible image pairs from TNO dataset. From top to bottom: infrared images, visible images, results of CVT, DTCWT, WLS, GTF, FusionGAN and our method. Some detail parts are zoomed in and put at the bottom right corner for clear comparison.

分享到：

赞收藏

资料库

红外与可见光图像融合基于区域的深度学习方法.pdf

相关推荐

开发技术

热门标签

最新资料