logo资料库

白翔-端到端的文本检测与识别.pdf 英文

第1页 / 共18页
第2页 / 共18页
第3页 / 共18页
第4页 / 共18页
第5页 / 共18页
第6页 / 共18页
第7页 / 共18页
第8页 / 共18页
资料共18页,剩余部分请下载后查看
Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes
Supplementary of Mask TextSpotter
8 1 0 2 g u A 1 ] V C . s c [ 2 v 2 4 2 2 0 . 7 0 8 1 : v i X r a Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes Pengyuan Lyu1[0000−0003−3153−8519], Minghui Liao1[0000−0002−2583−4314], Cong Yao2[0000−0001−6564−4796], Wenhao Wu2, and Xiang Bai1[0000−0002−3449−5940] 1 Huazhong University of Science and Technology 2 Megvii (Face++) Technology Inc. lvpyuan@gmail.com, mhliao@hust.edu.cn, yaocong2010@gmail.com, wwh@megvii.com, xbai@hust.edu.cn Abstract. Recently, models based on deep neural networks have dom- inated the fields of scene text detection and recognition. In this paper, we investigate the problem of scene text spotting, which aims at simul- taneous text detection and recognition in natural images. An end-to-end trainable neural network model for scene text spotting is proposed. The proposed model, named as Mask TextSpotter, is inspired by the newly published work Mask R-CNN. Different from previous methods that also accomplish text spotting with end-to-end trainable deep neural networks, Mask TextSpotter takes advantage of simple and smooth end-to-end learning procedure, in which precise text detection and recognition are acquired via semantic segmentation. Moreover, it is superior to previ- ous methods in handling text instances of irregular shapes, for example, curved text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the proposed method achieves state-of-the-art results in both scene text detection and end-to-end text recognition tasks. Keywords: Scene Text Spotting · Neural Network · Arbitrary Shapes 1 Introduction In recent years, scene text detection and recognition have attracted growing re- search interests from the computer vision community, especially after the revival of neural networks and growth of image datasets. Scene text detection and recog- nition provide an automatic, rapid approach to access the textual information embodied in natural scenes, benefiting a variety of real-world applications, such as geo-location [58], instant translation, and assistance for the blind. Scene text spotting, which aims at concurrently localizing and recognizing text from natural scenes, have been previously studied in numerous works [49, Authors contribute equally. Corresponding author.
2 Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai 21]. However, in most works, except [27] and [3], text detection and subsequent recognition are handled separately. Text regions are first hunted from the original image by a trained detector and then fed into a recognition module. This pro- cedure seems simple and natural, but might lead to sub-optimal performances for both detection and recognition, since these two tasks are highly correlated and complementary. On one hand, the quality of detections larges determines the accuracy of recognition; on the other hand, the results of recognition can provide feedback to help reject false positives in the phase of detection. Recently, two methods [27, 3] that devise end-to-end trainable frameworks for scene text spotting have been proposed. Benefiting from the complementarity between detection and recognition, these unified models significantly outperform previous competitors. However, there are two major drawbacks in [27] and [3]. First, both of them can not be completely trained in an end-to-end manner. [27] applied a curriculum learning paradigm [1] in the training period, where the sub-network for text recognition is locked at the early iterations and the training data for each period is carefully selected. Busta et al. [3] at first pre-train the networks for detection and recognition separately and then jointly train them until convergence. There are mainly two reasons that stop [27] and [3] from training the models in a smooth, end-to-end fashion. One is that the text recog- nition part requires accurate locations for training while the locations in the early iterations are usually inaccurate.The other is that the adopted LSTM [17] or CTC loss [11] are difficult to optimize than general CNNs. The second limi- tation of [27] and [3] lies in that these methods only focus on reading horizontal or oriented text. However, the shapes of text instances in real-world scenarios may vary significantly, from horizontal or oriented, to curved forms. In this paper, we propose a text spotter named as Mask TextSpotter, which can detect and recognize text instances of arbitrary shapes. Here, arbitrary shapes mean various forms text instances in real world. Inspired by Mask R- CNN [13], which can generate shape masks of objects, we detect text by segment the instance text regions. Thus our detector is able to detect text of arbitrary shapes. Besides, different from the previous sequence-based recognition methods [45, 44, 26] which are designed for 1-D sequence, we recognize text via semantic Fig. 1: Illustrations of different text spotting methods. The left presents horizon- tal text spotting methods [30, 27]; The middle indicates oriented text spotting methods [3]; The right is our proposed method. Green bounding box: detection result; Red text in green background: recognition result. petrosainsproekngastrosains
Mask TextSpotter 3 segmentation in 2-D space, to solve the issues in reading irregular text instances. Another advantage is that it does not require accurate locations for recognition. Therefore, the detection task and recognition task can be completely trained end-to-end, and benefited from feature sharing and joint optimization. We validate the effectiveness of our model on the datasets that include hor- izontal, oriented and curved text. The results demonstrate the advantages of the proposed algorithm in both text detection and end-to-end text recognition tasks. Specially, on ICDAR2015, evaluated at a single scale, our method achieves an F-Measure of 0.86 on the detection task and outperforms the previous top performers by 13.2% − 25.3% on the end-to-end recognition task. The main contributions of this paper are four-fold. (1) We propose an end- to-end trainable model for text spotting, which enjoys a simple, smooth train- ing scheme. (2) The proposed method can detect and recognize text of vari- ous shapes, including horizontal, oriented, and curved text. (3) In contrast to previous methods, precise text detection and recognition in our method are ac- complished via semantic segmentation. (4) Our method achieves state-of-the-art performances in both text detection and text spotting on various benchmarks. 2 Related Work 2.1 Scene Text Detection In scene text recognition systems, text detection plays an important role [59]. A large number of methods have been proposed to detect scene text [7, 36, 37, 50, 19, 23, 54, 21, 47, 54, 56, 30, 52, 55, 34, 15, 48, 43, 57, 16, 35, 31]. In [21], Jaderberg et al. use Edge Boxes [60] to generate proposals and refine candidate boxes by regression. Zhang et al. [54] detect scene text by exploiting the sym- metry property of text. Adapted from Faster R-CNN [40] and SSD [33] with well-designed modifications, [56, 30] are proposed to detect horizontal words. Multi-oriented scene text detection has become a hot topic recently. Yao et al. [52] and Zhang et al. [55] detect multi-oriented scene text by semantic seg- mentation. Tian et al. [48] and Shi et al. [43] propose methods which first detect text segments and then link them into text instances by spatial relationship or link predictions. Zhou et al. [57] and He et al. [16] regress text boxes directly from dense segmentation maps. Lyu et al. [35] propose to detect and group the corner points of the text to generate text boxes. Rotation-sensitive regression for oriented scene text detection is proposed by Liao et al. [31]. Compared to the popularity of horizontal or multi-oriented scene text detec- tion, there are few works focusing on text instances of arbitrary shapes. Recently, detection of text with arbitrary shapes has gradually drawn the attention of re- searchers due to the application requirements in the real-life scenario. In [41], Risnumawan et al. propose a system for arbitrary text detection based on text symmetry properties. In [4], a dataset which focuses on curve orientation text detection is proposed. Different from most of the above-mentioned methods, we propose to detect scene text by instance segmentation which can detect text with arbitrary shapes.
4 Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai 2.2 Scene Text Recognition Scene text recognition [53, 46] aims at decoding the detected or cropped image regions into character sequences. The previous scene text recognition approaches can be roughly split into three branches: character-based methods, word-based methods, and sequence-based methods. The character-based recognition meth- ods [2, 22] mostly first localize individual characters and then recognize and group them into words. In [20], Jaderberg et al. propose a word-based method which treats text recognition as a common English words (90k) classification problem. Sequence-based methods solve text recognition as a sequence labeling problem. In [44], Shi et al. use CNN and RNN to model image features and output the recognized sequences with CTC [11]. In [26, 45], Lee et al. and Shi et al. recognize scene text via attention based sequence-to-sequence model. The proposed text recognition component in our framework can be classified as a character-based method. However, in contrast to previous character-based approaches, we use an FCN [42] to localize and classify characters simultaneously. Besides, compared with sequence-based methods which are designed for a 1-D sequence, our method is more suitable to handle irregular text (multi-oriented text, curved text et al.). 2.3 Scene Text Spotting Most of the previous text spotting methods [21, 30, 12, 29] split the spotting process into two stages. They first use a scene text detector [21, 30, 29] to localize text instances and then use a text recognizer [20, 44] to obtain the recognized text. In [27, 3], Li et al. and Busta et al. propose end-to-end methods to localize and recognize text in a unified network, but require relatively complex training procedures. Compared with these methods, our proposed text spotter can not only be trained end-to-end completely, but also has the ability to detect and recognize arbitrary-shape (horizontal, oriented, and curved) scene text. 2.4 General Object Detection and Semantic Segmentation With the rise of deep learning, general object detection and semantic segmenta- tion have achieved great development. A large number of object detection and segmentation methods [9, 8, 40, 6, 32, 33, 39, 42, 5, 28, 13] have been pro- posed. Benefited from those methods, scene text detection and recognition have achieved obvious progress in the past few years. Our method is also inspired by those methods. Specifically, our method is adapted from a general object in- stance segmentation model Mask R-CNN [13]. However, there are key differences between the mask branch of our method and that in Mask R-CNN. Our mask branch can not only segment text regions but also predict character probabil- ity maps, which means that our method can be used to recognize the instance sequence inside character maps rather than predicting an object mask only.
Mask TextSpotter 5 3 Methodology The proposed method is an end-to-end trainable text spotter, which can handle various shapes of text. It consists of an instance-segmentation based text detector and a character-segmentation based text recognizer. 3.1 Framework The overall architecture of our proposed method is presented in Fig. 2. Func- tionally, the framework consists of four components: a feature pyramid network (FPN) [32] as backbone, a region proposal network (RPN) [40] for generating text proposals, a Fast R-CNN [40] for bounding boxes regression, a mask branch for text instance segmentation and character segmentation. In the training phase, a lot of text proposals are first generated by RPN, and then the RoI features of the proposals are fed into the Fast R-CNN branch and the mask branch to gen- erate the accurate text candidate boxes, the text instance segmentation maps, and the character segmentation maps. Backbone Text in nature images are various in sizes. In order to build high-level semantic feature maps at all scales, we apply a feature pyramid structure [32] backbone with ResNet [14] of depth 50. FPN uses a top-down architecture to fuse the feature of different resolutions from a single-scale input, which improves accuracy with marginal cost. RPN RPN is used to generate text proposals for the subsequent Fast R-CNN and mask branch. Following [32], we assign anchors on different stages de- pending on the anchor size. Specifically, the area of the anchors are set to {322, 642, 1282, 2562, 5122} pixels on five stages {P2, P3, P4, P5, P6} respectively. Different aspect ratios {0.5, 1, 2} are also adopted in each stages as in [40]. In this way, the RPN can handle text of various sizes and aspect ratios. RoI Align [13] is adapted to extract the region features of the proposals. Compared to RoI Pool- ing [8], RoI Align preserves more accurate location information, which is quite beneficial to the segmentation task in the mask branch. Note that no special design for text is adopted, such as the special aspect ratios or orientations of anchors for text, as in previous works [30, 15, 34]. Fast R-CNN The Fast R-CNN branch includes a classification task and a regression task. The main function of this branch is to provide more accurate Fig. 2: Illustration of the architecture of the our method. Box ClassificationBox RegressionRoI AlignWord segmentationCharacter instance segmentationRPNFast R-CNNMask branch
6 Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai bounding boxes for detection. The inputs of Fast R-CNN are in 7× 7 resolution, which are generated by RoI Align from the proposals produced by RPN. Mask Branch There are two tasks in the mask branch, including a global text instance segmentation task and a character segmentation task. As shown in Fig. 3, giving an input RoI, whose size is fixed to 16∗ 64, through four convo- lutional layers and a de-convolutional layer, the mask branch predicts 38 maps (with 32 ∗ 128 size), including a global text instance map, 36 character maps, and a background map of characters. The global text instance map can give ac- curate localization of a text region, regardless of the shape of the text instance. The character maps are maps of 36 characters, including 26 letters and 10 Ara- bic numerals. The background map of characters, which excludes the character regions, is also needed for post-processing. 3.2 Label Generation For a training sample with the input image I and the corresponding ground truth, we generate targets for RPN, Fast R-CNN and mask branch. Generally, the ground truth contains P = {p1, p2...pm} and C = {c1 = (cc1, cl1), c2 = (cc2, cl2), ..., cn = (ccn, cln)}, where pi is a polygon which represents the local- ization of a text region, ccj and clj are the category and location of a character respectively. Note that, in our method C is not necessary for all training samples. We first transform the polygons into horizontal rectangles which cover the polygons with minimal areas. And then we generate targets for RPN and Fast R-CNN following [8, 40, 32]. There are two types of target maps to be generated for the mask branch with the ground truth P , C (may not exist) as well as the proposals yielded by RPN: a global map for text instance segmentation and a character map for character semantic segmentation. Given a positive proposal r, we first use the matching mechanism of [8, 40, 32] to obtain the best matched horizontal rectangle. The corresponding polygon as well as characters (if any) can be obtained further. Next, the matched polygon and character boxes are shifted and resized to align the proposal and the target map of H × W as the following formulas: Fig. 3: Illustration of the mask branch. Subsequently, there are four convolutional layers, one de-convolutional layer, and a final convolutional layer which predicts maps of 38 channels (1 for global text instance map; 36 for character maps; 1 for background map of characters). RoI16*64*25632*128*25616*64*25616*64*25616*64*256……01AZ……Global word mapCharacter mapsBackground map32*128
Mask TextSpotter 7 Bx = (Bx0 − min(rx)) × W/(max(rx) − min(rx)) By = (By0 − min(ry)) × H/(max(ry) − min(ry)) (1) (2) where (Bx, By) and (Bx0, By0 ) are the updated and original vertexes of the polygon and all character boxes; (rx, ry) are the vertexes of the proposal r. After that, the target global map can be generated by just drawing the normalized polygon on a zero-initialized mask and filling the polygon region with the value 1. The character map generation is visualized in Fig. 4a. We first shrink all character bounding boxes by fixing their center point and shortening the sides to the fourth of the original sides. Then, the values of the pixels in the shrunk character bounding boxes are set to their corresponding category indices and those outside the shrunk character bounding boxes are set to 0. If there are no character bounding boxes annotations, all values are set to −1. 3.3 Optimization As discussed in Sec. 3.1, our model includes multiple tasks. We naturally define a multi-task loss function: L = Lrpn + α1Lrcnn + α2Lmask, (3) where Lrpn and Lrcnn are the loss functions of RPN and Fast R-CNN, which are identical as these in [40] and [8]. The mask loss Lmask consists of a global text instance segmentation loss Lglobal and a character segmentation loss Lchar: Lmask = Lglobal + βLchar, (4) where Lglobal is an average binary cross-entropy loss and Lchar is a weighted spatial soft-max loss. In this work, the α1, α2, β, are empirically set to 1.0. (a) (b) Fig. 4: (a) Label generation of mask branch. Left: the blue box is a proposal yielded by RPN, the red polygon and yellow boxes are ground truth polygon and character boxes, the green box is the horizontal rectangle which covers the polygon with minimal area. Right: the global map (top) and the character map (bottom). (b) Overview of the pixel voting algorithm. Left: the predicted char- acter maps; right: for each connected regions, we calculate the scores for each character by averaging the probability values in the corresponding region. BALLYS……Background map01AZ……Character mapsPooling &Average1.000……
8 Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai Text instance segmentation loss The output of the text instance segmenta- tion task is a single map. Let N be the number of pixels in the global map, yn be the pixel label (yn ∈ 0, 1), and xn be the output pixel, we define the Lglobal as follows: Lglobal = − 1 N [yn × log(S(xn)) + (1 − yn) × log(1 − S(xn))] (5) N n=1 where S(x) is a sigmoid function. Character segmentation loss The output of the character segmentation con- sists of 37 maps, which correspond to 37 classes (36 classes of characters and the background class). Let T be the number of classes, N be the number of pixels in each map. The output maps X can be viewed as an N × T matrix. In this way, the weighted spatial soft-max loss can be defined as follows: N T−1 Lchar = − 1 N n=1 t=0 Wn Yn,tlog( ), (6) T−1 eXn,t k=0 eXn,k where Y is the corresponding ground truth of X. The weight W is used to balance the loss value of the positives (character classes) and the background class. Let the number of the background pixels be Nneg, and the background class index be 0, the weights can be calculated as: Wi = 1 if Yi,0 = 1, Nneg/(N − Nneg) otherwise (7) Note that in inference, a sigmoid function and a soft-max function are applied to generate the global map and the character segmentation maps respectively. 3.4 Inference Different from the training process where the input RoIs of mask branch come from RPN, in the inference phase, we use the outputs of Fast R-CNN as proposals to generate the predicted global maps and character maps, since the Fast R-CNN outputs are more accurate. Specially, the processes of inference are as follows: first, inputting a test image, we obtain the outputs of Fast R-CNN as [40] and filter out the redundant candidate boxes by NMS; and then, the kept proposals are fed into the mask branch to generate the global maps and the character maps; finally the predicted polygons can be obtained directly by calculating the contours of text regions on global maps, the character sequences can be generated by our proposed pixel voting algorithm on character maps. Pixel Voting We decode the predicted character maps into character sequences by our proposed pixel voting algorithm. We first binarize the background map, where the values are from 0 to 255, with a threshold of 192. Then we obtain
分享到:
收藏