8
1
0
2
g
u
A
1
]
V
C
.
s
c
[
2
v
2
4
2
2
0
.
7
0
8
1
:
v
i
X
r
a
Mask TextSpotter: An End-to-End Trainable
Neural Network for Spotting Text with
Arbitrary Shapes
Pengyuan Lyu1[0000−0003−3153−8519], Minghui Liao1[0000−0002−2583−4314],
Cong Yao2[0000−0001−6564−4796], Wenhao Wu2, and
Xiang Bai1[0000−0002−3449−5940]
1 Huazhong University of Science and Technology
2 Megvii (Face++) Technology Inc.
lvpyuan@gmail.com, mhliao@hust.edu.cn, yaocong2010@gmail.com,
wwh@megvii.com, xbai@hust.edu.cn
Abstract. Recently, models based on deep neural networks have dom-
inated the fields of scene text detection and recognition. In this paper,
we investigate the problem of scene text spotting, which aims at simul-
taneous text detection and recognition in natural images. An end-to-end
trainable neural network model for scene text spotting is proposed. The
proposed model, named as Mask TextSpotter, is inspired by the newly
published work Mask R-CNN. Different from previous methods that also
accomplish text spotting with end-to-end trainable deep neural networks,
Mask TextSpotter takes advantage of simple and smooth end-to-end
learning procedure, in which precise text detection and recognition are
acquired via semantic segmentation. Moreover, it is superior to previ-
ous methods in handling text instances of irregular shapes, for example,
curved text. Experiments on ICDAR2013, ICDAR2015 and Total-Text
demonstrate that the proposed method achieves state-of-the-art results
in both scene text detection and end-to-end text recognition tasks.
Keywords: Scene Text Spotting · Neural Network · Arbitrary Shapes
1 Introduction
In recent years, scene text detection and recognition have attracted growing re-
search interests from the computer vision community, especially after the revival
of neural networks and growth of image datasets. Scene text detection and recog-
nition provide an automatic, rapid approach to access the textual information
embodied in natural scenes, benefiting a variety of real-world applications, such
as geo-location [58], instant translation, and assistance for the blind.
Scene text spotting, which aims at concurrently localizing and recognizing
text from natural scenes, have been previously studied in numerous works [49,
Authors contribute equally.
Corresponding author.
2
Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai
21]. However, in most works, except [27] and [3], text detection and subsequent
recognition are handled separately. Text regions are first hunted from the original
image by a trained detector and then fed into a recognition module. This pro-
cedure seems simple and natural, but might lead to sub-optimal performances
for both detection and recognition, since these two tasks are highly correlated
and complementary. On one hand, the quality of detections larges determines
the accuracy of recognition; on the other hand, the results of recognition can
provide feedback to help reject false positives in the phase of detection.
Recently, two methods [27, 3] that devise end-to-end trainable frameworks
for scene text spotting have been proposed. Benefiting from the complementarity
between detection and recognition, these unified models significantly outperform
previous competitors. However, there are two major drawbacks in [27] and [3].
First, both of them can not be completely trained in an end-to-end manner.
[27] applied a curriculum learning paradigm [1] in the training period, where the
sub-network for text recognition is locked at the early iterations and the training
data for each period is carefully selected. Busta et al. [3] at first pre-train the
networks for detection and recognition separately and then jointly train them
until convergence. There are mainly two reasons that stop [27] and [3] from
training the models in a smooth, end-to-end fashion. One is that the text recog-
nition part requires accurate locations for training while the locations in the
early iterations are usually inaccurate.The other is that the adopted LSTM [17]
or CTC loss [11] are difficult to optimize than general CNNs. The second limi-
tation of [27] and [3] lies in that these methods only focus on reading horizontal
or oriented text. However, the shapes of text instances in real-world scenarios
may vary significantly, from horizontal or oriented, to curved forms.
In this paper, we propose a text spotter named as Mask TextSpotter, which
can detect and recognize text instances of arbitrary shapes. Here, arbitrary
shapes mean various forms text instances in real world. Inspired by Mask R-
CNN [13], which can generate shape masks of objects, we detect text by segment
the instance text regions. Thus our detector is able to detect text of arbitrary
shapes. Besides, different from the previous sequence-based recognition methods
[45, 44, 26] which are designed for 1-D sequence, we recognize text via semantic
Fig. 1: Illustrations of different text spotting methods. The left presents horizon-
tal text spotting methods [30, 27]; The middle indicates oriented text spotting
methods [3]; The right is our proposed method. Green bounding box: detection
result; Red text in green background: recognition result.
petrosainsproekngastrosains
Mask TextSpotter
3
segmentation in 2-D space, to solve the issues in reading irregular text instances.
Another advantage is that it does not require accurate locations for recognition.
Therefore, the detection task and recognition task can be completely trained
end-to-end, and benefited from feature sharing and joint optimization.
We validate the effectiveness of our model on the datasets that include hor-
izontal, oriented and curved text. The results demonstrate the advantages of
the proposed algorithm in both text detection and end-to-end text recognition
tasks. Specially, on ICDAR2015, evaluated at a single scale, our method achieves
an F-Measure of 0.86 on the detection task and outperforms the previous top
performers by 13.2% − 25.3% on the end-to-end recognition task.
The main contributions of this paper are four-fold. (1) We propose an end-
to-end trainable model for text spotting, which enjoys a simple, smooth train-
ing scheme. (2) The proposed method can detect and recognize text of vari-
ous shapes, including horizontal, oriented, and curved text. (3) In contrast to
previous methods, precise text detection and recognition in our method are ac-
complished via semantic segmentation. (4) Our method achieves state-of-the-art
performances in both text detection and text spotting on various benchmarks.
2 Related Work
2.1 Scene Text Detection
In scene text recognition systems, text detection plays an important role [59].
A large number of methods have been proposed to detect scene text [7, 36, 37,
50, 19, 23, 54, 21, 47, 54, 56, 30, 52, 55, 34, 15, 48, 43, 57, 16, 35, 31]. In [21],
Jaderberg et al. use Edge Boxes [60] to generate proposals and refine candidate
boxes by regression. Zhang et al. [54] detect scene text by exploiting the sym-
metry property of text. Adapted from Faster R-CNN [40] and SSD [33] with
well-designed modifications, [56, 30] are proposed to detect horizontal words.
Multi-oriented scene text detection has become a hot topic recently. Yao et
al. [52] and Zhang et al. [55] detect multi-oriented scene text by semantic seg-
mentation. Tian et al. [48] and Shi et al. [43] propose methods which first detect
text segments and then link them into text instances by spatial relationship or
link predictions. Zhou et al. [57] and He et al. [16] regress text boxes directly
from dense segmentation maps. Lyu et al. [35] propose to detect and group the
corner points of the text to generate text boxes. Rotation-sensitive regression
for oriented scene text detection is proposed by Liao et al. [31].
Compared to the popularity of horizontal or multi-oriented scene text detec-
tion, there are few works focusing on text instances of arbitrary shapes. Recently,
detection of text with arbitrary shapes has gradually drawn the attention of re-
searchers due to the application requirements in the real-life scenario. In [41],
Risnumawan et al. propose a system for arbitrary text detection based on text
symmetry properties. In [4], a dataset which focuses on curve orientation text
detection is proposed. Different from most of the above-mentioned methods, we
propose to detect scene text by instance segmentation which can detect text
with arbitrary shapes.
4
Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai
2.2 Scene Text Recognition
Scene text recognition [53, 46] aims at decoding the detected or cropped image
regions into character sequences. The previous scene text recognition approaches
can be roughly split into three branches: character-based methods, word-based
methods, and sequence-based methods. The character-based recognition meth-
ods [2, 22] mostly first localize individual characters and then recognize and
group them into words. In [20], Jaderberg et al. propose a word-based method
which treats text recognition as a common English words (90k) classification
problem. Sequence-based methods solve text recognition as a sequence labeling
problem. In [44], Shi et al. use CNN and RNN to model image features and
output the recognized sequences with CTC [11]. In [26, 45], Lee et al. and Shi
et al. recognize scene text via attention based sequence-to-sequence model.
The proposed text recognition component in our framework can be classified
as a character-based method. However, in contrast to previous character-based
approaches, we use an FCN [42] to localize and classify characters simultaneously.
Besides, compared with sequence-based methods which are designed for a 1-D
sequence, our method is more suitable to handle irregular text (multi-oriented
text, curved text et al.).
2.3 Scene Text Spotting
Most of the previous text spotting methods [21, 30, 12, 29] split the spotting
process into two stages. They first use a scene text detector [21, 30, 29] to localize
text instances and then use a text recognizer [20, 44] to obtain the recognized
text. In [27, 3], Li et al. and Busta et al. propose end-to-end methods to localize
and recognize text in a unified network, but require relatively complex training
procedures. Compared with these methods, our proposed text spotter can not
only be trained end-to-end completely, but also has the ability to detect and
recognize arbitrary-shape (horizontal, oriented, and curved) scene text.
2.4 General Object Detection and Semantic Segmentation
With the rise of deep learning, general object detection and semantic segmenta-
tion have achieved great development. A large number of object detection and
segmentation methods [9, 8, 40, 6, 32, 33, 39, 42, 5, 28, 13] have been pro-
posed. Benefited from those methods, scene text detection and recognition have
achieved obvious progress in the past few years. Our method is also inspired
by those methods. Specifically, our method is adapted from a general object in-
stance segmentation model Mask R-CNN [13]. However, there are key differences
between the mask branch of our method and that in Mask R-CNN. Our mask
branch can not only segment text regions but also predict character probabil-
ity maps, which means that our method can be used to recognize the instance
sequence inside character maps rather than predicting an object mask only.
Mask TextSpotter
5
3 Methodology
The proposed method is an end-to-end trainable text spotter, which can handle
various shapes of text. It consists of an instance-segmentation based text detector
and a character-segmentation based text recognizer.
3.1 Framework
The overall architecture of our proposed method is presented in Fig. 2. Func-
tionally, the framework consists of four components: a feature pyramid network
(FPN) [32] as backbone, a region proposal network (RPN) [40] for generating
text proposals, a Fast R-CNN [40] for bounding boxes regression, a mask branch
for text instance segmentation and character segmentation. In the training phase,
a lot of text proposals are first generated by RPN, and then the RoI features of
the proposals are fed into the Fast R-CNN branch and the mask branch to gen-
erate the accurate text candidate boxes, the text instance segmentation maps,
and the character segmentation maps.
Backbone Text in nature images are various in sizes. In order to build high-level
semantic feature maps at all scales, we apply a feature pyramid structure [32]
backbone with ResNet [14] of depth 50. FPN uses a top-down architecture to
fuse the feature of different resolutions from a single-scale input, which improves
accuracy with marginal cost.
RPN RPN is used to generate text proposals for the subsequent Fast R-CNN
and mask branch. Following [32], we assign anchors on different stages de-
pending on the anchor size. Specifically, the area of the anchors are set to
{322, 642, 1282, 2562, 5122} pixels on five stages {P2, P3, P4, P5, P6} respectively.
Different aspect ratios {0.5, 1, 2} are also adopted in each stages as in [40]. In this
way, the RPN can handle text of various sizes and aspect ratios. RoI Align [13] is
adapted to extract the region features of the proposals. Compared to RoI Pool-
ing [8], RoI Align preserves more accurate location information, which is quite
beneficial to the segmentation task in the mask branch. Note that no special
design for text is adopted, such as the special aspect ratios or orientations of
anchors for text, as in previous works [30, 15, 34].
Fast R-CNN The Fast R-CNN branch includes a classification task and a
regression task. The main function of this branch is to provide more accurate
Fig. 2: Illustration of the architecture of the our method.
Box ClassificationBox RegressionRoI AlignWord segmentationCharacter instance segmentationRPNFast R-CNNMask branch
6
Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai
bounding boxes for detection. The inputs of Fast R-CNN are in 7× 7 resolution,
which are generated by RoI Align from the proposals produced by RPN.
Mask Branch There are two tasks in the mask branch, including a global
text instance segmentation task and a character segmentation task. As shown
in Fig. 3, giving an input RoI, whose size is fixed to 16∗ 64, through four convo-
lutional layers and a de-convolutional layer, the mask branch predicts 38 maps
(with 32 ∗ 128 size), including a global text instance map, 36 character maps,
and a background map of characters. The global text instance map can give ac-
curate localization of a text region, regardless of the shape of the text instance.
The character maps are maps of 36 characters, including 26 letters and 10 Ara-
bic numerals. The background map of characters, which excludes the character
regions, is also needed for post-processing.
3.2 Label Generation
For a training sample with the input image I and the corresponding ground
truth, we generate targets for RPN, Fast R-CNN and mask branch. Generally,
the ground truth contains P = {p1, p2...pm} and C = {c1 = (cc1, cl1), c2 =
(cc2, cl2), ..., cn = (ccn, cln)}, where pi is a polygon which represents the local-
ization of a text region, ccj and clj are the category and location of a character
respectively. Note that, in our method C is not necessary for all training samples.
We first transform the polygons into horizontal rectangles which cover the
polygons with minimal areas. And then we generate targets for RPN and Fast
R-CNN following [8, 40, 32]. There are two types of target maps to be generated
for the mask branch with the ground truth P , C (may not exist) as well as the
proposals yielded by RPN: a global map for text instance segmentation and a
character map for character semantic segmentation. Given a positive proposal
r, we first use the matching mechanism of [8, 40, 32] to obtain the best matched
horizontal rectangle. The corresponding polygon as well as characters (if any)
can be obtained further. Next, the matched polygon and character boxes are
shifted and resized to align the proposal and the target map of H × W as the
following formulas:
Fig. 3: Illustration of the mask branch. Subsequently, there are four convolutional
layers, one de-convolutional layer, and a final convolutional layer which predicts
maps of 38 channels (1 for global text instance map; 36 for character maps; 1
for background map of characters).
RoI16*64*25632*128*25616*64*25616*64*25616*64*256……01AZ……Global word mapCharacter mapsBackground map32*128
Mask TextSpotter
7
Bx = (Bx0 − min(rx)) × W/(max(rx) − min(rx))
By = (By0 − min(ry)) × H/(max(ry) − min(ry))
(1)
(2)
where (Bx, By) and (Bx0, By0 ) are the updated and original vertexes of the
polygon and all character boxes; (rx, ry) are the vertexes of the proposal r.
After that, the target global map can be generated by just drawing the
normalized polygon on a zero-initialized mask and filling the polygon region
with the value 1. The character map generation is visualized in Fig. 4a. We first
shrink all character bounding boxes by fixing their center point and shortening
the sides to the fourth of the original sides. Then, the values of the pixels in the
shrunk character bounding boxes are set to their corresponding category indices
and those outside the shrunk character bounding boxes are set to 0. If there are
no character bounding boxes annotations, all values are set to −1.
3.3 Optimization
As discussed in Sec. 3.1, our model includes multiple tasks. We naturally define
a multi-task loss function:
L = Lrpn + α1Lrcnn + α2Lmask,
(3)
where Lrpn and Lrcnn are the loss functions of RPN and Fast R-CNN, which
are identical as these in [40] and [8]. The mask loss Lmask consists of a global
text instance segmentation loss Lglobal and a character segmentation loss Lchar:
Lmask = Lglobal + βLchar,
(4)
where Lglobal is an average binary cross-entropy loss and Lchar is a weighted
spatial soft-max loss. In this work, the α1, α2, β, are empirically set to 1.0.
(a)
(b)
Fig. 4: (a) Label generation of mask branch. Left: the blue box is a proposal
yielded by RPN, the red polygon and yellow boxes are ground truth polygon
and character boxes, the green box is the horizontal rectangle which covers the
polygon with minimal area. Right: the global map (top) and the character map
(bottom). (b) Overview of the pixel voting algorithm. Left: the predicted char-
acter maps; right: for each connected regions, we calculate the scores for each
character by averaging the probability values in the corresponding region.
BALLYS……Background map01AZ……Character mapsPooling &Average1.000……
8
Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai
Text instance segmentation loss The output of the text instance segmenta-
tion task is a single map. Let N be the number of pixels in the global map, yn
be the pixel label (yn ∈ 0, 1), and xn be the output pixel, we define the Lglobal
as follows:
Lglobal = − 1
N
[yn × log(S(xn)) + (1 − yn) × log(1 − S(xn))]
(5)
N
n=1
where S(x) is a sigmoid function.
Character segmentation loss The output of the character segmentation con-
sists of 37 maps, which correspond to 37 classes (36 classes of characters and the
background class). Let T be the number of classes, N be the number of pixels in
each map. The output maps X can be viewed as an N × T matrix. In this way,
the weighted spatial soft-max loss can be defined as follows:
N
T−1
Lchar = − 1
N
n=1
t=0
Wn
Yn,tlog(
),
(6)
T−1
eXn,t
k=0 eXn,k
where Y is the corresponding ground truth of X. The weight W is used to
balance the loss value of the positives (character classes) and the background
class. Let the number of the background pixels be Nneg, and the background
class index be 0, the weights can be calculated as:
Wi =
1
if Yi,0 = 1,
Nneg/(N − Nneg) otherwise
(7)
Note that in inference, a sigmoid function and a soft-max function are applied
to generate the global map and the character segmentation maps respectively.
3.4
Inference
Different from the training process where the input RoIs of mask branch come
from RPN, in the inference phase, we use the outputs of Fast R-CNN as proposals
to generate the predicted global maps and character maps, since the Fast R-CNN
outputs are more accurate.
Specially, the processes of inference are as follows: first, inputting a test
image, we obtain the outputs of Fast R-CNN as [40] and filter out the redundant
candidate boxes by NMS; and then, the kept proposals are fed into the mask
branch to generate the global maps and the character maps; finally the predicted
polygons can be obtained directly by calculating the contours of text regions on
global maps, the character sequences can be generated by our proposed pixel
voting algorithm on character maps.
Pixel Voting We decode the predicted character maps into character sequences
by our proposed pixel voting algorithm. We first binarize the background map,
where the values are from 0 to 255, with a threshold of 192. Then we obtain