This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2892076, IEEE
Transactions on Visualization and Computer Graphics
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. X, MONTH YEAR
1
Scribble based 3D shape segmentation via
weakly-supervised learning
Zhenyu Shu, Xiaoyong Shen, Shiqing Xin*, Qingjun Chang, Jieqing Feng, Ladislav Kavan, and Ligang Liu
Abstract—Shape segmentation is a fundamental problem in shape analysis. Previous research shows that prior knowledge helps to
improve the segmentation accuracy and quality. However, completely labeling each 3D shape in a large training data set requires a
heavy manual workload. In this paper, we propose a novel weakly-supervised algorithm for segmenting 3D shapes using deep
learning. Our method jointly propagates information from scribbles to unlabeled faces and learns deep neural network parameters.
Therefore, it does not rely on completely labeled training shapes and only needs a really simple and convenient scribble-based partially
labeling process, instead of the extremely time-consuming and tedious fully labeling processes. Various experimental results
demonstrate the proposed method’s superior segmentation performance over the previous unsupervised approaches and comparable
segmentation performance to the state-of-the-art fully supervised methods.
Index Terms—3D shapes, Segmentation, Scribble, Weakly-supervised, Deep learning
!
1 INTRODUCTION
A Utomatic segmentation of 3D shapes is a fundamental
problem in shape analysis [1]. It is also central to many
computer graphics problems, including mesh parameteri-
zation [2], skeleton extraction [3], resolution modeling [4],
shape retrieval [5] and so on. A commonly used idea in
traditional segmentation algorithms is to first map all the tri-
angular faces on 3D shapes, which are usually represented
by 3D triangular meshes, into feature space using shape
signatures, then clustering the faces on a mesh into several
clusters in the feature space, and finally transforming the
result onto the dual graph of the mesh to obtain the final seg-
mentation result. Earlier segmentation work [1], [6], [7] does
not utilize any prior knowledge and focuses on employing
unsupervised methods to cluster faces in the feature space
to obtain the desired results. Recent work [8], [9], [10] uses
prior knowledge to improve performance by classifying
faces in the feature space using supervised methods.
There are two trends for improving the performance of
3D shape segmentation. One trend is to use multiple shape
signatures simultaneously to improve the ability to describe
a shape’s geometric features from different perspectives, as a
single signature can only characterize a shape from a specific
perspective.
• Zhenyu Shu and Qingjun Chang are with School of Computer and
Data Engineering, Ningbo Institute of Technology, Zhejiang University,
Ningbo, PR China.
E-mail: shuzhenyu@nit.zju.edu.cn (Zhenyu Shu)
• Xiaoyong Shen is with Department of Computer Science and Engineering,
•
The Chinese University of Hong Kong, HongKong.
Shiqing Xin is with School of Computer Science and Technology, Shan-
Dong University, Jinan, PR China. Corresponding author.
E-mail: xinshiqing@163.com (Shiqing Xin)
Jieqing Feng is with State Key Lab of CAD&CG, Zhejiang University,
Hangzhou, PR China.
Ladislav Kavan is with School of Computing, University of Utah, Salt
Lake City, USA.
Ligang Liu is with Graphics & Geometric Computing Laboratory, School
of Mathematical Sciences, University of Science and Technology of China,
Anhui, PR China.
•
•
•
Manuscript received month day, year; revised month day, year.
Fig. 1. Some representative results of our approach.
The other trend is to use and learn from prior knowl-
edge, such as training data, to improve the faces’ classifi-
cation results in feature space. These two techniques work
very well and are used in the state-of-the-art segmentation
systems [8], [9], [10], [11], [12]. Due to their nature, most
existing learning-based 3D shape segmentation methods
rely heavily on high-quality training data to gain a signif-
icant performance improvement. However, preparing high-
quality training data for 3D shape segmentation is quite
expensive because manually labeling each face on a 3D
shape is a very tedious and time-consuming task, even
though some tools [13] have been developed to accelerate
the labeling process. As an example, preparing the training
data for 380 3D shapes in the Princeton Shape Benchmark
dataset, 12 volunteers took about 10 hours to complete the
manual labeling.
In this paper, we propose an effective weakly-supervised
method for shape segmentation based on deep learning to
address the problem of tedious manual labeling. In our
method, we use sparse scribble-based labels, rather than the
complete high-quality labels to train our 3D shape segmen-
tation model. Our method does not need the tedious and
time-consuming labeling process for preparing training data
and therefore is much more practical than existing learning-
1077-2626 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on February 21,2020 at 10:35:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2892076, IEEE
Transactions on Visualization and Computer Graphics
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. X, MONTH YEAR
2
Fig. 2. Workflow of our proposed approach for 3D shape segmentation.
based 3D shape segmentation approaches. Experimental
results on the Princeton Shape Benchmark dataset [13] show
that our method outperforms other state-of-the-art scribble-
based or non-scribble-based segmentation methods, and
can achieve comparable or only slightly worse perfor-
mance when compared with existing fully supervised deep
learning-based 3D shape segmentation methods. Figure 1
shows some representative results of our approach, and an
overview of our algorithm is shown in Figure 2.
The main contribution of this work is our novel shape
segmentation method using weakly-supervised deep learn-
ing techniques, eliminating the tedious and time-consuming
training data preparation processes.
The remaining parts of the paper are organized as
follows. Section 2 reviews the related work on 3D shape
segmentation. In Section 3, we give our novel scribble-
based deep training model and the overall framework of
our algorithm. After that, we show extensive experimental
results, as well as comparisons with the state of the art, in
Section 4. Finally, we discuss the limitations and future work
in Section 5 and conclude this paper in Section 6.
2 RELATED WORK
Shape segmentation [14], [15] aims at dividing a 3D shape
into meaningful parts and plays an important role in shape
analysis and shape understanding. Until now, many meth-
ods have been proposed to address this problem. A com-
mon practice is to first extract geometric feature vectors
and then apply some decomposition techniques to obtain
a satisfactory segmentation result: extreme learning ma-
chine [9], approximate convexity analysis [16], concavity-
aware fields [17], spectral clustering [18], K-Means [19], core
extraction [20], randomized cuts [21], normalized cuts [21],
graph cuts [21], [22], random walks [23]. The existing shape
segmentation methods can be mainly divided into two
categories, including unsupervised 3D shape segmentation
and supervised 3D shape segmentation.
Unsupervised 3D shape segmentation. Earlier work focuses
on segmenting one single 3D shape without prior knowl-
edge. For example, Sidi and van Kaick et al. [7] proposed
an unsupervised method that treats segmentation as a
clustering problem in a descriptor space. Golovinskiy and
Funkhouser [24] use graph clustering techniques to segment
a set of 3D shapes simultaneously. Their assumption is that
a global rigid alignment exists between the input shapes,
which facilitates iteratively establishing correspondence be-
tween respective parts. Huang and Koltun et al. [25] sug-
gested formulating the joint segmentation problem as an
integer quadratic programming problem. Experimental re-
sults show that segmenting a set of similar 3D shapes si-
multaneously significantly outperforms traditional segmen-
tation that targets a single shape. Hu and Fan et.al [26]
also presented an unsupervised approach for segmenting
a set of 3D shapes by over-segmenting the input shapes
into primitive patches and then grouping similar patches
via a subspace clustering scheme. Recently, Meng and Xia
et al. [6] proposed to improve the segmentation results by a
multi-label optimization process.
Compared with segmenting a single 3D shape, segment-
1077-2626 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on February 21,2020 at 10:35:33 UTC from IEEE Xplore. Restrictions apply.
PredictFeature DescriptorsPredictPredictWeakly supervised Deep LearningFeature DescriptorsFeature DescriptorsOver-segmentationOver-segmentationOver-segmentationFeature Vectors(Partly with Labels)Graph CutsGraph CutsGraph CutsSegmentation on training shapesPredictPredictGraph CutsGraph CutsSegmentation on testing shapes
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2892076, IEEE
Transactions on Visualization and Computer Graphics
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. X, MONTH YEAR
3
ing a family of 3D shapes in the same class can achieve
better results, assuming that their underlying relevance
is effectively utilized. Therefore, more work employing
this idea was proposed by being combining with various
techniques. Wu and Wang et al. [1] suggested a spectral
clustering method that generates consistent segmentation
by performing spectral clustering in a fused space of shape
descriptors. Xu and Li et al. [27] used anisotropic part scales
to part correspondence and their algorithm performs well
on a diverse set of shapes. Later, Zheng and Cohen-Or et
al. [28] suggested an indirect top-down approach to deal
with large shape variations.
A significant advantage of unsupervised 3D shape seg-
mentation is that it does not need any training data, which
can be expensive to prepare. However, due to the absence of
prior knowledge, their segmentation performance may not
be as good as supervised ones.
Supervised 3D shape segmentation. Because 3D shape seg-
mentation can actually be regarded as a face clustering
process in the feature space, a data-driven method may
be helpful to obtain better results. Therefore, more recent
work tries to further improve performance by utilizing
prior knowledge during segmentation. For example, Wang
and Asafi et al. [29] presented a semi-supervised learning
method where the user actively assists in the co-analysis
by iteratively providing inputs. Their method requires only
a sparse set of constraints to quickly converge toward
a consistent and error-free semantic labeling of the set.
Kalogerakis and Hertzmann et al. [8] proposed a supervised
approach to do labeling and segmentation simultaneously.
They showed that the segmentation performance can be
dramatically improved by learning techniques. Van Kaick
and Tagliasacchi et al.[30] introduced an approach to part
correspondence which incorporates prior knowledge im-
parted by a training set of pre-segmented, labeled models
and combines that knowledge with content-driven analysis
based on the geometric similarity between the matched
shapes. Recently, Guo and Zou et al. [10] train convolutional
neural networks to effectively classify faces in feature space
to segment 3D shapes. Supervised segmentation methods
usually outperform unsupervised ones. However, they often
require a huge set of manually labeled segmentation results,
which greatly limits their applicability.
Existing learning-based segmentation methods [8], [9],
[10] need a fully labeled training shapes to train a learning
model. Due to the limited generalizability of learning-based
methods, the prediction models must be learned by using
3D training shapes similar to the testing 3D shapes, to
achieve satisfactory segmentation results. For example, to
segment a chair shape, one has to construct a prediction
model by learning from training shapes of chairs. If we want
to segment a variety of 3D shapes using learning-based
methods, we have to manually prepare many different train-
ing 3D shapes, which usually require a different set of train-
ing data for different types of 3D shapes, where each face
on the 3D shapes needs to be labeled. Therefore, expanding
the capabilities of the system to different types of shapes
becomes prohibitively expensive in terms of the manual
labor required for producing the training data. Although
a tool [13] for manually segmenting 3D shapes has been
provided by the computer graphics research community, the
Fig. 3. Some examples of weakly labeled shapes with scribbles (top).
The labeling information is automatically propagated to unlabeled parts
on the 3D shapes by our algorithm (bottom). Note that we use different
colors to show disconnected segments even if they have the same label.
labeling process is still quite tedious and time-consuming.
To reduce the time of training data preparation, we pro-
pose a novel optimization model to remove the necessity of
fully labeled training 3D shapes. With our new optimization
model, only some sparsely labeled training 3D shapes are
required for the learning process. The labeling information
will be automatically propagated to the unlabeled parts
of the training 3D shapes and given as a part of the
optimization results in our algorithm. Figure 3 shows the
weakly labeled training 3D shapes used as the input of our
algorithm and the corresponding propagation results.
To our best knowledge, we are the first to introduce
weakly-supervised deep learning techniques to 3D shape
segmentation. It is worth pointing out that the distinguish-
ing feature of our method is that it does not need fully
labeled training shapes for the learning process, which can
greatly reduce the expenses spent on training data prepa-
ration in existing learning-based segmentation method.
For existing learning-based segmentation methods, a large
number of similar fully labeled training shapes must be
prepared. That means the users must fully label a lot
of training shapes, which is quite expensive. Also, when
segmenting different target 3D shapes, different groups of
corresponding training 3D shapes are required for training
the learning model, which is a significant drawback of the
existing learning-based segmentation methods. Our method
removes the necessity of the tedious and time-consuming
training 3D shapes preparation process, and directly learns
from scribble-based weakly labeled training 3D shapes.
Therefore, our method is much more practical when com-
pared with the existing learning-based methods, because it
does not require fully labeled training 3D shapes.
3 SCRIBBLE-BASED LEARNING MODEL
3.1 Objective Function
We use a graphical model to propagate information from
scribbles to unknown faces. For acceleration, we over-
segment 3D shapes first, where each shape is divided into
1077-2626 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on February 21,2020 at 10:35:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2892076, IEEE
Transactions on Visualization and Computer Graphics
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. X, MONTH YEAR
4
primitive patches, such as the ones achieved in [21]. We then
build a dual graph of the primitive patches. Each primitive
patch is regarded as a vertex in the dual graph and an edge
in the graph represents the similarity between two patches.
Given a training 3D shape M, the scribble labels of M are
S = {sk, lk} where sk are the faces of a scribble k and lk
is that scribble’s category label. For a primitive patch xi, we
want to find its category label yi.
In order to merge the initial small patches on the sur-
face of a 3D shape into resulting segments, we have three
considerations.
Ei
First, the segmentation results should respect the scribble
labels provided by users and regard them as the ground
truth. Therefore, we use the following unary energy term
Ei
scr (yi) to ensure the optimization results comply with the
training labels provided by scribbles.
if yi = lk and xi ∩ sk = ∅
if yi ∈ {lk} and xi ∩ S = ∅
otherwise
0
1|{lk}|
− log
∞
scr (yi) =
The meaning of Ei
(1)
scr (yi) is the following: If a primitive
patch xi overlaps with a scribble sk, then its corresponding
penalty is zero when assigned lk or infinity when the
assigned label is not lk. Otherwise, xi can be assigned any
labels with equal probability if it does not overlap with any
scribble. However, it can never be assigned any labels that
do not occur in this 3D shape.
Second, the segmentation results should conform to the
predicted results of our trained model, i.e., a deep neural
network. We use a unary term Ei
neu (yi, Θ) to respect the
output of the deep neural network and define it as:
neu (yi, Θ) = − log P (yi |X, Θ ) .
Ei
(2)
Here Θ represents the parameters of the deep neural net-
work. log P (yi |X, Θ ) denotes the log probability of predict-
ing xi to have the label yi. It can be computed by summing
the log probabilities of all faces in the patch xi.
Finally, faces with similar geometric features should be
merged into one common segment. To measure the simi-
larity between two primitive patches in the feature space,
we define a pairwise term Ei,j
f ea (yi, yj) in our optimization
problem as follows:
−f (xi)−f (xj )2
2
δ2
exp
0
if yi = yj
otherwise
(3)
Ei,j
f ea (yi, yj) =
where f (xi) and f (xj) represent the corresponding feature
vectors of primitive patches xi and xj, respectively. The
purpose of this term is to ensure that two primitive patches
should be very different in the feature space if they have
different labels.
With the above considerations, we give the following ob-
jective function to minimize by our scribble-based learning
model:
E = Escr (Y ) + Eneu (Y, Θ) + λEf ea (Y ) ,
where Y
scr (yi), Eneu (Y, Θ) =
denotes
the
Ei
(4)
{yi}, Escr (Y ) =
neu (yi, Θ) and Ef ea (Y ) =
Ei
set of
i
Ei,j
f ea (yi, yj).
i
i,j
There are two sets of variables to be optimized, which
are primitive patches’ category labels Y and the deep neural
network’s parameters Θ, such that the above objective func-
tion is minimized to get the trained deep learning models. In
this paper, λ is empirically set to be 1.0. Equation (4) is only
for one 3D shape. Our final energy function sums Equation
(4) over all 3D shapes in the same category.
3.2 Optimization
The optimization problem proposed in the previous section
can be effectively solved by alternating optimization (also
known as block coordinate descent). We fix Θ to solve for Y
and then fix Y to solve for Θ.
Propagating users’ scribble information to unlabeled faces.
When Θ is fixed, the solver propagates labels to unmarked
faces based on users’ scribble, geometric features, and
deep neural network predictions. In this step, the objec-
tive function can be effectively minimized by using graph
cuts algorithm [31] because the objective function in Equa-
tion (4) can be seen as a linear combination of the unary
neu (yi, Θ)) and the pairwise term
f ea (yi, yj)). This formulation is widely used for 3D
scr (yi) +
term (
(
i
Ei,j
Ei
Ei
i
i,j
shape segmentation [6], [9], [10].
Optimizing deep neural network parameters. When Y is
fixed, minimizing the objective function turns into training
a deep neural network and the solver learns a deep neural
network for face-wise 3D shape segmentation. Given the
labels Y for all faces on 3D shapes, the deep neural net-
work can be trained by regarding feature vectors and label
information of all faces as input vectors and output proba-
bility vectors respectively. The last layer of the deep neural
network outputs the face-wise log probabilities, which are
used to update the unary term in our energy function.
Our solver is initialized from the graph cuts step without
using network prediction and then alternates between the
two steps. In our experiments, the neural network contains
4 fully connected layers and the numbers of neurons in
the two hidden layers are experimentally tuned to best fit
the training set. The first layer and the last layer contains
d and c neurons respectively, where d is the dimension of
input feature vectors and c is the number of label categories
provided by scribbles. The rectified linear unit [32] is em-
ployed as the activation function for each layer except the
last layer. The last layer of our neural network is a softmax
classifier producing the probabilities of each faces belonging
to different categories. Our solver stops iterations when the
total energy does not decrease anymore.
Note that here the graph cuts step is applied on the patch
level, while the deep neural network is directly trained on
the face level. Theoretically, we can change our objective
function so that our method can apply the graph cuts step
and the training step on the face level. However, applying
graph cuts on the patch level instead of the face level can
further reduce the computation time. Therefore, we apply
graph cuts algorithm on the patch level (usually 50 patches)
instead of the face level in our method.
Figure 4 shows the optimization process of our algorithm
for an ant model. In this figure, we can see that the interme-
diate segmentations improve with each successive iteration.
1077-2626 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on February 21,2020 at 10:35:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2892076, IEEE
Transactions on Visualization and Computer Graphics
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. X, MONTH YEAR
5
(a) Scribble
(b) Iteration 1
(c) Iteration 2
(d) Iteration 3
(e) Final result
Fig. 4. The optimization process for an ant model. With the decreasing value of the objective function, the intermediate segmentation result of each
iteration becomes better and better. The solver converges after 4 iterations. The top row and the bottom row show the ant model from the top view
and bottom view respectively. Average geodesic distance, shape diameter function and Gaussian curvature are used as feature descriptors.
In each iteration, the labeling information is learned from
the result of the graph cut by the deep neural network, and
then used to more accurately predict the labeling informa-
tion. As the deep neural network gets updated, the labeling
information from the network becomes more reliable, im-
proving the accuracy propagated labeling information from
the graph cuts. This propagated labeling information in turn
improve the deep neural network learning. In our exper-
iments, we empirically find that the optimization process
converges after four iterations and the satisfactory result is
obtained.
3.3 Shape signatures selection
To extract feature vectors for faces and primitive patches,
four widely known feature descriptors, including average
geodesic distance (AGD) [33], shape diameter function
(SDF) [3], Gaussian curvature (GC) [34] and scale-invariant
heat kernel signatures (SIHKS) [35], are selected in this
paper. These feature descriptors are deemed to have the
capability of characterizing the geometric properties well
from different perspectives. For each face, we concatenate
the values (for AGD, SDF, and GC) and vectors (for SIHKS,
the dimension is experimentally set to be 19) together and
obtain a corresponding feature vector, whose dimension is
22, to be used in Equation (2). For each primitive patch, a
different strategy is adopted for extracting feature vectors.
For scalar fields defined on each face, such as AGD, SDF, and
GC, we use histograms capturing the feature distribution
to extract the feature vectors for each primitive patch. For
SIHKS, which is defined as a vector field on each face, we
extract feature vectors by employing the famous bag-of-
feature (BoF) technique for each primitive patch. We note
that the number of bins in the histograms and that of the
bags for the bag-of-feature representation are both set to be
B = 100. Like the face-level situation, the feature vectors
from different shape signatures are also concatenated into
a single feature vector to be used in Equation (1) and (3).
In this way, any feature descriptor can be adapted into our
algorithm framework.
3.4 Algorithm
Our algorithm works on a set of similar 3D shapes and
it consists of five stages, as illustrated in Figure 2. First,
similar to the super-pixel based image segmentation [36],
[37], we divide each shape into primitive patches for speed-
up, like in [21]. Second, we calculate feature vectors for each
patch, as described in Section 3.3. Then in the next stage,
we manually scribble the shapes and get the weakly-labeled
training data for our segmentation method. After that, with
the scribble-based labels and the extracted feature vectors
for each face and primitive patch, a deep learning model is
constructed and trained to predict the label of each primitive
patch in the shapes. Finally, the learned model is applied
to the shapes to obtain the segmentation results. We apply
graph cuts algorithm [6], [31] to further improve the quality
of the segmentation.
Our algorithm can be summarized as follows. Given a
set of manually scribbled training 3D shapes,
1)
2)
For each training 3D shape, divide it into several
primitive patches.
For each primitive patch, calculate corresponding
feature vectors.
3) Minimize the objective function (4) using alternating
optimization.
a) Apply graph cuts technique to get an initial
b)
guess of Y .
Fix Y and train deep neural network param-
eters Θ to minimize the objective function
(4).
1077-2626 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on February 21,2020 at 10:35:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2892076, IEEE
Transactions on Visualization and Computer Graphics
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. X, MONTH YEAR
6
Parameter settings. In this paper, the weight coefficient
λ in the optimization problem (4) is set to 1.0. In our
experiments, we tried various choices of L. We show the
average Rand Index scores with different values of L on
PSB in Figure 5. It can be seen that the Rand Index score
gets worse when L is far less than 50 because small parts
cannot be captured very well by our method in this case.
However, if L is far larger than 50, too many patches will
lead our method to be too inefficient. Based on the above
observation, the number of patches for over-segmenting is
experimentally set to L = 50.
4.1 Results
Our method is a weakly-supervised one and the training
dataset only requires partially annotated 3D shapes. It is
worth pointing out that, although the 3D shapes in the
training set only contain partially annotated information
(obtained from scribbles) instead of fully annotated ones,
all the 3D shapes in the training set are automatically fully
annotated by our learning process as a by-product of our
algorithm. Therefore, our algorithm can produce two sets of
segmentation results. One set of results are the segmentation
results on the testing set, obtained by applying our learned
model to the testing set. The other set of results are the
segmentation results on the training set, obtained from the
learning process of our method.
Figure 6 shows the segmentation results on the training
shapes by using our method for some representative cate-
gories of 3D shapes in PSB. Figure 7 shows the segmentation
results on the testing shapes by using our method on the
same dataset. Although there are large shape variations,
a large majority of the segmentation results are desirable
and consistent with our perception. The Rand Index score
statistics of our segmentation for training shapes on the PSB
dataset, as well as those of other methods, are detailed in
Table 1. These were obtained by first computing the average
Rand Index score for each category respectively, and then
averaging over all categories in the corresponding dataset.
From Table 1 we can see that our algorithm obtains an
average Rand Index of 0.076, which outperforms related
algorithms.
4.2 Sensitivities to scribble quality
The performance of our algorithm relies on the scribbles
provided by users. To test the sensitivity of our algorithm
to scribble quality, we compute the Rand Index scores of
our algorithm for each 3D shape in PSB, while varying the
percentage of the surface covered by the scribbles.. For a
3D shape, let S be the set of faces covered by the scribbles
provided by our volunteers. We gradually remove faces
from S to reduce the coverage of scribbles. For each step, we
randomly remove faces from S and get the corresponding
Rand Index scores from our algorithm. Figure 8 shows the
corresponding Rand Index scores after averaging over the
whole PSB. We can see that although the average Rand In-
dex score decreases rapidly when more scribble information
provided by users, the score is already very small even when
the scribbles provided by the user only cover 1.0% area of
the whole surfaces of 3D shapes.
Fig. 5. The average Rand Index scores with regard to L on the PSB
dataset, where lower values indicate better results.
c)
Fix Θ and optimize Y to minimize the objec-
tive function (4).
d) Repeat step (3b) and step (3c) until conver-
gence.
The results of our algorithm include not only segmented
training 3D shapes but also a neural network which can be
used to predict and segment test 3D shapes.
4 EVALUATION
Experimental dataset. We test our segmentation algorithm on
the Princeton Segmentation Benchmark (PSB) dataset [13]
and ShapeNetCore dataset [12], [38], which are both open
datasets for 3D shape segmentation. The PSB dataset con-
tains 19 different object categories and each category con-
tains 20 3D shapes. For PSB dataset, we use the ground-
truth provided by [8] for evaluation. For ShapeNetCore
dataset, we use the ground-truth provided by [38], which
are converted from point-based segmentation results pro-
vided by [12]. It is worth pointing out that most of the
3D shapes in the ShapeNetCore dataset are non-manifold
while our algorithm needs a manifold shape as the input.
Therefore, we resample and reconstruct all the shapes in
the dataset so that all 3D shapes become manifold. We
developed a simple software tool to facilitate the labeling
process and employed 3 volunteers to scribble the training
3D shapes. Some examples of scribble results are shown in
Appendix.
Evaluation metrics. We use four metrics, Rand Index,
Cut Discrepancy, Hamming Distance and Consistency Error,
which are defined by [13] to evaluate the segmentation
results. Rand Index is used to compute the similarity be-
tween two segmentations. Cut Discrepancy measures the
similarity between different cuts by comparing the bound-
aries. As opposed to Cut Discrepancy, Hamming Distance
compares the regions and evaluates the minimum number
of substitutions required when changing one region into
another. Consistency Errors, including the global one (GCE)
and the local one (LCE), measure the hierarchical differences
between two segmentations of the same shape.
1077-2626 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on February 21,2020 at 10:35:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2892076, IEEE
Transactions on Visualization and Computer Graphics
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. X, MONTH YEAR
7
(a) AirPlane
(b) Table
(c) Teddy
(d) Cup
Fig. 6. Representative segmentation results produced by our approach (on training shapes) on the Princeton segmentation benchmark dataset.
(a) AirPlane
(b) Table
(c) Teddy
(d) Cup
Fig. 7. Representative segmentation results produced by our approach (on testing shapes) on the Princeton segmentation benchmark dataset.
The Rand Index scores of segmentation for each category of 3D shapes in PSB dataset with different methods.
TABLE 1
Human
Cup
Octopus
Ant
Chair
Glasses
Airplane
Object Categories Ours
0.064
0.021
0.048
0.081
0.007
0.029
0.022
0.014
0.068
0.078
0.050
0.100
0.092
0.065
0.222
0.128
0.106
0.101
0.148
0.076
Table
Teddy
Hand
Plier
Fish
Bird
Bust
Mech
Bearing
Vase
Fourleg
Average
Armadillo
PMC WcSeg
0.059
0.084
0.125
0.089
0.124
0.167
0.070
0.108
0.009
0.057
0.066
0.116
0.026
0.066
0.025
0.056
0.040
0.294
0.103
0.220
0.078
0.110
0.144
0.471
0.094
0.212
0.195
0.076
0.189
0.256
0.064
0.132
0.087
0.159
0.109
0.351
0.139
0.175
0.162
0.098
RandCuts NormCuts
RandWalks
Kmeans
0.117
0.268
0.100
0.129
0.041
0.172
0.093
0.413
0.050
0.132
0.119
0.307
0.105
0.093
0.227
0.348
0.160
0.115
0.182
0.167
0.126
0.432
0.137
0.200
0.083
0.081
0.101
0.260
0.122
0.180
0.175
0.450
0.183
0.126
0.327
0.367
0.250
0.285
0.201
0.215
0.162
0.443
0.246
0.249
0.076
0.143
0.103
0.274
0.113
0.194
0.206
0.429
0.220
0.117
0.330
0.388
0.303
0.301
0.232
0.238
0.136
0.559
0.195
0.236
0.109
0.193
0.142
0.465
0.177
0.218
0.179
0.476
0.182
0.108
0.352
0.505
0.322
0.389
0.197
0.270
SDF
0.133
0.384
0.197
0.083
0.009
0.075
0.041
0.174
0.043
0.199
0.368
0.263
0.104
0.076
0.308
0.242
0.105
0.212
0.127
0.165
FitPrim CoreExtra
0.139
0.488
0.223
0.179
0.100
0.198
0.132
0.309
0.130
0.234
0.170
0.477
0.191
0.088
0.315
0.417
0.202
0.271
0.189
0.234
0.199
0.309
0.260
0.272
0.056
0.166
0.048
0.210
0.103
0.166
0.067
0.303
0.111
0.165
0.281
0.335
0.380
0.182
0.164
0.199
In order to obtain a desirable segmentation result, there
are two key aspects. On the one hand, users’ input scribbles
should be representative, i.e., a scribble should be able to
reflect the key geometric features or the overall geometric
shape of the target area dominated by the scribble. On
the other hand, the segmentation algorithm should be suf-
ficiently intelligent to enable different scribbles to snatch
their own regions while maintaining a regional balance.
Generally speaking, the two aspects are closely related and
inseparable. If the geometric shapes of different patches
are much different from each other, then it decouples the
requirements of input scribbles. For example, for the desk
model in Figure 9, whether long scribbles or short scribbles,
the segmentation results are identical. However, when the
geometric features of different patches are quite similar,
scribbles that are more representative have to be required.
If the input scribbles are very short and not representative
(see the top row of Figure 10), our algorithm may report
a low-quality segmentation result. But generally speaking,
our algorithm is able to yield a high-quality segmentation
result as long as the input scribbles are not intentionally
bad. See the bottom row of Figure 10.
1077-2626 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on February 21,2020 at 10:35:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2892076, IEEE
Transactions on Visualization and Computer Graphics
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. X, MONTH YEAR
8
Fig. 8. The average Rand Index scores of our segmentation results on
training shapes when using different percent of coverage with scribbles.
Fig. 9. Segmentation results produced by using short and long scribbles
for a Table model. The top line shows short scribbles and the corre-
sponding segmentation results. The bottom line shows long scribbles
and the corresponding segmentation results.
4.3 Comparison
We compare our method with 12 other segmentation al-
gorithms including Paint mesh cutting (PMC, [39]), ap-
proximate convexity analysis (WcSeg, [16]), Randomized
cuts [21], Normalized cuts [21], Random walks [23], K-
Means [19], SDF [3], Fitting primitive [40], Core extrac-
tion [20], FSDL [10], the method of Kalogerakis et al.
(SB19, [8]) and Extreme Learning Machine (ELM, [9]) on
the PSB database. All the related segmentation algorithms
are compared against human-generated segmentations (pro-
vided by [8]). Table 1 shows the comparison of results
from our method on training shapes and other algorithms
using Rand Index. Additionally, we also use the other four
metrics for evaluation and the detailed statistics plots are
shown in Figure 11. It can be seen that our segmentation
method on training shapes outperforms other segmentation
approaches no matter what kind of evaluation metric is
used, except the WcSeg method. When comparing with the
WcSeg method, our algorithm is also better for all metrics
except Consistency Error. No matter what kind of metric is
used, lower values mean closer similarity to the human-
generated ground truth. Figure 12 provides a qualitative
comparison on the Teddy model, the Cup model, and the
Armadillo model.
Comparison with another scribble-based method. Table 1
shows the comparison between our method and Paint mesh
Fig. 10. Segmentation results produced by using short and long scrib-
bles on a Bird model. The top line shows short scribbles and the cor-
responding segmentation results. The bottom line shows long scribbles
and the corresponding segmentation results.
cutting (PMC, [39]), which is also a 3D shape segmentation
algorithm based on users’ scribbles, with Rand Index score.
From the results, we can see that our method outperforms
PMC for 16 out of 19 categories. Furthermore, the average
Rand Index score of our method for all categories is lower
than PMCs’. Additionally, Figure 11 shows the superior
performance of our method compared to PMC.
Comparison with fully supervised deep learning method. We
compare our method with three other 3D shape segmenta-
tion methods (FSDL, [10], SB19, [8], ELM, [9]), which em-
ploys fully supervised deep learning techniques to segment
3D shapes. Table 2 shows the Rand Index scores for each
category of 3D shapes in PSB using these four methods.
Note that the results from our method on testing shapes
and other methods are obtained by regarding 19 3D shapes
in each category (there are 20 3D shapes in each category in
total) as training set and the one left out 3D shape is used
as testing set. From the results, we can see that Rand Index
scores of our method on training shapes for 7 categories are
better than FSDLs’ and worse for the other 12 categories.
The Rand Index scores of our method on testing shapes
for 6 categories is better than FSDLs’ and worse for the
other categories. The average performance of our method is
worse than FSDL. Similarly, SB19 also achieve slightly better
performance than our method. We believe the main reason
is that our method is weakly-supervised and utilizes much
fewer human labeled training samples than FSDL. Figure 13
shows a qualitative comparison between our method and
FSDL on the Fish model, the Bird model, and the Armadillo
model.
Additionally, we compare our method with the state-of-
the-art method PFCN proposed in [38] on ShapeNetCore
dataset ([12]), which is also fully supervised. The method
combines image-based fully convolutional networks and
surface-based conditional random fields to obtain coherent
segmentations of 3D shapes. The ShapeNetCore dataset
contains 4916 3D shapes totally, where the most 3D shapes
in the dataset are non-manifold. Because our method can
only be applied to manifold 3D shapes, we remesh all the
non-manifold shapes in ShapeNetCore and convert them
to manifold version. The training and testing setting used
in the two methods follows the one provided by PFCN.
Table 3 shows the Rand Index scores of two methods for
14 categories of 3D shapes in the manifold ShapeNetCore
dataset. We can see that Rand Index scores of our method on
training shapes for 5 categories are better than PFCNs’ and
worse for the other 9 categories. The Rand Index scores of
1077-2626 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on February 21,2020 at 10:35:33 UTC from IEEE Xplore. Restrictions apply.