Demystifying Segnet
Written by : Sirawat Pitaksarit
Inner workings of Segnet is not very well documented. In this guide, I researched into Segnet by
observations and script digging. Along with this, we can understand about neural network on
implementation side like how people actually implement them.
Demystifying Segnet
First of all if you follow the previous guide
Let’s look at SegNet’s output
What does SegNet Tutorial’s training data looks like?
What’s wrong with the CamVid training data
Answering the questions
Questions not answered
Finally, how to train our own data?
Conclusion
1
2
2
4
8
11
13
14
14
1
First of all if you follow the previous guide
The demo’s Scripts/webcam_demo.py execution command is :
python Scripts/webcam_demo.py --model Example_Models/
segnet_model_driving_webdemo.prototxt --weights Example_Models/
segnet_weights_driving_webdemo.caffemodel --colours Scripts/
camvid12.png
The trained data is segnet_weights_driving_webdemo.caffemodel. This outputs colored
images. Keep in mind that this image never contains the color black as we will come back to this
later.
Let’s look at SegNet’s output
In Segnet-Tutorial repository, if you looked in Scripts/webcam_demo.py
segmentation_ind = np.squeeze(net.blobs['argmax'].data)
Is where we get the result from final layer of the network.This result’s shape is (1,1,360,480) which
is then np.squeeze-d into (360,480)
I test the network that is supposed to segment road scene with the close up webcam picture of my
face.
Trying to print it and inspect the .shape, the output looks like this :
[[ 1. 1. 1. ..., 0. 0. 0.]
[ 1. 1. 1. ..., 0. 0. 0.]
[ 1. 1. 1. ..., 0. 0. 0.]
...,
[ 1. 1. 1. ..., 5. 5. 5.]
[ 1. 1. 1. ..., 5. 5. 5.]
[ 1. 1. 1. ..., 5. 5. 5.]]
(360, 480)
2
Each value corresponds to one pixel. The number of each values is the label. To see an example
of how many instances of each label that exist in this output, we will try numpy.unique(A, False,
False, True) to count each instant of them.
(array([ 0., 1., 2., 4., 5., 6., 7., 9., 10., 11.],
dtype=float32), array([51467, 19657, 297, 7994, 2120, 14383, 157,
70013, 6698, 14]))
In this image there are many classes ranges from label 0 to 11. (3 and 8 is absent in this image)
And the frequency of class label 0 is highest. The squeezed array then will be made into 3
channels and then used with lookup table function (np.LUT) to add colors. Now look at what
SegNet can segment.
You can see in the image :
purple area is largest (Count : 70013) and that corresponds to vehicle (somehow my face looks like
a vehicle), which is the label #9. Second largest is grey, label #0, sky. (Count : 51467)
We can conclude that these numbers exactly aligns with horizontal list that Segnet web demo
provided starting from #0 to #11. There is no vermillion color in my image, vermillion is label #3,
“Road Marking” and the array debug shows no data that have value 3. (Skips from 2 to 4 as you
3
can see.) This proves that segnet_weights_driving_webdemo.caffemodel is indeed the same
model as their web demo.
One interesting point to think is the output from this never contains “Unknown”. This means
segnet_weights_driving_webdemo.caffemodel has been trained with only known labels. To
prove this, look at the Scripts/camvid12.png that we used to do lookup table color matching.
Mouse is hovering at the final light blue color. It is X=11. This means value 0~11 in the answer from
network will get mapped to these colors. To get black color, it needs answer 12 and higher which
never happen from this segnet_weights_driving_webdemo.caffemodel. All of these goes
together well, but from next section I enountered more confusing things.
What does SegNet Tutorial’s training data looks like?
In this page, http://mi.eng.cam.ac.uk/projects/segnet/tutorial.html, the author shows how to train
Segnet from scratch with provided CamVid data which you can get toghether with the Segnet-
Tutorial repo.
The author concludes that it will ends with 88.6% global test accuracy. You can follow the tutorial to
learn how Caffe works on training/testing/evaluation. However, to train your own data or
creating your own network, you need to understand how this works then you will be able to
prepare your own data.
In this section, I will prove that when finished training following the tutorial, one will not end up with
the same model as segnet_weights_driving_webdemo.caffemodel (their web demo) that I
tested in previous section.
From previous section we can understand the input. Each training data to SegNet is a pair of
image. The image data and the label. One such pair from Segnet-Tutorial/CamVid/train and
trainannot looks like this
4
The second image is not black image. Each pixel is in greyscale format and each pixel has been
labeled with number. This corresponds to the answer we got ealier, as the network directly uses
these numbers as a target answer in the final layer. The whole network will try to adjust weights
and bias to produce these numbers as an answer for each pixels. If you repeatedly train with these
almost black images you will get the network that can predict these small numbers. Then you can
use the answer to color map to something meaningful.
Let’s look at some numbers in this label image.
As this number is very small, (for example R,G,B [0,0,0] [2,2,2] [3,3,3]) the result we see is nearly
black. Using Photoshop’s Curve to increase the contrast between colors reveal some interesting
data. In Photoshop’s “Info” panel, we can see the original value before using Curve. Red arrow is
where the Info panel represented. (My screen capture program cannot capture mouse pointer)
5