A Year in Computer Vision
The M Tank
Website: http://themtank.org/
Contact: info@themtank.com
Note: This document is intended for educational purposes only. Any information
contained within is representative of the editors professional views. This piece
contains a number of academic publications for which references are provided where
appropriate.
Edited for The M Tank by
Benjamin F. Duffy
&
Daniel R. Flynn
A Year in Computer Vision: The M Tank, 2017
Table of Contents
Introduction
Part One: Classification/Localisation, Object Detection, Object Tracking
Classification/Localisation
Object Detection
Object Tracking
Part Two: Segmentation, Super-res/Colourisation/Style Transfer, Action
Recognition
Segmentation
Super-resolution, Style Transfer & Colourisation
Action Recognition
Part Three: Toward a 3D understanding of the world
Other uncategorised 3D
3
5
5
8
12
14
14
17
23
24
33
38
36
In summation
Part Four: ConvNet Architectures, Datasets, Ungroupable Extras
ConvNet Architectures
38
Datasets 46
Ungroupable extras and interesting trends 50
Conclusion 55
2
A Year in Computer Vision: The M Tank, 2017
Introduction
Computer Vision typically refers to the scientific discipline of giving machines the ability
of sight, or perhaps more colourfully, enabling machines to visually analyse their
environments and the stimuli within them. This process typically involves the evaluation
of an image, images or video. The British Machine Vision Association (BMVA) defines
Computer Vision as “the automatic extraction, analysis and understanding of useful
information from a single image or a sequence of images.” 1
The term understanding provides an interesting counterpoint to an otherwise
mechanical definition of vision, one which serves to demonstrate both the significance
and complexity of the Computer Vision field. True understanding of our environment is
not achieved through visual representations alone. Rather, visual cues travel through
the optic nerve to the primary visual cortex and are interpreted by the brain, in a highly
stylised sense. The interpretations drawn from this sensory information encompass the
near-totality of our natural programming and subjective experiences, i.e. how evolution
has wired us to survive and what we learn about the world throughout our lives.
In this respect, vision only relates to the transmission of images for interpretation; while
computing said images is more analogous to thought or cognition, drawing on a
multitude of the brain’s faculties. Hence, many believe that Computer Vision, a true
understanding of visual environments and their contexts, paves the way for future
iterations of Strong Artificial Intelligence, due to its cross-domain mastery.
However, put down the pitchforks as we’re still very much in the embryonic stages of
this fascinating field. This piece simply aims to shed some light on 2016’s biggest
Computer Vision advancements. And hopefully ground some of these advancements in
a healthy mix of expected near-term societal-interactions and, where applicable,
tongue-in-cheek prognostications of the end of life as we know it.
While our work is always written to be as accessible as possible, sections within this
particular piece may be oblique at times due to the subject matter. We do provide
rudimentary definitions throughout, however, these only convey a facile understanding
of key concepts. In keeping our focus on work produced in 2016, often omissions are
made in the interest of brevity.
One such glaring omission relates to the functionality of Convolutional Neural Networks
(hereafter CNNs or ConvNets), which are ubiquitous within the field of Computer Vision.
1 British Machine Vision Association (BMVA). 2016. What is computer vision? [Online] Available at:
http://www.bmva.org/visionoverview [Accessed 21/12/2016]
3
A Year in Computer Vision: The M Tank, 2017
2
The success of AlexNet in 2012, a CNN architecture which blindsided ImageNet
competitors, proved instigator of a de facto revolution within the field, with numerous
researchers adopting neural network-based approaches as part of Computer Vision’s
new period of ‘normal science’. 3
Over four years later and CNN variants still make up the bulk of new neural network
architectures for vision tasks, with researchers reconstructing them like legos; a working
testament to the power of both open source information and Deep Learning. However,
an explanation of CNNs could easily span several postings and is best left to those with
a deeper expertise on the subject and an affinity for making the complex
understandable.
For casual readers who wish to gain a quick grounding before proceeding we
recommend the first two resources below. For those who wish to go further still, we
have ordered the resources below to facilitate that:
● What a Deep Neural Network thinks about your #selfie from Andrej Karpathy
is one of our favourites for helping people understand the applications and
functionalities behind CNNs. 4
● Quora: “what is a convolutional neural network?” - Has no shortage of great
links and explanations. Particularly suited to those with no prior understanding. 5
● CS231n: Convolutional Neural Networks for Visual Recognition from
Stanford University is an excellent resource for more depth. 6
● Deep Learning (Goodfellow, Bengio & Courville, 2016) provides detailed
explanations of CNN features and functionality in Chapter 9. The textbook has
been kindly made available for free in HTML format by the authors. 7
For those wishing to understand more about Neural Networks and Deep Learning in
general we suggest:
2 Krizhevsky, A., Sutskever, I. and Hinton, G. E. 2012. ImageNet Classification with Deep Convolutional
Neural Networks, NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada. Available:
http://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf
3 Kuhn, T. S. 1962. The Structure of Scientific Revolutions. 4th ed. United States: The University of
Chicago Press.
4 Karpathy, A. 2015. What a Deep Neural Network thinks about your #selfie. [Blog] Andrej Karpathy Blog.
Available: http://karpathy.github.io/2015/10/25/selfie/ [Accessed: 21/12/2016]
5 Quora. 2016. What is a convolutional neural network? [Online] Available:
https://www.quora.com/What-is-a-convolutional-neural-network [Accessed: 21/12/2016]
6 Stanford University. 2016. Convolutional Neural Networks for Visual Recognition. [Online] CS231n.
Available: http://cs231n.stanford.edu/ [Accessed 21/12/2016]
7 Goodfellow et al. 2016. Deep Learning. MIT Press. [Online] http://www.deeplearningbook.org/
[Accessed: 21/12/2016] Note: Chapter 9, Convolutional Networks [Available:
http://www.deeplearningbook.org/contents/convnets.html]
4
A Year in Computer Vision: The M Tank, 2017
● Neural Networks and Deep Learning (Nielsen, 2017) is a free online textbook
which provides the reader with a really intuitive understanding of the complexities
of Neural Networks and Deep Learning. Even just completing chapter one should
greatly illuminate the subject matter of this piece for first-timers. 8
As a whole this piece is disjointed and spasmodic, a reflection of the authors’
excitement and the spirit in which it was intended to be utilised, section by section.
Information is partitioned using our own heuristics and judgements, a necessary
compromise due to the cross-domain influence of much of the work presented.
We hope that readers benefit from our aggregation of the information here to further
their own knowledge, regardless of previous experience.
From all our contributors,
The M Tank
8 Nielsen, M. 2017. Neural Networks and Deep Learning. [Online] EBook. Available:
http://neuralnetworksanddeeplearning.com/index.html [Accessed: 06/03/2017].
5
A Year in Computer Vision: The M Tank, 2017
Part One: Classification/Localisation, Object Detection, Object
Tracking
Classification/Localisation
The task of classification, when it relates to images, generally refers to assigning a label
to the whole image, e.g. ‘cat’. Assuming this, Localisation may then refer to finding
where the object is in said image, usually denoted by the output of some form of
bounding box around the object. Current classification/localisation techniques on
ImageNet have likely surpassed an ensemble of trained humans.
place greater emphasis on subsequent sections of the blog.
For this reason, we
10
9
Figure 1: Computer Vision Tasks
Source: Fei-Fei Li, Andrej Karpathy & Justin Johnson (2016) cs231n, Lecture 8 - Slide 8, Spatial
Localization and Detection (01/02/2016). Available:
http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf
However, the introduction of larger datasets with an increased number of classes will
likely provide new metrics for progress in the near future. On that point, François
Chollet, the creator of Keras,
has applied new techniques, including the popular
12
architecture Xception, to an internal google dataset with over 350 million multi-label
images containing 17,000 classes.
13 14
11
,
9 ImageNet refers to a popular image dataset for Computer Vision. Each year entrants compete in a
series of different tasks called the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
Available: http://image-net.org/challenges/LSVRC/2016/index
10 See “What I learned from competing against a ConvNet on ImageNet” by Andrej Karpathy. The blog
post details the author’s journey to provide a human benchmark against the ILSVRC 2014 dataset. The
error rate was approximately 5.1% versus a then state-of-the-art GoogLeNet classification error of 6.8%.
Available:
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
11 See new datasets later in this piece.
12 Keras is a popular neural network-based deep learning library: https://keras.io/
13 Chollet, F. 2016. Information-theoretical label embeddings for large-scale image classification. [Online]
arXiv: 1607.05691. Available: arXiv:1607.05691v1
6
A Year in Computer Vision: The M Tank, 2017
Figure 2: Classification/Localisation results from ILSVRC (2010-2016)
Note: ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The change in results from
2011-2012 resulting from the AlexNet submission. For a review of the challenge requirements relating to
Classification and Localization see: http://www.image-net.org/challenges/LSVRC/2016/index#comp
Source: Jia Deng (2016). ILSVRC2016 object localisation: introduction, results. Slide 2. Available:
http://image-net.org/challenges/talks/2016/ILSVRC2016_10_09_clsloc.pdf
Interesting takeaways from the ImageNet LSVRC (2016):
● Scene Classification refers to the task of labelling an image with a certain
scene class like ‘greenhouse’, ‘stadium’, ‘cathedral’, etc. ImageNet held a Scene
Classification challenge last year with a subset of the Places2 dataset: 8 million
images for training with 365 scene categories.
Hikvision won with a 9% top-5 error with an ensemble of deep Inception-style
networks, and not-so-deep residuals networks.
16
15
● Trimps-Soushen won the ImageNet Classification task with 2.99% top-5
classification error and 7.71% localisation error. The team employed an
ensemble for classification (averaging the results of Inception, Inception-Resnet,
ResNet and Wide Residual Networks models ) and Faster R-CNN for
localisation based on the labels.
18
image classes with 1.2 million images provided as training data. The partitioned
test data compiled a further 100 thousand unseen images.
The dataset was distributed across 1000
17
14 Chollet, F. 2016. Xception: Deep Learning with Depthwise Separable Convolutions. [Online]
arXiv:1610.02357. Available: arXiv:1610.02357v2
15 Places2 dataset, details available: http://places2.csail.mit.edu/. See also new datasets section.
16 Hikvision. 2016. Hikvision ranked No.1 in Scene Classification at ImageNet 2016 challenge. [Online]
Security News Desk. Available:
http://www.securitynewsdesk.com/hikvision-ranked-no-1-scene-classification-imagenet-2016-challenge/
[Accessed: 20/03/2017].
17 See Residual Networks in Part Four of this publication for more details.
18 Details available under team information Trimps-Soushen from:
http://image-net.org/challenges/LSVRC/2016/results
7
A Year in Computer Vision: The M Tank, 2017
● ResNeXt by Facebook came a close second in top-5 classification error with
3.03% by using a new architecture that extends the original ResNet architecture.
19
Object Detection
As one can imagine the process of Object Detection does exactly that, detects objects
within images. The definition provided for object detection by the ILSVRC 2016 20
includes outputting bounding boxes and labels for individual objects. This differs from
the classification/localisation task by applying classification and localisation to many
objects instead of just a single dominant object.
Figure 3: Object Detection With Face as the Only Class
Note: Picture is an example of face detection, Object Detection of a single class. The authors cite one of
the persistent issues in Object Detection to be the detection of small objects. Using small faces as a test
class they explore the role of scale invariance, image resolution, and contextual reasoning.
Source: Hu and Ramanan (2016, p. 1)
One of 2016’s major trends in Object Detection was the shift towards a quicker, more
efficient detection system. This was visible in approaches like YOLO, SSD and R-FCN
as a move towards sharing computation on a whole image. Hence, differentiating
themselves from the costly subnetworks associated with Fast/Faster R-CNN
21
19 Xie, S., Girshick, R., Dollar, P., Tu, Z. & He, K. 2016. Aggregated Residual Transformations for Deep
Neural Networks. [Online] arXiv: 1611.05431. Available: arXiv:1611.05431v1
20 ImageNet Large Scale Visual Recognition Challenge (2016), Part II, Available:
http://image-net.org/challenges/LSVRC/2016/#det [Accessed: 22/11/2016]
21 Hu and Ramanan. 2016. Finding Tiny Faces. [Online] arXiv: 1612.04402. Available:
arXiv:1612.04402v1
8