Understanding vision: theory, models, and data
(provisional title)
cLi Zhaoping
University College London, UK
This book came originally from lecture notes used to teach students on computational/theoretical
vision. Some readers may find that a paper, Zhaoping (2006) “Theoretical understanding of the
early visual processes by data compression and data selection” in Network: Computation in neural
systems 17(4):301-334, is an abbreviation of some parts in chapter 2-4 of this book.
The book is still a very rough draft — I hope to update the draft continuously and make it
available to students. Feedbacks are welcome. If you like better explanations or more details in
any parts of this manuscript, or if you think certain parts of the text are not clear or confusing, or
anything else, please do not hesitate to contact me at z.li@ucl.ac.uk .
This document was produced on September 29, 2011
2
Contents
1
Introduction and scope
1.1 The approach . .
1.2 The problem of vision .
1.3 What is known about vision experimentally . . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
. . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
1.1.1 Theory, models, and data . .
. . .
. . . .
. . . .
1.3.1 Neurons, neural circuits, cortical areas, and the brain . .
1.3.2 Visual processing stages along the visual pathway . . . .
. . . .
1.3.3 Retina . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
. . . .
. . . .
.
. . . .
. . . .
.
. . . .
. . . .
.
. . . .
1.2.1 Vision seen through visual encoding, selection, and decoding . . . .
.
1.2.2 Retina and V1 seen through visual encoding and bottom-up selection . . .
.
. . . .
1.2.3 Visual decoding and higher visual cortical areas .
.
. . . .
. . . .
.
. . . .
.
. . . .
.
. . . .
.
. . . .
.
. . . .
.
. . . .
.
. . . .
.
. . . .
.
. . . .
The retinotopic map . .
The receptive fields in the primary visual cortex — the feature detectors . .
.
The influences on a V1 neuron’s response from contextual stimuli outside the
.
.
.
.
. . . .
Receptive fields of the retinal ganglion cells . . . .
Contrast sensitivity to sinusoidal gratings .
. . . .
. . . .
. . . .
Color processing in the retina . . . .
. . . .
. . . .
Spatial sampling on the retina . . . .
. . . .
. . . .
. . .
. . . .
. . . .
. . . .
receptive field . . .
1.3.5 The higher visual areas
. . .
1.3.6 Behavioral studies on vision .
. . .
1.3.7 Etc . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
1.3.4 The primary visual cortex (V1)
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
. . .
. . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
2 Information encoding in early vision: the efficient coding principle
. . . .
. . . .
. . . .
. . . .
. . . .
2.2 Formulation of the efficient coding principle . . .
. . . .
2.3 Efficient neural sampling in the retina . . .
2.1 A brief introduction on information theory — skip if not needed . . .
. . .
.
. . . .
.
Measuring information amount . . .
. . . .
.
Information transmission, information channels, and mutual information .
.
. . . .
. . . .
Information redundancy and error correction . . .
.
. . . .
. . . .
. . . .
.
. . . .
. . . .
. . . .
.
. . . .
. . . .
2.3.1 Contrast sampling in a fly’s compound eye . . . .
.
. . . .
2.3.2
Spatial sampling by receptor distribution on the retina .
.
2.3.3 Color sampling by wavelength sensitivities of the cones
. . . .
Efficient coding by early visual receptive fields . .
.
. . . .
. . . .
2.4.1 Obtaining the efficient code, and the related sparse code, in low noise limit
.
.
.
.
.
. . . .
2.4.2 The general analytical solution to efficient codings of gaussian signals . . .
. . . .
Illustration: stereo coding in V1 . . .
. . . .
. . . .
2.5.1 Principal component analysis . . . .
2.5.2 Gain control . . .
. . . .
. . . .
2.5.3 Contrast enhancement, decorrelation, and whitening in the high S/N region
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
by numerical simulations . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
2.4
2.5
. . .
. . .
. . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
3
7
7
7
8
9
11
12
13
13
14
16
16
20
23
24
24
25
26
30
31
31
31
33
33
34
35
37
40
43
43
44
47
48
48
50
52
53
57
59
4
CONTENTS
. . .
. . . .
. . . .
. . .
. . . .
. . . .
. . . .
. . . .
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Smoothing and output correlation in the low S/N region . .
. . . .
2.5.4 Many equivalent solutions of optimal encoding . . .
. . . .
2.5.5 A special, most local, class of optimal coding . .
. . .
2.5.6
. . . .
2.5.7 Adaptation of the optimal code to the statistics of the input environment .
. . . .
. . . .
. . . .
. . . .
. . . .
. .
. .
. .
. .
Binocular cells, monocular cells, and ocular dominance columns
. .
Coupling between stereo coding and spatial scale coding . .
. .
. . . .
. . .
Adaptation of stereo coding to light levels
. .
. . . .
Strabismus . . .
. . .
. . . .
. .
Adaptation of stereo coding with animal species . . .
. . . .
. .
Coupling between stereo coding and the preferred orientation of V1 neurons
Monocular deprivation . .
. .
2.6 Applying efficient coding to understand coding in space, color, time, and scale in
. .
. . . .
retina and V1 .
. . . .
. .
. . . .
2.6.1 Efficient spatial coding for retina .
. .
. . . .
. . . .
2.6.2 Efficient coding in time
. .
. .
2.6.3 Efficient coding in color . .
. . . .
. . . .
. .
2.6.4 Coupling space and color coding in retina . . .
. .
2.6.5 Efficient Spatial Coding in V1 . . .
. . . .
. .
2.6.6 Coupling the spatial and color coding in V1 . .
2.6.7 Coupling spatial coding with stereo coding . . .
. .
2.6.8 Coupling spatial space with temporal, chromatic, and stereo coding in V1 . .
2.7 How to get the efficient codes by developmental rules or unsupervised learning? . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
3 V1 and information coding
visual objects .
3.1 Pursuit of efficient coding in V1 by reducing higher order redundancy . .
. .
3.1.1 Higher order statistics contains much of the meaningful information about
. .
. .
. .
. .
. .
. . . .
3.1.2 Characterizing higher order statistics
. . . .
3.1.3 Efforts to understand V1 by removal of higher order redundancy .
3.2 Problems in understanding V1 by the goal of efficient coding . . . .
. . . .
3.3 Meanings versus Amount of Information, and Information Selection . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. .
. . . .
. . . .
. . .
. . .
. . . .
. . . .
. . . .
. . .
4
Information selection in early vision: the V1 hypothesis — creating a bottom up saliency
map for pre-attentive selection and segmentation
4.1 The problems and frameworks . .
. . . .
4.1.1 The problem of visual segmentation . . .
4.1.2 Visual selection, attention, and saliency .
. . . .
. .
. . . .
. .
. . . .
. .
Visual saliency, and a brief overview of its behavioral manifestation . . .
. .
How can one probe bottom-up saliency through reaction times when behav-
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
. . .
. . . .
60
61
61
63
63
64
65
65
65
65
66
66
68
75
78
79
81
85
86
86
86
89
89
89
91
93
93
95
97
97
97
98
99
4.2 Testing the V1 saliency map in a V1 model
. . . .
Saliency regardless of input features . . .
Detailed formulation of the V1 saliency hypothesis
ior is controlled by both top-down and bottom-up factors? . . .
. . . .
. . . .
. . . .
4.2.1 The V1 model: its neural elements, connections, and desired behavior . .
. . . .
4.2.2 Calibration of the V1 model to the biological reality .
. . . .
4.2.3 Computational requirements on the dynamic behavior of the model
. . .
. . . .
4.2.4 Applying the V1 model to visual segmentaion and visual search .
. . . .
Quantitative assessments of saliency from the V1 responses . . . .
Feature search and conjunction search by the V1 model . . .
. . . .
. . . .
A trivial case of visual search asymmetry through the presence or the absence
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . 101
. . 102
. . 104
. . 107
. . 107
. . 114
. . 114
. . 117
. . 118
. . 120
. . .
.
. . .
. . . .
. . . .
. . .
of a feature in the target .
. . . .
The influence of background variability on the ease of visual search . . .
Influence of the density of input items on saliencies by feature contrast . .
How does a hole in a texture attract attention? .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
. . 122
. . 123
. . 124
. . 124
CONTENTS
5
Segmenting two identical abutting textures from each other .
More subtle examples of visual search asymmetry . . . .
. . .
. . . .
. . . .
. . . .
. . . .
. 128
. 130
4.3 Neural circuit and nonlinear dynamics in the primary visual cortex for saliency
. . . .
. . . .
. . . .
. . . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
4.3.2 Dynamic analysis . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
computation . . .
. . . .
4.3.1 A minimal model of the primary visual cortex . .
A less-than-minimal recurrent model of V1 . . . .
A minimal recurrent model with hidden units . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
4.4.1 Testing the feature-blind “auction” framework of the V1 saliency map . . .
. . .
A single pair of neurons . . .
Two interacting pairs of neurons with non-overlapping receptive fields
A one dimensional array of identical bars .
. . . .
Two dimensional textures and texture boundaries . . . .
. . . .
Translation invariance and pop-out
. . . .
. . . .
Filling-in and leaking-out
. . . .
. . . .
. . . .
Hallucination prevention, and neural oscillations
. . . .
. . . .
4.4 Psychophysical test of the V1 theory of bottom up saliency . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
4.3.3 Discussions . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . . .
. . . .
. .
. . . .
. 131
. 132
. 132
. 137
. 144
. 145
. 145
. 146
. 148
. 151
. 152
. 154
. 156
. 158
. 158
Further discussions and explorations on the interference by task irrelevant
features . . . .
. . .
. . . .
. . . .
. . . .
. . . .
. . .
. . . .
. . . .
. 160
Contrasting with the feature-map-to-master-map framework of the previous
. . .
4.4.2
. . . .
views .
. . . .
. . . .
. . . .
. . . .
. . . .
4.5 The respective roles of V1 and other cortical areas for attentional guidance .
. . . .
Fingerprints of V1 in the bottom-up saliency . . .
. . . .
Fingerprint of V1’s conjunctive cells . . . .
Fingerprint of V1’s monocular cells
. . . .
. . . .
. . . .
Fingerprint of V1’s colinear facilitation . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .
. . .
. . .
. . .
. . .
. . . .
. . . .
5 Visual recognition and discrimination
6 Summary
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 161
. 163
. 163
. 166
. 170
. 172
179
181
6
CONTENTS
Chapter 1
Introduction and scope
1.1 The approach
Vision is the most intensively studied aspect of the brain, physiologically, anatomically, and behav-
iorally.162 The saying that our eyes are the windows to our brain is not unreasonable since, at least
in primates, brain areas devoted to visual functions occupy a large portion, about 50% in monkeys
(see Fig. (1.5)), of the cerebral cortex. Understanding visual functions can hopefully reveal much
about how the brain works. Vision researchers come from many specialist fields, including physi-
ology, psychology, anatomy, medicine, engineering, mathematics and physics, each with its distinct
approach and value. A common language is essential for effective communication and collabora-
tion between visual scientists. One way to achieve this is to frame and define everything clearly
before communicating the details. This is what I will try my best to do in this book, with a clear def-
inition of the problems and terms used whenever the need arises. These definitions also includes
scoping, or division of problems or domains into sub-problems or sub-domains in order to better
study them. For example, vision may be divided into low level, mid-level, and high level vision
according to a rough temporal progression of the computation involved, and visual attentional se-
lection may be divided into those by top-down and bottom-up factors. Many of these divisions
and scopings are likely to appear sub-optimal, and can be improved, after more knowledge are ob-
tained through research progresses. However, not dividing or scoping the problems and domains
now for fear of imperfections in the process often makes the research progress slower.
1.1.1 Theory, models, and data
This book aims to understand vision through the interplay between theory, models, and data, each
playing their respective roles, as illustrated in Fig. (1.1). Theoretical studies of vision suggest com-
putational principles or hypotheses to understand why physiology and anatomy are as they are
from visual behavior, and vice versa. They should provide non-trivial insights in the multitudes of
experimental observations, link seemingly unrelated data to each other, and motivate experimental
investigations. Often, appropriate mathematical formulations of the theories are necessary to make
the theories sufficiently precise and powerful. Experimental data of all aspects, physiological, be-
havioral, anatomical, provide inspiration to, and ultimate tests of, the theories. For example, this
book presents detailed materials on two theories of early vision, one is the Efficient coding theory
(details in chapter 2) of the early visual receptive fields, and the other is the V1 saliency hypothesis
on a functional role of the primary visual cortex (in chapter 4). The experimental data inspiring the
theories include the receptive fields of the neurons in the retina and cortex and their dependence on
the animal species and their adaptation to the environment, human sensitivities to various visual
stimuli, the intra-cortical circuits in V1, and the visual behavior in visual search and segmentation
tasks. Models, including phenomenological, biophysical, and neural circuit models of neural mech-
anisms, are very useful tools in linking the theory and data, particularly when their complexity is
7
8
CHAPTER1. INTRODUCTIONANDSCOPE
Theory:
hypotheses, principles
e.g., early visual processing has a
goal to find an efficient representation
of visual input information
demonstrate
implement
inspire
test
predict
explain
Data:
psychological observations
physiological and
e.g., neurons’ visual receptive fields,
visual behavorial sensitivities, and
their adaptations to environment
Models:
characterizing the
mechanisms or phenomena
e.g., a difference of two gaussian
spatial functions for a receptive
field of a retinal ganglion neuron
fit
Figure 1.1: The roles of theory, models, and data in understanding vision.
designed to suit the questions asked. They can for example be used to illustrate or demonstrate the
theoretical hypotheses, or to test the feasibilities of the hypotheses by specific neural mechanisms.
Note that while the models are very useful, they are just tools intended to illustrate, demon-
strate, and to link between the theory and the data. They often involve simplifications and approxi-
mations which make them quantitatively incorrect, as long as their purpose in specific applications
does not require quantitative precision. Hence, their quantitative imprecision should not be the
bases to dismiss a theory, especially when simplified toy models are used to illustrate a theoretical
concept. For example, if Newton’s Laws could not predict the trajectory of a rocket precisely be-
cause the knowledge about the Earth’s atmosphere was insufficient, the Laws should not be thrown
out with the bath water. Similarly, the theoretical proposal that the early visual processing has a
goal to recode the raw visual input by an efficient representation (details in chapter 2) could still
be correct even if the visual receptive fields of the retinal ganglion cells are modelled simply as
differences of gaussians to illustrate the efficient coding transform.
Focusing on the why of the physiology, this book de-emphasizes purely descriptive models
concerning what and how, e.g., models of the center-surround receptive fields of the retinal ganglion
cells, or mechanistic models of how orientation tuning in V1 develops, except when using them for
illustrative or other purpose.
1.2 The problem of vision
Vision could be defined as the inverse problem of imaging or computer graphics, which is the oper-
ation of transforming the three dimensional visual world containing objects reflecting light to two-
dimensional images formed by these lights hitting the imaging planes, see Fig. (1.2). Any visual
world can give rise to an unique image given a viewing direction or imaging, simply by projecting
in that direction from the 3D scene to a 2D image. Hence, this imaging problem is well understood,
as manifested in the success of computer graphics applied to movie making. Meanwhile, the in-
verse problem of imaging or graphics is to obtain the three dimensional scene information from
the two dimensional images. Human vision is poorly understood, partly because, if we see vision
as the inverse problem of imaging, there is typically no unique solution of the three dimensional
visual world given the two dimensional images. This can be illustrated explicitly in a simplified