logo资料库

Understanding vision: theory, models, and data.pdf

第1页 / 共191页
第2页 / 共191页
第3页 / 共191页
第4页 / 共191页
第5页 / 共191页
第6页 / 共191页
第7页 / 共191页
第8页 / 共191页
资料共191页,剩余部分请下载后查看
Understanding vision: theory, models, and data (provisional title) cLi Zhaoping University College London, UK This book came originally from lecture notes used to teach students on computational/theoretical vision. Some readers may find that a paper, Zhaoping (2006) “Theoretical understanding of the early visual processes by data compression and data selection” in Network: Computation in neural systems 17(4):301-334, is an abbreviation of some parts in chapter 2-4 of this book. The book is still a very rough draft — I hope to update the draft continuously and make it available to students. Feedbacks are welcome. If you like better explanations or more details in any parts of this manuscript, or if you think certain parts of the text are not clear or confusing, or anything else, please do not hesitate to contact me at z.li@ucl.ac.uk . This document was produced on September 29, 2011
2
Contents 1 Introduction and scope 1.1 The approach . . 1.2 The problem of vision . 1.3 What is known about vision experimentally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Theory, models, and data . . . . . . . . . . . . . 1.3.1 Neurons, neural circuits, cortical areas, and the brain . . 1.3.2 Visual processing stages along the visual pathway . . . . . . . . 1.3.3 Retina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Vision seen through visual encoding, selection, and decoding . . . . . 1.2.2 Retina and V1 seen through visual encoding and bottom-up selection . . . . . . . . 1.2.3 Visual decoding and higher visual cortical areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The retinotopic map . . The receptive fields in the primary visual cortex — the feature detectors . . . The influences on a V1 neuron’s response from contextual stimuli outside the . . . . . . . . Receptive fields of the retinal ganglion cells . . . . Contrast sensitivity to sinusoidal gratings . . . . . . . . . . . . . Color processing in the retina . . . . . . . . . . . . Spatial sampling on the retina . . . . . . . . . . . . . . . . . . . . . . . . . . . receptive field . . . 1.3.5 The higher visual areas . . . 1.3.6 Behavioral studies on vision . . . . 1.3.7 Etc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 The primary visual cortex (V1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Information encoding in early vision: the efficient coding principle . . . . . . . . . . . . . . . . . . . . 2.2 Formulation of the efficient coding principle . . . . . . . 2.3 Efficient neural sampling in the retina . . . 2.1 A brief introduction on information theory — skip if not needed . . . . . . . . . . . . Measuring information amount . . . . . . . . Information transmission, information channels, and mutual information . . . . . . . . . . Information redundancy and error correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Contrast sampling in a fly’s compound eye . . . . . . . . . 2.3.2 Spatial sampling by receptor distribution on the retina . . 2.3.3 Color sampling by wavelength sensitivities of the cones . . . . Efficient coding by early visual receptive fields . . . . . . . . . . . 2.4.1 Obtaining the efficient code, and the related sparse code, in low noise limit . . . . . . . . . 2.4.2 The general analytical solution to efficient codings of gaussian signals . . . . . . . Illustration: stereo coding in V1 . . . . . . . . . . . 2.5.1 Principal component analysis . . . . 2.5.2 Gain control . . . . . . . . . . . 2.5.3 Contrast enhancement, decorrelation, and whitening in the high S/N region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . by numerical simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 7 7 7 8 9 11 12 13 13 14 16 16 20 23 24 24 25 26 30 31 31 31 33 33 34 35 37 40 43 43 44 47 48 48 50 52 53 57 59
4 CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smoothing and output correlation in the low S/N region . . . . . . 2.5.4 Many equivalent solutions of optimal encoding . . . . . . . 2.5.5 A special, most local, class of optimal coding . . . . . 2.5.6 . . . . 2.5.7 Adaptation of the optimal code to the statistics of the input environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binocular cells, monocular cells, and ocular dominance columns . . Coupling between stereo coding and spatial scale coding . . . . . . . . . . . Adaptation of stereo coding to light levels . . . . . . Strabismus . . . . . . . . . . . . Adaptation of stereo coding with animal species . . . . . . . . . Coupling between stereo coding and the preferred orientation of V1 neurons Monocular deprivation . . . . 2.6 Applying efficient coding to understand coding in space, color, time, and scale in . . . . . . retina and V1 . . . . . . . . . . . 2.6.1 Efficient spatial coding for retina . . . . . . . . . . . 2.6.2 Efficient coding in time . . . . 2.6.3 Efficient coding in color . . . . . . . . . . . . 2.6.4 Coupling space and color coding in retina . . . . . 2.6.5 Efficient Spatial Coding in V1 . . . . . . . . . 2.6.6 Coupling the spatial and color coding in V1 . . 2.6.7 Coupling spatial coding with stereo coding . . . . . 2.6.8 Coupling spatial space with temporal, chromatic, and stereo coding in V1 . . 2.7 How to get the efficient codes by developmental rules or unsupervised learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 V1 and information coding visual objects . 3.1 Pursuit of efficient coding in V1 by reducing higher order redundancy . . . . 3.1.1 Higher order statistics contains much of the meaningful information about . . . . . . . . . . . . . . 3.1.2 Characterizing higher order statistics . . . . 3.1.3 Efforts to understand V1 by removal of higher order redundancy . 3.2 Problems in understanding V1 by the goal of efficient coding . . . . . . . . 3.3 Meanings versus Amount of Information, and Information Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Information selection in early vision: the V1 hypothesis — creating a bottom up saliency map for pre-attentive selection and segmentation 4.1 The problems and frameworks . . . . . . 4.1.1 The problem of visual segmentation . . . 4.1.2 Visual selection, attention, and saliency . . . . . . . . . . . . . . . . . . . Visual saliency, and a brief overview of its behavioral manifestation . . . . . How can one probe bottom-up saliency through reaction times when behav- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 61 61 63 63 64 65 65 65 65 66 66 68 75 78 79 81 85 86 86 86 89 89 89 91 93 93 95 97 97 97 98 99 4.2 Testing the V1 saliency map in a V1 model . . . . Saliency regardless of input features . . . Detailed formulation of the V1 saliency hypothesis ior is controlled by both top-down and bottom-up factors? . . . . . . . . . . . . . . . 4.2.1 The V1 model: its neural elements, connections, and desired behavior . . . . . . 4.2.2 Calibration of the V1 model to the biological reality . . . . . 4.2.3 Computational requirements on the dynamic behavior of the model . . . . . . . 4.2.4 Applying the V1 model to visual segmentaion and visual search . . . . . Quantitative assessments of saliency from the V1 responses . . . . Feature search and conjunction search by the V1 model . . . . . . . . . . . A trivial case of visual search asymmetry through the presence or the absence . . . . . . . . . . . . . . . . . . . . . . . . . . 101 . . 102 . . 104 . . 107 . . 107 . . 114 . . 114 . . 117 . . 118 . . 120 . . . . . . . . . . . . . . . . . . of a feature in the target . . . . . The influence of background variability on the ease of visual search . . . Influence of the density of input items on saliencies by feature contrast . . How does a hole in a texture attract attention? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 . . 123 . . 124 . . 124
CONTENTS 5 Segmenting two identical abutting textures from each other . More subtle examples of visual search asymmetry . . . . . . . . . . . . . . . . . . . . . . . . 128 . 130 4.3 Neural circuit and nonlinear dynamics in the primary visual cortex for saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Dynamic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . computation . . . . . . . 4.3.1 A minimal model of the primary visual cortex . . A less-than-minimal recurrent model of V1 . . . . A minimal recurrent model with hidden units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Testing the feature-blind “auction” framework of the V1 saliency map . . . . . . A single pair of neurons . . . Two interacting pairs of neurons with non-overlapping receptive fields A one dimensional array of identical bars . . . . . Two dimensional textures and texture boundaries . . . . . . . . Translation invariance and pop-out . . . . . . . . Filling-in and leaking-out . . . . . . . . . . . . Hallucination prevention, and neural oscillations . . . . . . . . 4.4 Psychophysical test of the V1 theory of bottom up saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 . 132 . 132 . 137 . 144 . 145 . 145 . 146 . 148 . 151 . 152 . 154 . 156 . 158 . 158 Further discussions and explorations on the interference by task irrelevant features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Contrasting with the feature-map-to-master-map framework of the previous . . . 4.4.2 . . . . views . . . . . . . . . . . . . . . . . . . . . 4.5 The respective roles of V1 and other cortical areas for attentional guidance . . . . . Fingerprints of V1 in the bottom-up saliency . . . . . . . Fingerprint of V1’s conjunctive cells . . . . Fingerprint of V1’s monocular cells . . . . . . . . . . . . Fingerprint of V1’s colinear facilitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Visual recognition and discrimination 6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 161 . 163 . 163 . 166 . 170 . 172 179 181
6 CONTENTS
Chapter 1 Introduction and scope 1.1 The approach Vision is the most intensively studied aspect of the brain, physiologically, anatomically, and behav- iorally.162 The saying that our eyes are the windows to our brain is not unreasonable since, at least in primates, brain areas devoted to visual functions occupy a large portion, about 50% in monkeys (see Fig. (1.5)), of the cerebral cortex. Understanding visual functions can hopefully reveal much about how the brain works. Vision researchers come from many specialist fields, including physi- ology, psychology, anatomy, medicine, engineering, mathematics and physics, each with its distinct approach and value. A common language is essential for effective communication and collabora- tion between visual scientists. One way to achieve this is to frame and define everything clearly before communicating the details. This is what I will try my best to do in this book, with a clear def- inition of the problems and terms used whenever the need arises. These definitions also includes scoping, or division of problems or domains into sub-problems or sub-domains in order to better study them. For example, vision may be divided into low level, mid-level, and high level vision according to a rough temporal progression of the computation involved, and visual attentional se- lection may be divided into those by top-down and bottom-up factors. Many of these divisions and scopings are likely to appear sub-optimal, and can be improved, after more knowledge are ob- tained through research progresses. However, not dividing or scoping the problems and domains now for fear of imperfections in the process often makes the research progress slower. 1.1.1 Theory, models, and data This book aims to understand vision through the interplay between theory, models, and data, each playing their respective roles, as illustrated in Fig. (1.1). Theoretical studies of vision suggest com- putational principles or hypotheses to understand why physiology and anatomy are as they are from visual behavior, and vice versa. They should provide non-trivial insights in the multitudes of experimental observations, link seemingly unrelated data to each other, and motivate experimental investigations. Often, appropriate mathematical formulations of the theories are necessary to make the theories sufficiently precise and powerful. Experimental data of all aspects, physiological, be- havioral, anatomical, provide inspiration to, and ultimate tests of, the theories. For example, this book presents detailed materials on two theories of early vision, one is the Efficient coding theory (details in chapter 2) of the early visual receptive fields, and the other is the V1 saliency hypothesis on a functional role of the primary visual cortex (in chapter 4). The experimental data inspiring the theories include the receptive fields of the neurons in the retina and cortex and their dependence on the animal species and their adaptation to the environment, human sensitivities to various visual stimuli, the intra-cortical circuits in V1, and the visual behavior in visual search and segmentation tasks. Models, including phenomenological, biophysical, and neural circuit models of neural mech- anisms, are very useful tools in linking the theory and data, particularly when their complexity is 7
8 CHAPTER1. INTRODUCTIONANDSCOPE Theory: hypotheses, principles e.g., early visual processing has a goal to find an efficient representation of visual input information demonstrate implement inspire test predict explain Data: psychological observations physiological and e.g., neurons’ visual receptive fields, visual behavorial sensitivities, and their adaptations to environment Models: characterizing the mechanisms or phenomena e.g., a difference of two gaussian spatial functions for a receptive field of a retinal ganglion neuron fit Figure 1.1: The roles of theory, models, and data in understanding vision. designed to suit the questions asked. They can for example be used to illustrate or demonstrate the theoretical hypotheses, or to test the feasibilities of the hypotheses by specific neural mechanisms. Note that while the models are very useful, they are just tools intended to illustrate, demon- strate, and to link between the theory and the data. They often involve simplifications and approxi- mations which make them quantitatively incorrect, as long as their purpose in specific applications does not require quantitative precision. Hence, their quantitative imprecision should not be the bases to dismiss a theory, especially when simplified toy models are used to illustrate a theoretical concept. For example, if Newton’s Laws could not predict the trajectory of a rocket precisely be- cause the knowledge about the Earth’s atmosphere was insufficient, the Laws should not be thrown out with the bath water. Similarly, the theoretical proposal that the early visual processing has a goal to recode the raw visual input by an efficient representation (details in chapter 2) could still be correct even if the visual receptive fields of the retinal ganglion cells are modelled simply as differences of gaussians to illustrate the efficient coding transform. Focusing on the why of the physiology, this book de-emphasizes purely descriptive models concerning what and how, e.g., models of the center-surround receptive fields of the retinal ganglion cells, or mechanistic models of how orientation tuning in V1 develops, except when using them for illustrative or other purpose. 1.2 The problem of vision Vision could be defined as the inverse problem of imaging or computer graphics, which is the oper- ation of transforming the three dimensional visual world containing objects reflecting light to two- dimensional images formed by these lights hitting the imaging planes, see Fig. (1.2). Any visual world can give rise to an unique image given a viewing direction or imaging, simply by projecting in that direction from the 3D scene to a 2D image. Hence, this imaging problem is well understood, as manifested in the success of computer graphics applied to movie making. Meanwhile, the in- verse problem of imaging or graphics is to obtain the three dimensional scene information from the two dimensional images. Human vision is poorly understood, partly because, if we see vision as the inverse problem of imaging, there is typically no unique solution of the three dimensional visual world given the two dimensional images. This can be illustrated explicitly in a simplified
分享到:
收藏