logo资料库

受限玻尔兹曼机的英文详细介绍(论文).pdf

第1页 / 共21页
第2页 / 共21页
第3页 / 共21页
第4页 / 共21页
第5页 / 共21页
第6页 / 共21页
第7页 / 共21页
第8页 / 共21页
资料共21页,剩余部分请下载后查看
An Introduction to Restricted Boltzmann Machines Asja Fischer1,2 and Christian Igel2 1 Institut f¨ur Neuroinformatik, Ruhr-Universit¨at Bochum, Germany 2 Department of Computer Science, University of Copenhagen, Denmark Abstract. Restricted Boltzmann machines (RBMs) are probabilistic graphical models that can be interpreted as stochastic neural networks. The increase in computational power and the development of faster learn- ing algorithms have made them applicable to relevant machine learning problems. They attracted much attention recently after being proposed as building blocks of multi-layer learning systems called deep belief net- works. This tutorial introduces RBMs as undirected graphical models. The basic concepts of graphical models are introduced first, however, basic knowledge in statistics is presumed. Different learning algorithms for RBMs are discussed. As most of them are based on Markov chain Monte Carlo (MCMC) methods, an introduction to Markov chains and the required MCMC techniques is provided. 1 Introduction Boltzmann machines (BMs) have been introduced as bidirectionally connected networks of stochastic processing units, which can be interpreted as neural net- work models [1,16]. A BM can be used to learn important aspects of an unknown probability distribution based on samples from this distribution. In general, this learning process is difficult and time-consuming. However, the learning problem can be simplified by imposing restrictions on the network topology, which leads us to restricted Boltzmann machines (RBMs, [34]), the topic of this tutorial. A (restricted) BM is a parameterized generative model representing a prob- ability distribution. Given some observations, the training data, learning a BM means adjusting the BM parameters such that the probability distribution repre- sented by the BM fits the training data as well as possible. Boltzmann machines consist of two types of units, so called visible and hidden neurons, which can be thought of as being arranged in two layers. The visible units constitute the first layer and correspond to the components of an observation (e.g., one visible unit for each pixel of a digital input image). The hidden units model dependencies between the components of observations (e.g., dependencies between pixels in images). They can be viewed as non-linear feature detectors [16]. Boltzmann machines can also be regarded as particular graphical models [22], more precisely undirected graphical models also known as Markov random fields. The embedding of BMs into the framework of probabilistic graphical models provides immediate access to a wealth of theoretical results and well-developed L. Alvarez et al. (Eds.): CIARP 2012, LNCS 7441, pp. 14–36, 2012. c Springer-Verlag Berlin Heidelberg 2012
An Introduction to Restricted Boltzmann Machines 15 algorithms. Therefore, our tutorial introduces RBMs from this perspective. Com- puting the likelihood of an undirected model or its gradient for inference is in general computationally intensive, and this also holds for RBMs. Thus, sampling based methods are employed to approximate the likelihood and its gradient. Sampling from an undirected graphical model is in general not straightforward, but for RBMs Markov chain Monte Carlo (MCMC) methods are easily appli- cable in the form of Gibbs sampling, which will be introduced in this tutorial along with basic concepts of Markov chain theory. After successful learning, an RBM provides a closed-form representation of the distribution underlying the observations. It can be used to compare the probabilities of (unseen) observations and to sample from the learnt distribution (e.g., to generate image textures [25,21]), in particular from marginal distribu- tions of interest. For example, we can fix some visible units corresponding to a partial observation and sample the remaining visible units for completing the observation (e.g., to solve an image inpainting task [21]). Boltzmann machines have been proposed in the 1980s [1,34]. Compared to the times when they were first introduced, RBMs can now be applied to more interesting problems due to the increase in computational power and the de- velopment of new learning strategies [15]. Restricted Boltzmann machines have received a lot of attention recently after being proposed as building blocks of multi-layer learning architectures called deep belief networks (DBNs, [19,17]). The idea is that the hidden neurons extract relevant features from the observa- tions. These features can serve as input to another RBM. By stacking RBMs in this way, one can learn features from features in the hope of arriving at a high level representation. It is an important property that single as well as stacked RBMs can be rein- terpreted as deterministic feed-forward neural networks. Than they are used as functions from the domain of the observations to the expectations of the la- tent variables in the top layer. Such a function maps the observations to learnt features, which can, for example, serve as input to a supervised learning sys- tem. Further, the neural network corresponding to a trained RBM or DBN can be augmented by an output layer, where units in the new added output layer represent labels corresponding to observations. Then the model corresponds to a standard neural network for classification or regression that can be further trained by standard supervised learning algorithms [31]. It has been argued that this initialization (or unsupervised pretraining) of the feed-forward neural net- work weights based on a generative model helps to overcome problems observed when training multi-layer neural networks [19]. This introduction to RBMs is meant to supplement existing tutorials, such as the highly recommended review by Bengio [2], by providing more background information on Markov random fields and MCMC methods in Section 2 and Section 3, respectively. However, basic knowledge in statistics is presumed. We put an emphasis on topics that are – based on our experience – sometimes not familiar to people starting with RBMs. Restricted Boltzmann machines will be presented in Section 4. Section 5 will consider RBM training algorithms based
16 A. Fischer and C. Igel on approximations of the log-likelihood gradient. This includes a discussion of contrastive divergence learning [15] as well as parallel tempering [10]. We will close by hinting at generalizations of RBMs in sections 6 and 7. 2 Graphical Models Probabilistic graphical models describe probability distributions by mapping conditional dependence and independence properties between random variables on a graph structure (two sets of random variables X 1 and X 2 are condi- tionally independent given a set of random variables X 3 if p(X 1, X 2|X 3) = p(X 1|X 3)p(X 2|X 3)). Visualization by graphs can help to develop, understand and motivate probabilistic models. Furthermore, complex computations (e.g., marginalization) can be derived efficiently by using algorithms exploiting the graph structure. There exist graphical models associated with different kind of graph struc- tures, for example factor graphs, Bayesian networks associated with directed graphs, and Markov random fields, which are also called Markov networks or undirected graphical models. This tutorial focuses on the latter. A general in- troduction to graphical models for machine learning can, for example be found in [5]. The most comprehensive resource on graphical models is the textbook by Koller and Friedman [22]. 2.1 Undirected Graphs and Markov Random Fields First, we will summarize some fundamental concepts from graph theory. An undirected graph is a tuple G = (V, E), where V is a finite set of nodes and E is a set of undirected edges. An edge consists out of a pair of nodes from V . If there exists an edge between two nodes v and w, i.e. {v, w} ∈ E, w belongs to the neighborhood of v and vice versa. The neighborhood Nv := {w ∈ V : {w, v} ∈ E} of v is defined by the set of nodes connected to v. A clique is a subset of V in which all nodes are pairwise connected. A clique is called maximal if no node can be added such that the resulting set is still a clique. In the following we will denote by C the set of all maximal cliques of an undirected graph. We call a sequence of nodes v1, v2, . . . , vm ∈ V , with {vi, vi+1} ∈ E for i = 1, . . . , m − 1 a path from v1 to vm. A set V ⊂ V separates two nodes v /∈ V and w /∈ V, if every path from v to w contains a node from V. We now associate a random variable Xv taking values in a state space Λv with each node v in an undirected graph G = (V, E). To ease the notation, we assume Λv = Λ for all v ∈ V . The random variables X = (Xv)v∈V are called Markov random field (MRF) if the joint probability distribution p fulfills the (global) Markov property w.r.t. the graph: For all disjunct subsets A,B,S ⊂ V , where all nodes in A and B are separated by S the variables (Xa)a∈A and (Xb)b∈B are conditional independent given (Xs)s∈S, i.e. for all x ∈ Λ it holds p ((xa)a∈A|(xt)t∈S∪B) = p ((xa)a∈A|(xt)t∈S). A set of nodes MB(v) is called the Markov blanket of node v, if for any set of nodes B with v ∈ B we have |V |
An Introduction to Restricted Boltzmann Machines 17 p(v | MB(v),B) = p(v | MB(v)). This means that v is conditional independent from any other variables given MB(v). In an MRF, the Markov blanket MB(v) is given by the neighborhood Nv of v, a fact that is also referred to as local Markov property. Since conditional independence of random variables and the factorization properties of the joint probability distribution are closely related, one can ask if there exists a general factorization form of the distributions of MRFs. An an- swer to this question is given by the Hammersley-Clifford Theorem (for rigorous formulations and proofs we refer to [23,22]). The theorem states that a strictly positive distribution p satisfies the Markov property w.r.t. an undirected graph G if and only if p factorizes over G. A distribution is said to factorize about an undirected graph G with maximal cliques C if there exists a set of non-negative functions {ψC}C⊂C, called potential functions, with ∀x, ˆx ∈ Λ |V | and : (xc)c∈C = (ˆxc)c∈C ⇒ ψC (x) = ψC (ˆx) p(x) = 1 Z ψC (x). (1) (2) (3) The normalization constant Z = C∈C C∈C ψC (xC ) is called partition function. If p is strictly positive, the same holds for the potential functions. Thus we can write x p(x) = ψC (xC ) = C∈C ln ψC (xC ) = 1 Z e −E(x) , e 1 Z C∈C 1 Z C∈C ln ψC (xC) the energy function. Thus, the probability where we call E := distribution of every MRF can be expressed in the form given by (3), which is also called Gibbs distribution. 2.2 Unsupervised Learning Unsupervised learning means learning (important aspects of) an unknown dis- tribution q based on sample data. This includes finding new representations of data that foster learning, generalization, and communication. If we assume that the structure of the graphical model is known and the energy function belongs to a known family of functions parameterized by θ, unsupervised learning of a data distribution with an MRF means adjusting the parameters θ. We write p(x|θ) when we want to emphasize the dependency of a distribution on its parameters. We consider training data S = {x1, . . . , x}. The data samples are assumed to be independent and identically distributed (i.i.d.). That is, they are drawn independently from some unknown distribution q. A standard way of estimating the parameters of a statistical model is maximum-likelihood estimation. Applied to MRFs, this corresponds to finding the MRF parameters that maximize the probability of S under the MRF distribution, i.e. training corresponds to finding the parameters θ that maximize the likelihood given the training data. The
18 A. Fischer and C. Igel likelihood L : Θ → R of an MRF given the data set S maps parameters θ from a parameter space Θ to L(θ | S) = p(xi|θ). Maximizing the likelihood is the same as maximizing the log-likelihood given by i=1 lnL(θ | S) = ln p(xi|θ) = ln p(xi|θ) . (4) i=1 i=1 For the Gibbs distribution of an MRF it is in general not possible to find the maximum likelihood parameters analytically. Thus, numerical approximation methods have to be used, for example gradient ascent which is described below. Maximizing the likelihood corresponds to minimizing the distance between the unknown distribution q underlying S and the distribution p of the MRF in terms of the Kullback-Leibler-divergence (KL-divergence), which for a finite state space Ω is given by: KL(q||p) = q(x) ln q(x) p(x) = q(x) ln q(x) − q(x) ln p(x) (5) x∈Ω x∈Ω x∈Ω The KL-divergence is a (non-symmetric) measure of the difference between two distributions. It is always positive and zero if and only if the distributions are the same. As becomes clear by equation (5) the KL-divergence can be expressed as the difference between the entropy of q and a second term. Only the latter depends on the parameters subject to optimization. Approximating the expec- tation over q in this term by the training samples from q results in the log- likelihood. Therefore, maximizing the log-likelihood corresponds to minimizing the KL-divergence. N lnL(θ(t)|xi) Optimization by Gradient Ascent. If it is not possible to find parameters maximizing the likelihood analytically, the usual way to find them is gradient ascent on the log-likelihood. This corresponds to iteratively updating the param- eters θ(t) to θ(t+1) based on the gradient of the log-likelihood. Let us consider the following update rule: ∂ i=1 ∂θ(t) θ(t+1) = θ(t) + η −λθ(t) + νΔθ(t−1) If the constants λ ∈ R+ 0 and ν ∈ R+ 0 are set to zero, we have vanilla gradient ascent. The constant η ∈ R+ is the learning rate. As we will see later, it can be desirable to strive for models with weights having small absolute values. To achieve this, we can optimize an objective function consisting of the log- likelihood minus half of the norm of the parameters θ2/2 weighted by λ. This method called weight decay penalizes weights with large magnitude. It leads to the −λθ(t) term in our update rule (6). In a Bayesian framework, weight decay := Δθ(t) (6)
An Introduction to Restricted Boltzmann Machines 19 can be interpreted as assuming a zero-mean Gaussian prior on the parameters. The update rule can be further extended by a momentum term Δθ(t−1), weighted by the parameter ν. Using a momentum term helps against oscillations in the iterative update procedure and can speed-up the learning process as known from feed-forward neural network training [31]. Introducing Latent Variables. Suppose we want to model an m-dimensional unknown probability distribution q (e.g., each component of a sample corresponds to one of m pixels of an image). Typically, not all variables X = (Xv)v∈V in an MRF need to correspond to some observed component, and the number of nodes is larger than m. We split X into visible (or observed ) variables V = (V1, . . . , Vm) corresponding to the components of the observations and latent (or hidden) vari- ables H = (H1, . . . , Hn) given by the remaining n = |V | − m variables. Using latent variables allows to describe complex distributions over the visible variables by means of simple (conditional) distributions. In this case the Gibbs distribu- tion of an MRF describes the joint probability distribution of (V , H) and one is usually interested in the marginal distribution of V which is given by h 1 Z p(v) = p(v, h) = h −E(v,h) , e (7) −E(v,h). While the visible variables correspond to the com- where Z = ponents of an observation, the latent variables introduce dependencies between the visible variables (e.g., between pixels of an input image). v,h e Log-Likelihood Gradient of MRFs with Latent Variables. Restricted Boltzmann machines are MRFs with hidden variables and RBM learning algo- rithms are based on gradient ascent on the log-likelihood. For a model of the form (7) with parameters θ, the log-likelihood given a single training example v is lnL(θ | v) = ln p(v | θ) = ln and for the gradient we get: −E(v,h) − ln −E(v,h) = ln −E(v,h) 1 Z e h e h (8) e v,h ∂lnL(θ | v) 1 ∂θ = − = ∂ ∂θ −E(v,h) ln e − ∂ ∂θ h −E(v,h) ∂E(v, h) e−E(v,h) e h h 1 v,h ln v,h −E(v,h) e v,h ∂θ + e−E(v,h) e −E(v,h) ∂E(v, h) ∂θ = − p(h| v) ∂E(v, h) ∂θ + p(v, h) v,h ∂E(v, h) ∂θ (9) h In the last step we used that the conditional probability can be written in the following way:
20 A. Fischer and C. Igel p(h| v) = p(v, h) p(v) = 1 Z 1 Z e h −E(v,h) e−E(v,h) e h = −E(v,h) e−E(v,h) (10) Note that the last expression of equality (9) is the difference of two expectations: the expected values of the energy function under the model distribution and under the conditional distribution of the hidden variables given the training ex- ample. Directly calculating this sums, which run over all values of the respective variables, leads to a computational complexity which is in general exponential in the number of variables of the MRF. To avoid this computational burden, the expectations can be approximated by samples drawn from the corresponding distributions based on MCMC techniques. 3 Markov Chains and Markov Chain Monte Carlo Techniques Markov chains play an important role in RBM training because they provide a method to draw samples from ’complex’ probability distributions like the Gibbs distribution of an MRF. This section will serve as an introduction to some funda- mental concepts of Markov chain theory. A detailed introduction can be found, for example, in [6] and the aforementioned textbooks [5,22]. The section will describe Gibbs sampling as an MCMC technique often used for MRF training and in particular for training RBMs. 3.1 Definition of a Markov Chain and Convergence to Stationarity A Markov chain is a time discrete stochastic process for which the Markov take values in a (in the following considerations finite) set Ω and for which property holds, that is, a family of random variables X = {X (k)|k ∈ N0} which ∀k ≥ 0 and ∀j, i, i0, . . . , ik−1 ∈ Ω it holds X (k+1) = j|X (k) = i, X (k−1) = ik−1, . . . , X (0) = i0 X (k+1) = j|X (k) = i p(k) ij := P (11) (12) = P . This means that the next state of the system depends only on the current state and not on the sequence of events that preceded it. If for all k ≥ 0 the p(k) ij have the same value pij, the chain is called homogeneous and the matrix P = (pij)i,j∈Ω is called transition matrix of the homogeneous Markov chain. If the starting distribution μ(0) (i.e., the probability distribution of X (0)) is given by the probability vector μ(0) = (μ(0)(i))i∈Ω, with μ(0)(i) = P (X (0) = i), the distribution μ(k) of X (k) is given by μ(k) T = μ(0) TPk . A distribution π for which it holds πT = πTP is called stationary distribution. If the Markov chain for any time k reaches the stationary distribution μ(k) = π all subsequent states will be distributed accordingly, that is, μ(k+n) = π for
An Introduction to Restricted Boltzmann Machines 21 all n ∈ N. A sufficient (but not necessary) condition for a distribution π to be stationary w.r.t. a Markov chain described by the transition probabilities pij, i, j ∈ Ω is that ∀i, j ∈ Ω it holds: π(i)pij = π(j)pji . (13) This is called the detailed balance condition. Especially relevant are Markov chains for which it is known that there exists an unique stationary distribution. For finite Ω this is the case if the Markov chain is irreducible. A Markov chain is irreducible if one can get from any state in Ω to any other in a finite number of transitions or more formally ∀i, j ∈ Ω ∃k > 0 with P (X (k) = j|X (0) = i) > 0. A chain is called aperiodic if for all i ∈ Ω the greatest common divisor of {k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0} is 1. One can show that an irreducible and aperiodic Markov chain on a finite state space is guarantied to converge to its stationary distribution (see, e.g., [6]). That is, for an arbitrary starting distribution μ it holds k→∞ dV (μT Pk, πT ) = 0 , lim (14) where dV is the distance of variation. For two distributions α and β on a finite state space Ω, the distance of variation is defined as x∈Ω dV (α, β) = |α − β| = 1 2 1 2 |α(x) − β(x)| . (15) To ease the notation, we allow both row and column probability vectors as arguments of the functions in (15). Markov chain Monte Carlo methods make use of this convergence theorem for producing samples from certain probability distribution by setting up a Markov chain that converges to the desired distributions. Suppose you want to sample from a distribution q with a finite state space. Then you construct an irreducible and aperiodic Markov chain with stationary distribution π = q. This is a non- trivial task. If t is large enough, a sample X (t) from the constructed chain is then approximately a sample from π and therefore from q. Gibbs Sampling [13] is such a MCMC method and will be described in the following section. 3.2 Gibbs Sampling Gibbs Sampling belongs to the class of Metropolis-Hastings algorithms [14]. It is a simple MCMC algorithm for producing samples from the joint probability dis- tribution of multiple random variables. The basic idea is to update each variable subsequently based on its conditional distribution given the state of the others. We will describe it in detail by explaining how Gibbs sampling can be used to simulate the Gibbs distribution of an MRF. We consider an MRF X = (X1, . . . , XN ) w.r.t. a graph G = (V, E), where V = {1, . . . , N} for the sake of clearness of notation. The random variables Xi, i ∈ V take values in a finite set Λ and π(x) = 1 −E(x) is the joint probability Z e
分享到:
收藏