logo资料库

Tutorial on Variational Autoencoders.pdf

第1页 / 共23页
第2页 / 共23页
第3页 / 共23页
第4页 / 共23页
第5页 / 共23页
第6页 / 共23页
第7页 / 共23页
第8页 / 共23页
资料共23页,剩余部分请下载后查看
1 Introduction
1.1 Preliminaries: Latent Variable Models
2 Variational Autoencoders
2.1 Setting up the objective
2.2 Optimizing the objective
2.3 Testing the learned model
2.4 Interpreting the objective
2.4.1 The error from D[Q(z|X)"026B30D P(z|X)]
2.4.2 The information-theoretic interpretation
2.4.3 VAEs and the regularization parameter
3 Conditional Variational Autoencoders
4 Examples
4.1 MNIST variational autoencoder
5 MNIST conditional variational autoencoder
A Proof in 1D that VAEs have zero approximation error given arbitrarily powerful learners.
6 1 0 2 g u A 3 1 ] L M . t a t s [ 2 v 8 0 9 5 0 . 6 0 6 1 : v i X r a Tutorial on Variational Autoencoders CARL DOERSCH Carnegie Mellon / UC Berkeley August 16, 2016 Abstract In just three years, Variational Autoencoders (VAEs) have emerged as one of the most popular approaches to unsupervised learning of complicated distributions. VAEs are appealing because they are built on top of standard function approximators (neural networks), and can be trained with stochastic gradient descent. VAEs have already shown promise in generating many kinds of complicated data, in- cluding handwritten digits [1, 2], faces [1, 3, 4], house numbers [5, 6], CIFAR images [6], physical models of scenes [4], segmentation [7], and predicting the future from static images [8]. This tutorial introduces the intuitions behind VAEs, explains the mathematics behind them, and describes some empirical behavior. No prior knowledge of variational Bayesian methods is assumed. Keywords: variational autoencoders, unsupervised learning, structured prediction, neural networks 1 Introduction “Generative modeling” is a broad area of machine learning which deals with models of distributions P(X), defined over datapoints X in some potentially high-dimensional space X . For instance, images are a popular kind of data for which we might create generative models. Each “datapoint” (image) has thousands or millions of dimensions (pixels), and the generative model’s job is to somehow capture the dependencies between pixels, e.g., that nearby pixels have similar color, and are organized into objects. Exactly what it means to “capture” these dependencies depends on exactly what we want to do with the model. One straightforward kind of generative model simply allows us to compute P(X) numerically. In the case of images, X values 1
which look like real images should get high probability, whereas images that look like random noise should get low probability. However, models like this are not necessarily useful: knowing that one image is unlikely does not help us synthesize one that is likely. Instead, one often cares about producing more examples that are like those already in a database, but not exactly the same. We could start with a database of raw images and synthesize new, unseen images. We might take in a database of 3D models of something like plants and produce more of them to fill a forest in a video game. We could take handwritten text and try to produce more handwritten text. Tools like this might actually be useful for graphic designers. We can formalize this setup by saying that we get examples X distributed according to some unknown distribution Pgt(X), and our goal is to learn a model P which we can sample from, such that P is as similar as possible to Pgt. Training this type of model has been a long-standing problem in the ma- chine learning community, and classically, most approaches have had one of three serious drawbacks. First, they might require strong assumptions about the structure in the data. Second, they might make severe approximations, leading to suboptimal models. Or third, they might rely on computation- ally expensive inference procedures like Markov Chain Monte Carlo. More recently, some works have made tremendous progress in training neural networks as powerful function approximators through backpropagation [9]. These advances have given rise to promising frameworks which can use backpropagation-based function approximators to build generative models. One of the most popular such frameworks is the Variational Autoen- coder [1, 3], the subject of this tutorial. The assumptions of this model are weak, and training is fast via backpropagation. VAEs do make an approxi- mation, but the error introduced by this approximation is arguably small given high-capacity models. These characteristics have contributed to a quick rise in their popularity. This tutorial is intended to be an informal introduction to VAEs, and not a formal scientific paper about them. It is aimed at people who might have uses for generative models, but might not have a strong background in the variatonal Bayesian methods and “minimum description length” coding models on which VAEs are based. This tutorial began its life as a presentation for computer vision reading groups at UC Berkeley and Carnegie Mellon, and hence has a bias toward a vision audience. Suggestions for improvement are appreciated. 2
1.1 Preliminaries: Latent Variable Models When training a generative model, the more complicated the dependencies between the dimensions, the more difficult the models are to train. Take, for example, the problem of generating images of handwritten characters. Say for simplicity that we only care about modeling the digits 0-9. If the left half of the character contains the left half of a 5, then the right half cannot contain the left half of a 0, or the character will very clearly not look like any real digit. Intuitively, it helps if the model first decides which character to generate before it assigns a value to any specific pixel. This kind of decision is formally called a latent variable. That is, before our model draws anything, it first randomly samples a digit value z from the set [0, ..., 9], and then makes sure all the strokes match that character. z is called ‘latent’ because given just a character produced by the model, we don’t necessarily know which settings of the latent variables generated the character. We would need to infer it using something like computer vision. Before we can say that our model is representative of our dataset, we need to make sure that for every datapoint X in the dataset, there is one (or many) settings of the latent variables which causes the model to generate something very similar to X. Formally, say we have a vector of latent variables z in a high-dimensional space Z which we can easily sample according to some probability density function (PDF) P(z) defined over Z. Then, say we have a family of deterministic functions f (z; θ), parameterized by a vector θ in some space Θ, where f : Z × Θ → X . f is deterministic, but if z is random and θ is fixed, then f (z; θ) is a random variable in the space X . We wish to optimize θ such that we can sample z from P(z) and, with high probability, f (z; θ) will be like the X’s in our dataset. To make this notion precise mathematically, we are aiming maximize the probability of each X in the training set under the entire generative process, according to: P(X|z; θ)P(z)dz. P(X) = (1) Here, f (z; θ) has been replaced by a distribution P(X|z; θ), which allows us to make the dependence of X on z explicit by using the law of total probabil- ity. The intuition behind this framework—called “maximum likelihood”— is that if the model is likely to produce training set samples, then it is also likely to produce similar samples, and unlikely to produce dissimilar ones. In VAEs, the choice of this output distribution is often Gaussian, i.e., P(X|z; θ) = N (X| f (z; θ), σ2 ∗ I). That is, it has mean f (z; θ) and covariance equal to the identity matrix I times some scalar σ (which is a hyperparam- eter). This replacement is necessary to formalize the intuition that some z 3
Figure 1: The standard VAE model represented as a graphical model. Note the conspicuous lack of any structure or even an “encoder” pathway: it is possible to sample from the model without any input. Here, the rectangle is “plate notation” meaning that we can sample from z and X N times while the model parameters θ remain fixed. needs to result in samples that are merely like X. In general, and particularly early in training, our model will not produce outputs that are identical to any particular X. By having a Gaussian distribution, we can use gradient descent (or any other optimization technique) to increase P(X) by making f (z; θ) approach X for some z, i.e., gradually making the training data more likely under the generative model. This wouldn’t be possible if P(X|z) was a Dirac delta function, as it would be if we used X = f (z; θ) deterministi- cally! Note that the output distribution is not required to be Gaussian: for instance, if X is binary, then P(X|z) might be a Bernoulli parameterized by f (z; θ). The important property is simply that P(X|z) can be computed, and is continuous in θ. From here onward, we will omit θ from f (z; θ) to avoid clutter. 2 Variational Autoencoders The mathematical basis of VAEs actually has relatively little to do with classical autoencoders, e.g. sparse autoencoders [10, 11] or denoising au- toencoders [12, 13]. VAEs approximately maximize Equation 1, according to the model shown in Figure 1. They are called “autoencoders” only be- cause the final training objective that derives from this setup does have an encoder and a decoder, and resembles a traditional autoencoder. Unlike sparse autoencoders, there are generally no tuning parameters analogous to the sparsity penalties. And unlike sparse and denoising autoencoders, we 4
Figure 2: Given a random variable z with one distribution, we can create another random variable X = g(z) with a completely different distribution. Left: samples from a gaussian distribution. Right: those same samples mapped through the function g(z) = z/10 + z/||z|| to form a ring. This is the strategy that VAEs use to create arbitrary distributions: the deterministic function g is learned from data. can sample directly from P(X) (without performing Markov Chain Monte Carlo, as in [14]). To solve Equation 1, there are two problems that VAEs must deal with: how to define the latent variables z (i.e., decide what information they represent), and how to deal with the integral over z. VAEs give a definite answer to both. First, how do we choose the latent variables z such that we capture latent information? Returning to our digits example, the ‘latent’ decisions that the model needs to make before it begins painting the digit are actually rather complicated. It needs to choose not just the digit, but the angle that the digit is drawn, the stroke width, and also abstract stylistic properties. Worse, these properties may be correlated: a more angled digit may result if one writes faster, which also might tend to result in a thinner stroke. Ideally, we want to avoid deciding by hand what information each dimension of z encodes (although we may want to specify it by hand for some dimensions [4]). We also want to avoid explicitly describing the dependencies—i.e., the latent structure—between the dimensions of z. VAEs take an unusual approach to dealing with this problem: they assume that there is no simple interpretation of the dimensions of z, and instead assert that samples of z can be drawn from a simple distribution, i.e., N (0, I), where I is the identity matrix. How 5 4321012344321012341.51.00.50.00.51.01.51.51.00.50.00.51.01.5
(a) (b) (c) Figure 3: It’s hard to measure the likelihood of images under a model using only sampling. Given an image X (a), the middle sample (b) is much closer in Euclidean distance than the one on the right (c). Because pixel distance is so different from perceptual distance, a sample needs to be extremely close in pixel distance to a datapoint X before it can be considered evidence that X is likely under the model. is this possible? The key is to notice that any distribution in d dimensions can be generated by taking a set of d variables that are normally distributed and mapping them through a sufficiently complicated function1. For example, say we wanted to construct a 2D random variable whose values lie on a ring. If z is 2D and normally distributed, g(z) = z/10 + z/||z|| is roughly ring-shaped, as shown in Figure 2. Hence, provided powerful function approximators, we can simply learn a function which maps our independent, normally-distributed z values to whatever latent variables might be needed for the model, and then map those latent variables to X. In fact, recall that P(X|z; θ) = N (X| f (z; θ), σ2 ∗ I). If f (z; θ) is a multi-layer neural network, then we can imagine the network using its first few layers to map the normally distributed z’s to the latent values (like digit identity, stroke weight, angle, etc.) with exactly the right statitics. Then it can use later layers to map those latent values to a fully-rendered digit. In general, we don’t need to worry about ensuring that the latent structure exists. If such latent structure helps the model accurately reproduce (i.e. maximize the likelihood of) the training set, then the network will learn that structure at some layer. Now all that remains is to maximize Equation 1, where P(z) = N (z|0, I). As is common in machine learning, if we can find a computable formula for P(X), and we can take the gradient of that formula, then we can opti- 1In one dimension, you can use the inverse cumulative distribution function (CDF) of the desired distribution composed with the CDF of a Gaussian. This is an extension of “inverse transform sampling.” For multiple dimensions, do the stated process starting with the marginal distribution for a single dimension, and repeat with the conditional distribution of each additional dimension. See the “inversion method” and the “conditional distribution method” in Devroye et al. [15] 6
n mize the model using stochastic gradient ascent. It is actually conceptually straightforward to compute P(X) approximately: we first sample a large number of z values {z1, ..., zn}, and compute P(X) ≈ 1 ∑i P(X|zi). The prob- lem here is that in high dimensional spaces, n might need to be extremely large before we have an accurate estimate of P(X). To see why, consider our example of handwritten digits. Say that our digit datapoints are stored in pixel space, in 28x28 images as shown in Figure 3. Since P(X|z) is an isotropic Gaussian, the negative log probability of X is proportional squared Euclidean distance between f (z) and X. Say that Figure 3(a) is the target (X) for which we are trying to find P(X). A model which produces the image shown in Figure 3(b) is probably a bad model, since this digit is not much like a 2. Hence, we should set the σ hyperparameter of our Gaussian distribution such that this kind of erroroneous digit does not contribute to P(X). On the other hand, a model which produces Figure 3(c) (identical to X but shifted down and to the right by half a pixel) might be a good model. We would hope that this sample would contribute to P(X). Unfortunately, however, we can’t have it both ways: the squared distance between X and Figure 3(c) is .2693 (assuming pixels range between 0 and 1), but between X and Figure 3(b) it is just .0387. The lesson here is that in order to reject samples like Figure 3(b), we need to set σ very small, such that the model needs to generate something significantly more like X than Figure 3(c)! Even if our model is an accurate generator of digits, we would likely need to sample many thousands of digits before we produce a 2 that is sufficiently similar to the one in Figure 3(a). We might solve this problem by using a better similarity metric, but in practice these are difficult to engineer in complex domains like vision, and they’re difficult to train without labels that indicate which datapoints are similar to each other. Instead, VAEs alter the sampling procedure to make it faster, without changing the similarity metric. 2.1 Setting up the objective Is there a shortcut we can take when using sampling to compute Equation 1? In practice, for most z, P(X|z) will be nearly zero, and hence contribute almost nothing to our estimate of P(X). The key idea behind the variational autoencoder is to attempt to sample values of z that are likely to have produced X, and compute P(X) just from those. This means that we need a new function Q(z|X) which can take a value of X and give us a distribution over z values that are likely to produce X. Hopefully the space of z values that are likely under Q will be much smaller than the space of all z’s that are likely under the prior P(z). This lets us, for example, compute Ez∼QP(X|z) relatively easily. However, if z is sampled from an arbitrary distribution 7
with PDF Q(z), which is not N (0, I), then how does that help us optimize P(X)? The first thing we need to do is relate Ez∼QP(X|z) and P(X). We’ll see where Q comes from later. The relationship between Ez∼QP(X|z) and P(X) is one of the corner- stones of variational Bayesian methods. We begin with the definition of Kullback-Leibler divergence (KL divergence or D) between P(z|X) and Q(z), for some arbitrary Q (which may or may not depend on X): D [Q(z)P(z|X)] = Ez∼Q [log Q(z) − log P(z|X)] . (2) We can get both P(X) and P(X|z) into this equation by applying Bayes rule to P(z|X): D [Q(z)P(z|X)] = Ez∼Q [log Q(z) − log P(X|z) − log P(z)] + log P(X). (3) Here, log P(X) comes out of the expectation because it does not depend on z. Negating both sides, rearranging, and contracting part of Ez∼Q into a KL-divergence terms yields: log P(X) − D [Q(z)P(z|X)] = Ez∼Q [log P(X|z)] − D [Q(z)P(z)] . (4) Note that X is fixed, and Q can be any distribution, not just a distribution which does a good job mapping X to the z’s that can produce X. Since we’re interested in inferring P(X), it makes sense to construct a Q which does depend on X, and in particular, one which makes D [Q(z)|P(z|X)] small: log P(X) − D [Q(z|X)P(z|X)] = Ez∼Q [log P(X|z)] − D [Q(z|X)P(z)] . (5) This equation serves is the core of the variational autoencoder, and it’s worth spending some time thinking about what it says2. In two sentences, the left hand side has the quantity we want to maximize: log P(X) (plus an error term, which makes Q produce z’s that can reproduce a given X; this term will become small if Q is high-capacity). The right hand side is something we can optimize via stochastic gradient descent given the right choice of Q (although it may not be obvious yet how). Note that the framework—in particular, the right hand side of Equation 5—has suddenly taken a form which looks like an autoencoder, since Q is “encoding” X into z, and P is “decoding” it to reconstruct X. We’ll explore this connection in more detail later. 2 Historically, this math (particularly Equation 5) was known long before VAEs. For example, Helmholtz Machines [16] (see Equation 5) use nearly identical mathematics, with one crucial difference. The integral in our expectations is replaced with a sum in Dayan et al. [16], because Helmholtz Machines assume a discrete distribution for the latent variables. This choice prevents the transformations that make gradient descent tractable in VAEs. 8
分享到:
收藏