logo资料库

半监督生成对抗网络综述.pdf

第1页 / 共80页
第2页 / 共80页
第3页 / 共80页
第4页 / 共80页
第5页 / 共80页
第6页 / 共80页
第7页 / 共80页
第8页 / 共80页
资料共80页,剩余部分请下载后查看
I Background
Introduction
The scarcity of labels
Semi-supervised learning
Overview and contributions
Overview
Contributions
Assumptions and approaches that enable semi-supervised learning
Assumptions required for semi-supervised learning
Smoothness assumption
Cluster assumption
Low-density separation
Existence of a discoverable manifold
Classes of semi-supervised learning algorithms
Methods based on generative models
Methods based on low-density separation
Graph-based methods
Methods based on a change of representation
II Related Work
Review of Semi-Supervised Learning Literature
Semi-supervised learning before the deep learning era
Semi-supervised deep learning
Autoencoder-based approaches
Regularisation and data augmentation-based approaches
Other approaches
Semi-supervised generative adversarial networks
Generative adversarial networks
Approaches by which generative adversarial networks can be used for semi-supervised learning
Model focus: Good Semi-Supervised Learning that Requires a Bad GAN
Shannon's entropy and its relation to decision boundaries
CatGAN
Improved GAN
The Improved GAN SSL Model
BadGAN
Implications of the BadGAN model
III Analysis and experiments
Enforcing low-density separation
Approaches to low-density separation based on entropy
Synthetic experiments used in this section
Using entropy to incorporate the low-density separation assumption into our model
Taking advantage of a known prior class distribution
Generating data in low-density regions of the input space
Viewing entropy maximisation as minimising a Kullback-Leibler divergence
Theoretical evaluation of different entropy-related loss functions
Similarity of Improved GAN and CatGAN approaches
The CatGAN and Reverse KL approaches may be `forgetful'
Another approach: removing the K+1th class' constraint in the Improved GAN formulation
Summary
Experiments with alternative loss functions on synthetic datasets
Research questions and hypotheses
Which loss function formulation is best?
Can the PixelCNN++ model be replaced by some other density estimate?
Discriminator from a pre-trained generative adversarial network
Pre-trained denoising autoencoder
Do the generated examples actually contribute to feature learning?
Is VAT or InfoReg really complementary to BadGAN?
Experiments
PI-MNIST-100
Experimental setup
Effectiveness of different proxies for entropy
Potential for replacing PixelCNN++ model with a DAE or discriminator from a GAN
Extent to which generated images contribute to feature learning
Complementarity of VAT and BadGAN
SVHN-1k
Experimental setup
Effectiveness of different proxies for entropy
Potential for replacing PixelCNN++ model with a DAE or discriminator from a GAN
Hypotheses as to why our BadGAN implementation does not perform well on SVHN-1k
Experiments summary
Conclusions, practical recommendations and suggestions for future work
IV Appendices
Information regularisation for neural networks
Derivation
Intuition and experiments on synthetic datasets
FastInfoReg: overcoming InfoReg's speed issues
Performance on PI-MNIST-100
Viewing entropy minimisation as a KL divergence minimisation problem
References
MSc Artificial Intelligence Master Thesis Semi-Supervised Learning with Generative Adversarial Networks by Liam Schoneveld 11139013 September 1, 2017 36 ECTS February – September, 2017 Supervisor: Prof. Dr. M. Welling Daily Supervisor: T. Cohen MSc Assessor: Dr. E. Gavves Faculteit der Natuurkunde, Wiskunde en Informatica
Abstract As society continues to accumulate more and more data, demand for machine learning algorithms that can learn from data with limited human intervention only increases. Semi- supervised learning (SSL) methods, which extend supervised learning algorithms by enabling them to use unlabeled data, play an important role in addressing this challenge. In this thesis, a framework unifying the traditional assumptions and approaches to SSL is defined. A synthesis of SSL literature then places a range of contemporary approaches into this common framework. Our focus is on methods which use generative adversarial networks (GANs) to perform SSL. We analyse in detail one particular GAN-based SSL approach [Dai et al. (2017)]. This is shown to be closely related to two preceding approaches. Through synthetic experiments we provide an intuitive understanding and motivate the formulation of our focus approach. We then theoretically analyse potential alternative formulations of its loss function. This analysis motivates a number of research questions that centre on possible improvements to, and experiments to better understand the focus model. While we find support for our hypotheses, our conclusion more broadly is that the focus method is not especially robust. 1
Acknowledgements I would like to thank Taco Cohen for supervising my thesis. Despite his busy schedule, Taco was able to provide me with invaluable feedback throughout the course of this project. I am also extremely grateful to Auke Wiggers for his mentoring, discussions and guidance. He really helped me to think about these problems in a more effective way. Tijmen and the rest of the Scyfer team deserve a special mention for providing a working environment that was fun but also set the bar high. My gratitude also goes out to my committee of Max Welling and Efstratios Gavves for agreeing to read and assess my work among their demanding schedules. I gratefully acknowledge Zihang Dai, author of the paper which is the central focus of this thesis, for his timely and insightful correspondence via email. Finally I would like to thank my parents, brother, and my grandpa. 2
Contents I Background 1 Introduction 1.1 1.2 The scarcity of labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Overview and contributions 2.1 2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 5 7 7 7 3 Assumptions and approaches that enable semi-supervised learning 8 8 Assumptions required for semi-supervised learning . . . . . . . . . . . . . . 8 3.1.1 Smoothness assumption . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.2 Cluster assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . Low-density separation . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 9 Existence of a discoverable manifold . . . . . . . . . . . . . . . . . . 10 3.1.4 . . . . . . . . . . . . . . . . 10 3.2.1 Methods based on generative models . . . . . . . . . . . . . . . . . . 10 3.2.2 Methods based on low-density separation . . . . . . . . . . . . . . . . 11 3.2.3 Graph-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.4 Methods based on a change of representation . . . . . . . . . . . . . 11 Classes of semi-supervised learning algorithms 3.1 3.2 4.1 4.2 4.3 II Related Work 12 4 Review of Semi-Supervised Learning Literature 12 Semi-supervised learning before the deep learning era . . . . . . . . . . . . 12 Semi-supervised deep learning . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2.1 Autoencoder-based approaches . . . . . . . . . . . . . . . . . . . . . 15 4.2.2 Regularisation and data augmentation-based approaches . . . . . . . 17 4.2.3 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Semi-supervised generative adversarial networks . . . . . . . . . . . . . . . 21 4.3.1 Generative adversarial networks . . . . . . . . . . . . . . . . . . . . . 22 4.3.2 Approaches by which generative adversarial networks can be used for semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . 22 5.1 5.2 5.3 5 Model focus: Good Semi-Supervised Learning that Requires a Bad GAN 26 Shannon’s entropy and its relation to decision boundaries . . . . . . . . . . 26 CatGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Improved GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.3.1 The Improved GAN SSL Model . . . . . . . . . . . . . . . . . . . . . 29 BadGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Implications of the BadGAN model . . . . . . . . . . . . . . . . . . . 31 5.4.1 5.4 III Analysis and experiments 33 6 Enforcing low-density separation 33 Approaches to low-density separation based on entropy . . . . . . . . . . . 33 Synthetic experiments used in this section . . . . . . . . . . . . . . . 33 6.1.1 6.1 3
6.1.2 Using entropy to incorporate the low-density separation assumption into our model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.1.3 Taking advantage of a known prior class distribution . . . . . . . . . 36 . . . . . . 38 6.1.4 Generating data in low-density regions of the input space Viewing entropy maximisation as minimising a Kullback-Leibler divergence 38 Theoretical evaluation of different entropy-related loss functions . . . . . . 40 6.3.1 Similarity of Improved GAN and CatGAN approaches . . . . . . . . 40 6.3.2 The CatGAN and Reverse KL approaches may be ‘forgetful’ . . . . . 43 6.3.3 Another approach: removing the K+1th class’ constraint in the Im- 6.2 6.3 6.3.4 proved GAN formulation . . . . . . . . . . . . . . . . . . . . . . . . . 45 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Experiments with alternative loss functions on synthetic datasets . . . . . . 46 6.4 7 Research questions and hypotheses 48 RQ7.1 Which loss function formulation is best? . . . . . . . . . . . . . . . . . . . . 48 RQ7.2 Can the PixelCNN++ model be replaced by some other density estimate? 48 7.2.1 Discriminator from a pre-trained generative adversarial network . . . 48 . . . . . . . . . . . . . . . . . . . 48 7.2.2 RQ7.3 Do the generated examples actually contribute to feature learning? . . . . 49 RQ7.4 Is VAT or InfoReg really complementary to BadGAN? . . . . . . . . . . . . 49 Pre-trained denoising autoencoder 8 Experiments 8.1 8.2 8.1.1 8.1.2 8.1.3 51 PI-MNIST-100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Effectiveness of different proxies for entropy . . . . . . . . . . . . . . 52 Potential for replacing PixelCNN++ model with a DAE or discrim- inator from a GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 8.1.4 Extent to which generated images contribute to feature learning . . . 60 8.1.5 Complementarity of VAT and BadGAN . . . . . . . . . . . . . . . . 60 SVHN-1k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Effectiveness of different proxies for entropy . . . . . . . . . . . . . . 62 Potential for replacing PixelCNN++ model with a DAE or discrim- inator from a GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8.2.1 8.2.2 8.2.3 8.2.4 Hypotheses as to why our BadGAN implementation does not perform well on SVHN-1k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Experiments summary . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.2.5 9 Conclusions, practical recommendations and suggestions for future work 67 IV Appendices 69 A Information regularisation for neural networks 69 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 . . . . . . . . . . . . . . . 70 Intuition and experiments on synthetic datasets FastInfoReg: overcoming InfoReg’s speed issues . . . . . . . . . . . . . . . 71 Performance on PI-MNIST-100 . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.1 A.2 A.3 A.4 B Viewing entropy minimisation as a KL divergence minimisation problem 73 References 75 4
Part I Background 1 Introduction 1.1 The scarcity of labels “If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake.” Yann LeCun (2016) Today, data is of ever-increasing abundance. It is estimated that 4.4 zettabytes of digital data existed in 2013, and this is set to rise to 44 zettabytes (i.e., 44 trillion gigabytes) by 2020 [International Data Corporation (2014)]. Such a vast store of information presents substantial opportunities to harness. Machine learning provides us with ways to make the most of these opportunities, through algorithms that essentially enable computers to learn from data. One significant drawback of many machine learning approaches however, is that they require annotated data. Annotations can take different forms and be used in different ways. Most commonly though, the annotations required by machine learning algorithms consist of the desired targets or outputs our trained model should produce after being shown the corresponding input. It is relatively uncommon that data found ‘in the wild’ comes with ready-made anno- tations. There are some exceptions, such as captions added to images by online content creators, but even these may not be appropriate for the particular objective. Manually an- notating data is usually time-consuming and costly. Hence, as Yann LeCun’s quote suggests, there is an ever-growing need for machine learning methods designed to work with a limited stock of annotations. As we explain in the next section, algorithms designed to do so are known as unsupervised, or semi-supervised approaches. 1.2 Semi-supervised learning To define semi-supervised learning (SSL), we begin by defining supervised and unsupervised learning, as SSL lies somewhere in between these two concepts. Supervised learning algorithms are machine learning approaches which require that every input data point has a corresponding output data point. The goal of these algorithms is often to train a model that can accurately predict the correct outputs for inputs that were not seen during training. That is, it learns a function from training data that we hope will generalise to unseen data points. Unsupervised learning algorithms are those which require only input data points – no corresponding outputs are provided or assumed to exist. These algorithms can have a variety of goals; unsupervised generative models are generally tasked with being able to generate new data points from the same distribution as the input data, while a range of other unsupervised methods are focused on learning some new representation of the input data. For instance, one might aim to learn a representation of the data that requires less storage space while retaining most of the input information (i.e., compression). 5
Figure 1: Illustration showing the general idea behind many SSL classification algorithms, based on the cluster assumption. The known labels in the top panel are propagated to the unlabeled points in the clusters they belong to. SSL algorithms fall in between these paradigms. Strictly speaking, SSL methods are those designed to be used with datasets comprised of both annotated (or labeled) and un- annotated (or unlabeled) subsets. Generally though, these methods assume the number of labeled instances is much smaller than the number of unlabeled instances. This is because unlabeled data tends to be more useful when we have few labeled examples. As explained in Section 3.1, in general these methods rely on some assumption of smoothness or clustering of the input data. The main intuition at the core of most semi-supervised classification methods is illustrated in Figure 1. In the context of today’s data-rich world as described in the introduction, we believe that SSL methods play a particularly important role. While unsupervised learning is vital, as we cannot expect to annotate even a tiny proportion of the world’s data, we believe SSL might prove equally important, due to its ability to give direction to unsupervised learning methods. That is, the two sides of SSL can assist one another with regards to the practitioner’s task; the supervised side directs the unsupervised side into extracting structure that is more relevant to our particular task, while the unsupervised side provides the supervised side with more usable information. 6
2 Overview and contributions 2.1 Overview This thesis is structured as follows. In Section 3, we present the assumptions that are re- quired for SSL, and the place the main categories of historic SSL approaches into a contem- porary context. In Section 4 we synthesise the SSL literature, focusing mainly on approaches that involve deep learning in some way. Readers that are already well-versed in the basic concepts of SSL, and the surrounding literature, can be advised to skip or skim Sections 3 and 4. In Section 5 we give a detailed background on three related SSL models. These are all based on generative adversarial networks (GANs) and form a central focus of this thesis. In Section 6 we motivate a number of more basic cost functions used to enable SSL and illustrate their behaviour through synthetic experiments. This exercise gives readers a more intuitive understanding behind the approach taken in our focus model. We then undertake a more theoretical analysis of these loss functions, derive some new alternatives, and hypothesise about their potential advantages and disadvantages in the context of our focus model. Based on the preceding analysis and discussion, in Section 7 we formulate a number of research questions. Then in Section 8 we address each of these questions through larger empirical experiments, and present and discuss the results. Finally in Section 9 we conclude our study, give some practical recommendations, and suggest promising directions for future research. 2.2 Contributions The contributions made in this thesis include: • A broad review of deep learning-based approaches to SSL, placed into a historical context (Sections 3, 4 and 5). • An intuitive walk-through, that motivates and illustrates the behaviour of a number of loss functions commonly found in SSL models (Section 6.1). This is also used to more clearly explain the logic behind our focus model (the BadGAN model, introduced in Section 5.4). • Theoretical analysis of these loss functions and the introduction of alternative options, alongside a theoretical comparison between the approaches used by the CatGAN (in- troduced in Section 5.2) and Improved GAN (introduced in Section 5.3) models (re- mainder of Section 6). • Larger empirical experiments, which address the following research questions: Which of our analysed loss functions, or approaches to the BadGAN-style model, works best? Can the PixelCNN++ component of the BadGAN model be replaced by some- thing that is faster to train and infer from? Do the generated images actively contribute to higher-order feature learning of the classifier network in some way? Is Virtual Adversarial Training [Miyato et al. (2017)] truly orthogonal to a BadGAN- style approach, as asserted in Dai et al. (2017)? • The revival of Information Regularisation, a SSL approach from 2006, derivations mak- ing it suitable for use with neural networks, and evidence suggesting it is competitive with modern approaches (Appendix A). 7
分享到:
收藏