logo资料库

neural network methods for natural language processing.pdf

第1页 / 共282页
第2页 / 共282页
第3页 / 共282页
第4页 / 共282页
第5页 / 共282页
第6页 / 共282页
第7页 / 共282页
第8页 / 共282页
资料共282页,剩余部分请下载后查看
Contents
Preface
Introduction
Challenges of NLP
Neural Networks and Deep Learning
Deep Learning in NLP
Coverage and Organization
What's not Covered
A Note on Terminology
Mathematical Notation
Supervised Classification & Feed-Forward NNs
Learning Basics & Linear Models
Supervised Learning and Parameterized Functions
Train, Test, and Validation Sets
Linear Models
Binary Classification
Log-linear Binary Classification
Multi-class Classification
Representations
One-Hot and Dense Vector Representations
Log-linear Multi-class Classification
Training as Optimization
Loss Functions
Regularization
Gradient-based Optimization
Stochastic Gradient Descent
Worked-out Example
Beyond SGD
From Linear Models to Multi-layer Perceptrons
Limitations of Linear Models: The XOR Problem
Nonlinear Input Transformations
Kernel Methods
Trainable Mapping Functions
Feed-Forward NNs
A Brain-inspired Metaphor
In Mathematical Notation
Representation Power
Common Nonlinearities
Loss Functions
Regularization and Dropout
Similarity and Distance Layers
Embedding Layers
Neural Network Training
The Computation Graph Abstraction
Forward Computation
Backward Computation (Derivatives, Backprop)
Software
Implementation Recipe
Network Composition
Practicalities
Choice of Optimization Algorithm
Initialization
Restarts and Ensembles
Vanishing and Exploding Gradients
Saturation and Dead Neurons
Shuffling
Learning Rate
Minibatches
--- Natural Language Data
Features for Textual Data
Typology of NLP Classification Problems
Features for NLP Problems
Directly Observable Properties
Inferred Linguistic Properties
Core Features vs. Combination Features
Ngram Features
Distributional Features
Case Studies of NLP Features
Document Classification: Language Identification
Document Classification: Topic Classification
Document Classification: Authorship Attribution
Word-in-context: Part of Speech Tagging
Word-in-context: Named Entity Recognition
Word in Context, Linguistic Features: Preposition Sense Disambiguation
Relation Between Words in Context: Arc-Factored Parsing
From Textual Features to Inputs
Encoding Categorical Features
One-hot Encodings
Dense Encodings (Feature Embeddings)
Dense Vectors vs. One-hot Representations
Combining Dense Vectors
Window-based Features
Variable Number of Features: Continuous Bag of Words
Relation Between One-hot and Dense Vectors
Odds and Ends
Distance and Position Features
Padding, Unknown Words, and Word Dropout
Feature Combinations
Vector Sharing
Dimensionality
Embeddings Vocabulary
Network's Output
Example: Part-of-Speech Tagging
Example: Arc-factored Parsing
Language Modeling
The Language Modeling Task
Evaluating Language Models: Perplexity
Traditional Approaches to Language Modeling
Further Reading
Limitations of Traditional Language Models
Neural Language Models
Using Language Models for Generation
Byproduct: Word Representations
Pre-trained Word Representations
Random Initialization
Supervised Task-specific Pre-training
Unsupervised Pre-training
Using Pre-trained Embeddings
Word Embedding Algorithms
Distributional Hypothesis and Word Representations
From Neural Language Models to Distributed Representations
Connecting the Worlds
Other Algorithms
The Choice of Contexts
Window Approach
Sentences, Paragraphs, or Documents
Syntactic Window
Multilingual
Character-based and Sub-word Representations
Dealing with Multi-word Units and Word Inflections
Limitations of Distributional Methods
Word Embeddings
Obtaining Word Vectors
Word Similarity
Word Clustering
Finding Similar Words
Similarity to a Group of Words
Odd-one Out
Short Document Similarity
Word Analogies
Retrofitting and Projections
Practicalities and Pitfalls
Case Study - Feed-Forward Architecture for Sentence Meaning Inference
Natural Language Inference and the SNLI Dataset
A Textual Similarity Network
--- Specialized Architectures
Ngram Detectors - Convolutional NNs
Basic Convolution + Pooling
1D Convolutions Over Text
Vector Pooling
Variations
Alternative: Feature Hashing
Hierarchical Convolutions
Recurrent NNs - Modeling Sequences & Stacks
The RNN Abstraction
RNN Training
Common RNN Usage-patterns
Acceptor
Encoder
Transducer
Bidirectional RNNs (biRNN)
Multi-layer (stacked) RNNs
RNNs for Representing Stacks
A Note on Reading the Literature
Concrete Recurrent NN Architectures
CBOW as an RNN
Simple RNN
Gated Architectures
LSTM
GRU
Other Variants
Dropout in RNNs
Modeling with Recurrent Networks
Acceptors
Sentiment Classification
Subject-verb Agreement Grammaticality Detection
RNNs as Feature Extractors
Part-of-speech Tagging
RNN–CNN Document Classification
Arc-factored Dependency Parsing
Conditioned Generation
RNN Generators
Training Generators
Conditioned Generation (Encoder-Decoder)
Sequence to Sequence Models
Applications
Other Conditioning Contexts
Unsupervised Sentence Similarity
Conditioned Generation with Attention
Computational Complexity
Interpretability
Attention-based Models in NLP
Machine Translation
Morphological Inflection
Syntactic Parsing
--- Additional Topics
Modeling Trees with Recursive NNs
Formal Definition
Extensions and Variations
Training Recursive Neural Networks
A Simple Alternative–Linearized Trees
Outlook
Structured Output Prediction
Search-based Structured Prediction
Structured Prediction with Linear Models
Nonlinear Structured Prediction
Probabilistic Objective (CRF)
Approximate Search
Reranking
See Also
Greedy Structured Prediction
Conditional Generation as Structured Output Prediction
Examples
Search-based Structured Prediction: First-order Dependency Parsing
Neural-CRF for Named Entity Recognition
Approximate NER-CRF With Beam-Search
Cascaded, Multi-task & Semi-supervised Learning
Model Cascading
Multi-task Learning
Training in a Multi-task Setup
Selective Sharing
Word-embeddings Pre-training as Multi-task Learning
Multi-task Learning in Conditioned Generation
Multi-task Learning as Regularization
Caveats
Semi-supervised Learning
Examples
Gaze-prediction and Sentence Compression
Arc Labeling and Syntactic Parsing
Preposition Sense Disambiguation and Preposition Translation Prediction
Conditioned Generation: Multilingual Machine Translation, Parsing, and Image Captioning
Outlook
Conclusion
What Have We Seen?
The Challenges Ahead
Biblio
Neural Network Methods for Natural Language Processing Yoav Goldberg Bar Ilan University SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES #37 CM&cLaypoolMorganpublishers&
Copyright © 2017 by Morgan & Claypool Neural Network Methods for Natural Language Processing Yoav Goldberg www.morganclaypool.com ISBN: 9781627052986 ISBN: 9781627052955 paperback ebook DOI 10.2200/S00762ED1V01Y201703HLT037 A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES Lecture #37 Series Editor: Graeme Hirst, University of Toronto Series ISSN Print 1947-4040 Electronic 1947-4059
ABSTRACT Neural networks are a family of powerful machine learning models. is book focuses on the application of neural network models to natural language data. e first half of the book (Parts I and II) covers the basics of supervised machine learning and feed-forward neural networks, the basics of working with machine learning over language data, and the use of vector-based rather than symbolic representations for words. It also covers the computation-graph abstraction, which allows to easily define and train arbitrary neural networks, and is the basis behind the design of contemporary neural network software libraries. e second part of the book (Parts III and IV) introduces more specialized neural net- work architectures, including 1D convolutional neural networks, recurrent neural networks, conditioned-generation models, and attention-based models. ese architectures and techniques are the driving force behind state-of-the-art algorithms for machine translation, syntactic parsing, and many other applications. Finally, we also discuss tree-shaped networks, structured prediction, and the prospects of multi-task learning. KEYWORDS natural language processing, machine learning, supervised learning, deep learning, neural networks, word embeddings, recurrent neural networks, sequence to sequence models
Contents 1 2 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 e Challenges of Natural Language Processing. . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Neural Networks and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Deep Learning in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Success Stories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Coverage and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 What’s not Covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 A Note on Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.7 Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 PART I Supervised Classification and Feed-forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Learning Basics and Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Supervised Learning and Parameterized Functions . . . . . . . . . . . . . . . . . . . . . . 13 2.1 Train, Test, and Validation Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 2.3 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Log-linear Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Multi-class Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 One-Hot and Dense Vector Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Log-linear Multi-class Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 2.7 Training as Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 4 5 2.8 Gradient-based Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.8.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.8.2 Worked-out Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.8.3 Beyond SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 From Linear Models to Multi-layer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Limitations of Linear Models: e XOR Problem . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Nonlinear Input Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 3.4 Trainable Mapping Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Feed-forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 A Brain-inspired Metaphor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 In Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 4.3 Representation Power. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Common Nonlinearities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 4.6 Regularization and Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Similarity and Distance Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.7 4.8 Embedding Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 e Computation Graph Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1.1 Forward Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1.2 Backward Computation (Derivatives, Backprop) . . . . . . . . . . . . . . . . . . . 54 5.1.3 Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1.4 Implementation Recipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1.5 Network Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Practicalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 Choice of Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.3 Restarts and Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.4 Vanishing and Exploding Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.5 Saturation and Dead Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.6 Shuffling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.7 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.8 Minibatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2
6 7 8 PART II Working with Natural Language Data . . . . . . . . 63 Features for Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1 Typology of NLP Classification Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Features for NLP Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2 6.2.1 Directly Observable Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2.2 Inferred Linguistic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2.3 Core Features vs. Combination Features . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2.4 Ngram Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2.5 Distributional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Case Studies of NLP Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.1 Document Classification: Language Identification . . . . . . . . . . . . . . . . . . . . . . . 77 7.2 Document Classification: Topic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3 Document Classification: Authorship Attribution . . . . . . . . . . . . . . . . . . . . . . . 78 7.4 Word-in-context: Part of Speech Tagging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.5 Word-in-context: Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.6 Word in Context, Linguistic Features: Preposition Sense Disambiguation . . . . 82 7.7 Relation Between Words in Context: Arc-Factored Parsing. . . . . . . . . . . . . . . . 85 From Textual Features to Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Encoding Categorical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.1 8.1.1 One-hot Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.1.2 Dense Encodings (Feature Embeddings) . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.1.3 Dense Vectors vs. One-hot Representations . . . . . . . . . . . . . . . . . . . . . . 90 Combining Dense Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.2.1 Window-based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.2.2 Variable Number of Features: Continuous Bag of Words . . . . . . . . . . . . 93 8.3 Relation Between One-hot and Dense Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.4 Odds and Ends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.4.1 Distance and Position Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.4.2 Padding, Unknown Words, and Word Dropout . . . . . . . . . . . . . . . . . . . 96 8.4.3 Feature Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.4.4 Vector Sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.4.5 Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.4.6 Embeddings Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.4.7 Network’s Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.2
9 10 Example: Part-of-Speech Tagging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 8.5 8.6 Example: Arc-factored Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.1 e Language Modeling Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Evaluating Language Models: Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 9.2 9.3 Traditional Approaches to Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . 107 9.3.1 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 9.3.2 Limitations of Traditional Language Models. . . . . . . . . . . . . . . . . . . . . 108 9.4 Neural Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 9.5 Using Language Models for Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 9.6 Byproduct: Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Pre-trained Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.1 Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.2 Supervised Task-specific Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.3 Unsupervised Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 10.3.1 Using Pre-trained Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.4 Word Embedding Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.4.1 Distributional Hypothesis and Word Representations. . . . . . . . . . . . . . 118 10.4.2 From Neural Language Models to Distributed Representations . . . . . . 122 10.4.3 Connecting the Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 10.4.4 Other Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 10.5 e Choice of Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10.5.1 Window Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10.5.2 Sentences, Paragraphs, or Documents . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.5.3 Syntactic Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.5.4 Multilingual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 10.5.5 Character-based and Sub-word Representations . . . . . . . . . . . . . . . . . . 131 10.6 Dealing with Multi-word Units and Word Inflections . . . . . . . . . . . . . . . . . . . 132 10.7 Limitations of Distributional Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 11 Using Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.1 Obtaining Word Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.2 Word Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.3 Word Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.4 Finding Similar Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 11.4.1 Similarity to a Group of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 11.5 Odd-one Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 11.6 Short Document Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 11.7 Word Analogies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 11.8 Retrofitting and Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 11.9 Practicalities and Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 12 Case Study: A Feed-forward Architecture for Sentence Meaning Inference . . 141 12.1 Natural Language Inference and the SNLI Dataset . . . . . . . . . . . . . . . . . . . . . 141 12.2 A Textual Similarity Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 PART III Specialized Architectures . . . . . . . . . . . . . . . . . 147 13 Ngram Detectors: Convolutional Neural Networks. . . . . . . . . . . . . . . . . . . . . . 151 13.1 Basic Convolution + Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 13.1.1 1D Convolutions Over Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 13.1.2 Vector Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 13.1.3 Variations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 13.2 Alternative: Feature Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 13.3 Hierarchical Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 14 Recurrent Neural Networks: Modeling Sequences and Stacks . . . . . . . . . . . . . 163 14.1 e RNN Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 14.2 RNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 14.3 Common RNN Usage-patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 14.3.1 Acceptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 14.3.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 14.3.3 Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 14.4 Bidirectional RNNs (biRNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 14.5 Multi-layer (stacked) RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 14.6 RNNs for Representing Stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 14.7 A Note on Reading the Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
分享到:
收藏