neural network methods for natural language processing.pdf

发布时间：2022-05-31 发布人：admin 分类：说明书资料大小：1.83M 资料格式：pdf 举报版权申诉

skobe_kuang-10132834-4744302542848203848.pdf-第1页.png

第1页 / 共282页

skobe_kuang-10132834-4744302542848203848.pdf-第2页.png

第2页 / 共282页

skobe_kuang-10132834-4744302542848203848.pdf-第3页.png

第3页 / 共282页

skobe_kuang-10132834-4744302542848203848.pdf-第4页.png

第4页 / 共282页

skobe_kuang-10132834-4744302542848203848.pdf-第5页.png

第5页 / 共282页

skobe_kuang-10132834-4744302542848203848.pdf-第6页.png

第6页 / 共282页

skobe_kuang-10132834-4744302542848203848.pdf-第7页.png

第7页 / 共282页

skobe_kuang-10132834-4744302542848203848.pdf-第8页.png

第8页 / 共282页

Contents

Preface

Introduction

Challenges of NLP

Neural Networks and Deep Learning

Deep Learning in NLP

Coverage and Organization

What's not Covered

A Note on Terminology

Mathematical Notation

Supervised Classification & Feed-Forward NNs

Learning Basics & Linear Models

Supervised Learning and Parameterized Functions

Train, Test, and Validation Sets

Linear Models

Binary Classification

Log-linear Binary Classification

Multi-class Classification

Representations

One-Hot and Dense Vector Representations

Log-linear Multi-class Classification

Training as Optimization

Loss Functions

Regularization

Gradient-based Optimization

Stochastic Gradient Descent

Worked-out Example

Beyond SGD

From Linear Models to Multi-layer Perceptrons

Limitations of Linear Models: The XOR Problem

Nonlinear Input Transformations

Kernel Methods

Trainable Mapping Functions

Feed-Forward NNs

A Brain-inspired Metaphor

In Mathematical Notation

Representation Power

Common Nonlinearities

Loss Functions

Regularization and Dropout

Similarity and Distance Layers

Embedding Layers

Neural Network Training

The Computation Graph Abstraction

Forward Computation

Backward Computation (Derivatives, Backprop)

Software

Implementation Recipe

Network Composition

Practicalities

Choice of Optimization Algorithm

Initialization

Restarts and Ensembles

Vanishing and Exploding Gradients

Saturation and Dead Neurons

Shuffling

Learning Rate

Minibatches

--- Natural Language Data

Features for Textual Data

Typology of NLP Classification Problems

Features for NLP Problems

Directly Observable Properties

Inferred Linguistic Properties

Core Features vs. Combination Features

Ngram Features

Distributional Features

Case Studies of NLP Features

Document Classification: Language Identification

Document Classification: Topic Classification

Document Classification: Authorship Attribution

Word-in-context: Part of Speech Tagging

Word-in-context: Named Entity Recognition

Word in Context, Linguistic Features: Preposition Sense Disambiguation

Relation Between Words in Context: Arc-Factored Parsing

From Textual Features to Inputs

Encoding Categorical Features

One-hot Encodings

Dense Encodings (Feature Embeddings)

Dense Vectors vs. One-hot Representations

Combining Dense Vectors

Window-based Features

Variable Number of Features: Continuous Bag of Words

Relation Between One-hot and Dense Vectors

Odds and Ends

Distance and Position Features

Padding, Unknown Words, and Word Dropout

Feature Combinations

Vector Sharing

Dimensionality

Embeddings Vocabulary

Network's Output

Example: Part-of-Speech Tagging

Example: Arc-factored Parsing

Language Modeling

The Language Modeling Task

Evaluating Language Models: Perplexity

Traditional Approaches to Language Modeling

See Also

Greedy Structured Prediction

Conditional Generation as Structured Output Prediction

Examples

Search-based Structured Prediction: First-order Dependency Parsing

Neural-CRF for Named Entity Recognition

Approximate NER-CRF With Beam-Search

Cascaded, Multi-task & Semi-supervised Learning

Model Cascading

Multi-task Learning

Training in a Multi-task Setup

Selective Sharing

Word-embeddings Pre-training as Multi-task Learning

Multi-task Learning in Conditioned Generation

Multi-task Learning as Regularization

Caveats

Semi-supervised Learning

Examples

Gaze-prediction and Sentence Compression

Arc Labeling and Syntactic Parsing

Preposition Sense Disambiguation and Preposition Translation Prediction

Conditioned Generation: Multilingual Machine Translation, Parsing, and Image Captioning

Outlook

Conclusion

What Have We Seen?

The Challenges Ahead

Biblio

Neural Network Methods for Natural Language Processing Yoav Goldberg Bar Ilan University SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES #37 CM&cLaypoolMorganpublishers&

Copyright © 2017 by Morgan & Claypool Neural Network Methods for Natural Language Processing Yoav Goldberg www.morganclaypool.com ISBN: 9781627052986 ISBN: 9781627052955 paperback ebook DOI 10.2200/S00762ED1V01Y201703HLT037 A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES Lecture #37 Series Editor: Graeme Hirst, University of Toronto Series ISSN Print 1947-4040 Electronic 1947-4059

ABSTRACT Neural networks are a family of powerful machine learning models. is book focuses on the application of neural network models to natural language data. e ﬁrst half of the book (Parts I and II) covers the basics of supervised machine learning and feed-forward neural networks, the basics of working with machine learning over language data, and the use of vector-based rather than symbolic representations for words. It also covers the computation-graph abstraction, which allows to easily deﬁne and train arbitrary neural networks, and is the basis behind the design of contemporary neural network software libraries. e second part of the book (Parts III and IV) introduces more specialized neural net- work architectures, including 1D convolutional neural networks, recurrent neural networks, conditioned-generation models, and attention-based models. ese architectures and techniques are the driving force behind state-of-the-art algorithms for machine translation, syntactic parsing, and many other applications. Finally, we also discuss tree-shaped networks, structured prediction, and the prospects of multi-task learning. KEYWORDS natural language processing, machine learning, supervised learning, deep learning, neural networks, word embeddings, recurrent neural networks, sequence to sequence models

Contents 1 2 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 e Challenges of Natural Language Processing. . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Neural Networks and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Deep Learning in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Success Stories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Coverage and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 What’s not Covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 A Note on Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.7 Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 PART I Supervised Classiﬁcation and Feed-forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Learning Basics and Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Supervised Learning and Parameterized Functions . . . . . . . . . . . . . . . . . . . . . . 13 2.1 Train, Test, and Validation Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 2.3 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Binary Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Log-linear Binary Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Multi-class Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 One-Hot and Dense Vector Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Log-linear Multi-class Classiﬁcation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 2.7 Training as Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 4 5 2.8 Gradient-based Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.8.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.8.2 Worked-out Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.8.3 Beyond SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 From Linear Models to Multi-layer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Limitations of Linear Models: e XOR Problem . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Nonlinear Input Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 3.4 Trainable Mapping Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Feed-forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 A Brain-inspired Metaphor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 In Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 4.3 Representation Power. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Common Nonlinearities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 4.6 Regularization and Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Similarity and Distance Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.7 4.8 Embedding Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 e Computation Graph Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1.1 Forward Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1.2 Backward Computation (Derivatives, Backprop) . . . . . . . . . . . . . . . . . . . 54 5.1.3 Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1.4 Implementation Recipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1.5 Network Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Practicalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 Choice of Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.3 Restarts and Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.4 Vanishing and Exploding Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.5 Saturation and Dead Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.6 Shuﬄing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.7 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.8 Minibatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2

6 7 8 PART II Working with Natural Language Data . . . . . . . . 63 Features for Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1 Typology of NLP Classiﬁcation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Features for NLP Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2 6.2.1 Directly Observable Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2.2 Inferred Linguistic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2.3 Core Features vs. Combination Features . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2.4 Ngram Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2.5 Distributional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Case Studies of NLP Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.1 Document Classiﬁcation: Language Identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . 77 7.2 Document Classiﬁcation: Topic Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3 Document Classiﬁcation: Authorship Attribution . . . . . . . . . . . . . . . . . . . . . . . 78 7.4 Word-in-context: Part of Speech Tagging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.5 Word-in-context: Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.6 Word in Context, Linguistic Features: Preposition Sense Disambiguation . . . . 82 7.7 Relation Between Words in Context: Arc-Factored Parsing. . . . . . . . . . . . . . . . 85 From Textual Features to Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Encoding Categorical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.1 8.1.1 One-hot Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.1.2 Dense Encodings (Feature Embeddings) . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.1.3 Dense Vectors vs. One-hot Representations . . . . . . . . . . . . . . . . . . . . . . 90 Combining Dense Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.2.1 Window-based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.2.2 Variable Number of Features: Continuous Bag of Words . . . . . . . . . . . . 93 8.3 Relation Between One-hot and Dense Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.4 Odds and Ends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.4.1 Distance and Position Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.4.2 Padding, Unknown Words, and Word Dropout . . . . . . . . . . . . . . . . . . . 96 8.4.3 Feature Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.4.4 Vector Sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.4.5 Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.4.6 Embeddings Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.4.7 Network’s Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.2

9 10 Example: Part-of-Speech Tagging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 8.5 8.6 Example: Arc-factored Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.1 e Language Modeling Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Evaluating Language Models: Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 9.2 9.3 Traditional Approaches to Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . 107 9.3.1 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 9.3.2 Limitations of Traditional Language Models. . . . . . . . . . . . . . . . . . . . . 108 9.4 Neural Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 9.5 Using Language Models for Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 9.6 Byproduct: Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Pre-trained Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.1 Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.2 Supervised Task-speciﬁc Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.3 Unsupervised Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 10.3.1 Using Pre-trained Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.4 Word Embedding Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.4.1 Distributional Hypothesis and Word Representations. . . . . . . . . . . . . . 118 10.4.2 From Neural Language Models to Distributed Representations . . . . . . 122 10.4.3 Connecting the Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 10.4.4 Other Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 10.5 e Choice of Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10.5.1 Window Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10.5.2 Sentences, Paragraphs, or Documents . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.5.3 Syntactic Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.5.4 Multilingual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 10.5.5 Character-based and Sub-word Representations . . . . . . . . . . . . . . . . . . 131 10.6 Dealing with Multi-word Units and Word Inﬂections . . . . . . . . . . . . . . . . . . . 132 10.7 Limitations of Distributional Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 11 Using Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.1 Obtaining Word Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.2 Word Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.3 Word Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

11.4 Finding Similar Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 11.4.1 Similarity to a Group of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 11.5 Odd-one Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 11.6 Short Document Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 11.7 Word Analogies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 11.8 Retroﬁtting and Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 11.9 Practicalities and Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 12 Case Study: A Feed-forward Architecture for Sentence Meaning Inference . . 141 12.1 Natural Language Inference and the SNLI Dataset . . . . . . . . . . . . . . . . . . . . . 141 12.2 A Textual Similarity Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 PART III Specialized Architectures . . . . . . . . . . . . . . . . . 147 13 Ngram Detectors: Convolutional Neural Networks. . . . . . . . . . . . . . . . . . . . . . 151 13.1 Basic Convolution + Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 13.1.1 1D Convolutions Over Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 13.1.2 Vector Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 13.1.3 Variations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 13.2 Alternative: Feature Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 13.3 Hierarchical Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 14 Recurrent Neural Networks: Modeling Sequences and Stacks . . . . . . . . . . . . . 163 14.1 e RNN Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 14.2 RNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 14.3 Common RNN Usage-patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 14.3.1 Acceptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 14.3.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 14.3.3 Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 14.4 Bidirectional RNNs (biRNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 14.5 Multi-layer (stacked) RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 14.6 RNNs for Representing Stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 14.7 A Note on Reading the Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

分享到：

赞收藏

资料库

neural network methods for natural language processing.pdf

相关推荐

人工智能

热门标签

最新资料