logo资料库

Natural Language Processing with Python(使用Python进行自然语言处理).pdf

第1页 / 共504页
第2页 / 共504页
第3页 / 共504页
第4页 / 共504页
第5页 / 共504页
第6页 / 共504页
第7页 / 共504页
第8页 / 共504页
资料共504页,剩余部分请下载后查看
Table of Contents
Preface
Audience
Emphasis
What You Will Learn
Organization
Why Python?
Software Requirements
Natural Language Toolkit (NLTK)
For Instructors
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
Royalties
Chapter 1. Language Processing and Python
1.1  Computing with Language: Texts and Words
Getting Started with Python
Getting Started with NLTK
Searching Text
Counting Vocabulary
1.2  A Closer Look at Python: Texts as Lists of Words
Lists
Indexing Lists
Variables
Strings
1.3  Computing with Language: Simple Statistics
Frequency Distributions
Fine-Grained Selection of Words
Collocations and Bigrams
Counting Other Things
1.4  Back to Python: Making Decisions and Taking Control
Conditionals
Operating on Every Element
Nested Code Blocks
Looping with Conditions
1.5  Automatic Natural Language Understanding
Word Sense Disambiguation
Pronoun Resolution
Generating Language Output
Machine Translation
Spoken Dialogue Systems
Textual Entailment
Limitations of NLP
1.6  Summary
1.7  Further Reading
1.8  Exercises
Chapter 2. Accessing Text Corpora and Lexical Resources
2.1  Accessing Text Corpora
Gutenberg Corpus
Web and Chat Text
Brown Corpus
Reuters Corpus
Inaugural Address Corpus
Annotated Text Corpora
Corpora in Other Languages
Text Corpus Structure
Loading Your Own Corpus
2.2  Conditional Frequency Distributions
Conditions and Events
Counting Words by Genre
Plotting and Tabulating Distributions
Generating Random Text with Bigrams
2.3  More Python: Reusing Code
Creating Programs with a Text Editor
Functions
Modules
2.4  Lexical Resources
Wordlist Corpora
A Pronouncing Dictionary
Comparative Wordlists
Shoebox and Toolbox Lexicons
2.5  WordNet
Senses and Synonyms
The WordNet Hierarchy
More Lexical Relations
Semantic Similarity
2.6  Summary
2.7  Further Reading
2.8  Exercises
Chapter 3. Processing Raw Text
3.1  Accessing Text from the Web and from Disk
Electronic Books
Dealing with HTML
Processing Search Engine Results
Processing RSS Feeds
Reading Local Files
Extracting Text from PDF, MSWord, and Other Binary Formats
Capturing User Input
The NLP Pipeline
3.2  Strings: Text Processing at the Lowest Level
Basic Operations with Strings
Printing Strings
Accessing Individual Characters
Accessing Substrings
More Operations on Strings
The Difference Between Lists and Strings
3.3  Text Processing with Unicode
What Is Unicode?
Extracting Encoded Text from Files
Using Your Local Encoding in Python
3.4  Regular Expressions for Detecting Word Patterns
Using Basic Metacharacters
Ranges and Closures
3.5  Useful Applications of Regular Expressions
Extracting Word Pieces
Doing More with Word Pieces
Finding Word Stems
Searching Tokenized Text
3.6  Normalizing Text
Stemmers
Lemmatization
3.7  Regular Expressions for Tokenizing Text
Simple Approaches to Tokenization
NLTK’s Regular Expression Tokenizer
Further Issues with Tokenization
3.8  Segmentation
Sentence Segmentation
Word Segmentation
3.9  Formatting: From Lists to Strings
From Lists to Strings
Strings and Formats
Lining Things Up
Writing Results to a File
Text Wrapping
3.10  Summary
3.11  Further Reading
3.12  Exercises
Chapter 4. Writing Structured Programs
4.1  Back to the Basics
Assignment
Equality
Conditionals
4.2  Sequences
Operating on Sequence Types
Combining Different Sequence Types
Generator Expressions
4.3  Questions of Style
Python Coding Style
Procedural Versus Declarative Style
Some Legitimate Uses for Counters
4.4  Functions: The Foundation of Structured Programming
Function Inputs and Outputs
Parameter Passing
Variable Scope
Checking Parameter Types
Functional Decomposition
Documenting Functions
4.5  Doing More with Functions
Functions As Arguments
Accumulative Functions
Higher-Order Functions
Named Arguments
4.6  Program Development
Structure of a Python Module
Multimodule Programs
Sources of Error
Debugging Techniques
Defensive Programming
4.7  Algorithm Design
Recursion
Space-Time Trade-offs
Dynamic Programming
4.8  A Sample of Python Libraries
Matplotlib
NetworkX
csv
NumPy
Other Python Libraries
4.9  Summary
4.10  Further Reading
4.11  Exercises
Chapter 5. Categorizing and Tagging Words
5.1  Using a Tagger
5.2  Tagged Corpora
Representing Tagged Tokens
Reading Tagged Corpora
A Simplified Part-of-Speech Tagset
Nouns
Verbs
Adjectives and Adverbs
Unsimplified Tags
Exploring Tagged Corpora
5.3  Mapping Words to Properties Using Python Dictionaries
Indexing Lists Versus Dictionaries
Dictionaries in Python
Defining Dictionaries
Default Dictionaries
Incrementally Updating a Dictionary
Complex Keys and Values
Inverting a Dictionary
5.4  Automatic Tagging
The Default Tagger
The Regular Expression Tagger
The Lookup Tagger
Evaluation
5.5  N-Gram Tagging
Unigram Tagging
Separating the Training and Testing Data
General N-Gram Tagging
Combining Taggers
Tagging Unknown Words
Storing Taggers
Performance Limitations
Tagging Across Sentence Boundaries
5.6  Transformation-Based Tagging
5.7  How to Determine the Category of a Word
Morphological Clues
Syntactic Clues
Semantic Clues
New Words
Morphology in Part-of-Speech Tagsets
5.8  Summary
5.9  Further Reading
5.10  Exercises
Chapter 6. Learning to Classify Text
6.1  Supervised Classification
Gender Identification
Choosing the Right Features
Document Classification
Part-of-Speech Tagging
Exploiting Context
Sequence Classification
Other Methods for Sequence Classification
6.2  Further Examples of Supervised Classification
Sentence Segmentation
Identifying Dialogue Act Types
Recognizing Textual Entailment
Scaling Up to Large Datasets
6.3  Evaluation
The Test Set
Accuracy
Precision and Recall
Confusion Matrices
Cross-Validation
6.4  Decision Trees
Entropy and Information Gain
6.5  Naive Bayes Classifiers
Underlying Probabilistic Model
Zero Counts and Smoothing
Non-Binary Features
The Naivete of Independence
The Cause of Double-Counting
6.6  Maximum Entropy Classifiers
The Maximum Entropy Model
Maximizing Entropy
Generative Versus Conditional Classifiers
6.7  Modeling Linguistic Patterns
What Do Models Tell Us?
6.8  Summary
6.9  Further Reading
6.10  Exercises
Chapter 7. Extracting Information from Text
7.1  Information Extraction
Information Extraction Architecture
7.2  Chunking
Noun Phrase Chunking
Tag Patterns
Chunking with Regular Expressions
Exploring Text Corpora
Chinking
Representing Chunks: Tags Versus Trees
7.3  Developing and Evaluating Chunkers
Reading IOB Format and the CoNLL-2000 Chunking Corpus
Simple Evaluation and Baselines
Training Classifier-Based Chunkers
7.4  Recursion in Linguistic Structure
Building Nested Structure with Cascaded Chunkers
Trees
Tree Traversal
7.5  Named Entity Recognition
7.6  Relation Extraction
7.7  Summary
7.8  Further Reading
7.9  Exercises
Chapter 8. Analyzing Sentence Structure
8.1  Some Grammatical Dilemmas
Linguistic Data and Unlimited Possibilities
Ubiquitous Ambiguity
8.2  What’s the Use of Syntax?
Beyond n-grams
8.3  Context-Free Grammar
A Simple Grammar
Writing Your Own Grammars
Recursion in Syntactic Structure
8.4  Parsing with Context-Free Grammar
Recursive Descent Parsing
Shift-Reduce Parsing
The Left-Corner Parser
Well-Formed Substring Tables
8.5  Dependencies and Dependency Grammar
Valency and the Lexicon
Scaling Up
8.6  Grammar Development
Treebanks and Grammars
Pernicious Ambiguity
Weighted Grammar
8.7  Summary
8.8  Further Reading
8.9  Exercises
Chapter 9. Building Feature-Based Grammars
9.1  Grammatical Features
Syntactic Agreement
Using Attributes and Constraints
Terminology
9.2  Processing Feature Structures
Subsumption and Unification
9.3  Extending a Feature-Based Grammar
Subcategorization
Heads Revisited
Auxiliary Verbs and Inversion
Unbounded Dependency Constructions
Case and Gender in German
9.4  Summary
9.5  Further Reading
9.6  Exercises
Chapter 10. Analyzing the Meaning of Sentences
10.1  Natural Language Understanding
Querying a Database
Natural Language, Semantics, and Logic
10.2  Propositional Logic
10.3  First-Order Logic
Syntax
First-Order Theorem Proving
Summarizing the Language of First-Order Logic
Truth in Model
Individual Variables and Assignments
Quantification
Quantifier Scope Ambiguity
Model Building
10.4  The Semantics of English Sentences
Compositional Semantics in Feature-Based Grammar
The λ-Calculus
Quantified NPs
Transitive Verbs
Quantifier Ambiguity Revisited
10.5  Discourse Semantics
Discourse Representation Theory
Discourse Processing
10.6  Summary
10.7  Further Reading
10.8  Exercises
Chapter 11. Managing Linguistic Data
11.1  Corpus Structure: A Case Study
The Structure of TIMIT
Notable Design Features
Fundamental Data Types
11.2  The Life Cycle of a Corpus
Three Corpus Creation Scenarios
Quality Control
Curation Versus Evolution
11.3  Acquiring Data
Obtaining Data from the Web
Obtaining Data from Word Processor Files
Obtaining Data from Spreadsheets and Databases
Converting Data Formats
Deciding Which Layers of Annotation to Include
Standards and Tools
Special Considerations When Working with Endangered Languages
11.4  Working with XML
Using XML for Linguistic Structures
The Role of XML
The ElementTree Interface
Using ElementTree for Accessing Toolbox Data
Formatting Entries
11.5  Working with Toolbox Data
Adding a Field to Each Entry
Validating a Toolbox Lexicon
11.6  Describing Language Resources Using OLAC Metadata
What Is Metadata?
OLAC: Open Language Archives Community
11.7  Summary
11.8  Further Reading
11.9  Exercises
Afterword: The Language Challenge
Language Processing Versus Symbol Processing
Contemporary Philosophical Divides
NLTK Roadmap
Envoi...
Bibliography
NLTK Index
General Index
Natural Language Processing with Python
Natural Language Processing with Python Steven Bird, Ewan Klein, and Edward Loper Beijing•Cambridge•Farnham•Köln•Sebastopol•Taipei•Tokyo
Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper Copyright © 2009 Steven Bird, Ewan Klein, and Edward Loper. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com. Julie Steele Editor: Production Editor: Loranah Dimant Copyeditor: Genevieve d’Entremont Proofreader: Loranah Dimant Printing History: June 2009: First Edition. Indexer: Ellen Troutman Zaig Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Natural Language Processing with Python, the image of a right whale, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein. ISBN: 978-0-596-51649-9 [M] 1244726609
Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Language Processing and Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 10 16 22 27 33 34 35 1.1 Computing with Language: Texts and Words 1.2 A Closer Look at Python: Texts as Lists of Words 1.3 Computing with Language: Simple Statistics 1.4 Back to Python: Making Decisions and Taking Control 1.5 Automatic Natural Language Understanding 1.6 Summary 1.7 Further Reading 1.8 Exercises 2. Accessing Text Corpora and Lexical Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 52 56 59 67 73 73 74 2.1 Accessing Text Corpora 2.2 Conditional Frequency Distributions 2.3 More Python: Reusing Code 2.4 Lexical Resources 2.5 WordNet 2.6 Summary 2.7 Further Reading 2.8 Exercises 3. Processing Raw Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 80 87 93 97 102 107 109 112 116 3.1 Accessing Text from the Web and from Disk 3.2 Strings: Text Processing at the Lowest Level 3.3 Text Processing with Unicode 3.4 Regular Expressions for Detecting Word Patterns 3.5 Useful Applications of Regular Expressions 3.6 Normalizing Text 3.7 Regular Expressions for Tokenizing Text 3.8 Segmentation 3.9 Formatting: From Lists to Strings v
3.10 Summary 3.11 Further Reading 3.12 Exercises 121 122 123 4. Writing Structured Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 130 133 138 142 149 154 160 167 172 173 173 4.1 Back to the Basics 4.2 Sequences 4.3 Questions of Style 4.4 Functions: The Foundation of Structured Programming 4.5 Doing More with Functions 4.6 Program Development 4.7 Algorithm Design 4.8 A Sample of Python Libraries 4.9 Summary 4.10 Further Reading 4.11 Exercises 5. Categorizing and Tagging Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 179 181 189 198 202 208 210 213 214 215 5.1 Using a Tagger 5.2 Tagged Corpora 5.3 Mapping Words to Properties Using Python Dictionaries 5.4 Automatic Tagging 5.5 N-Gram Tagging 5.6 Transformation-Based Tagging 5.7 How to Determine the Category of a Word 5.8 Summary 5.9 Further Reading 5.10 Exercises 6. Learning to Classify Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 221 233 237 242 245 250 254 256 256 257 6.1 Supervised Classification 6.2 Further Examples of Supervised Classification 6.3 Evaluation 6.4 Decision Trees 6.5 Naive Bayes Classifiers 6.6 Maximum Entropy Classifiers 6.7 Modeling Linguistic Patterns 6.8 Summary 6.9 Further Reading 6.10 Exercises 7. Extracting Information from Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 261 7.1 Information Extraction vi | Table of Contents
分享到:
收藏