logo资料库

Text Mining with R: A Tidy Approach [True PDF].pdf

第1页 / 共193页
第2页 / 共193页
第3页 / 共193页
第4页 / 共193页
第5页 / 共193页
第6页 / 共193页
第7页 / 共193页
第8页 / 共193页
资料共193页,剩余部分请下载后查看
Cover
Copyright
Table of Contents
Preface
Outline
Topics This Book Does Not Cover
About This Book
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgements
Chapter 1. The Tidy Text Format
Contrasting Tidy Text with Other Data Structures
The unnest_tokens Function
Tidying the Works of Jane Austen
The gutenbergr Package
Word Frequencies
Summary
Chapter 2. Sentiment Analysis with Tidy Data
The sentiments Dataset
Sentiment Analysis with Inner Join
Comparing the Three Sentiment Dictionaries
Most Common Positive and Negative Words
Wordclouds
Looking at Units Beyond Just Words
Summary
Chapter 3. Analyzing Word and Document Frequency: tf-idf
Term Frequency in Jane Austen’s Novels
Zipf’s Law
The bind_tf_idf Function
A Corpus of Physics Texts
Summary
Chapter 4. Relationships Between Words: N-grams and Correlations
Tokenizing by N-gram
Counting and Filtering N-grams
Analyzing Bigrams
Using Bigrams to Provide Context in Sentiment Analysis
Visualizing a Network of Bigrams with ggraph
Visualizing Bigrams in Other Texts
Counting and Correlating Pairs of Words with the widyr Package
Counting and Correlating Among Sections
Examining Pairwise Correlation
Summary
Chapter 5. Converting to and from Nontidy Formats
Tidying a Document-Term Matrix
Tidying DocumentTermMatrix Objects
Tidying dfm Objects
Casting Tidy Text Data into a Matrix
Tidying Corpus Objects with Metadata
Example: Mining Financial Articles
Summary
Chapter 6. Topic Modeling
Latent Dirichlet Allocation
Word-Topic Probabilities
Document-Topic Probabilities
Example: The Great Library Heist
LDA on Chapters
Per-Document Classification
By-Word Assignments: augment
Alternative LDA Implementations
Summary
Chapter 7. Case Study: Comparing Twitter Archives
Getting the Data and Distribution of Tweets
Word Frequencies
Comparing Word Usage
Changes in Word Use
Favorites and Retweets
Summary
Chapter 8. Case Study: Mining NASA Metadata
How Data Is Organized at NASA
Wrangling and Tidying the Data
Some Initial Simple Exploration
Word Co-ocurrences and Correlations
Networks of Description and Title Words
Networks of Keywords
Calculating tf-idf for the Description Fields
What Is tf-idf for the Description Field Words?
Connecting Description Fields to Keywords
Topic Modeling
Casting to a Document-Term Matrix
Ready for Topic Modeling
Interpreting the Topic Model
Connecting Topic Modeling with Keywords
Summary
Chapter 9. Case Study: Analyzing Usenet Text
Preprocessing
Preprocessing Text
Words in Newsgroups
Finding tf-idf Within Newsgroups
Topic Modeling
Sentiment Analysis
Sentiment Analysis by Word
Sentiment Analysis by Message
N-gram Analysis
Summary
Bibliography
Index
About the Authors
Colophon
Text Mining with R A TIDY APPROACH Julia Silge & David Robinson
Text Mining with R A Tidy Approach Julia Silge and David Robinson Beijing Beijing Boston Boston Farnham Sebastopol Farnham Sebastopol Tokyo Tokyo
Text Mining with R by Julia Silge and David Robinson Copyright © 2017 Julia Silge, David Robinson. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Nicole Tache Production Editor: Nicholas Adams Copyeditor: Sonia Saruba Proofreader: Charles Roumeliotis June 2017: First Edition Revision History for the First Edition 2017-06-08: First Release Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest See http://oreilly.com/catalog/errata.csp?isbn=9781491981658 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Text Mining with R, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-98165-8 [LSI]
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. The Tidy Text Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Contrasting Tidy Text with Other Data Structures 2 The unnest_tokens Function 2 Tidying the Works of Jane Austen 4 The gutenbergr Package 7 Word Frequencies 8 Summary 12 2. Sentiment Analysis with Tidy Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 The sentiments Dataset 14 Sentiment Analysis with Inner Join 16 Comparing the Three Sentiment Dictionaries 19 Most Common Positive and Negative Words 22 Wordclouds 25 Looking at Units Beyond Just Words 27 Summary 29 3. Analyzing Word and Document Frequency: tf-idf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Term Frequency in Jane Austen’s Novels 32 Zipf’s Law 34 The bind_tf_idf Function 37 A Corpus of Physics Texts 40 Summary 44 4. Relationships Between Words: N-grams and Correlations. . . . . . . . . . . . . . . . . . . . . . . . 45 Tokenizing by N-gram 45 iii
Counting and Filtering N-grams 46 Analyzing Bigrams 48 Using Bigrams to Provide Context in Sentiment Analysis 51 Visualizing a Network of Bigrams with ggraph 54 Visualizing Bigrams in Other Texts 59 Counting and Correlating Pairs of Words with the widyr Package 61 Counting and Correlating Among Sections 62 Examining Pairwise Correlation 63 Summary 67 5. Converting to and from Nontidy Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Tidying a Document-Term Matrix 70 Tidying DocumentTermMatrix Objects 71 Tidying dfm Objects 74 Casting Tidy Text Data into a Matrix 77 Tidying Corpus Objects with Metadata 79 Example: Mining Financial Articles 81 Summary 87 6. Topic Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Latent Dirichlet Allocation 90 Word-Topic Probabilities 91 Document-Topic Probabilities 95 Example: The Great Library Heist 96 LDA on Chapters 97 Per-Document Classification 100 By-Word Assignments: augment 103 Alternative LDA Implementations 107 Summary 108 7. Case Study: Comparing Twitter Archives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Getting the Data and Distribution of Tweets 109 Word Frequencies 110 Comparing Word Usage 114 Changes in Word Use 116 Favorites and Retweets 120 Summary 124 8. Case Study: Mining NASA Metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 How Data Is Organized at NASA 126 Wrangling and Tidying the Data 126 Some Initial Simple Exploration 129 iv | Table of Contents
Word Co-ocurrences and Correlations 130 Networks of Description and Title Words 131 Networks of Keywords 134 Calculating tf-idf for the Description Fields 137 What Is tf-idf for the Description Field Words? 137 Connecting Description Fields to Keywords 138 Topic Modeling 140 Casting to a Document-Term Matrix 140 Ready for Topic Modeling 141 Interpreting the Topic Model 142 Connecting Topic Modeling with Keywords 149 Summary 152 9. Case Study: Analyzing Usenet Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Preprocessing 153 Preprocessing Text 155 Words in Newsgroups 156 Finding tf-idf Within Newsgroups 157 Topic Modeling 160 Sentiment Analysis 163 Sentiment Analysis by Word 164 Sentiment Analysis by Message 167 N-gram Analysis 169 Summary 171 Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Table of Contents | v
Preface If you work in analytics or data science, like we do, you are familiar with the fact that data is being generated all the time at ever faster rates. (You may even be a little weary of people pontificating about this fact.) Analysts are often trained to handle tabular or rectangular data that is mostly numeric, but much of the data proliferating today is unstructured and text-heavy. Many of us who work in analytical fields are not trained in even simple interpretation of natural language. We developed the tidytext (Silge and Robinson 2016) R package because we were familiar with many methods for data wrangling and visualization, but couldn’t easily apply these same methods to text. We found that using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Treating text as data frames of individual words allows us to manipulate, summarize, and visualize the characteristics of text easily, and integrate natural lan‐ guage processing into effective workflows we were already using. This book serves as an introduction to text mining using the tidytext package and other tidy tools in R. The functions provided by the tidytext package are relatively simple; what is important are the possible applications. Thus, this book provides compelling examples of real text mining problems. Outline We start by introducing the tidy text format, and some of the ways dplyr, tidyr, and tidytext allow informative analyses of this structure: • Chapter 1 outlines the tidy text format and the unnest_tokens() function. It also introduces the gutenbergr and janeaustenr packages, which provide useful liter‐ ary text datasets that we’ll use throughout this book. • Chapter 2 shows how to perform sentiment analysis on a tidy text dataset using the sentiments dataset from tidytext and inner_join() from dplyr. vii
分享到:
收藏