Text Mining with R: A Tidy Approach [True PDF].pdf

发布时间：2022-05-31 发布人：admin 分类：说明书资料大小：9.74M 资料格式：pdf 举报版权申诉

liypcdl-10008931-4744302543302341681.pdf-第1页.png

第1页 / 共193页

liypcdl-10008931-4744302543302341681.pdf-第2页.png

第2页 / 共193页

liypcdl-10008931-4744302543302341681.pdf-第3页.png

第3页 / 共193页

liypcdl-10008931-4744302543302341681.pdf-第4页.png

第4页 / 共193页

liypcdl-10008931-4744302543302341681.pdf-第5页.png

第5页 / 共193页

liypcdl-10008931-4744302543302341681.pdf-第6页.png

第6页 / 共193页

liypcdl-10008931-4744302543302341681.pdf-第7页.png

第7页 / 共193页

liypcdl-10008931-4744302543302341681.pdf-第8页.png

第8页 / 共193页

Cover

Table of Contents

Preface

Outline

Topics This Book Does Not Cover

About This Book

Conventions Used in This Book

Using Code Examples

O’Reilly Safari

How to Contact Us

Acknowledgements

Chapter 1. The Tidy Text Format

Contrasting Tidy Text with Other Data Structures

The unnest_tokens Function

Tidying the Works of Jane Austen

The gutenbergr Package

Word Frequencies

Summary

Chapter 2. Sentiment Analysis with Tidy Data

The sentiments Dataset

Sentiment Analysis with Inner Join

Comparing the Three Sentiment Dictionaries

Most Common Positive and Negative Words

Wordclouds

Looking at Units Beyond Just Words

Summary

Chapter 3. Analyzing Word and Document Frequency: tf-idf

Term Frequency in Jane Austen’s Novels

Zipf’s Law

The bind_tf_idf Function

A Corpus of Physics Texts

Summary

Chapter 4. Relationships Between Words: N-grams and Correlations

Tokenizing by N-gram

Counting and Filtering N-grams

Analyzing Bigrams

Using Bigrams to Provide Context in Sentiment Analysis

Visualizing a Network of Bigrams with ggraph

Visualizing Bigrams in Other Texts

Counting and Correlating Pairs of Words with the widyr Package

Counting and Correlating Among Sections

Examining Pairwise Correlation

Summary

Chapter 5. Converting to and from Nontidy Formats

Tidying a Document-Term Matrix

Tidying DocumentTermMatrix Objects

Tidying dfm Objects

Casting Tidy Text Data into a Matrix

Tidying Corpus Objects with Metadata

Example: Mining Financial Articles

Summary

Chapter 6. Topic Modeling

Latent Dirichlet Allocation

Word-Topic Probabilities

Document-Topic Probabilities

Example: The Great Library Heist

LDA on Chapters

Per-Document Classification

By-Word Assignments: augment

Alternative LDA Implementations

Summary

Chapter 7. Case Study: Comparing Twitter Archives

Getting the Data and Distribution of Tweets

Word Frequencies

Comparing Word Usage

Changes in Word Use

Favorites and Retweets

Summary

Chapter 8. Case Study: Mining NASA Metadata

How Data Is Organized at NASA

Wrangling and Tidying the Data

Some Initial Simple Exploration

Word Co-ocurrences and Correlations

Networks of Description and Title Words

Networks of Keywords

Calculating tf-idf for the Description Fields

What Is tf-idf for the Description Field Words?

Connecting Description Fields to Keywords

Topic Modeling

Casting to a Document-Term Matrix

Ready for Topic Modeling

Interpreting the Topic Model

Connecting Topic Modeling with Keywords

Summary

Chapter 9. Case Study: Analyzing Usenet Text

Preprocessing

Preprocessing Text

Words in Newsgroups

Finding tf-idf Within Newsgroups

Topic Modeling

Sentiment Analysis

Sentiment Analysis by Word

Sentiment Analysis by Message

N-gram Analysis

Summary

Bibliography

Index

About the Authors

Colophon

Text Mining with R A TIDY APPROACH Julia Silge & David Robinson

Text Mining with R A Tidy Approach Julia Silge and David Robinson Beijing Beijing Boston Boston Farnham Sebastopol Farnham Sebastopol Tokyo Tokyo

Text Mining with R by Julia Silge and David Robinson Copyright © 2017 Julia Silge, David Robinson. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Nicole Tache Production Editor: Nicholas Adams Copyeditor: Sonia Saruba Proofreader: Charles Roumeliotis June 2017: First Edition Revision History for the First Edition 2017-06-08: First Release Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest See http://oreilly.com/catalog/errata.csp?isbn=9781491981658 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Text Mining with R, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-98165-8 [LSI]

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. The Tidy Text Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Contrasting Tidy Text with Other Data Structures 2 The unnest_tokens Function 2 Tidying the Works of Jane Austen 4 The gutenbergr Package 7 Word Frequencies 8 Summary 12 2. Sentiment Analysis with Tidy Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 The sentiments Dataset 14 Sentiment Analysis with Inner Join 16 Comparing the Three Sentiment Dictionaries 19 Most Common Positive and Negative Words 22 Wordclouds 25 Looking at Units Beyond Just Words 27 Summary 29 3. Analyzing Word and Document Frequency: tf-idf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Term Frequency in Jane Austen’s Novels 32 Zipf’s Law 34 The bind_tf_idf Function 37 A Corpus of Physics Texts 40 Summary 44 4. Relationships Between Words: N-grams and Correlations. . . . . . . . . . . . . . . . . . . . . . . . 45 Tokenizing by N-gram 45 iii

Counting and Filtering N-grams 46 Analyzing Bigrams 48 Using Bigrams to Provide Context in Sentiment Analysis 51 Visualizing a Network of Bigrams with ggraph 54 Visualizing Bigrams in Other Texts 59 Counting and Correlating Pairs of Words with the widyr Package 61 Counting and Correlating Among Sections 62 Examining Pairwise Correlation 63 Summary 67 5. Converting to and from Nontidy Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Tidying a Document-Term Matrix 70 Tidying DocumentTermMatrix Objects 71 Tidying dfm Objects 74 Casting Tidy Text Data into a Matrix 77 Tidying Corpus Objects with Metadata 79 Example: Mining Financial Articles 81 Summary 87 6. Topic Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Latent Dirichlet Allocation 90 Word-Topic Probabilities 91 Document-Topic Probabilities 95 Example: The Great Library Heist 96 LDA on Chapters 97 Per-Document Classification 100 By-Word Assignments: augment 103 Alternative LDA Implementations 107 Summary 108 7. Case Study: Comparing Twitter Archives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Getting the Data and Distribution of Tweets 109 Word Frequencies 110 Comparing Word Usage 114 Changes in Word Use 116 Favorites and Retweets 120 Summary 124 8. Case Study: Mining NASA Metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 How Data Is Organized at NASA 126 Wrangling and Tidying the Data 126 Some Initial Simple Exploration 129 iv | Table of Contents

Word Co-ocurrences and Correlations 130 Networks of Description and Title Words 131 Networks of Keywords 134 Calculating tf-idf for the Description Fields 137 What Is tf-idf for the Description Field Words? 137 Connecting Description Fields to Keywords 138 Topic Modeling 140 Casting to a Document-Term Matrix 140 Ready for Topic Modeling 141 Interpreting the Topic Model 142 Connecting Topic Modeling with Keywords 149 Summary 152 9. Case Study: Analyzing Usenet Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Preprocessing 153 Preprocessing Text 155 Words in Newsgroups 156 Finding tf-idf Within Newsgroups 157 Topic Modeling 160 Sentiment Analysis 163 Sentiment Analysis by Word 164 Sentiment Analysis by Message 167 N-gram Analysis 169 Summary 171 Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Table of Contents | v

Preface If you work in analytics or data science, like we do, you are familiar with the fact that data is being generated all the time at ever faster rates. (You may even be a little weary of people pontificating about this fact.) Analysts are often trained to handle tabular or rectangular data that is mostly numeric, but much of the data proliferating today is unstructured and text-heavy. Many of us who work in analytical fields are not trained in even simple interpretation of natural language. We developed the tidytext (Silge and Robinson 2016) R package because we were familiar with many methods for data wrangling and visualization, but couldn’t easily apply these same methods to text. We found that using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Treating text as data frames of individual words allows us to manipulate, summarize, and visualize the characteristics of text easily, and integrate natural lan‐ guage processing into effective workflows we were already using. This book serves as an introduction to text mining using the tidytext package and other tidy tools in R. The functions provided by the tidytext package are relatively simple; what is important are the possible applications. Thus, this book provides compelling examples of real text mining problems. Outline We start by introducing the tidy text format, and some of the ways dplyr, tidyr, and tidytext allow informative analyses of this structure: • Chapter 1 outlines the tidy text format and the unnest_tokens() function. It also introduces the gutenbergr and janeaustenr packages, which provide useful liter‐ ary text datasets that we’ll use throughout this book. • Chapter 2 shows how to perform sentiment analysis on a tidy text dataset using the sentiments dataset from tidytext and inner_join() from dplyr. vii

分享到：

赞收藏

资料库

Text Mining with R: A Tidy Approach [True PDF].pdf

相关推荐

人工智能

热门标签

最新资料