logo资料库

Text Analysis with R for Students of Literature.pdf

第1页 / 共199页
第2页 / 共199页
第3页 / 共199页
第4页 / 共199页
第5页 / 共199页
第6页 / 共199页
第7页 / 共199页
第8页 / 共199页
资料共199页,剩余部分请下载后查看
Preface
Acknowledgments
Contributors
Contents
Part I Microanalysis
1 R Basics
1.1 Introduction
1.2 R and RStudio
1.3 Download and Install R
1.4 Download and Install RStudio
1.5 Download the Supporting Materials
1.6 RStudio
1.7 Let's Get Started
Practice
2 First Foray into Text Analysis with R
2.1 Loading the First Text File
2.2 Separate Content from Metadata
2.3 Reprocessing the Content
2.4 Beginning the Analysis
Practice
3 Accessing and Comparing Word Frequency Data
3.1 Accessing Word Data
3.2 Recycling
Practice
4 Token Distribution Analysis
4.1 Dispersion Plots
4.2 Searching with grep
4.2.1 Cleaning the Workspace
4.2.2 Identify the chapter break positions in the vector using the grep function
4.3 The for Loop and if Conditional
4.4 Accessing and Processing List Items
4.4.1 rbind
4.4.2 More Recycling
4.4.3 apply
4.4.4 do.call (Do Dot Call)
4.4.5 cbind
Practice
5 Correlation
5.1 Introduction
5.2 Correlation Analysis
5.3 A Word About Data Frames
5.4 Testing Correlation with Randomization
Practice
Part II Mesoanalysis
6 Measures of Lexical Variety
6.1 Lexical Variety and the Type-Token Ratio
6.2 Mean Word Frequency
6.3 Extracting Word Usage Means
6.4 Ranking the Values
6.5 Calculating the TTR Inside lapply
6.6 A Further Use of Correlation
Practice
7 Hapax Richness
7.1 Introduction
7.2 sapply
7.3 A Mini-Conditional Function
Practice
8 Do It KWIC
8.1 Introduction
8.2 Custom Functions
8.3 A Word List Making Function
8.4 Finding Words and Their Neighbors
Practice
9 Do It KWIC (Better)
9.1 Getting Organized
9.2 Separating Functions for Reuse
9.3 User Interaction
9.4 readline
9.5 Building a Better KWIC Function
9.6 Fixing a Problem
Practice
10 Text Quality, Text Variety, and Parsing XML
10.1 Introduction
10.2 The Text Encoding Initiative (TEI)
10.3 Parsing XML with R
10.4 Installing R Packages
10.5 Loading and Using the XML Package
10.6 Metadata
Practice
Part III Macroanalysis
11 Clustering
11.1 Introduction
11.2 Review
11.3 Some Oddities in R
11.4 Corpus Ingestion
11.5 Another Function
11.6 Unsupervised Clustering and the Euclidean Metric
11.7 Converting an R List into a Data Matrix
11.8 Preparing Data for Clustering
11.9 Clustering Data
Practice
12 Classification
12.1 Introduction
12.2 A Small Authorship Experiment
12.3 Text Segmentation
12.4 Converting an R List into a Matrix
12.5 Organizing the Data
12.6 Cross Tabulation
12.7 Mapping the Data to the Metadata
12.8 Reducing the Feature Set
12.9 Performing the Classification with SVM
Practice
13 Topic Modeling
13.1 Introduction
13.2 R and Topic Modeling
13.3 Text Segmentation and Preparation
13.4 The R mallet Package
13.5 Simple Topic Modeling with a Standard Stop List
13.6 Unpacking the Model
13.7 Topic Visualization
13.8 Topic Coherence and Topic Probability
13.9 Pre-processing with a POS Tagger
Practice
A Variable Scope Example
B The LDA Buffet
C Start up Code
C.1 Chapter 3
C.2 Chapter 4
C.3 Chapter 5
C.4 Chapter 6
C.5 Chapter 7
D R Resources for Further Reading
Practice Exercise Solutions
Index
Quantitative Methods in the Humanities and Social Sciences Matthew L. Jockers Text Analysis with R for Students of Literature
Text Analysis with R for Students of Literature
Quantitative Methods in the Humanities and Social Sciences Editorial Board Thomas DeFanti, Anthony Grafton, Thomas E. Levy, Lev Manovich, Alyn Rockwood Quantitative Methods in the Humanities and Social Sciences is a book series designed to foster research-based conversation with all parts of the univer- sity campus from buildings of ivy-covered stone to technologically savvy walls of glass. Scholarship from international researchers and the esteemed editorial board represents the far-reaching applications of computational analysis, statistical models, computer-based programs, and other quantita- tive methods. Methods are integrated in a dialogue that is sensitive to the broader context of humanistic study and social science research. Scholars, including among others historians, archaeologists, classicists and linguists, promote this interdisciplinary approach. These texts teach new methodolog- ical approaches for contemporary research. Each volume exposes readers to a particular research method. Researchers and students then benefit from exposure to subtleties of the larger project or corpus of work in which the quantitative methods come to fruition. For further volumes: http://www.springer.com/series/11748
Matthew L. Jockers Text Analysis with R for Students of Literature 123
Matthew L. Jockers Department of English University of Nebraska Lincoln, Nebraska, USA ISBN 978-3-319-03163-7 DOI 10.1007/978-3-319-03164-4 Springer Cham Heidelberg New York Dordrecht London ISBN 978-3-319-03164-4 (eBook) Library of Congress Control Number: 2014935151 © Springer International Publishing Switzerland 2014 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of pub- lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
For my mother, who prefers to follow the instructions
Preface This book provides an introduction to computational text analysis using the open source programming language R. Unlike other very good books on the use of R for the statistical analysis of linguistic data1 or for conducting quantitative corpus lin- guistics,2 this book is meant for students and scholars of literature and then, more generally, for humanists wishing to extend their methodological toolkit to include quantitative and computational approaches to the study of text. This book is also meant to be short and to the point. R is a complex program that no single text- book can demystify. The focus here is on making the technical palatable and more importantly making the technical useful and immediately rewarding! Here I mean rewarding not in the sense of satisfaction one gets from mastering a programming language, but rewarding specifically in the sense of quick return on your investment. You will begin analyzing and processing text right away and each chapter will walk you through a new technique or process. Computation provides access to information in texts that we simply cannot gather using our traditionally qualitative methods of close reading and human synthesis. The reward comes in being able to access that information at both the micro and macro scale. If this book succeeds, you finish it with a foundation, with a broad exposure to core techniques and a basic understanding of the possibilities. The real learning will begin when you put this book aside and build a project of your own. My aim is to give you enough background so that you can begin that project comfortably and so that you’ll be able to continue to learn and educate yourself. When discussing my work as a computing humanist, I am frequently asked whether the methods and approaches I advocate succeed in bringing new knowledge to our study of literature. My answer is strong and resounding yes. At the same time, that strong yes must be qualified a bit; not everything that text analysis reveals is a breakthrough discovery. A good deal of computational work is specifically aimed 1 Baayen, H. A. Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cam- bridge UP, 2008. 2 Gries, Stefan Th. Quantitative Corpus Linguistics with R: A Practical Introduction. New York: Routledge, 2009. vii
分享到:
收藏