Text Analysis with R for Students of Literature.pdf

发布时间：2022-06-24 发布人：admin 分类：说明书资料大小：2.16M 资料格式：pdf 举报版权申诉

srfchn-10702074-4744302543438946635.pdf-第1页.png

第1页 / 共199页

srfchn-10702074-4744302543438946635.pdf-第2页.png

第2页 / 共199页

srfchn-10702074-4744302543438946635.pdf-第3页.png

第3页 / 共199页

srfchn-10702074-4744302543438946635.pdf-第4页.png

第4页 / 共199页

srfchn-10702074-4744302543438946635.pdf-第5页.png

第5页 / 共199页

srfchn-10702074-4744302543438946635.pdf-第6页.png

第6页 / 共199页

srfchn-10702074-4744302543438946635.pdf-第7页.png

第7页 / 共199页

srfchn-10702074-4744302543438946635.pdf-第8页.png

第8页 / 共199页

Preface

Acknowledgments

Contributors

Contents

Part I Microanalysis

1 R Basics

1.1 Introduction

1.2 R and RStudio

1.3 Download and Install R

1.4 Download and Install RStudio

1.5 Download the Supporting Materials

1.6 RStudio

1.7 Let's Get Started

Practice

2 First Foray into Text Analysis with R

2.1 Loading the First Text File

2.2 Separate Content from Metadata

2.3 Reprocessing the Content

2.4 Beginning the Analysis

Practice

3 Accessing and Comparing Word Frequency Data

3.1 Accessing Word Data

3.2 Recycling

Practice

4 Token Distribution Analysis

4.1 Dispersion Plots

4.2 Searching with grep

4.2.1 Cleaning the Workspace

4.2.2 Identify the chapter break positions in the vector using the grep function

4.3 The for Loop and if Conditional

4.4 Accessing and Processing List Items

4.4.1 rbind

4.4.2 More Recycling

4.4.3 apply

4.4.4 do.call (Do Dot Call)

4.4.5 cbind

Practice

5 Correlation

5.1 Introduction

5.2 Correlation Analysis

5.3 A Word About Data Frames

5.4 Testing Correlation with Randomization

Practice

Part II Mesoanalysis

6 Measures of Lexical Variety

6.1 Lexical Variety and the Type-Token Ratio

6.2 Mean Word Frequency

6.3 Extracting Word Usage Means

6.4 Ranking the Values

6.5 Calculating the TTR Inside lapply

6.6 A Further Use of Correlation

Practice

7 Hapax Richness

7.1 Introduction

7.2 sapply

7.3 A Mini-Conditional Function

Practice

8 Do It KWIC

8.1 Introduction

8.2 Custom Functions

8.3 A Word List Making Function

8.4 Finding Words and Their Neighbors

Practice

9 Do It KWIC (Better)

9.1 Getting Organized

9.2 Separating Functions for Reuse

9.3 User Interaction

9.4 readline

9.5 Building a Better KWIC Function

9.6 Fixing a Problem

Practice

10 Text Quality, Text Variety, and Parsing XML

10.1 Introduction

10.2 The Text Encoding Initiative (TEI)

10.3 Parsing XML with R

10.4 Installing R Packages

10.5 Loading and Using the XML Package

10.6 Metadata

Practice

Part III Macroanalysis

11 Clustering

11.1 Introduction

11.2 Review

11.3 Some Oddities in R

11.4 Corpus Ingestion

11.5 Another Function

11.6 Unsupervised Clustering and the Euclidean Metric

11.7 Converting an R List into a Data Matrix

11.8 Preparing Data for Clustering

11.9 Clustering Data

Practice

12 Classification

12.1 Introduction

12.2 A Small Authorship Experiment

12.3 Text Segmentation

12.4 Converting an R List into a Matrix

12.5 Organizing the Data

12.6 Cross Tabulation

12.7 Mapping the Data to the Metadata

12.8 Reducing the Feature Set

12.9 Performing the Classification with SVM

Practice

13 Topic Modeling

13.1 Introduction

13.2 R and Topic Modeling

13.3 Text Segmentation and Preparation

13.4 The R mallet Package

13.5 Simple Topic Modeling with a Standard Stop List

13.6 Unpacking the Model

13.7 Topic Visualization

13.8 Topic Coherence and Topic Probability

13.9 Pre-processing with a POS Tagger

Practice

A Variable Scope Example

B The LDA Buffet

C Start up Code

C.1 Chapter 3

C.2 Chapter 4

C.3 Chapter 5

C.4 Chapter 6

C.5 Chapter 7

D R Resources for Further Reading

Practice Exercise Solutions

Index

Quantitative Methods in the Humanities and Social Sciences Matthew L. Jockers Text Analysis with R for Students of Literature

Text Analysis with R for Students of Literature

Quantitative Methods in the Humanities and Social Sciences Editorial Board Thomas DeFanti, Anthony Grafton, Thomas E. Levy, Lev Manovich, Alyn Rockwood Quantitative Methods in the Humanities and Social Sciences is a book series designed to foster research-based conversation with all parts of the univer- sity campus from buildings of ivy-covered stone to technologically savvy walls of glass. Scholarship from international researchers and the esteemed editorial board represents the far-reaching applications of computational analysis, statistical models, computer-based programs, and other quantita- tive methods. Methods are integrated in a dialogue that is sensitive to the broader context of humanistic study and social science research. Scholars, including among others historians, archaeologists, classicists and linguists, promote this interdisciplinary approach. These texts teach new methodolog- ical approaches for contemporary research. Each volume exposes readers to a particular research method. Researchers and students then beneﬁt from exposure to subtleties of the larger project or corpus of work in which the quantitative methods come to fruition. For further volumes: http://www.springer.com/series/11748

Matthew L. Jockers Text Analysis with R for Students of Literature 123

Matthew L. Jockers Department of English University of Nebraska Lincoln, Nebraska, USA ISBN 978-3-319-03163-7 DOI 10.1007/978-3-319-03164-4 Springer Cham Heidelberg New York Dordrecht London ISBN 978-3-319-03164-4 (eBook) Library of Congress Control Number: 2014935151 © Springer International Publishing Switzerland 2014 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied speciﬁcally for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of pub- lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

For my mother, who prefers to follow the instructions

Preface This book provides an introduction to computational text analysis using the open source programming language R. Unlike other very good books on the use of R for the statistical analysis of linguistic data1 or for conducting quantitative corpus lin- guistics,2 this book is meant for students and scholars of literature and then, more generally, for humanists wishing to extend their methodological toolkit to include quantitative and computational approaches to the study of text. This book is also meant to be short and to the point. R is a complex program that no single text- book can demystify. The focus here is on making the technical palatable and more importantly making the technical useful and immediately rewarding! Here I mean rewarding not in the sense of satisfaction one gets from mastering a programming language, but rewarding speciﬁcally in the sense of quick return on your investment. You will begin analyzing and processing text right away and each chapter will walk you through a new technique or process. Computation provides access to information in texts that we simply cannot gather using our traditionally qualitative methods of close reading and human synthesis. The reward comes in being able to access that information at both the micro and macro scale. If this book succeeds, you ﬁnish it with a foundation, with a broad exposure to core techniques and a basic understanding of the possibilities. The real learning will begin when you put this book aside and build a project of your own. My aim is to give you enough background so that you can begin that project comfortably and so that you’ll be able to continue to learn and educate yourself. When discussing my work as a computing humanist, I am frequently asked whether the methods and approaches I advocate succeed in bringing new knowledge to our study of literature. My answer is strong and resounding yes. At the same time, that strong yes must be qualiﬁed a bit; not everything that text analysis reveals is a breakthrough discovery. A good deal of computational work is speciﬁcally aimed 1 Baayen, H. A. Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cam- bridge UP, 2008. 2 Gries, Stefan Th. Quantitative Corpus Linguistics with R: A Practical Introduction. New York: Routledge, 2009. vii

分享到：

赞收藏

资料库

Text Analysis with R for Students of Literature.pdf

相关推荐

数据库

热门标签

最新资料