Mining Text Data
Charu C. Aggarwal • ChengXiang Zhai
Editors
Mining Text Data
Library of Congress Control Number: 2012930923
Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use inconnection with any form of information storage and retrieval, electronic adaptation, computer software,or by similar or dissimilar methodology now known or hereafter developed is forbidden.The use in this publication of trade names, trademarks, service marks, and similar terms, even if theyare not identified as such, is not to be taken as an expression of opinion as to whether or not they aresubject to proprietary rights. Springer New York Dordrecht Heidelberg LondonISBN978-1-4614-3222-7e-ISBN978-1-4614-3223-4DOI10.1007/978-1-4614-© Springer Science+Business Media, LLC 2012EditorsCharu C. AggarwalIBM T.J. Watson Research CenterYorktown Heights, NY, USAcharu@us.ibm.comUniversity of Illinois at Urbana-ChampaignUrbana, IL, USAczhai@cs.uiuc.edu3223-4ChengXiang Zhai
Contents1AnIntroductiontoTextMining1CharuC.AggarwalandChengXiangZhai1.Introduction12.AlgorithmsforTextMining43.FutureDirections8References102InformationExtractionfromText11JingJiang1.Introduction112.NamedEntityRecognition152.1Rule-basedApproach162.2StatisticalLearningApproach173.RelationExtraction223.1Feature-basedClassification233.2KernelMethods263.3WeaklySupervisedLearningMethods294.UnsupervisedInformationExtraction304.1RelationDiscoveryandTemplateInduction314.2OpenInformationExtraction325.Evaluation336.ConclusionsandSummary34References353ASurveyofTextSummarizationTechniques43AniNenkovaandKathleenMcKeown1.HowdoExtractiveSummarizersWork?442.TopicRepresentationApproaches462.1TopicWords462.2Frequency-drivenApproaches482.3LatentSemanticAnalysis522.4BayesianTopicModels532.5SentenceClusteringandDomain-dependentTopics553.InfluenceofContext563.1WebSummarization573.2SummarizationofScientificArticles58v
for Summarization
2.
Feature Selection and Transformation Methods for Text
Clustering
81
5
Dimensionality Reduction and Topic Modeling
Steven P. Crain, Ke Zhou, Shuang-Hong Yang and Hongyuan Zha
1.
2.
3.
4.
Introduction
1.1
The Relationship Between Clustering, Dimension
Reduction and Topic Modeling
Notation and Concepts
The Procedure of Latent Semantic Indexing
Implementation Issues
Analysis
1.2
Latent Semantic Indexing
2.1
2.2
2.3
Topic Models and Dimension Reduction
3.1
3.2
Interpretation and Evaluation
Probabilistic Latent Semantic Indexing
Latent Dirichlet Allocation
129
130
131
132
133
134
135
137
139
140
142
148
viMININGTEXTDATA3.3Query-focusedSummarization583.4EmailSummarization594.IndicatorRepresentationsandMachineLearning4.1GraphMethodsforSentenceImportance604.2MachineLearningforSummarization625.SelectingSummarySentences645.1GreedyApproaches:MaximalMarginalRelevance645.2GlobalSummarySelection656.Conclusion66References664ASurveyofTextClusteringAlgorithms77CharuC.AggarwalandChengXiangZhai1.Introduction772.1FeatureSelectionMethods812.2LSI-basedMethods842.3Non-negativeMatrixFactorization863.Distance-basedClusteringAlgorithms893.1AgglomerativeandHierarchicalClusteringAlgorithms903.2Distance-basedPartitioningAlgorithms923.3AHybridApproach:TheScatter-GatherMethod944.WordandPhrase-basedClustering994.1ClusteringwithFrequentWordPatterns1004.2LeveragingWordClustersforDocumentClusters1024.3Co-clusteringWordsandDocuments1034.4ClusteringwithFrequentPhrases1055.ProbabilisticDocumentClusteringandTopicModels1076.OnlineClusteringwithTextStreams1107.ClusteringTextinNetworks1158.Semi-SupervisedClustering1189.ConclusionsandSummary120References12160
Contentsvii4.1Interpretation1484.2Evaluation1494.3ParameterSelection1504.4DimensionReduction1505.BeyondLatentDirichletAllocation1515.1Scalability1515.2DynamicData1515.3NetworkedData1525.4AdaptingTopicModelstoApplications1546.Conclusion155References1566ASurveyofTextClassificationAlgorithms163CharuC.AggarwalandChengXiangZhai1.Introduction1632.FeatureSelectionforTextClassification1672.1GiniIndex1682.2InformationGain1692.3MutualInformation1692.4χ2-Statistic1702.5FeatureTransformationMethods:SupervisedLSI1712.6SupervisedClusteringforDimensionalityReduction1722.7LinearDiscriminantAnalysis1732.8GeneralizedSingularValueDecomposition1752.9InteractionofFeatureSelectionwithClassification1753.DecisionTreeClassifiers1764.Rule-basedClassifiers1785.ProbabilisticandNaiveBayesClassifiers1815.1BernoulliMultivariateModel1835.2MultinomialDistribution1885.3MixtureModelingforTextClassification1906.LinearClassifiers1936.1SVMClassifiers1946.2Regression-BasedClassifiers1966.3NeuralNetworkClassifiers1976.4SomeObservationsaboutLinearClassifiers1997.Proximity-basedClassifiers2008.ClassificationofLinkedandWebData2039.Meta-AlgorithmsforTextClassification2099.1ClassifierEnsembleLearning2099.2DataCenteredMethods:BoostingandBagging2109.3OptimizingSpecificMeasuresofAccuracy21110.ConclusionsandSummary213References2137TransferLearningforTextMining223WeikePan,ErhengZhongandQiangYang1.Introduction2242.TransferLearninginTextClassification2252.1CrossDomainTextClassification225