logo资料库

Weka同步课本-Data_Mining.pdf

第1页 / 共665页
第2页 / 共665页
第3页 / 共665页
第4页 / 共665页
第5页 / 共665页
第6页 / 共665页
第7页 / 共665页
第8页 / 共665页
资料共665页,剩余部分请下载后查看
Front cover
Data Mining: Practical Machine Learning Tools and Techniques
Copyright page
Table of contents
List of Figures
List of Tables
Preface
Updated and revised content
Acknowledgments
About the Authors
PART I: Introduction to Data Mining
Chapter 1: What’s It All About?
Data mining and machine learning
Simple examples: the weather and other problems
Fielded applications
Machine learning and statistics
Generalization as search
Data mining and ethics
Further reading
Chapter 2: Input: Concepts, Instances, and Attributes
What’s a concept?
What’s in an example?
What’s in an attribute?
Preparing the input
Further reading
Chapter 3: Output: Knowledge Representation
Tables
Linear models
Trees
Rules
Instance-based representation
Clusters
Further reading
Chapter 4: Algorithms: The Basic Methods
InFerring rudimentary rules
Statistical modeling
Divide-and-conquer: constructing decision trees
Covering algorithms: constructing rules
Mining association rules
Linear models
Instance-based learning
Clustering
Multi-instance learning
Further reading
Weka implementations
Chapter 5: Credibility: Evaluating What’s Been Learned
Training and testing
Predicting performance
Cross-validation
Other estimates
Comparing data mining schemes
Predicting probabilities
Counting the cost
Evaluating numeric prediction
Minimum description length principle
Applying the MDL principle to clustering
Further reading
Part 2: Advanced Data Mining
Chapter 6: Implementations: Real Machine Learning Schemes
Decision trees
Classification rules
Association rules
Extending linear models
Instance-based learning
Numeric prediction with local linear models
Bayesian networks
Clustering
Semisupervised learning
Multi-instance learning
Weka implementations
Chapter 7: Data Transformations
Attribute selection
Discretizing numeric attributes
Projections
Sampling
Cleansing
Transforming multiple classes to binary ones
Calibrating class probabilities
Further reading
Weka implementations
Chapter 8: Ensemble Learning
Combining multiple models
Bagging
Randomization
Boosting
Additive regression
Interpretable ensembles
Stacking
Further reading
Weka implementations
Chapter 9: Moving on: Applications and Beyond
Applying data mining
Learning from massive datasets
Data stream learning
Incorporating domain knowledge
Text mining
Web mining
Adversarial situations
Ubiquitous data mining
Further reading
PART III: The Weka Data Mining Workbench
Chapter 10: Introduction to Weka
What’s in weka?
How do you use it?
What else can you do?
How do you get it?
Chapter 11: The Explorer
Getting started
Exploring the explorer
Filtering algorithms
Learning algorithms
Metalearning algorithms
Clustering algorithms
Association-rule learners
Attribute selection
Chapter 12: The Knowledge Flow Interface
Getting started
Components
Configuring and connecting the components
Incremental learning
Chapter 13: The Experimenter
Getting started
Simple setup
Advanced setup
The analyze panel
Distributing processing over several machines
Chapter 14: The Command-Line Interface
Getting started
The structure of weka
Command-line options
Chapter 15: Embedded Machine Learning
A simple data mining application
Chapter 16: Writing New Learning Schemes
An example classifier
Conventions for implementing classifiers
Chapter 17: Tutorial Exercises for the Weka Explorer
Introduction to the explorer interface
Nearest-neighbor learning and decision trees
Classification boundaries
Preprocessing and parameter tuning
Document classification
Mining association rules
References
Index
Data Mining Third Edition
This page intentionally left blank
Data Mining Practical Machine Learning Tools and Techniques Third Edition Ian H. Witten Eibe Frank Mark A. Hall AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann Publishers is an imprint of Elsevier
Morgan Kaufmann Publishers is an imprint of Elsevier 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA This book is printed on acid-free paper. Copyright © 2011 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data Witten, I. H. (Ian H.) Data mining : practical machine learning tools and techniques.—3rd ed. / Ian H. Witten, Frank Eibe, Mark A. Hall. 1. Data mining. QA76.9.D343W58 2011 006.3′12—dc22 p. cm.—(The Morgan Kaufmann series in data management systems) ISBN 978-0-12-374856-0 (pbk.) I. Hall, Mark A. II. Title. 2010039827 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. For information on all Morgan Kaufmann publications, visit our website at www.mkp.com or www.elsevierdirect.com Printed in the United States 11 12 13 14 15 10 9 8 7 6 5 4 3 2 1 Working together to grow libraries in developing countries www.elsevier.com | www.bookaid.org | www.sabre.org
Contents LIST OF FIGURES .................................................................................................xv LIST OF TABLES ..................................................................................................xix PREFACE ...............................................................................................................xxi Updated and Revised Content ...........................................................................xxv Second Edition ...............................................................................................xxv Third Edition .................................................................................................xxvi ACKNOWLEDGMENTS ....................................................................................xxix ABOUT THE AUTHORS ..................................................................................xxxiii INTRODUCTION TO DATA MINING PART I CHAPTER 1 What’s It All About? ................................................................ 3 1.1 Data Mining and Machine Learning ..............................................3 Describing Structural Patterns ........................................................5 Machine Learning ...........................................................................7 Data Mining ....................................................................................8 1.2 Simple Examples: The Weather Problem and Others ....................9 The Weather Problem .....................................................................9 Contact Lenses: An Idealized Problem ........................................12 Irises: A Classic Numeric Dataset ................................................13 CPU Performance: Introducing Numeric Prediction....................15 Labor Negotiations: A More Realistic Example ..........................15 Soybean Classification: A Classic Machine Learning Success ....19 1.3 Fielded Applications .....................................................................21 Web Mining...................................................................................21 Decisions Involving Judgment .....................................................22 Screening Images ..........................................................................23 Load Forecasting ...........................................................................24 Diagnosis .......................................................................................25 Marketing and Sales .....................................................................26 Other Applications ........................................................................27 1.4 Machine Learning and Statistics ..................................................28 1.5 Generalization as Search .............................................................29 1.6 Data Mining and Ethics ................................................................33 Reidentification .............................................................................33 Using Personal Information ..........................................................34 Wider Issues ..................................................................................35 1.7 Further Reading ............................................................................36 v
vi Contents CHAPTER 2 Input: Concepts, Instances, and Attributes ............................. 39 2.1 What’s a Concept? ........................................................................40 2.2 What’s in an Example? .................................................................42 Relations ........................................................................................43 Other Example Types ....................................................................46 2.3 What’s in an Attribute? .................................................................49 2.4 Preparing the Input .......................................................................51 Gathering the Data Together .........................................................51 ARFF Format ................................................................................52 Sparse Data ...................................................................................56 Attribute Types ..............................................................................56 Missing Values ..............................................................................58 Inaccurate Values ..........................................................................59 Getting to Know Your Data ..........................................................60 2.5 Further Reading ............................................................................60 CHAPTER 3 Output: Knowledge Representation ........................................ 61 3.1 Tables ............................................................................................61 3.2 Linear Models ...............................................................................62 3.3 Trees ..............................................................................................64 3.4 Rules ..............................................................................................67 Classification Rules .......................................................................69 Association Rules ..........................................................................72 Rules with Exceptions ..................................................................73 More Expressive Rules .................................................................75 3.5 Instance-Based Representation .....................................................78 3.6 Clusters ..........................................................................................81 3.7 Further Reading ............................................................................83 4.1 CHAPTER 4 Algorithms: The Basic Methods ............................................. 85 Inferring Rudimentary Rules ........................................................86 Missing Values and Numeric Attributes .......................................87 Discussion .....................................................................................89 4.2 Statistical Modeling ......................................................................90 Missing Values and Numeric Attributes ......................................94 Naïve Bayes for Document Classification....................................97 Discussion .....................................................................................99 4.3 Divide-and-Conquer: Constructing Decision Trees .....................99 Calculating Information ..............................................................103 Highly Branching Attributes .......................................................105 Discussion ...................................................................................107
Contents vii 4.4 Covering Algorithms: Constructing Rules .................................108 Rules versus Trees ......................................................................109 A Simple Covering Algorithm .................................................... 110 Rules versus Decision Lists ........................................................ 115 4.5 Mining Association Rules ........................................................... 116 Item Sets ...................................................................................... 116 Association Rules ........................................................................ 119 Generating Rules Efficiently .......................................................122 Discussion ...................................................................................123 4.6 Linear Models .............................................................................124 Numeric Prediction: Linear Regression .....................................124 Linear Classification: Logistic Regression .................................125 Linear Classification Using the Perceptron ................................127 Linear Classification Using Winnow ..........................................129 Instance-Based Learning .............................................................131 Distance Function .......................................................................131 Finding Nearest Neighbors Efficiently .......................................132 Discussion ...................................................................................137 4.8 Clustering ....................................................................................138 Iterative Distance-Based Clustering ...........................................139 Faster Distance Calculations .......................................................139 Discussion ...................................................................................141 4.9 Multi-Instance Learning ..............................................................141 Aggregating the Input .................................................................142 Aggregating the Output ..............................................................142 Discussion ...................................................................................142 4.10 Further Reading ..........................................................................143 4.11 Weka Implementations ................................................................145 4.7 CHAPTER 5 Credibility: Evaluating What’s Been Learned ........................ 147 5.1 Training and Testing ...................................................................148 5.2 Predicting Performance ...............................................................150 5.3 Cross-Validation ..........................................................................152 5.4 Other Estimates ...........................................................................154 Leave-One-Out Cross-Validation ................................................154 The Bootstrap ..............................................................................155 5.5 Comparing Data Mining Schemes ..............................................156 5.6 Predicting Probabilities ...............................................................159 Quadratic Loss Function .............................................................160 Informational Loss Function .......................................................161 Discussion ...................................................................................162
分享到:
收藏