Weka同步课本-Data_Mining.pdf

发布时间：2022-06-20 发布人：admin 分类：说明书资料大小：6.94M 资料格式：pdf 举报版权申诉

063123d2-04c8-4586-bbf8-e156d7f527e3.pdf-第1页.png

第1页 / 共665页

063123d2-04c8-4586-bbf8-e156d7f527e3.pdf-第2页.png

第2页 / 共665页

063123d2-04c8-4586-bbf8-e156d7f527e3.pdf-第3页.png

第3页 / 共665页

063123d2-04c8-4586-bbf8-e156d7f527e3.pdf-第4页.png

第4页 / 共665页

063123d2-04c8-4586-bbf8-e156d7f527e3.pdf-第5页.png

第5页 / 共665页

063123d2-04c8-4586-bbf8-e156d7f527e3.pdf-第6页.png

第6页 / 共665页

063123d2-04c8-4586-bbf8-e156d7f527e3.pdf-第7页.png

第7页 / 共665页

063123d2-04c8-4586-bbf8-e156d7f527e3.pdf-第8页.png

第8页 / 共665页

Front cover

Data Mining: Practical Machine Learning Tools and Techniques

Table of contents

List of Figures

List of Tables

Preface

Updated and revised content

Acknowledgments

About the Authors

PART I: Introduction to Data Mining

Chapter 1: What’s It All About?

Data mining and machine learning

Simple examples: the weather and other problems

Fielded applications

Machine learning and statistics

Generalization as search

Data mining and ethics

Further reading

Chapter 2: Input: Concepts, Instances, and Attributes

What’s a concept?

What’s in an example?

What’s in an attribute?

Preparing the input

Further reading

Chapter 3: Output: Knowledge Representation

Tables

Linear models

Trees

Rules

Instance-based representation

Clusters

Further reading

Chapter 4: Algorithms: The Basic Methods

InFerring rudimentary rules

Statistical modeling

Divide-and-conquer: constructing decision trees

Covering algorithms: constructing rules

Mining association rules

Linear models

Instance-based learning

Clustering

Multi-instance learning

Further reading

Weka implementations

Chapter 5: Credibility: Evaluating What’s Been Learned

Training and testing

Predicting performance

Cross-validation

Other estimates

Comparing data mining schemes

Predicting probabilities

Counting the cost

Evaluating numeric prediction

Minimum description length principle

Applying the MDL principle to clustering

Further reading

Part 2: Advanced Data Mining

Chapter 6: Implementations: Real Machine Learning Schemes

Decision trees

Classification rules

Association rules

Extending linear models

Instance-based learning

Numeric prediction with local linear models

Bayesian networks

Clustering

Semisupervised learning

Multi-instance learning

Weka implementations

Chapter 7: Data Transformations

Attribute selection

Discretizing numeric attributes

Projections

Sampling

Cleansing

Transforming multiple classes to binary ones

Calibrating class probabilities

Further reading

Weka implementations

Chapter 8: Ensemble Learning

Combining multiple models

Bagging

Randomization

Boosting

Additive regression

Interpretable ensembles

Stacking

Further reading

Weka implementations

Chapter 9: Moving on: Applications and Beyond

Applying data mining

Learning from massive datasets

Data stream learning

Incorporating domain knowledge

Text mining

Web mining

Adversarial situations

Ubiquitous data mining

Further reading

PART III: The Weka Data Mining Workbench

Chapter 10: Introduction to Weka

What’s in weka?

How do you use it?

What else can you do?

How do you get it?

Chapter 11: The Explorer

Getting started

Exploring the explorer

Filtering algorithms

Learning algorithms

Metalearning algorithms

Clustering algorithms

Association-rule learners

Attribute selection

Chapter 12: The Knowledge Flow Interface

Getting started

Components

Configuring and connecting the components

Incremental learning

Chapter 13: The Experimenter

Getting started

Simple setup

Advanced setup

The analyze panel

Distributing processing over several machines

Chapter 14: The Command-Line Interface

Getting started

The structure of weka

Command-line options

Chapter 15: Embedded Machine Learning

A simple data mining application

Chapter 16: Writing New Learning Schemes

An example classifier

Conventions for implementing classifiers

Chapter 17: Tutorial Exercises for the Weka Explorer

Introduction to the explorer interface

Nearest-neighbor learning and decision trees

Classification boundaries

Preprocessing and parameter tuning

Document classification

Mining association rules

References

Index

Data Mining Third Edition

This page intentionally left blank

Data Mining Practical Machine Learning Tools and Techniques Third Edition Ian H. Witten Eibe Frank Mark A. Hall AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann Publishers is an imprint of Elsevier

Morgan Kaufmann Publishers is an imprint of Elsevier 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA This book is printed on acid-free paper. Copyright © 2011 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data Witten, I. H. (Ian H.) Data mining : practical machine learning tools and techniques.—3rd ed. / Ian H. Witten, Frank Eibe, Mark A. Hall. 1. Data mining. QA76.9.D343W58 2011 006.3′12—dc22 p. cm.—(The Morgan Kaufmann series in data management systems) ISBN 978-0-12-374856-0 (pbk.) I. Hall, Mark A. II. Title. 2010039827 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. For information on all Morgan Kaufmann publications, visit our website at www.mkp.com or www.elsevierdirect.com Printed in the United States 11 12 13 14 15 10 9 8 7 6 5 4 3 2 1 Working together to grow libraries in developing countries www.elsevier.com | www.bookaid.org | www.sabre.org

Contents LIST OF FIGURES .................................................................................................xv LIST OF TABLES ..................................................................................................xix PREFACE ...............................................................................................................xxi Updated and Revised Content ...........................................................................xxv Second Edition ...............................................................................................xxv Third Edition .................................................................................................xxvi ACKNOWLEDGMENTS ....................................................................................xxix ABOUT THE AUTHORS ..................................................................................xxxiii INTRODUCTION TO DATA MINING PART I CHAPTER 1 What’s It All About? ................................................................ 3 1.1 Data Mining and Machine Learning ..............................................3 Describing Structural Patterns ........................................................5 Machine Learning ...........................................................................7 Data Mining ....................................................................................8 1.2 Simple Examples: The Weather Problem and Others ....................9 The Weather Problem .....................................................................9 Contact Lenses: An Idealized Problem ........................................12 Irises: A Classic Numeric Dataset ................................................13 CPU Performance: Introducing Numeric Prediction....................15 Labor Negotiations: A More Realistic Example ..........................15 Soybean Classification: A Classic Machine Learning Success ....19 1.3 Fielded Applications .....................................................................21 Web Mining...................................................................................21 Decisions Involving Judgment .....................................................22 Screening Images ..........................................................................23 Load Forecasting ...........................................................................24 Diagnosis .......................................................................................25 Marketing and Sales .....................................................................26 Other Applications ........................................................................27 1.4 Machine Learning and Statistics ..................................................28 1.5 Generalization as Search .............................................................29 1.6 Data Mining and Ethics ................................................................33 Reidentification .............................................................................33 Using Personal Information ..........................................................34 Wider Issues ..................................................................................35 1.7 Further Reading ............................................................................36 v

vi Contents CHAPTER 2 Input: Concepts, Instances, and Attributes ............................. 39 2.1 What’s a Concept? ........................................................................40 2.2 What’s in an Example? .................................................................42 Relations ........................................................................................43 Other Example Types ....................................................................46 2.3 What’s in an Attribute? .................................................................49 2.4 Preparing the Input .......................................................................51 Gathering the Data Together .........................................................51 ARFF Format ................................................................................52 Sparse Data ...................................................................................56 Attribute Types ..............................................................................56 Missing Values ..............................................................................58 Inaccurate Values ..........................................................................59 Getting to Know Your Data ..........................................................60 2.5 Further Reading ............................................................................60 CHAPTER 3 Output: Knowledge Representation ........................................ 61 3.1 Tables ............................................................................................61 3.2 Linear Models ...............................................................................62 3.3 Trees ..............................................................................................64 3.4 Rules ..............................................................................................67 Classification Rules .......................................................................69 Association Rules ..........................................................................72 Rules with Exceptions ..................................................................73 More Expressive Rules .................................................................75 3.5 Instance-Based Representation .....................................................78 3.6 Clusters ..........................................................................................81 3.7 Further Reading ............................................................................83 4.1 CHAPTER 4 Algorithms: The Basic Methods ............................................. 85 Inferring Rudimentary Rules ........................................................86 Missing Values and Numeric Attributes .......................................87 Discussion .....................................................................................89 4.2 Statistical Modeling ......................................................................90 Missing Values and Numeric Attributes ......................................94 Naïve Bayes for Document Classification....................................97 Discussion .....................................................................................99 4.3 Divide-and-Conquer: Constructing Decision Trees .....................99 Calculating Information ..............................................................103 Highly Branching Attributes .......................................................105 Discussion ...................................................................................107

Contents vii 4.4 Covering Algorithms: Constructing Rules .................................108 Rules versus Trees ......................................................................109 A Simple Covering Algorithm .................................................... 110 Rules versus Decision Lists ........................................................ 115 4.5 Mining Association Rules ........................................................... 116 Item Sets ...................................................................................... 116 Association Rules ........................................................................ 119 Generating Rules Efficiently .......................................................122 Discussion ...................................................................................123 4.6 Linear Models .............................................................................124 Numeric Prediction: Linear Regression .....................................124 Linear Classification: Logistic Regression .................................125 Linear Classification Using the Perceptron ................................127 Linear Classification Using Winnow ..........................................129 Instance-Based Learning .............................................................131 Distance Function .......................................................................131 Finding Nearest Neighbors Efficiently .......................................132 Discussion ...................................................................................137 4.8 Clustering ....................................................................................138 Iterative Distance-Based Clustering ...........................................139 Faster Distance Calculations .......................................................139 Discussion ...................................................................................141 4.9 Multi-Instance Learning ..............................................................141 Aggregating the Input .................................................................142 Aggregating the Output ..............................................................142 Discussion ...................................................................................142 4.10 Further Reading ..........................................................................143 4.11 Weka Implementations ................................................................145 4.7 CHAPTER 5 Credibility: Evaluating What’s Been Learned ........................ 147 5.1 Training and Testing ...................................................................148 5.2 Predicting Performance ...............................................................150 5.3 Cross-Validation ..........................................................................152 5.4 Other Estimates ...........................................................................154 Leave-One-Out Cross-Validation ................................................154 The Bootstrap ..............................................................................155 5.5 Comparing Data Mining Schemes ..............................................156 5.6 Predicting Probabilities ...............................................................159 Quadratic Loss Function .............................................................160 Informational Loss Function .......................................................161 Discussion ...................................................................................162

分享到：

赞收藏

资料库

Weka同步课本-Data_Mining.pdf

相关推荐

大数据

热门标签

最新资料