logo资料库

Software for Data Analysis - Programming with R.pdf

第1页 / 共515页
第2页 / 共515页
第3页 / 共515页
第4页 / 共515页
第5页 / 共515页
第6页 / 共515页
第7页 / 共515页
第8页 / 共515页
资料共515页,剩余部分请下载后查看
front-matter
fulltext01
fulltext02
fulltext03
fulltext04
fulltext05
fulltext06
fulltext07
fulltext08
fulltext09
fulltext10
fulltext11
fulltext12
fulltext13
fulltext14
Statistics and Computing Series Editors: J. Chambers D. Hand W. Härdle
Statistics and Computing Brusco/Stahl: Branch and Bound Applications in Combinatorial Data Analysis Chambers: Software for Data Analysis: Programming with R Dalgaard: Introductory Statistics with R Gentle: Elements of Computational Statistics Gentle: Numerical Linear Algebra for Applications in Statistics Gentle: Random Number Generation and Monte Carlo Methods, 2nd ed. Härdle/Klinke/Turlach: XploRe: An Interactive Statistical Computing Environment Hörmann/Leydold/Derflinger: Automatic Nonuniform Random Variate Generation Krause/Olson: The Basics of S-PLUS, 4th ed. Lange: Numerical Analysis for Statisticians Lemmon/Schafer: Developing Statistical Software in Fortran 95 Loader: Local Regression and Likelihood Ó Ruanaidh/Fitzgerald: Numerical Bayesian Methods Applied to Signal Processing Pannatier: VARIOWIN: Software for Spatial Data Analysis in 2D Pinheiro/Bates: Mixed-Effects Models in S and S-PLUS Unwin/Theus/Hofmann: Graphics of Large Datasets: Visualizing a Million Venables/Ripley: Modern Applied Statistics with S, 4th ed. Venables/Ripley: S Programming Wilkinson: The Grammar of Graphics, 2nd ed.
John M. Chambers Software for Data Analysis Programming with R
John Chambers Department of Statistics–Sequoia Hall 390 Serra Mall Stanford University Stanford, CA 94305-4065 USA jmc@r-project.org Series Editors: John Chambers Department of Statistics–Sequoia Hall 390 Serra Mall Stanford University Stanford, CA 94305-4065 USA W. Härdle Institut für Statistik und Ökonometrie Humboldt-Universität zu Berlin Spandauer Str. 1 D-10178 Berlin Germany David Hand Department of Mathematics South Kensington Campus Imperial College London London, SW7 2AZ United Kingdom Java™ is a trademark or registered trademark of Sun Microsystems, Inc. in the United States and other countries. Mac OS® X - Operating System software - is a registered trademark of Apple Computer, Inc. MATLAB® is a trademark of The MathWorks, Inc. MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. S-PLUS® is a registered trademark of Insightful Corporation. UNIX® is a registered trademark of The Open Group. Windows® and/or other Microsoft products referenced herein are either registered trademarks or trademarks of Microsoft Corporation in the U.S. and/or other countries. Star Trek and related marks are trademarks of CBS Studios, Inc. ISBN: 978-0-387-75935-7 DOI: 10.1007/978-0-387-75936-4 e-ISBN: 978-0-387-75936-4 Library of Congress Control Number: 2008922937 © 2008 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper. 9 8 7 6 5 4 3 2 1 springer.com
Preface This is a book about Software for Data Analysis: using computer software to extract information from some source of data by organizing, visualizing, modeling, or performing any other relevant computation on the data. We all seem to be swimming in oceans of data in the modern world, and tasks ranging from scientific research to managing a business require us to extract meaningful information from the data using computer software. This book is aimed at those who need to select, modify, and create software to explore data. In a word, programming. Our programming will center on the R system. R is an open-source software project widely used for computing with data and giving users a huge base of techniques. Hence, Programming with R. R provides a general language for interactive computations, supported by techniques for data organization, graphics, numerical computations, model- fitting, simulation, and many other tasks. The core system itself is greatly supplemented and enriched by a huge and rapidly growing collection of soft- ware packages built on R and, like R, largely implemented as open-source software. Furthermore, R is designed to encourage learning and develop- ing, with easy starting mechanisms for programming and also techniques to help you move on to more serious applications. The complete picture— the R system, the language, the available packages, and the programming environment—constitutes an unmatched resource for computing with data. At the same time, the “with” word in Programming with R is impor- tant. No software system is sufficient for exploring data, and we emphasize interfaces between systems to take advantage of their respective strengths. Is it worth taking time to develop or extend your skills in such program- ming? Yes, because the investment can pay off both in the ability to ask questions and in the trust you can have in the answers. Exploring data with the right questions and providing trustworthy answers to them are the key to analyzing data, and the twin principles that will guide us. v
vi PREFACE What’s in the book? A sequence of chapters in the book takes the reader on successive steps from user to programmer to contributor, in the gradual progress that R encourages. Specifically: using R; simple programming; packages; classes and methods; inter-system interfaces (Chapters 2; 3; 4; 9 and 10; 11 and 12). The order reflects a natural progression, but the chapters are largely independent, with many cross references to encourage browsing. Other chapters explore computational techniques needed at all stages: basic computations; graphics; computing with text (Chapters 6; 7; 8). Lastly, a chapter (13) discusses how R works and the appendix covers some topics in the history of the language. Woven throughout are a number of reasonably serious examples, ranging from a few paragraphs to several pages, some of them continued elsewhere as they illustrate different techniques. See “Examples” in the index. I encourage you to explore these as leisurely as time permits, thinking about how the computations evolve, and how you would approach these or similar examples. The book has a companion R package, SoDA, obtainable from the main CRAN repository, as described in Chapter 4. A number of the functions and classes developed in the book are included in the package. The package also contains code for most of the examples; see the documentation for "Examples" in the package. Even at five hundred pages, the book can only cover a fraction of the relevant topics, and some of those receive a pretty condensed treatment. Spending time alternately on reading, thinking, and interactive computation will help clarify much of the discussion, I hope. Also, the final word is with the online documentation and especially with the software; a substantial benefit of open-source software is the ability to drill down and see what’s really happening. Who should read this book? I’ve written this book with three overlapping groups of readers generally in mind. First, “data analysts”; that is, anyone with an interest in exploring data, especially in serious scientific studies. This includes statisticians, certainly, but increasingly others in a wide range of disciplines where data-rich studies now require such exploration. Helping to enable exploration is our mission
PREFACE vii I hope and expect that you will find that working with R and re- here. lated software enhances your ability to learn from the data relevant to your interests. If you have not used R or S-Plus R before, you should precede this book (or at least supplement it) with a more basic presentation. There are a number of books and an even larger number of Web sites. Try searching with a combination of “introduction” or “introductory” along with “R”. Books by W. John Braun and Duncan J. Murdoch [2], Michael Crawley [11], Peter Dalgaard [12], and John Verzani [24], among others, are general introductions (both to R and to statistics). Other books and Web sites are beginning to appear that introduce R or S-Plus with a particular area of application in mind; again, some Web searching with suitable terms may find a presentation attuned to your interests. A second group of intended readers are people involved in research or teaching related to statistical techniques and theory. R and other modern software systems have become essential in the research itself and in commu- nicating its results to the community at large. Most graduate-level programs in statistics now provide some introduction to R. This book is intended to guide you on the followup, in which your software becomes more important to your research, and often a way to share results and techniques with the community. I encourage you to push forward and organize your software to be reusable and extendible, including the prospect of creating an R package to communicate your work to others. Many of the R packages now available derive from such efforts.. The third target group are those more directly interested in software and programming, particularly software for data analysis. The efforts of the R community have made it an excellent medium for “packaging” software and providing it to a large community of users. R is maintained on all the widely used operating systems for computing with data and is easy for users to install. Its package mechanism is similarly well maintained, both in the central CRAN repository and in other repositories. Chapter 4 covers both using packages and creating your own. R can also incorporate work done in other systems, through a wide range of inter-system interfaces (discussed in Chapters 11 and 12). Many potential readers in the first and second groups will have some experience with R or other software for statistics, but will view their involve- ment as doing only what’s absolutely necessary to “get the answers”. This book will encourage moving on to think of the interaction with the software as an important and valuable part of your activity. You may feel inhibited by not having done much programming before. Don’t be. Programming with
分享到:
收藏