R Programming for
Bioinformatics
Chapman & Hall/CRC
Computer Science and Data Analysis Series
The interface between the computer and statistical sciences is increasing, as each discipline
seeks to harness the power and resources of the other. This series aims to foster the integration
between the computer sciences and statistical, numerical, and probabilistic methods by
publishing a broad range of reference works, textbooks, and handbooks.
SERIES EDITORS
David Blei, Princeton University
David Madigan, Rutgers University
Marina Meila, University of Washington
Fionn Murtagh, Royal Holloway, University of London
Proposals for the series should be sent directly to one of the series editors above, or submitted to:
Chapman & Hall/CRC
4th Floor, Albert House
1-4 Singer Street
London EC2A 4BQ
UK
Published Titles
Bayesian Articial Intelligence
Kevin B. Korb and Ann E. Nicholson
Computational Statistics Handbook with
MATLAB®, Second Edition
Wendy L. Martinez and Angel R. Martinez
Pattern Recognition Algorithms for
Data Mining
Sankar K. Pal and Pabitra Mitra
Exploratory Data Analysis with MATLAB®
Wendy L. Martinez and Angel R. Martinez
Clustering for Data Mining: A Data
Recovery Approach
Boris Mirkin
Correspondence Analysis and Data
Coding with Java and R
Fionn Murtagh
Design and Modeling for Computer
Experiments
Kai-Tai Fang, Runze Li, and
Agus Sudjianto
Introduction to Machine Learning
and Bioinformatics
Sushmita Mitra, Sujay Datta,
Theodore Perkins, and George Michailidis
R Graphics
Paul Murrell
R Programming for Bioinformatics
Robert Gentleman
Semisupervised Learning for
Computational Linguistics
Steven Abney
Statistical Computing with R
Maria L. Rizzo
R Programming for
Bioinformatics
Robert Gentleman
Fred Hutchinson Cancer Research Center
Seattle, Washington, U.S.A.
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2009 by Taylor & Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2 1
International Standard Book Number-13: 978-1-4200-6367-7 (Hardcover)
This book contains information obtained from authentic and highly regarded sources Reason-
able efforts have been made to publish reliable data and information, but the author and publisher
cannot assume responsibility for the validity of all materials or the consequences of their use. The
Authors and Publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.
copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC)
222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that
provides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Gentleman, Robert, 1959-
R programming for bioinformatics / Robert Gentleman.
p. cm. -- (Chapman & Hall/CRC computer science and data analysis series)
Bibliographical references (p. ) and index.
ISBN 978-1-4200-6367-7
1. Bioinformatics. 2. R (Computer program language) I. Title. II. Series.
QH324.2.G46 2008
572.80285’5133--dc22
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
2008011352
Contents
1 Introducing R
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 A note on the text . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .
2 R Language Fundamentals
2.1
Some special values
2.3.1 Finding out more about an object
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 A brief introduction to R . . . . . . . . . . . . . . . .
2.1.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 A very brief introduction to OOP in R . . . . . . . . .
2.1.4
. . . . . . . . . . . . . . . . . . .
2.1.5 Types of objects
. . . . . . . . . . . . . . . . . . . . .
2.1.6
Sequence generating and vector subsetting . . . . . . .
2.1.7 Types of functions . . . . . . . . . . . . . . . . . . . .
2.2 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Atomic vectors . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Numerical computing . . . . . . . . . . . . . . . . . .
2.2.3 Factors
. . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Lists, environments and data frames . . . . . . . . . .
2.3 Managing your R session . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
2.4 Language basics . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Operators . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Subscripting and subsetting . . . . . . . . . . . . . . . . . . .
2.5.1 Vector and matrix subsetting . . . . . . . . . . . . . .
2.6 Vectorized computations . . . . . . . . . . . . . . . . . . . . .
2.6.1 The recycling rule . . . . . . . . . . . . . . . . . . . .
2.7 Replacement functions . . . . . . . . . . . . . . . . . . . . . .
2.8 Functional programming . . . . . . . . . . . . . . . . . . . . .
2.9 Writing functions . . . . . . . . . . . . . . . . . . . . . . . . .
2.10 Flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
2.11 Exception handling . . . . . . . . . . . . . . . . . . . . . . . .
2.12 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.1 Standard evaluation . . . . . . . . . . . . . . . . . . .
2.12.2 Non-standard evaluation . . . . . . . . . . . . . . . . .
2.10.1 Conditionals
1
1
2
3
4
5
5
5
6
7
8
9
11
12
12
12
15
16
18
22
24
25
26
28
29
36
37
38
39
41
42
44
45
50
51
52
vii
viii
2.12.3 Function evaluation . . . . . . . . . . . . . . . . . . .
2.12.4 Indirect function invocation . . . . . . . . . . . . . . .
2.12.5 Evaluation on exit . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
2.12.6 Other topics
2.12.7 Name spaces
. . . . . . . . . . . . . . . . . . . . . . .
2.13 Lexical scope . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.13.1 Likelihoods . . . . . . . . . . . . . . . . . . . . . . . .
2.13.2 Function optimization . . . . . . . . . . . . . . . . . .
2.14 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Object-Oriented Programming in R
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 The basics of OOP . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1
Inheritance . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Dispatch . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Abstract data types
. . . . . . . . . . . . . . . . . . .
Self-describing data . . . . . . . . . . . . . . . . . . .
3.2.4
3.3 S3 OOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Implicit classes . . . . . . . . . . . . . . . . . . . . . .
3.3.1
. . . . . . . . . . . . . . . .
3.3.2 Expression data example
3.3.3
S3 generic functions and methods . . . . . . . . . . . .
3.3.4 Details of dispatch . . . . . . . . . . . . . . . . . . . .
3.3.5 Group generics . . . . . . . . . . . . . . . . . . . . . .
3.3.6
S3 replacement methods . . . . . . . . . . . . . . . . .
3.4 S4 OOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Classes
. . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Types of classes . . . . . . . . . . . . . . . . . . . . . .
3.4.3 Attributes . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.4 Class unions
. . . . . . . . . . . . . . . . . . . . . . .
3.4.5 Accessor functions . . . . . . . . . . . . . . . . . . . .
3.4.6 Using S3 classes with S4 classes . . . . . . . . . . . . .
3.4.7
S4 generic functions and methods . . . . . . . . . . . .
3.4.8 The syntax of method declaration . . . . . . . . . . .
3.4.9 The semantics of method invocation . . . . . . . . . .
3.4.10 Replacement methods . . . . . . . . . . . . . . . . . .
3.4.11 Finding methods . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
3.4.12 Advanced topics
3.5 Using classes and methods in packages . . . . . . . . . . . . .
3.6 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.1 Finding documentation . . . . . . . . . . . . . . . . .
3.6.2 Writing documentation . . . . . . . . . . . . . . . . .
3.7 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 Managing S3 and S4 together . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . . .
3.8.1 Getting and setting the class attribute
3.8.2 Mixing S3 and S4 methods
53
54
54
55
57
59
61
62
64
67
67
68
69
71
72
73
74
76
77
78
81
83
83
84
85
98
98
99
100
100
101
105
106
107
107
108
110
110
110
111
111
112
113
114
3.9 Navigating the class and method hierarchy . . . . . . . . . .
115
ix
4 Input and Output in R
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Basic file handling . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Viewing files
. . . . . . . . . . . . . . . . . . . . . . .
4.2.2 File manipulation . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
4.2.3 Working with R’s binary format
4.3 Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Text connections . . . . . . . . . . . . . . . . . . . . .
4.3.2
. . . . . . . . . . . . . .
Interprocess communications
Seek . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3
. . . . . . . . . . . . . . . . . . . . . .
4.4.1 Reading rectangular data . . . . . . . . . . . . . . . .
4.4.2 Writing data . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Debian Control Format (DCF) . . . . . . . . . . . . .
4.4.4 FASTA Format . . . . . . . . . . . . . . . . . . . . . .
4.5 Source and sink: capturing R output . . . . . . . . . . . . . .
4.6 Tools for accessing files on the Internet . . . . . . . . . . . . .
4.4 File input and output
5 Working with Character Data
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Builtin capabilities . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Modifying text . . . . . . . . . . . . . . . . . . . . . .
5.2.2
Sorting and comparing . . . . . . . . . . . . . . . . . .
5.2.3 Matching a set of alternatives . . . . . . . . . . . . . .
5.2.4 Formatting text and numbers . . . . . . . . . . . . . .
5.2.5
Special characters and escaping . . . . . . . . . . . . .
5.2.6 Parsing and deparsing . . . . . . . . . . . . . . . . . .
5.2.7 Plotting with text
. . . . . . . . . . . . . . . . . . . .
5.2.8 Locale and font encoding . . . . . . . . . . . . . . . .
5.3 Regular expressions . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Regular expression basics . . . . . . . . . . . . . . . .
5.3.2 Matching . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3 Using regular expressions
. . . . . . . . . . . . . . . .
5.3.4 Globbing and regular expressions . . . . . . . . . . . .
5.4 Prefixes, suffixes and substrings . . . . . . . . . . . . . . . . .
5.5 Biological sequences
. . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Encoding genomes . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
5.6.1 Matching single query sequences
5.6.2 Matching many query sequences
. . . . . . . . . . . .
5.6.3 Palindromes and paired matches . . . . . . . . . . . .
5.6.4 Alignments . . . . . . . . . . . . . . . . . . . . . . . .
5.6 Matching patterns
119
119
120
124
125
129
130
131
133
136
137
138
139
140
141
142
143
145
145
146
151
152
153
155
155
158
159
159
159
160
166
167
169
169
171
172
173
174
175
177
179