Report Writing for Data Science
in R
Roger D. Peng
This book is for sale at http://leanpub.com/reportwriting
This version was published on 2016-07-20
This is a Leanpub book. Leanpub empowers authors and
publishers with the Lean Publishing process. Lean
Publishing is the act of publishing an in-progress ebook
using lightweight tools and many iterations to get reader
feedback, pivot until you have the right book and build
traction once you do.
© 2015 - 2016 Roger D. Peng
Also By Roger D. Peng
R Programming for Data Science
The Art of Data Science
Exploratory Data Analysis with R
Executive Data Science
Conversations On Data Science
Contents
What is Reproducible Reporting? .
The Data Science Pipeline .
.
.
.
.
.
Literate Statistical Programming .
Reproducibility Check List .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
8
12
15
What is Reproducible
Reporting?
Watch a video of this chapter.1
This chapter will be about reproducible reporting, and I
want to take the opportunity to cover some basic concepts
and ideas that are related to reproducible reporting, just in
case you haven’t heard about it or don’t know what it is.
Before we get to reproducibility, we need to cover a little
background with respect to how science works (even if
you’re not a scientist, this is important). The basic idea is
that in science, replication is the most important element
of verifying and validating findings. So if you claim that
X causes Y, or that Vitamin C improves disease, or that
something causes a problem, what happens is that other
scientists that are independent of you will try to investigate
that same question and see if they come up with a similar
result. If lots of different people come up with the same
result and replicate the original finding, then we tend to
think that the original finding was probably true and that
this is a real relationship or real finding.
The ultimate standard in strengthening scientific evidence
is replication. The goal is to have independent people to do
independent things with different data, different methods,
and different laboratories and see if you get the same result.
There’s a sense that if a relationship in nature is truly there,
then it should be robust to having different people discover
it in different ways. Replication is particularly important
1https://www.youtube.com/watch?v=4rBX6r5emgQ
1
What is Reproducible Reporting?
2
in areas where findings can have big policy impacts or can
influence regulatory types of decisions.
What’s Wrong with Replication?
What’s wrong with replication? There’s really nothing wrong
with it. This is what science has been doing for a long time,
through hundreds of years. And there’s nothing wrong with
it today. But the problem is that it’s becoming more and
more challenging to do replication or to replicate other
studies. Part of the reason is because studies are getting
bigger and bigger.
In order to do big studies you need a lot of money and so,
well, there’s a lot of money involved! If you want to do ten
versions of the same study, you need ten times as much
money and there’s not as much money around as there used
to be. Sometimes it’s difficult to replicate a study because if
the original study took 20 years to do, it’s difficult to wait
around another 20 years for replication. Some studies are
just plain unique, such as studying the impact of a massive
earthquake in a very specific location and time. If you’re
looking at a unique situation in time or a unique population,
you can’t readily replicate that situation.
There are a lot of good reasons why you can’t replicate
a study. If you can’t replicate a study, is the alternative
just to do nothing, just let that study stand by itself? The
idea behind a reproducible reporting is to create a kind of
minimum standard or a middle ground where we won’t
be replicating a study, but maybe we can do something
in between. The basic problem is that you have the gold
standard, which is replication, and then you have the worst
standard which is doing nothing. What can we do that’s
in between the gold standard and diong nothing? That is
What is Reproducible Reporting?
3
where reproducibility comes in. That’s how we can kind of
bridge the gap between replication and nothing.
In non-research settings, often full replication isn’t even the
point. Often the goal is to preserve something to the point
where anybody in an organization can repeat what you did
(for example, after you leave the organization). In this case,
reproducibility is key to maintaining the history of a project
and making sure that every step along the way is clear.
Reproducibility to the Rescue
Why do we need this kind of middle ground? I haven’t
clearly defined reproducibility yet, but the basic idea is that
you need to make the data available for the original study
and the computational methods available so that other
people can look at your data and run the kind of analysis
that you’ve run, and come to the same findings that you
found.
What reproducible reporting is about is a validation of the
data analysis. Because you’re not collecting independent
data using independent methods, it’s a little bit more dif-
ficult to validate the scientific question itself. But if you can
take someone’s data and reproduce their findings, then you
can, in some sense, validate the data analysis. This involves
having the data and the code because more likely than not,
the analysis will have been done on the computer using
some sort of programming language, like R. So you can take
their code and their data and reproduce the findings that
they come up with. Then you can at least have confidence
that the analysis was done appropriately and that the correct
methods were used.
Recently, there’s been a lot of discussion of reproducibility
in the media and in the scientific literature. The journal
What is Reproducible Reporting?
4
Science had a special issue on reproducibility and data repli-
cation. Other journals of updated policies on publication
to encourage reproducibility. In 2012, a feature on the TV
show 60 minutes looked at a major incident at Duke Univer-
sity where many results involving a promising cancer test
were found to be not reproducible. This led to a number of
studies and clinical trials having to be stopped, followed by
an investigation which is still ongoing.
Finally, the Institute of Medicine, in response to a lot of
recent events involving reproducibility of scientific studies,
issued a report saying that best practices should be done
to promote and encourage reproducibility, particularly in
what’s called ‘omics based research, such as genomics, pro-
teomics, other similar areas involving high-throughput bi-
ological measurements. This was a very important report.
Of the many recommendations that the IOM made, the key
ones were that
• Data and metadata need to be made available;
• Computer code should be fully specified, so that peo-
ple can examine it to see what was done;
• All the steps of the computational analysis, including
any preprocessing of data, should be fully described so
that people can study it and reproduce it.
From “X” to “Computational X”
What is driving this need for a “reproducibility middle
ground” between replication and doing nothing? For starters,
there are a lot of new technologies on the scene and in
many different fields of study including, biology, chemistry
and environmental science. These technologies allow us to
collect data at a much higher throughput so we end up