Tukey’s Biweight Correlation and the Breakdown
Mary Owen
April 2, 2010
2
Contents
1 Introduction
2 Background
2.1 Theory of M-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 M-Estimators of Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 The Sample Mean and Tukey’s Biweight . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 W-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Qualities of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Characteristics of the Biweight
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Simulation and Computation
4 Conclusion
A Appendix: Tables
5
7
7
7
8
10
12
13
15
25
27
3
4
Chapter 1
Introduction
DNA microarray experiments have become increasingly popular in recent years as a method of uncovering
patterns of gene expression and relationships between genes. Though sometimes problematic, it is possible to
divide genes into groups based on the similarity of their patterns of expression, measured with any of several
different measures of similarity, the Pearson correlation being the most popular [4]. With the measure
of similarity determined, a clustering algorithm or gene network analysis can give more insight into the
relationship between the genes.
However, the similarity measure will have a large impact on the clustering algorithm. If the similarity
measure is not resistant enough to outliers and noise in the data, dissimilar genes may be clustered as
though they are co-expressed, or vice versa. Microarray data tend to contain a lot of noise, which can
come from many different sources over the course of a microarray experiment and may remain in the data
despite normalization or filtering [7, 6]. For accurate results, a robust measure of similarity is needed. In a
2007 paper, Hardin et al. proposed a resistant correlation measure for use in clustering and gene network
algorithms. This new correlation measure is based on Tukey’s biweight and can be used both in clustering
and gene network algorithms and, by comparing it with the Pearson correlation, as a method of flagging
questionable data points.
It is more resistant than the Pearson correlation and more efficient than the
Spearman correlation [4].
Despite having performed promisingly so far, there is still much to learn about the biweight correlation. It
is somewhat unusual in that its breakdown point, defined in section 2.3, can be set to a specified value. In this
paper, we will explore the performance of the biweight correlation under a variety of different experimental
conditions in order to determine how the biweight correlation varies as its breakdown point changes. For
different breakdown points, underlying distributions of the data, and sample sizes, we will compute the
efficiency and bias. We will compare the performance of the biweight to that of the Pearson correlation
under identical circumstances with the goal of finding the best similarity measure.
5
6
Chapter 2
Background
2.1 Theory of M-estimation
2.1.1 M-Estimators of Location
The original development of M-estimators of location was based on creating robust estimators similar to
Maximum Likelihood Estimators (MLEs). Because some M-estimators are robust, that is, not overly affected
by outliers, they are wonderful tools to use in the analysis of microarray data, which is often full of outliers
and noise. M-estimators of location are independent of the underlying probability distribution function of
the data because they minimize an objective function that is dependent on the distances from the observed
values to the estimate. Since microarray data is multivariate - that is, for each person we know about the
expression of multiple genes - we will consider each observation to be a vector, X. Most of the theory that
applies to M-estimators of location in one dimension can be generalized to multivariate data. For a function
ρ(Xi, t), called the objective function, and a sample X1, . . . , Xn, the M-estimator of location Tn(X1, . . . , Xn)
is the value of t that minimizes
ρ(Xi, t)
(2.1)
n
i=1
n
When ρ is continuous and has a derivative with respect to t, we call that derivative the M-estimator’s
ψ-function and simplify the calculation of the M-estimator by computing a value of t such that
ψ(Xi, t) = 0
(2.2)
i=1
For the M-estimators of location with which we are concerned, ρ is independent of the underlying distribution
of the data. If ρ = f, the pdf of the distribution, then the M-estimator is the MLE. [5]
We would like any M-estimator of location to be location-and-scale-equivariant, that is, we would like it to
respond in a reasonable manner to linear changes across the sample. An M-estimator is location equivariant
if, when every observation is shifted by some amount C, the Tn of that shifted sample also shifts by C.
An M-estimator of location is scale equivariant if, when every observation is multiplied by some nonzero
constant d, the Tn of that altered sample is d times the Tn of the unaltered sample. An M-estimator of
location that possesses both these characteristics is location-and-scale equivariant. In other words, we want
the following to hold:
Tn(dX1 + C, dX2 + C, . . . , dXn + C) = dTn(X1 + X2 + ··· + Xn) + C
(2.3)
In order for an M-estimator of location to be location-and-scale-equivariant, it is often necessary to scale the
observations when computing ρ and ψ. This is often done whether rescaling is necessary or not because it
makes notation simpler. The matter of location-equivariance is fairly simple; it can be satisfied by making
7
the input of the form X − Tn. Adjusting for scale is not as straightforward. We must pick some estimator
of the scale of the sample, called Sn, which is a function of the observations X1, X2, . . . , Xn. Sn is scale-
equivariant and location invariant, meaning that it is unaffected by a shift in the sample as described above
[5]. For our multivariate microarray data, the observations are transformed to
(2.4)
where Tn is a location vector and Sn is a shape matrix. In one dimension, the observations are transformed
ui = (Xi − Tn)t(cSn)−1(Xi − Tn)
ui = xi − Tn
cSn
(2.5)
to
where Tn and Sn are location and shape scalars, and c is a tuning constant. Note that regardless of the
dimensional of the data, the ui will be one-dimensional. Equation (2.1) now becomes
and Equation (2.2) becomes
where Tn and Sn that solve equation (2.7) are the resulting M-estimates of location and shape, respectively.
i=1
2.1.2 The Sample Mean and Tukey’s Biweight
The simplest example of an M-estimator is the least squares estimator of the sample mean in one dimension.
In this case, we wish to minimize the sum of the squared residuals, the distances between the observations
and the estimator. Thus,
ρ(x, t) = (
(x − t)
cS
)2
2
and we minimize
(xi − t)
n
The value of t that solves this is the sample mean, Tn =n
n
cS
xi − t
= 0
i=1
i=1
or, after differentiating to simplify the calculation, solve the equation
sample mean is location-and-scale equivariant.
i=1 xi/n. It is straightforward to show that the
Unfortunately, the sample mean is not robust. Outliers can skew the estimate of the sample mean enough
so that it is no longer helpful as an analysis tool. To find a robust estimator, we turn to a different family
of ρ functions. Statisticians have developed many robust M-estimators, but in this paper we are concerned
with one in particular: Tukey’s biweight. The biweight estimate of correlation is produced by first iteratively
calculating the biweight estimate of shape, ˜S. The (i, j)th element of ˜S, ˜sij, can be thought of as a resistant
estimate of the covariance between two vectors, Xi and Xj. The biweight correlation of these two vectors is
calculated as follows:
ρ(ui)
n
n
i=1
ψ(ui) = 0
˜rij =
˜sij˜sii˜sjj
8
(2.6)
(2.7)
(2.8)