logo资料库

Tukey's Biweight Correlation and the Breakdown.pdf

第1页 / 共29页
第2页 / 共29页
第3页 / 共29页
第4页 / 共29页
第5页 / 共29页
第6页 / 共29页
第7页 / 共29页
第8页 / 共29页
资料共29页,剩余部分请下载后查看
Tukey’s Biweight Correlation and the Breakdown Mary Owen April 2, 2010
2
Contents 1 Introduction 2 Background 2.1 Theory of M-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 M-Estimators of Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 The Sample Mean and Tukey’s Biweight . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 W-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Qualities of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Characteristics of the Biweight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Simulation and Computation 4 Conclusion A Appendix: Tables 5 7 7 7 8 10 12 13 15 25 27 3
4
Chapter 1 Introduction DNA microarray experiments have become increasingly popular in recent years as a method of uncovering patterns of gene expression and relationships between genes. Though sometimes problematic, it is possible to divide genes into groups based on the similarity of their patterns of expression, measured with any of several different measures of similarity, the Pearson correlation being the most popular [4]. With the measure of similarity determined, a clustering algorithm or gene network analysis can give more insight into the relationship between the genes. However, the similarity measure will have a large impact on the clustering algorithm. If the similarity measure is not resistant enough to outliers and noise in the data, dissimilar genes may be clustered as though they are co-expressed, or vice versa. Microarray data tend to contain a lot of noise, which can come from many different sources over the course of a microarray experiment and may remain in the data despite normalization or filtering [7, 6]. For accurate results, a robust measure of similarity is needed. In a 2007 paper, Hardin et al. proposed a resistant correlation measure for use in clustering and gene network algorithms. This new correlation measure is based on Tukey’s biweight and can be used both in clustering and gene network algorithms and, by comparing it with the Pearson correlation, as a method of flagging questionable data points. It is more resistant than the Pearson correlation and more efficient than the Spearman correlation [4]. Despite having performed promisingly so far, there is still much to learn about the biweight correlation. It is somewhat unusual in that its breakdown point, defined in section 2.3, can be set to a specified value. In this paper, we will explore the performance of the biweight correlation under a variety of different experimental conditions in order to determine how the biweight correlation varies as its breakdown point changes. For different breakdown points, underlying distributions of the data, and sample sizes, we will compute the efficiency and bias. We will compare the performance of the biweight to that of the Pearson correlation under identical circumstances with the goal of finding the best similarity measure. 5
6
Chapter 2 Background 2.1 Theory of M-estimation 2.1.1 M-Estimators of Location The original development of M-estimators of location was based on creating robust estimators similar to Maximum Likelihood Estimators (MLEs). Because some M-estimators are robust, that is, not overly affected by outliers, they are wonderful tools to use in the analysis of microarray data, which is often full of outliers and noise. M-estimators of location are independent of the underlying probability distribution function of the data because they minimize an objective function that is dependent on the distances from the observed values to the estimate. Since microarray data is multivariate - that is, for each person we know about the expression of multiple genes - we will consider each observation to be a vector, X. Most of the theory that applies to M-estimators of location in one dimension can be generalized to multivariate data. For a function ρ(Xi, t), called the objective function, and a sample X1, . . . , Xn, the M-estimator of location Tn(X1, . . . , Xn) is the value of t that minimizes ρ(Xi, t) (2.1) n i=1 n When ρ is continuous and has a derivative with respect to t, we call that derivative the M-estimator’s ψ-function and simplify the calculation of the M-estimator by computing a value of t such that ψ(Xi, t) = 0 (2.2) i=1 For the M-estimators of location with which we are concerned, ρ is independent of the underlying distribution of the data. If ρ = f, the pdf of the distribution, then the M-estimator is the MLE. [5] We would like any M-estimator of location to be location-and-scale-equivariant, that is, we would like it to respond in a reasonable manner to linear changes across the sample. An M-estimator is location equivariant if, when every observation is shifted by some amount C, the Tn of that shifted sample also shifts by C. An M-estimator of location is scale equivariant if, when every observation is multiplied by some nonzero constant d, the Tn of that altered sample is d times the Tn of the unaltered sample. An M-estimator of location that possesses both these characteristics is location-and-scale equivariant. In other words, we want the following to hold: Tn(dX1 + C, dX2 + C, . . . , dXn + C) = dTn(X1 + X2 + ··· + Xn) + C (2.3) In order for an M-estimator of location to be location-and-scale-equivariant, it is often necessary to scale the observations when computing ρ and ψ. This is often done whether rescaling is necessary or not because it makes notation simpler. The matter of location-equivariance is fairly simple; it can be satisfied by making 7
the input of the form X − Tn. Adjusting for scale is not as straightforward. We must pick some estimator of the scale of the sample, called Sn, which is a function of the observations X1, X2, . . . , Xn. Sn is scale- equivariant and location invariant, meaning that it is unaffected by a shift in the sample as described above [5]. For our multivariate microarray data, the observations are transformed to (2.4) where Tn is a location vector and Sn is a shape matrix. In one dimension, the observations are transformed ui = (Xi − Tn)t(cSn)−1(Xi − Tn) ui = xi − Tn cSn (2.5) to where Tn and Sn are location and shape scalars, and c is a tuning constant. Note that regardless of the dimensional of the data, the ui will be one-dimensional. Equation (2.1) now becomes and Equation (2.2) becomes where Tn and Sn that solve equation (2.7) are the resulting M-estimates of location and shape, respectively. i=1 2.1.2 The Sample Mean and Tukey’s Biweight The simplest example of an M-estimator is the least squares estimator of the sample mean in one dimension. In this case, we wish to minimize the sum of the squared residuals, the distances between the observations and the estimator. Thus, ρ(x, t) = ( (x − t) cS )2 2 and we minimize (xi − t) n The value of t that solves this is the sample mean, Tn =n n cS xi − t = 0 i=1 i=1 or, after differentiating to simplify the calculation, solve the equation sample mean is location-and-scale equivariant. i=1 xi/n. It is straightforward to show that the Unfortunately, the sample mean is not robust. Outliers can skew the estimate of the sample mean enough so that it is no longer helpful as an analysis tool. To find a robust estimator, we turn to a different family of ρ functions. Statisticians have developed many robust M-estimators, but in this paper we are concerned with one in particular: Tukey’s biweight. The biweight estimate of correlation is produced by first iteratively calculating the biweight estimate of shape, ˜S. The (i, j)th element of ˜S, ˜sij, can be thought of as a resistant estimate of the covariance between two vectors, Xi and Xj. The biweight correlation of these two vectors is calculated as follows: ρ(ui) n n i=1 ψ(ui) = 0 ˜rij = ˜sij˜sii˜sjj 8 (2.6) (2.7) (2.8)
分享到:
收藏