Package clusterCrit for R
and its general term is defined as:
{k}
j − µ
{k}
j
5
(12)
In terms of variance and covariance, by analogy with the relations (4) and
(5), the coefficients of the matrix W G{k} can also be written as:
w
{k}
ij = tV
w
{k}
ij
{k}
ii
w
{k}
i
{k}
i − µ
V
= nk × CovV
= nk × VarV
K
W G =
W G{k}
k=0
{k}
, V
j
{k}
i
{k}
i
The matrices W G{k} are square symmetric matrices of size p × p. Let us
denote by W G their sum for all the clusters:
(13)
(14)
(17)
(18)
(19)
As was the case with the matrix T seen in section 1.1.1, the matrices W G{k}
represent a positive semi-definite quadratic form Qk and, in particular, their
eigenvalues and their determinant are greater than or equal to 0.
The within-cluster dispersion, noted W GSS{k} or W GSSk, is the trace of
the scatter matrix W G{k}:
W GSS{k} = Tr(W G{k}) =
||M
{k}
i − G{k}||2
(15)
i∈Ik
The within-cluster dispersion is the sum of the squared distances between
the observations M
{k}
i
and the barycenter G{k} of the cluster.
Finally the pooled within-cluster sum of squares WGSS is the sum of the
within-cluster dispersions for all the clusters:
W GSS =
W GSS{k}
(16)
K
k=0
The abovementioned geometric interpretation remains true at the level of
each group:
in each cluster Ck, the sum of the squared distances from the
points of the cluster to their barycenter is also the sum of the squared distances
between all the pairs of points in the cluster, divided par nk. In other words:
W GSS{k} =
=
Inverting the formula, one gets:
{k}
j
{k}
i − M
||M
i=j
||2 = 2
i∈Ik
1
nk
i
Package clusterCrit for R
6
1.1.3 Between-group scatter
The between-group dispersion measures the dispersion of the clusters between
each other. Precisely it is defined as the dispersion of the barycenters G{k} of
each cluster with respect to the barycenter G of the whole set of data.
Let us denote by B the matrix formed in rows by the vectors µ{k} − µ, each
one being reproduced nk times (1 ≤ k ≤ K). The between-group scatter matrix
is the matrix
BG = tB B.
The general term of this matrix is:
The between-group dispersion BGSS is the trace of this matrix:
(20)
(21)
(22)
K
k=1
bij =
nk(µ
{k}
j − µj)
{k}
i − µi)(µ
K
K
K
p
nk
k=1
k=1
nk
k=1
j=0
t(µ{k} − µ)(µ{k} − µ)
nk ||µ{k} − µ||2
{k}
j − µj)2
(µ
BGSS = Tr(BG) =
=
=
K
Geometrically, this sum is the weighted sum of the squared distances between
the G{k} and G, the weight being the number nk of elements in the cluster Ck:
BGSS =
nk||G{k} − G||2.
(23)
1.1.4 Pairs of points
k=1
The observations (rows of the matrix A) can be represented by points in the
space Rp. Several quality indices defined in section 1.2 consider the distances
between these points. One is led to distinguish between pairs made of points
belonging to the same cluster and pairs made of points belonging to different
clusters.
In the cluster Ck, there are nk(nk − 1)/2 pairs of distinct points (the order
of the points does not matter). Let us denote by NW the total number of such
pairs:
k=1
K
K
K
1
2
k=1
1
2
k=1
NW =
=
=
nk(nk − 1)
2
k − K
k=1
n2
k − N
n2
nk
(24)
(25)
(26)
7
(27)
Package clusterCrit for R
The total number of pairs of distinct points in the data set is
Since N =K
k=1 nk, one can write :
NT =
N (N − 1)
2
=
=
1
2
1
2
NT =
N (N − 1)
2
K
k=1
− 1
2
k
Package clusterCrit for R
8
Index
Ball-Hall
Name in R
Ball Hall
Banfeld-Raftery
Banfeld Raftery
C index
C index
Calinski-Harabasz
Calinski Harabasz
Davies-Bouldin
|T|/|W|
Dunn
Dunn generalized
Gamma
G +
k2|W|
log(|T|/|W|)
log(BGSS/W GSS)
McClain-Rao
PBM
Davies Bouldin
Det Ratio
Dunn
GDImn
Gamma
G plus
Ksq DetW
Log Det Ratio
Log SS Ratio
McClain Rao
PBM
Point biserial
Point biserial
Ratkowsky-Lance
Ratkowsky Lance
Ray-Turi
Ray Turi
Scott-Symons
Scott Symons
SD
SD
S Dbw
Silhouette
SD Scat
SD Dis
S Dbw
Silhouette
Tr(W )
Tr(W −1B)
Wemmert-Gan¸carski Wemmert Gancarski
Trace WiB
Trace W
Ref. Date
[2]
[3]
[15]
[5]
[6]
[24]
[7]
[4]
[1]
[23]
[16]
[24]
[14]
[17]
[19]
[18]
[21]
[22]
[24]
[13]
[13]
[12]
[20]
[8]
[10]
1965
1974
1976
1974
1979
1971
1974
1998
1975
1974
1975
1971
1975
2001
2004
1981
1978
1999
1971
2001
2001
2001
1987
1965
1967
Xie-Beni
Xie Beni
[25]
1991
Table 1: Index names in the package clusterCrit for R and bibliographic refer-
ences.