Survey of Clustering Algorithms.pdf

发布时间：2022-05-31 发布人：admin 分类：说明书资料大小：1.71M 资料格式：pdf 举报版权申诉

qq_33134503-12129342-16359647584673197372.pdf-第1页.png

第1页 / 共34页

qq_33134503-12129342-16359647584673197372.pdf-第2页.png

第2页 / 共34页

qq_33134503-12129342-16359647584673197372.pdf-第3页.png

第3页 / 共34页

qq_33134503-12129342-16359647584673197372.pdf-第4页.png

第4页 / 共34页

qq_33134503-12129342-16359647584673197372.pdf-第5页.png

第5页 / 共34页

qq_33134503-12129342-16359647584673197372.pdf-第6页.png

第6页 / 共34页

qq_33134503-12129342-16359647584673197372.pdf-第7页.png

第7页 / 共34页

qq_33134503-12129342-16359647584673197372.pdf-第8页.png

第8页 / 共34页

文本预览

IEEETRANSACTIONSONNEURALNETWORKS,VOL.16,NO.3,MAY2005645SurveyofClusteringAlgorithmsRuiXu,StudentMember,IEEEandDonaldWunschII,Fellow,IEEEAbstract—Dataanalysisplaysanindispensableroleforun-derstandingvariousphenomena.Clusteranalysis,primitiveexplorationwithlittleornopriorknowledge,consistsofresearchdevelopedacrossawidevarietyofcommunities.Thediversity,ononehand,equipsuswithmanytools.Ontheotherhand,theprofusionofoptionscausesconfusion.Wesurveyclusteringalgorithmsfordatasetsappearinginstatistics,computerscience,andmachinelearning,andillustratetheirapplicationsinsomebenchmarkdatasets,thetravelingsalesmanproblem,andbioin-formatics,anewﬁeldattractingintensiveefforts.Severaltightlyrelatedtopics,proximitymeasure,andclustervalidation,arealsodiscussed.IndexTerms—Adaptiveresonancetheory(ART),clustering,clusteringalgorithm,clustervalidation,neuralnetworks,prox-imity,self-organizingfeaturemap(SOFM).I.INTRODUCTIONWEARElivinginaworldfullofdata.Everyday,peopleencounteralargeamountofinformationandstoreorrepresentitasdata,forfurtheranalysisandmanagement.Oneofthevitalmeansindealingwiththesedataistoclassifyorgroupthemintoasetofcategoriesorclusters.Actually,asoneofthemostprimitiveactivitiesofhumanbeings[14],classi-ﬁcationplaysanimportantandindispensableroleinthelonghistoryofhumandevelopment.Inordertolearnanewobjectorunderstandanewphenomenon,peoplealwaystrytoseekthefeaturesthatcandescribeit,andfurthercompareitwithotherknownobjectsorphenomena,basedonthesimilarityordissimilarity,generalizedasproximity,accordingtosomecer-tainstandardsorrules.“Basically,classiﬁcationsystemsareei-thersupervisedorunsupervised,dependingonwhethertheyas-signnewinputstooneofaﬁnitenumberofdiscretesupervisedclassesorunsupervisedcategories,respectively[38],[60],[75].Insupervisedclassiﬁcation,themappingfromasetofinputdatavectors(,whereistheinputspacedimensionality),toaﬁnitesetofdiscreteclasslabels(,whereisthetotalnumberofclasstypes),ismodeledintermsofsomemathematicalfunction,whereisavectorofadjustableparameters.Thevaluesoftheseparametersarede-termined(optimized)byaninductivelearningalgorithm(alsotermedinducer),whoseaimistominimizeanempiricalriskfunctional(relatedtoaninductiveprinciple)onaﬁnitedatasetofinput–outputexamples,,whereistheﬁnitecardinalityoftheavailablerepresentativedataset[38],ManuscriptreceivedMarch31,2003;revisedSeptember28,2004.ThisworkwassupportedinpartbytheNationalScienceFoundationandinpartbytheM.K.FinleyMissouriEndowment.TheauthorsarewiththeDepartmentofElectricalandComputerEngineering,UniversityofMissouri-Rolla,Rolla,MO65409USA(e-mail:rxu@umr.edu;dwunsch@ece.umr.edu).DigitalObjectIdentiﬁer10.1109/TNN.2005.845141[60],[167].Whentheinducerreachesconvergenceortermi-nates,aninducedclassiﬁerisgenerated[167].Inunsupervisedclassiﬁcation,calledclusteringorex-ploratorydataanalysis,nolabeleddataareavailable[88],[150].Thegoalofclusteringistoseparateaﬁniteunlabeleddatasetintoaﬁniteanddiscretesetof“natural,”hiddendatastructures,ratherthanprovideanaccuratecharacterizationofunobservedsamplesgeneratedfromthesameprobabilitydistribution[23],[60].Thiscanmakethetaskofclusteringfalloutsideoftheframeworkofunsupervisedpredictivelearningproblems,suchasvectorquantization[60](seeSectionII-C),probabilitydensityfunctionestimation[38](seeSectionII-D),[60],andentropymaximization[99].Itisnoteworthythatclusteringdiffersfrommultidimensionalscaling(perceptualmaps),whosegoalistodepictalltheevaluatedobjectsinawaythatminimizesthetopographicaldistortionwhileusingasfewdimensionsaspossible.Alsonotethat,inpractice,many(predictive)vectorquantizersarealsousedfor(nonpredictive)clusteringanalysis[60].Nonpredictiveclusteringisasubjectiveprocessinnature,whichprecludesanabsolutejudgmentastotherelativeefﬁ-cacyofallclusteringtechniques[23],[152].AspointedoutbyBackerandJain[17],“inclusteranalysisagroupofobjectsissplitupintoanumberofmoreorlesshomogeneoussubgroupsonthebasisofanoftensubjectivelychosenmeasureofsim-ilarity(i.e.,chosensubjectivelybasedonitsabilitytocreate“interesting”clusters),suchthatthesimilaritybetweenobjectswithinasubgroupislargerthanthesimilaritybetweenobjectsbelongingtodifferentsubgroups””1.Clusteringalgorithmspartitiondataintoacertainnumberofclusters(groups,subsets,orcategories).Thereisnouniver-sallyagreedupondeﬁnition[88].Mostresearchersdescribeaclusterbyconsideringtheinternalhomogeneityandtheexternalseparation[111],[124],[150],i.e.,patternsinthesameclustershouldbesimilartoeachother,whilepatternsindifferentclus-tersshouldnot.Boththesimilarityandthedissimilarityshouldbeexaminableinaclearandmeaningfulway.Here,wegivesomesimplemathematicaldescriptionsofseveraltypesofclus-tering,basedonthedescriptionsin[124].Givenasetofinputpatterns,whereandeachmeasureissaidtobeafeature(attribute,dimension,orvariable).•(Hard)partitionalclusteringattemptstoseeka-par-titionof,suchthat1);2);3)and.1Theprecedingquoteistakenverbatimfromverbiagesuggestedbytheanonymousassociateeditor,asuggestionwhichwegratefullyacknowledge.1045-9227/$20.00©2005IEEE

646IEEETRANSACTIONSONNEURALNETWORKS,VOL.16,NO.3,MAY2005Fig.1.Clusteringprocedure.Thetypicalclusteranalysisconsistsoffourstepswithafeedbackpathway.Thesestepsarecloselyrelatedtoeachotherandaffectthederivedclusters.•)Hierarchicalclusteringattemptstoconstructatree-likenestedstructurepartitionof,suchthat,andimplyorforall.Forhardpartitionalclustering,eachpatternonlybelongstoonecluster.However,apatternmayalsobeallowedtobelongtoallclusterswithadegreeofmembership,,whichrepresentsthemembershipcoefﬁcientofthethobjectinthethclusterandsatisﬁesthefollowingtwoconstraints:andasintroducedinfuzzysettheory[293].Thisisknownasfuzzyclustering,reviewedinSectionII-G.Fig.1depictstheprocedureofclusteranalysiswithfourbasicsteps.1)Featureselectionorextraction.AspointedoutbyJainetal.[151],[152]andBishop[38],featureselectionchoosesdistinguishingfeaturesfromasetofcandidates,whilefeatureextractionutilizessometransformationstogenerateusefulandnovelfeaturesfromtheoriginalones.Bothareverycrucialtotheeffectivenessofclus-teringapplications.Elegantselectionoffeaturescangreatlydecreasetheworkloadandsimplifythesubse-quentdesignprocess.Generally,idealfeaturesshouldbeofuseindistinguishingpatternsbelongingtodifferentclusters,immunetonoise,easytoextractandinterpret.WeelaboratethediscussiononfeatureextractioninSectionII-L,inthecontextofdatavisualizationanddimensionalityreduction.Moreinformationonfeatureselectioncanbefoundin[38],[151],and[250].2)Clusteringalgorithmdesignorselection.Thestepisusuallycombinedwiththeselectionofacorrespondingproximitymeasureandtheconstructionofacriterionfunction.Patternsaregroupedaccordingtowhethertheyresembleeachother.Obviously,theproximitymeasuredirectlyaffectstheformationoftheresultingclusters.Almostallclusteringalgorithmsareexplicitlyorimplicitlyconnectedtosomedeﬁnitionofproximitymeasure.Somealgorithmsevenworkdirectlyontheproximitymatrix,asdeﬁnedinSectionII-A.Onceaproximitymeasureischosen,theconstructionofaclusteringcriterionfunctionmakesthepartitionofclustersanoptimizationproblem,whichiswelldeﬁnedmathematically,andhasrichsolutionsintheliterature.Clusteringisubiquitous,andawealthofclusteringalgo-rithmshasbeendevelopedtosolvedifferentproblemsinspeciﬁcﬁelds.However,thereisnoclusteringalgorithmthatcanbeuniversallyusedtosolveallproblems.“Ithasbeenverydifﬁculttodevelopauniﬁedframeworkforreasoningaboutit(clustering)atatechnicallevel,andprofoundlydiverseapproachestoclustering”[166],asprovedthroughanimpossibilitytheorem.Therefore,itisimportanttocarefullyinvestigatethecharacteristicsoftheproblemathand,inordertoselectordesignanappropriateclusteringstrategy.3)Clustervalidation.Givenadataset,eachclusteringalgorithmcanalwaysgenerateadivision,nomatterwhetherthestructureexistsornot.Moreover,differentapproachesusuallyleadtodifferentclusters;andevenforthesamealgorithm,parameteridentiﬁcationorthepresentationorderofinputpatternsmayaffecttheﬁnalresults.Therefore,effectiveevaluationstandardsandcriteriaareimportanttoprovidetheuserswithadegreeofconﬁdencefortheclusteringresultsderivedfromtheusedalgorithms.Theseassessmentsshouldbeobjectiveandhavenopreferencestoanyalgorithm.Also,theyshouldbeusefulforansweringquestionslikehowmanyclustersarehiddeninthedata,whethertheclustersobtainedaremeaningfulorjustanartifactofthealgorithms,orwhywechoosesomealgorithminsteadofanother.Generally,therearethreecategoriesoftestingcriteria:externalindices,internalindices,andrelativeindices.Thesearedeﬁnedonthreetypesofclusteringstructures,knownaspartitionalclus-tering,hierarchicalclustering,andindividualclusters[150].Testsforthesituation,wherenoclusteringstructureexistsinthedata,arealsoconsidered[110],butseldomused,sinceusersareconﬁdentofthepres-enceofclusters.Externalindicesarebasedonsomeprespeciﬁedstructure,whichisthereﬂectionofpriorinformationonthedata,andusedasastandardtovalidatetheclusteringsolutions.Internaltestsarenotdependentonexternalinformation(priorknowledge).Onthecontrary,theyexaminetheclusteringstructuredirectlyfromtheoriginaldata.Relativecriteriaplace

XUANDWUNSCHII:SURVEYOFCLUSTERINGALGORITHMS647theemphasisonthecomparisonofdifferentclusteringstructures,inordertoprovideareference,todecidewhichonemaybestrevealthecharacteristicsoftheobjects.Wewillnotsurveythetopicindepthandreferinterestedreadersto[74],[110],and[150].However,wewillcovermoredetailsonhowtodeterminethenumberofclustersinSectionII-M.Somemorerecentdiscussioncanbefoundin[22],[37],[121],[180],and[181].Approachesforfuzzyclusteringvalidityarereportedin[71],[104],[123],and[220].4)Resultsinterpretation.Theultimategoalofclusteringistoprovideuserswithmeaningfulinsightsfromtheoriginaldata,sothattheycaneffectivelysolvetheproblemsencountered.Expertsintherelevantﬁeldsin-terpretthedatapartition.Furtheranalyzes,evenexper-iments,mayberequiredtoguaranteethereliabilityofextractedknowledge.Notethattheﬂowchartalsoincludesafeedbackpathway.Clusteranalysisisnotaone-shotprocess.Inmanycircumstances,itneedsaseriesoftrialsandrepetitions.Moreover,therearenouniversalandeffectivecriteriatoguidetheselectionoffeaturesandclusteringschemes.Validationcriteriaprovidesomeinsightsonthequalityofclusteringsolutions.Butevenhowtochoosetheappropriatecriterionisstillaproblemrequiringmoreefforts.Clusteringhasbeenappliedinawidevarietyofﬁelds,rangingfromengineering(machinelearning,artiﬁcialintelli-gence,patternrecognition,mechanicalengineering,electricalengineering),computersciences(webmining,spatialdatabaseanalysis,textualdocumentcollection,imagesegmentation),lifeandmedicalsciences(genetics,biology,microbiology,paleontology,psychiatry,clinic,pathology),toearthsciences(geography.geology,remotesensing),socialsciences(soci-ology,psychology,archeology,education),andeconomics(marketing,business)[88],[127].Accordingly,clusteringisalsoknownasnumericaltaxonomy,learningwithoutateacher(orunsupervisedlearning),typologicalanalysisandpartition.Thediversityreﬂectstheimportantpositionofclusteringinscientiﬁcresearch.Ontheotherhand,itcausesconfusion,duetothedifferingterminologiesandgoals.Clusteringalgorithmsdevelopedtosolveaparticularproblem,inaspecializedﬁeld,usuallymakeassumptionsinfavoroftheapplicationofinterest.Thesebiasesinevitablyaffectperformanceinotherproblemsthatdonotsatisfythesepremises.Forexample,the-meansalgorithmisbasedontheEuclideanmeasureand,hence,tendstogeneratehypersphericalclusters.Butiftherealclustersareinothergeometricforms,-meansmaynolongerbeeffective,andweneedtoresorttootherschemes.Thissituationalsoholdstrueformixture-modelclustering,inwhichamodelisﬁttodatainadvance.Clusteringhasalonghistory,withlineagedatingbacktoAris-totle[124].Generalreferencesonclusteringtechniquesinclude[14],[75],[77],[88],[111],[127],[150],[161],[259].Importantsurveypapersonclusteringtechniquesalsoexistintheliterature.Startingfromastatisticalpatternrecognitionviewpoint,Jain,Murty,andFlynnreviewedtheclusteringalgorithmsandotherim-portantissuesrelatedtoclusteranalysis[152],whileHansenandJaumarddescribedtheclusteringproblemsunderamathematicalprogrammingscheme[124].KolatchandHeinvestigatedappli-cationsofclusteringalgorithmsforspatialdatabasesystems[171]andinformationretrieval[133],respectively.Berkhinfurtherex-pandedthetopictothewholeﬁeldofdatamining[33].Murtaghreportedtheadvancesinhierarchicalclusteringalgorithms[210]andBaraldisurveyedseveralmodelsforfuzzyandneuralnetworkclustering[24].Somemoresurveypaperscanalsobefoundin[25],[40],[74],[89],and[151].Inadditiontothereviewpapers,comparativeresearchonclusteringalgorithmsisalsosigniﬁcant.Rauber,Paralic,andPampalkpresentedempiricalresultsforﬁvetypicalclusteringalgorithms[231].Wei,Lee,andHsuplacedtheemphasisonthecomparisonoffastalgorithmsforlargedatabases[280].Scheunderscomparedseveralclusteringtechniquesforcolorimagequantization,withemphasisoncomputationaltimeandthepossibilityofobtainingglobaloptima[239].ApplicationsandevaluationsofdifferentclusteringalgorithmsfortheanalysisofgeneexpressiondatafromDNAmicroarrayexperimentsweredescribedin[153],[192],[246],and[271].Experimentalevalua-tionondocumentclusteringtechniques,basedonhierarchicaland-meansclusteringalgorithms,weresummarizedbySteinbach,Karypis,andKumar[261].Incontrasttotheabove,thepurposeofthispaperistopro-videacomprehensiveandsystematicdescriptionoftheinﬂu-entialandimportantclusteringalgorithmsrootedinstatistics,computerscience,andmachinelearning,withemphasisonnewadvancesinrecentyears.Theremainderofthepaperisorganizedasfollows.InSec-tionII,wereviewclusteringalgorithms,basedonthenaturesofgeneratedclustersandtechniquesandtheoriesbehindthem.Furthermore,wediscussapproachesforclusteringsequentialdata,largedatasets,datavisualization,andhigh-dimensionaldatathroughdimensionreduction.Twoimportantissuesonclusteranalysis,includingproximitymeasureandhowtochoosethenumberofclusters,arealsosummarizedinthesection.Thisisthelongestsectionofthepaper,so,forconve-nience,wegiveanoutlineofSectionIIinbulletformhere:II.ClusteringAlgorithms•A.DistanceandSimilarityMeasures(SeealsoTableI)•B.Hierarchical—AgglomerativeSinglelinkage,completelinkage,groupaveragelinkage,medianlinkage,centroidlinkage,Ward’smethod,balancediterativereducingandclusteringusinghierarchies(BIRCH),clusteringusingrep-resentatives(CURE),robustclusteringusinglinks(ROCK)—Divisivedivisiveanalysis(DIANA),monotheticanalysis(MONA)•C.SquaredError-Based(VectorQuantization)—-means,iterativeself-organizingdataanalysistechnique(ISODATA),genetic-meansalgorithm(GKA),partitioningaroundmedoids(PAM)•D.pdfEstimationviaMixtureDensities—Gaussianmixturedensitydecomposition(GMDD),AutoClass•E.GraphTheory-Based—Chameleon,Delaunaytriangulationgraph(DTG),highlyconnectedsubgraphs(HCS),clusteringiden-

648IEEETRANSACTIONSONNEURALNETWORKS,VOL.16,NO.3,MAY2005TABLEISIMILARITYANDDISSIMILARITYMEASUREFORQUANTITATIVEFEATUREStiﬁcationviaconnectivitykernels(CLICK),clusterafﬁnitysearchtechnique(CAST)•F.CombinatorialSearchTechniques-Based—Geneticallyguidedalgorithm(GGA),TSclustering,SAclustering•G.Fuzzy—Fuzzy-means(FCM),mountainmethod(MM),pos-sibilistic-meansclusteringalgorithm(PCM),fuzzy-shells(FCS)•H.NeuralNetworks-Based—Learningvectorquantization(LVQ),self-organizingfeaturemap(SOFM),ART,simpliﬁedART(SART),hyperellipsoidalclusteringnetwork(HEC),self-split-tingcompetitivelearningnetwork(SPLL)•I.Kernel-Based—Kernel-means,supportvectorclustering(SVC)•J.SequentialData—SequenceSimilarity—Indirectsequenceclustering—Statisticalsequenceclustering•K.Large-ScaleDataSets(SeealsoTableII)—CLARA,CURE,CLARANS,BIRCH,DBSCAN,DENCLUE,WaveCluster,FC,ART•L.DatavisualizationandHigh-dimensionalData—PCA,ICA,Projectionpursuit,Isomap,LLE,CLIQUE,OptiGrid,ORCLUS•M.HowManyClusters?Applicationsintwobenchmarkdatasets,thetravelingsalesmanproblem,andbioinformaticsareillustratedinSec-tionIII.WeconcludethepaperinSectionIV.II.CLUSTERINGALGORITHMSDifferentstartingpointsandcriteriausuallyleadtodifferenttaxonomiesofclusteringalgorithms[33],[88],[124],[150],[152],[171].Aroughbutwidelyagreedframeistoclassifyclusteringtechniquesashierarchicalclusteringandparti-tionalclustering,basedonthepropertiesofclustersgenerated[88],[152].Hierarchicalclusteringgroupsdataobjectswithasequenceofpartitions,eitherfromsingletonclusterstoaclusterincludingallindividualsorviceversa,whilepartitionalclusteringdirectlydividesdataobjectsintosomeprespeciﬁednumberofclusterswithoutthehierarchicalstructure.Wefollowthisframeinsurveyingtheclusteringalgorithmsintheliterature.Beginningwiththediscussiononproximitymeasure,whichisthebasisformostclusteringalgorithms,wefocusonhierarchicalclusteringandclassicalpartitionalclusteringalgo-rithmsinSectionII-B–D.StartingfrompartE,weintroduceandanalyzeclusteringalgorithmsbasedonawidevarietyoftheoriesandtechniques,includinggraphtheory,combinato-rialsearchtechniques,fuzzysettheory,neuralnetworks,andkernelstechniques.Comparedwithgraphtheoryandfuzzyset

XUANDWUNSCHII:SURVEYOFCLUSTERINGALGORITHMS649TABLEIICOMPUTATIONALCOMPLEXITYOFCLUSTERINGALGORITHMStheory,whichhadalreadybeenwidelyusedinclusteranalysisbeforethe1980s,theothertechniqueshavebeenﬁndingtheirapplicationsinclusteringjustintherecentdecades.Inspiteoftheshorthistory,muchprogresshasbeenachieved.Notethatthesetechniquescanbeusedforbothhierarchicalandparti-tionalclustering.Consideringthemorefrequentrequirementoftacklingsequentialdatasets,large-scale,andhigh-dimensionaldatasetsinmanycurrentapplications,wereviewclusteringalgorithmsfortheminthefollowingthreeparts.Wefocusparticularattentiononclusteringalgorithmsappliedinbioin-formatics.Weoffermoredetaileddiscussiononhowtoidentifyappropriatenumberofclusters,whichisparticularlyimportantinclustervalidity,inthelastpartofthesection.A.DistanceandSimilarityMeasuresItisnaturaltoaskwhatkindofstandardsweshouldusetodeterminethecloseness,orhowtomeasurethedistance(dis-similarity)orsimilaritybetweenapairofobjects,anobjectandacluster,orapairofclusters.Inthenextsectiononhierarchicalclustering,wewillillustratelinkagemetricsformeasuringprox-imitybetweenclusters.Usually,aprototypeisusedtorepresentaclustersothatitcanbefurtherprocessedlikeotherobjects.Here,wefocusonreviewingmeasureapproachesbetweenin-dividualsduetothepreviousconsideration.Adataobjectisdescribedbyasetoffeatures,usuallyrepre-sentedasamultidimensionalvector.Thefeaturescanbequan-titativeorqualitative,continuousorbinary,nominalorordinal,whichdeterminethecorrespondingmeasuremechanisms.Adistanceordissimilarityfunctiononadatasetisdeﬁnedtosatisfythefollowingconditions.1)Symmetry.;2)Positivity.foralland.Ifconditions3)Triangleinequality.forallandand(4)Reﬂexivity.alsohold,itiscalledametric.Likewise,asimilarityfunctionisdeﬁnedtosatisfythecon-ditionsinthefollowing.1)Symmetry.;2)Positivity.,foralland.Ifitalsosatisﬁesconditions3)forallandand(4),itiscalledasimi-laritymetric.Foradatasetwithinputpatterns,wecandeﬁneansymmetricmatrix,calledproximitymatrix,whosethelementrepresentsthesimilarityordissimilaritymeasureforthethandthpatterns.Typically,distancefunctionsareusedtomeasurecontinuousfeatures,whilesimilaritymeasuresaremoreimportantforqual-itativevariables.Wesummarizesometypicalmeasuresforcon-tinuousfeaturesinTableI.Theselectionofdifferentmeasuresisproblemdependent.Forbinaryfeatures,asimilaritymeasureiscommonlyused(dissimilaritymeasurescanbeobtainedbysimplyusing).Supposeweusetwobinarysub-scriptstocountfeaturesintwoobjects.andrepresentthenumberofsimultaneousabsenceorpresenceoffeaturesintwoobjects,andandcountthefeaturespresentonlyinoneobject.Thentwotypesofcommonlyusedsimilaritymea-suresfordatapointsandareillustratedinthefollowing.•simplematchingcoefﬁcientRogersandTanimotomeasure.GowerandLegendremeasureThesemeasurescomputethematchbetweentwoobjectsdirectly.Unmatchedpairsareweightedbasedontheircontributiontothesimilarity.•JaccardcoefﬁcientSokalandSneathmeasure.GowerandLegendremeasure

650IEEETRANSACTIONSONNEURALNETWORKS,VOL.16,NO.3,MAY2005Thesemeasuresfocusontheco-occurrencefeatureswhileignoringtheeffectofco-absence.Fornominalfeaturesthathavemorethantwostates,asimplestrategyneedstomapthemintonewbinaryfeatures[161],whileamoreeffectivemethodutilizesthematchingcriterionwhereifanddonotmatchifandmatch[88].Ordinalfeaturesordermultiplestatesaccordingtosomestandardandcanbecomparedbyusingcontinuousdissimi-laritymeasuresdiscussedin[161].EditdistanceforalphabeticsequencesisdiscussedinSectionII-J.Morediscussiononse-quencesandstringscomparisonscanbefoundin[120]and[236].Generally,forobjectsconsistingofmixedvariables,wecanmapallthesevariablesintotheinterval(0,1)andusemea-suresliketheEuclideanmetric.Alternatively,wecantrans-formthemintobinaryvariablesandusebinarysimilarityfunc-tions.Thedrawbackofthesemethodsistheinformationloss.AmorepowerfulmethodwasdescribedbyGowerintheformof,whereindicatesthesimilarityforthethfeatureandisa0–1coefﬁcientbasedonwhetherthemeasureofthetwoobjectsismissing[88],[112].B.HierarchicalClusteringHierarchicalclustering(HC)algorithmsorganizedataintoahierarchicalstructureaccordingtotheproximitymatrix.There-sultsofHCareusuallydepictedbyabinarytreeordendrogram.Therootnodeofthedendrogramrepresentsthewholedatasetandeachleafnodeisregardedasadataobject.Theinterme-diatenodes,thus,describetheextentthattheobjectsareprox-imaltoeachother;andtheheightofthedendrogramusuallyexpressesthedistancebetweeneachpairofobjectsorclusters,oranobjectandacluster.Theultimateclusteringresultscanbeobtainedbycuttingthedendrogramatdifferentlevels.Thisrepresentationprovidesveryinformativedescriptionsandvisu-alizationforthepotentialdataclusteringstructures,especiallywhenrealhierarchicalrelationsexistinthedata,likethedatafromevolutionaryresearchondifferentspeciesoforganizms.HCalgorithmsaremainlyclassiﬁedasagglomerativemethodsanddivisivemethods.Agglomerativeclusteringstartswithclustersandeachofthemincludesexactlyoneobject.Aseriesofmergeoperationsarethenfollowedoutthatﬁnallyleadallobjectstothesamegroup.Divisiveclusteringproceedsinanoppositeway.Inthebeginning,theentiredatasetbelongstoaclusterandaproceduresuccessivelydividesituntilallclus-tersaresingletonclusters.Foraclusterwithobjects,therearepossibletwo-subsetdivisions,whichisveryex-pensiveincomputation[88].Therefore,divisiveclusteringisnotcommonlyusedinpractice.Wefocusontheagglomera-tiveclusteringinthefollowingdiscussionandsomeofdivisiveclusteringapplicationsforbinarydatacanbefoundin[88].Twodivisiveclusteringalgorithms,namedMONAandDIANA,aredescribedin[161].Thegeneralagglomerativeclusteringcanbesummarizedbythefollowingprocedure.1)Startwithsingletonclusters.Calculatetheprox-imitymatrixfortheclusters.2)Searchtheminimaldistancewhereisthedistancefunctiondiscussedbe-fore,intheproximitymatrix,andcombineclusterandtoformanewcluster.3)Updatetheproximitymatrixbycomputingthedis-tancesbetweenthenewclusterandtheotherclusters.4)Repeatsteps2)–3)untilallobjectsareinthesamecluster.Basedonthedifferentdeﬁnitionsfordistancebetweentwoclusters,therearemanyagglomerativeclusteringalgorithms.Thesimplestandmostpopularmethodsincludesinglelinkage[256]andcompletelinkagetechnique[258].Forthesinglelinkagemethod,thedistancebetweentwoclustersisdeter-minedbythetwoclosestobjectsindifferentclusters,soitisalsocallednearestneighbormethod.Onthecontrary,thecompletelinkagemethodusesthefarthestdistanceofapairofobjectstodeﬁneinter-clusterdistance.BoththesinglelinkageandthecompletelinkagemethodcanbegeneralizedbytherecurrenceformulaproposedbyLanceandWilliams[178]aswhereisthedistancefunctionand,andarecoefﬁcientsthattakevaluesdependentontheschemeused.Theformuladescribesthedistancebetweenaclusterandanewclusterformedbythemergeoftwoclustersand.Notethatwhen,and,theformulabecomeswhichcorrespondstothesinglelinkagemethod.Whenand,theformulaiswhichcorrespondstothecompletelinkagemethod.Severalmorecomplicatedagglomerativeclusteringalgo-rithms,includinggroupaveragelinkage,medianlinkage,centroidlinkage,andWard’smethod,canalsobeconstructedbyselectingappropriatecoefﬁcientsintheformula.Adetailedtabledescribingthecoefﬁcientvaluesfordifferentalgorithmsisofferedin[150]and[210].Singlelinkage,completelinkageandaveragelinkageconsiderallpointsofapairofclusters,whencalculatingtheirinter-clusterdistance,andarealsocalledgraphmethods.Theothersarecalledgeometricmethodssincetheyusegeometriccenterstorepresentclustersanddeterminetheirdistances.Remarksonimportantfeaturesandpropertiesofthesemethodsaresummarizedin[88].Moreinter-cluster

XUANDWUNSCHII:SURVEYOFCLUSTERINGALGORITHMS651distancemeasures,especiallythemean-basedones,wereintro-ducedbyYager,withfurtherdiscussionontheirpossibleeffecttocontrolthehierarchicalclusteringprocess[289].ThecommoncriticismforclassicalHCalgorithmsisthattheylackrobustnessandare,hence,sensitivetonoiseandoutliers.Onceanobjectisassignedtoacluster,itwillnotbeconsideredagain,whichmeansthatHCalgorithmsarenotcapableofcor-rectingpossiblepreviousmisclassiﬁcation.ThecomputationalcomplexityformostofHCalgorithmsisatleastandthishighcostlimitstheirapplicationinlarge-scaledatasets.OtherdisadvantagesofHCincludethetendencytoformspher-icalshapesandreversalphenomenon,inwhichthenormalhier-archicalstructureisdistorted.Inrecentyears,withtherequirementforhandlinglarge-scaledatasetsindataminingandotherﬁelds,manynewHCtech-niqueshaveappearedandgreatlyimprovedtheclusteringper-formance.TypicalexamplesincludeCURE[116],ROCK[117],Chameleon[159],andBIRCH[295].ThemainmotivationsofBIRCHlieintwoaspects,theabilitytodealwithlargedatasetsandtherobustnesstooutliers[295].Inordertoachievethesegoals,anewdatastructure,clusteringfeature(CF)tree,isdesignedtostorethesummariesoftheoriginaldata.TheCFtreeisaheight-balancedtree,witheachinternalvertexcomposedofentriesdeﬁnedaschild,whereisarepresentationoftheclusterandisdeﬁnedas,whereisthenumberofdataobjectsinthecluster,isthelinearsumoftheobjects,andSSisthesquaredsumoftheobjects,childisapointertothethchildnode,andisathresholdparameterthatdeterminesthemaximumnumberofentriesinthevertex,andeachleafcomposedofentriesintheformof,whereisthethresholdparameterthatcontrolsthemaximumnumberofentriesintheleaf.Moreover,theleavesmustfollowtherestrictionthatthediameterofeachentryintheleafislessthanathreshold.TheCFtreestructurecapturestheimportantclusteringinformationoftheoriginaldatawhilereducingtherequiredstorage.Outliersareeliminatedfromthesummariesbyidentifyingtheobjectssparselydistributedinthefeaturespace.AftertheCFtreeisbuilt,anagglomerativeHCisappliedtothesetofsummariestoperformglobalclustering.Anadditionalstepmaybeperformedtoreﬁnetheclusters.BIRCHcanachieveacomputationalcomplexityof.Noticingtherestrictionofcentroid-basedHC,whichisunabletoidentifyarbitraryclustershapes,Guha,Rastogi,andShimdevelopedaHCalgorithm,calledCURE,toexploremoresophisticatedclustershapes[116].ThecrucialfeatureofCUREliesintheusageofasetofwell-scatteredpointstorepresenteachcluster,whichmakesitpossibletoﬁndrichclustershapesotherthanhyperspheresandavoidsboththechainingeffect[88]oftheminimumlinkagemethodandthetendencytofavorclusterswithsimilarsizesofcentroid.Theserepresentativepointsarefurthershrunktowardtheclustercentroidaccordingtoanadjustableparameterinordertoweakentheeffectsofoutliers.CUREutilizesrandomsample(andpartition)strategytoreducecomputationalcomplexity.Guhaetal.alsoproposedanotheragglomerativeHCalgorithm,ROCK,togroupdatawithqualitativeattributes[117].Theyusedanovelmeasure“link”todescribetherelationbetweenapairofobjectsandtheircommonneighbors.LikeCURE,arandomsamplestrategyisusedtohandlelargedatasets.ChameleonisconstructedfromgraphtheoryandwillbediscussedinSectionII-E.Relativehierarchicalclustering(RHC)isanotherexplorationthatconsidersboththeinternaldistance(distancebetweenapairofclusterswhichmaybemergedtoyieldanewcluster)andtheexternaldistance(distancefromthetwoclusterstotherest),andusestheratioofthemtodecidetheproximities[203].Leungetal.showedaninterestinghierarchicalclusteringbasedonscale-spacetheory[180].Theyinterpretedclusteringusingablurringprocess,inwhicheachdatumisregardedasalightpointinanimage,andaclusterisrepresentedasablob.LiandBiswasextendedagglomerativeHCtodealwithbothnu-mericandnominaldata.Theproposedalgorithm,calledsimi-larity-basedagglomerativeclustering(SBAC),employsamixeddatameasureschemethatpaysextraattentiontolesscommonmatchesoffeaturevalues[183].ParalleltechniquesforHCarediscussedin[69]and[217],respectively.C.SquaredError—BasedClustering(VectorQuantization)Incontrasttohierarchicalclustering,whichyieldsasucces-sivelevelofclustersbyiterativefusionsordivisions,partitionalclusteringassignsasetofobjectsintoclusterswithnohier-archicalstructure.Inprinciple,theoptimalpartition,basedonsomespeciﬁccriterion,canbefoundbyenumeratingallpos-sibilities.Butthisbruteforcemethodisinfeasibleinpractice,duetotheexpensivecomputation[189].Evenforasmall-scaleclusteringproblem(organizing30objectsinto3groups),thenumberofpossiblepartitionsis.Therefore,heuristicalgorithmshavebeendevelopedinordertoseekapproximatesolutions.Oneoftheimportantfactorsinpartitionalclusteringisthecriterionfunction[124].Thesumofsquarederrorfunctionisoneofthemostwidelyusedcriteria.Supposewehaveasetofobjects,andwewanttoorganizethemintosubsets.Thesquarederrorcriterionthenisdeﬁnedaswhereapartitionmatrix;ifclusterotherwisewithclusterprototypeorcentroid(means)matrix;samplemeanforthethcluster;numberofobjectsinthethcluster.Notetherelationbetweenthesumofsquarederrorcriterionandthescattermatricesdeﬁnedinmulticlassdiscriminantanal-ysis[75],

652IEEETRANSACTIONSONNEURALNETWORKS,VOL.16,NO.3,MAY2005wheretotalscattermatrix;within-classscattermatrix;between-classscattermatrix;andmeanvectorforthewholedataset.Itisnotdifﬁculttoseethatthecriterionbasedonthetraceofisthesameasthesumofsquarederrorcriterion.Tominimizethesquarederrorcriterionisequivalenttominimizingthetraceoformaximizingthetraceof.Wecanobtainarichclassofcriterionfunctionsbasedonthecharacteristicsofand[75].The-meansalgorithmisthebest-knownsquarederror-basedclusteringalgorithm[94],[191].1)Initializea-partitionrandomlyorbasedonsomepriorknowledge.Calculatetheclusterprototypema-trix.2)Assigneachobjectinthedatasettothenearestcluster,i.e.ifforand3)Recalculatetheclusterprototypematrixbasedonthecurrentpartition.4)Repeatsteps2)–3)untilthereisnochangeforeachcluster.The-meansalgorithmisverysimpleandcanbeeasilyimplementedinsolvingmanypracticalproblems.Itcanworkverywellforcompactandhypersphericalclusters.Thetimecomplexityof-meansis.Sinceandareusu-allymuchlessthan-meanscanbeusedtoclusterlargedatasets.Paralleltechniquesfor-meansaredevelopedthatcanlargelyacceleratethealgorithm[262].Thedrawbacksof-meansarealsowellstudied,andasaresult,manyvariantsof-meanshaveappearedinordertoovercometheseobstacles.Wesummarizesomeofthemajordisadvantageswiththepro-posedimprovementinthefollowing.1)Thereisnoefﬁcientanduniversalmethodforiden-tifyingtheinitialpartitionsandthenumberofclus-ters.Theconvergencecentroidsvarywithdifferentinitialpoints.Ageneralstrategyfortheproblemistorunthealgorithmmanytimeswithrandominitialpartitions.Peña,Lozano,andLarrañagacomparedtherandommethodwithotherthreeclassicalinitialparti-tionmethodsbyForgy[94],Kaufman[161],andMac-Queen[191],basedontheeffectiveness,robustness,andconvergencespeedcriteria[227].Accordingtotheirexperimentalresults,therandomandKaufman’smethodworkmuchbetterthantheothertwoundertheﬁrsttwocriteriaandbyfurtherconsideringtheconver-gencespeed,theyrecommendedKaufman’smethod.BradleyandFayyadpresentedareﬁnementalgorithmthatﬁrstutilizes-meanstimestorandomsub-setsfromtheoriginaldata[43].Thesetformedfromtheunionofthesolution(centroidsoftheclusters)ofthesubsetsisclusteredtimesagain,settingeachsubsetsolutionastheinitialguess.Thestartingpointsforthewholedataareobtainedbychoosingthesolutionwithminimalsumofsquareddistances.Likas,Vlassis,andVerbeekproposedaglobal-meansalgo-rithmconsistingofaseriesof-meansclusteringpro-cedureswiththenumberofclustersvaryingfrom1to[186].Afterﬁndingthecentroidforonlyoneclusterexisting,ateach,thepreviouscentroidsareﬁxedandthenewcentroidisselectedbyexaminingalldatapoints.Theauthorsclaimedthatthealgorithmisindependentoftheinitialpartitionsandprovidedacceleratingstrategies.Buttheproblemoncomputationalcomplexityexists,duetotherequire-mentforexecuting-meanstimesforeachvalueof.Aninterestingtechnique,calledISODATA,devel-opedbyBallandHall[21],dealswiththeestimationof.ISODATAcandynamicallyadjustthenumberofclustersbymergingandsplittingclustersaccordingtosomepredeﬁnedthresholds(inthissense,theproblemofidentifyingtheinitialnumberofclustersbecomesthatofparameter(threshold)tweaking).Thenewisusedastheexpectednumberofclustersforthenextit-eration.2)Theiterativelyoptimalprocedureof-meanscannotguaranteeconvergencetoaglobaloptimum.Thesto-chasticoptimaltechniques,likesimulatedannealing(SA)andgeneticalgorithms(alsoseepartII.F),canﬁndtheglobaloptimumwiththepriceofexpensivecomputation.KrishnaandMurtydesignednewopera-torsintheirhybridscheme,GKA,inordertoachieveglobalsearchandfastconvergence[173].ThedeﬁnedbiasedmutationoperatorisbasedontheEuclideandistancebetweenanobjectandthecentroidsandaimstoavoidgettingstuckinalocaloptimum.Anotheroperator,the-meansoperator(KMO),replacesthecomputationallyexpensivecrossoveroperatorsandalleviatesthecomplexitiescomingwiththem.Anadaptivelearningratestrategyfortheonlinemode-meansisillustratedin[63].Thelearningrateisexclusivelydependentonthewithin-groupvariationsandcanbeadjustedwithoutinvolvinganyuseractivi-ties.TheproposedenhancedLBG(ELBG)algorithmadoptsaroulettemechanismtypicalofgeneticalgo-rithmstobecomenear-optimalandtherefore,isnotsensitivetoinitialization[222].3)-meansissensitivetooutliersandnoise.Evenifanobjectisquitefarawayfromtheclustercentroid,itisstillforcedintoaclusterand,thus,distortstheclustershapes.ISODATA[21]andPAM[161]bothconsidertheeffectofoutliersinclusteringprocedures.ISO-DATAgetsridofclusterswithfewobjects.Thesplit-tingoperationofISODATAeliminatesthepossibilityofelongatedclusterstypicalof-means.PAMutilizesrealdatapoints(medoids)astheclusterprototypesandavoidstheeffectofoutliers.Basedonthesamecon-sideration,a-medoidsalgorithmispresentedin[87]

分享到：

赞收藏

资料库

Survey of Clustering Algorithms.pdf

相关推荐

人工智能

热门标签

最新资料