2011 IEEE International Conference on Robotics and Automation
Shanghai International Conference Center
May 9-13, 2011, Shanghai, China
978-1-61284-385-8/11/$26.00 ©2011 IEEE
860
Real-TimeHumanDetectionUsingContourCuesJianxinWuChristopherGeyerJamesM.RehgAbstract—Areal-timeandaccuratehumandetector,C4,isproposedinthispaper.C4achieves20fpsspeedandstate-of-the-artdetectionaccuracy,usingonlyoneprocessingthreadwithoutresortingtospecialhardwareslikeGPU.Real-timeac-curatehumandetectionismadepossiblebytwocontributions.First,weshowthatcontourisexactlywhatweshouldcaptureandsignsofcomparisonsamongneighboringpixelsarethekeyinformationtocapturecontours.Second,weshowthattheCENTRISTvisualdescriptorisparticularlysuitableforhumandetection,becauseitencodesthesigninformationandcanimplicitlyrepresenttheglobalcontour.WhenCENTRISTandlinearclassifierareused,weproposeacomputationalmethodthatdoesnotneedtoexplicitlygeneratefeaturevectors.Itinvolvesnoimagepre-processingorfeaturevectornormalization,andonlyrequiresO(1)stepstotestanimagepatch.C4isalsofriendlytofurtherhardwareacceleration.Inarobotwithembedded1.2GHzCPU,wealsoachievedaccurateand20fpshighspeedhumandetection.I.INTRODUCTIONHumandetectioninvideoisimportantinawiderangeofapplicationsthatintersectwithmanyaspectsofourlives:surveillancesystemsandairportsecurity,automaticdrivinganddriverassistancesystemsinhigh-endcars,human-robotinteractionandimmersive,interactiveentertainments,smarthomesandassistanceforseniorcitizensthatlivealone,andpeople-findingformilitaryapplications.Thewiderangeofapplicationsandunderlyingintellectualchallengesofhumandetectionhaveattractedmanyresearchers.Thegoalofthispaperistodetecthumansinreal-time,withahighdetectionrate,andfewfalsepositives.Inparticu-lar,forhumandetectionon-boardarobot,thecomputationalefficiencyofthedetectorisofparamountimportance.Notonlymusthumandetectionrunatvideorates,butitalsocanuseonlyasmallnumberofCPUcores(orasmallpercentageofCPUcycles)sothatotherimportanttaskssuchaspathplanningandnavigationwillnotbehindered.Recentprogressinhumandetectionhasadvancedthefrontiersofthisprobleminmanyaspects,e.g.,features,classifiers,testingspeed,andocclusionhandling[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11].However,atleasttwoimportantquestionsstillremainopen:•Realtimedetection.Thespeedissueisveryimportant,becausereal-timedetectionistheprerequisiteinmostofthereal-worldapplications[12]andinarobotinparticular.J.WuiswiththeSchoolofComputerEngineering,NanyangTechnolog-icalUniversity,Singaporejxwu@ntu.edu.sgC.GeyeriswiththeiRobotCorporation,Bedford,MA01730,USAcgeyer@irobot.comJ.RehgiswiththeCenterforBehaviorImagingandtheSchoolofInteractiveComputingattheGeorgiaInstituteofTechnology,Atlanta,GA30332,USArehg@cc.gatech.edu•Identifythemostimportantinformationsource.Fea-tureslikeHOG[1]andLBP[8]havebeensuccessfulinpractice.Butwedonotknowclearlyyetwhatisthecriticalinformationencodedinthesefeatures,orwhytheyachievehighpedestriandetectionperformanceinpractice.Inthispaperwearguethatthesetwoproblemsareclosely-related,andwedemonstratethatanappropriatefeaturechoicecanleadtoanefficientdetectionarchitecture.Infact,featurecomputationisthemajorspeedbottleneckinexistingmethods.Currentmethodscanonlyrunatabout10fps(framespersecond)[9],evenwhenutilizingthe100+parallelprocessingthreadsofaGPU.Mostofthistimeisspentincomputingthefeatures(includingimagepre-processing,featureconstruction,andfeaturevectornormalization).Thispapermakestwocontributions.First,weshowthatthecontourdefiningtheoutlineofthefigureistheessentialinformationinhumandetection,throughaseriesofcarefully-designedexperimentsinSec.III-A.Wefindthatthesignsofcomparisonsamongneighboringpixelsarecriticaltorep-resentacontour,whilethemagnitudesofsuchcomparisonsarenotasimportant.Second,weproposetodetecthumansusingthecontourcues,andshowthattherecently-developedCENTRIST[13]featureissuitableforthispurpose(Sec.III-B).Inparticular,itencodesthesignsoflocalcomparisons,andhasthekeycapabilitytocaptureglobal(orlargescale)structuresandcontours.WealsocompareCENTRISTandotherfeaturesinSec.III-C.CENTRISTisveryappealingintermsofspeed.InSec.IVwedescribeamethodforfeatureevaluationthatdoesnotinvolveimagepre-processingorfeaturevectornormaliza-tion.Infact,weshowthatitisnotevennecessarytoex-plicitlycomputetheCENTRISTfeaturevector,becauseitisseamlesslyembeddedintotheclassifierevaluation,achievingvideo-ratedetectionspeed.Weuseacascadeclassifier,andcalltheproposedmethodC4,sincewearedetectinghumansemphasizingthehumancontourusingacascadeclassifierandtheCENTRISTvisualdescriptor.C4producesanaccuratedetectorrunninginreal-time(usingonlyonesingleCPUcore(orthread),notinvolvingGPUorotherspecialhardware).WepresentdetectionresultsinSec.V.Wepresenttwoformsofexperimentalevaluation.First,wepresentresultsonastandardbenchmarkhumandetectiondataset.Second,wepresenttheresultsofon-line,real-timetestingofC4onaniRobotPackBot,operatingautonomouslyanduntethered.Specifically,wedemonstratepedestrianfollowingbasedonreal-timeon-boardpedestriandetectionandgroundplaneestimation.Wewillmakeour
861
detectionsystemavailabletootherresearcherstofacilitateprogressonthistopic.II.RELATEDWORKAccuratedetectionisstillamajorinterestinhumandetection,especiallyintermsofhighdetectionratewithlowFPPI(falsepositiveperimage)[2].Achievementshavebeenmadeintwomaindirections:featuresandclassifiers.Variousfeatureshavebeenappliedtodetectpedestrians,e.g.,Haarfeatures[7]andedgelet[10].However,HOGisprobablythemostpopularfeatureinhumandetection[1],[3],[4],[6],[8].Thedistributionofedgestrengthinvariousdirectionsseemtoefficientlycapturehumansinimages.Recently,variantsofLocalBinaryPattern(LBP)alsoshowhighpotentials[5],[8].Arecenttrendinhumandetectionistocombinemultipleinformationsources,e.g.,color,localtexture,edge,motion,etc.[14],[6],[8],[15].Introducingmoreinformationchannelsusuallyincreasesdetectionaccu-racies,atthecostofincreaseddetectiontime.Intermsofclassifiers,linearSVMiswidelyused,proba-blyforitsfasttestingspeed.WiththefastmethodtoevaluateHistogramIntersectionKernel(HIK)[16],[17],HIKSVMwasusedtoachievehigheraccuracieswithslightincreaseintestingtime[4].Recentresearchalsosubstantiallyspeedsuphumandetec-tion.Cascade(e.g.,[7],[11])andintegralimage(e.g.,[14],[8])werewidelyusedtoacceleratehumandetection.How-ever,thedetectionspeedisstillfarslowerthanframerate.ThusGPUwasfrequentlyusedtodistributethecomputationloadsintohundredsofparallelthreads.Forexample,thesystemin[9]achievedabout10fps,andsimilarly4fpsin[8],bothusingGPU.InSec.IVwewillpresentamethodthatrunsat20fpsusingonlyasingleprocessingthread(anditisalsoveryfriendlytofurtherGPUspeedup).TableIcomparesthespeedandaccuracyofseveralfastvision-baseddetectionmethods,includingC4,themethodproposedinthispaper.1Therehavebeennumerouspreviousworksintheroboticscommunitywhichdevelopedpedestriandetectionsystemsformobilerobotplatforms[18],[19],[20],[21].Themajorityoftheseworksemploysomeformofrangingsensor(rep-resentativeexamplesare[18]and[21]).3Dsensorshastheadvantageofleveraging3Dcuesfordetectionandtracking(e.g.,humanswillprotrudeabovethegroundplaneandcanoftenbesegmentedreliablyindepth),andseveralimpressivesystemshavebeendemonstrated.However,thisapproachhassomesignificantdisadvantages:activerangingsystemscanhavelimitedresolutionandrange,limitedtemporalsamplingrates,difficultieswithstrongoutdoorlighting,addtosystemexpense,andemitanenergysignature.ThereforeitseemsusefultoexploretheviabilityofpassiveEOsensingtechnologiessuchasvideocameras.1[11]reported“upto70x”speeduptoHOGand5-30fpsondifferentinputimages.ItsspeedinTableIiscomputedbasedonthesenumbers.(a)Originalimage(b)Sobelimage(c)OnlysignsFig.1:Detectinghumansfromtheircontours(1b)andsignsoflocalcomparison(1c).III.USINGCENTRISTTODETECTHUMANCONTOURA.SignsoflocalcomparisonsarecriticalforencodingcontoursandhumandetectionWebelievethatcontouristhemostusefulinformationforpedestriandetection,andthesignsofcomparisonsamongneighboringpixelsarekeytoencodethecontour.Bothhypothesesaresupportedbyexperimentspresentedinthissection.Hypothesis1:Forpedestriandetectionthemostimportantthingistoencodethecontour,andthisistheinformationthatHOGismostlyfocusingon.Localtexturecanbeharmful,e.g.,thepaintingsonaperson’sT-shirtmayconfuseahumandetector.InFig.1b,wecomputetheSobelgradientofeachpixelinFig.1aandreplaceapixelwiththegradientvalue(normalizedto[0255]).TheSobelimagesmoothshighfrequencylocaltextureinformation,andtheremainingcontourinFig.1bclearlyindicatesthelocationofahuman.Fig.6in[1]alsoindicatedthatimageblocksrelatedtothehumancontourareimportantintheHOGdetector.However,wedonotknowclearlywhatinformationcapturedbyHOGmakesitsuccessfulinhumandetection.WewillexperimentallyshowthatcontouristheimportantinformationcapturedbyHOG.WeusedtheoriginalHOGdetectorin[1],butusetheSobelversionoftestimages.TheoriginalHOGSVMdetectorwastrainedwithfeatureswherecontourandotherinformation(e.g.,fine-scaletexturesontheclothes)areinterwovenwitheachother(cf.Fig.1a).ItisunusualthatwithoutmodificationitwilldetecthumansonSobeltestingimageswherecontouristhemaininformation(cf.Fig.1b).Surprisingly,thedetectionaccuracyis67%at1FPPI,higherthan7outof12methodsevaluatedin[14].Thuswebelievethatcontouristhemostimportantinfor-mationcapturedbyHOGandforpedestriandetection.OneimportantdifferencebetweenC4andexistingmethodsisthatweexplicitlydetecthumansfromtheSobelimage.Hypothesis2:Signsofcomparisonsamongneighboringpixelsarekeytoencodethecontour.Weusuallyuseim-agegradientstodetectcontours,whicharecomputedbycomparingneighboringpixels.Weshowthatthesignsofsuchcomparisonsarekeytoencodecontourswhilethemagnitudesofcomparisonsarenotasimportant.
862
TABLEI:Speedcomparisonofseveralfastvision-basedhumandetectionmethods.VGAresolutionis640x480,andqVGAis320x240.Accuracyisat1FPPI(falsepositiveperimage).MethodGPUqVGAspeedVGAspeedspeeduptoHOGAccuracyHOG[1]No0.075fps[2]1x74.4%(Sec.V-B)ChnFtrs[14]No0.5fps[14]86%HOG-LBP[8]Yesabout4fpsabout87%HOGcascade[11]No5-30fps12-70xGPUHOG[9]Yes34fps[9]10fps[9]34x[9]similartoHOGC4No109fps20fps80x83.5%Inordertoverifythishypothesis,foragivenimageI,wewanttocreateanewimageIthatretainssignsoflocalcomparisonsbutignorestheirmagnitudes.Inotherwords,wewanttofindanimageIsuchthatsgn(I(p1)−I(p2))=sgn(I(p1)−I(p2)),(1)foranyneighboringpairofpixelsp1andp2.AnexampleisshowninEqn.2.I:3228389664I:101232(2)Notethatthepixel96isconvertedtoavalue3,becauseofthepathofcomparisons2<32<38<96.Inotherwords,althoughthemagnitudeofcomparisonsinIareignoredinI,thespatialrelationshipsamongmultiplecomparisonsinIwillprovidea“pseudo-magnitude”inI.AnotherimportantobservationisthatgradientscomputedfromIandIwillhavequitedifferentmagnitudes.Fig.1cshowssuchasigncomparisonimageI(inwhichpixelvaluesarescaledto[0255])whenIisFig.1b.WecaneasilydetectthehumancontourinFig.1c.Wefurtherverifiedhypothesis2inhumandetection.ApplyingtheoriginalHOGdetectortosigncomparisontestingimages(likeFig.1c),weachieved61%detectionaccuracyat1FPPI(betterthan7methodsevaluatedin[14]).AlthoughweobservelowerdetectionrateswhentheSobelimagesorthesigncomparisonimagesareusedastestimages,itisimportanttonotethattheclassifierwastrainedusingtheoriginalimages.ThefactthatweobtainhigheraccuraciesthanmanyexistingmethodswithoutmodifyingtheHOGclassifierisnoteworthy.Thuswearguethatthemostusefulinformationforhumandetectionistheglobalcontourinformation,andthesignsofcomparisonsamongneighboringpixelsisthekeytoencodeacontour.B.TheCENTRISTvisualdescriptorWethenproposetousetheCENTRISTvisualdescrip-tor[13]torecognizehumans,becauseitsuccinctlyencodesthecrucialsigninformation,anddoesnotrequirepre-orpost-processing.CENTRISTmeansCENsusTRansformhISTogram.WewillshowinthissectionwhyCENTRISTissuitableforthistask,andcompareCENTRISTwithotherpopulardescriptorsinSec.III-C.CensusTransform(CT)isoriginallydesignedforestab-lishingcorrespondencebetweenlocalpatches[22].Censustransformcomparestheintensityvalueofapixelwithitseightneighboringpixels,asillustratedinEqn.3.Ifthecenter10203020406080100(a)humancontour10203020406080100(b)reconstructionFig.2:ReconstructhumancontourfromCENTRIST.pixelisbiggerthan(orequalto)oneofitsneighbors,abit1issetinthecorrespondinglocation.Otherwiseabit0isset.326496326496323296⇒11010110⇒(11010110)2⇒CT=214(3)Theeightbitsgeneratedfromintensitycomparisonscanbeputtogetherinanyorder(wecollectbitsfromlefttoright,andtoptobottom),whichisconsequentlyconvertedtoabase-10numberin[0255].ThisistheCTvalueforthecenterpixel.TheCENTRISTdescriptorisahistogramoftheseCTvalues[13].AsshowninEqn.3,CTvaluessuccinctlyencodethesignsofcomparisonsbetweenneighboringpixels.TheonlythingthatseemstobemissingfromCENTRIST,however,isthepowertocaptureglobal(orlargerscale)structuresandcontoursbeyondthesmall3×3range.Moreimportantly,ifwearegivenanimageIwithCENTRISTh,thenamongthesmallnumberofimagesIthathasamatchingCENTRISTdescriptor,weexpectthatIwillbesimilartoI,especiallyintermsofglobalstructureorcontour,whichweillustrateinFig.2.Fig.2ashowsa108×36humancontour.Wedividethisimageinto12×4blocks,thuseachblockhas81pixels.ForeachblockI,wewanttofindanimageIthathasthesamehistogramandCENTRISTdescriptorasI.2AsshowninFig.2b,thereconstructedimageissimilartotheoriginalimage.Theglobalcharacteristicofthehumancontouriswellpreservedinspiteoferrorsintheleftpartoftheimage.2Wechoosetoworkwithsmallblockswith81pixelsandbinaryimagestomakesimulatedannealingconvergeinareasonableamountoftime.Pleasereferto[13]fordetailsofthereconstructionalgorithm.
863
−1−0.8−0.6−0.4−0.200.20.40.60.81020406080100120140160Normalized similarity score of closest same class example minus that of closest different class example Histogram bin values CENTRISTHOG(a)HistogramofsNN0500100015002000250030003500400045005000−0.4−0.200.20.40.60.81Difference of normalized similarity values CENTRISTHOG(b)PlotofsNNFig.3:Histogramandplotofsimilarityscoredifferences.ThefactthatCENTRISTnotonlyencodesimportantinformation(signsoflocalcomparisons)butalsoimplicitlyencodestheglobalhumancontourmakesusbelievethatitisasuitablerepresentationfordetectinghumancontours.C.ComparingwithHOGandLBPNowwewillcompareCENTRISTwithHOGandLBP,twovisualdescriptorsthatarepopularinhumandetection.Forclassificationtasks,thefeaturevectorsofexamplesinthesameclassshouldbesimilartoeachother,whileexam-plesindifferentclassesshouldhavedissimilardescriptors.Foranyexamplex,wewillcomputethesimilarityscorebetweenxandallotherexamples.Letxinbethemostsimilarexampletoxwithinthesameclass.Similarly,letxoutbethemostsimilarexamplethatisinadifferentclass.ObviouslywewantsNN=s(x,xin)−s(x,xout)tobepositiveandlarge,wheres(x,y)isthesimilarityscorebetweenxandy.ApositivesNNmeansthatxiscorrectlyclassifiedbyanearestneighbor(1-NN)rule.ThussNNisanintuitiveandeasy-to-computemeasuretodeterminewhetheradescriptorsuitscertaintasks.Fig.3comparesCENTRIST(onSobelimages)andHOG(onoriginalinputimages)usingtheINRIAdataset[1].InFig.3aweuseallthe2416humanexamples,andrandomlygenerate2non-humanexamplesfromeachnegativetrainingimagewhichleadsto2436non-humanexamples.Fig.3ashowsthedistribution(histogram)ofsNNforCENTRISTandHOG.Similarityscoresarenormalizedtotherange[01],andanegativesNN(i.e.,intheleftsideoftheblackdashedline)isanerrorof1-NNclassifier.ItisobviousthattheCENTRISTcurveresidesalmostentirelyinthecorrectside(2.9%1-NNerror),whileabouthalfoftheHOGcurveiswrong(46%1-NNerror).Fig.3bfurthershowsthatHOGerrorsaremostlyinthefirsthalfofthedatasetwhicharehumanexamples.Itisarguedin[13]thatvisualdescriptorssuchasHOGorSIFT[23]paysmoreattentiontodetailedlocaltexturalinformationinsteadofstructuralproperties(e.g.,contour)ofanimage.WefurtherspeculatethatthisisduetothefactthatthemagnitudesoflocalcomparisonsusedinHOGpaymoreattentiontolocaltextures.ItisalsoobviousthatwecannotreconstructanimagefromitsHOGorSIFTdescriptor.InFig.3theHOGvectorsarel2normalized,wesets(x,y)=xTy.ForCENTRIST,thehistogramintersectionkernel[24]isusedtocomputesimilarityscores.CENTRISThascloserelationshipwithLBP,anotherpop-ularfeatureforhumandetection.Ifweswitchallbits‘1’to‘0’andviceversainEqn.3,therevisedformulaisanintermediatesteptocomputetheLBPvalueforthesame3×3region[25].However,themoreimportantdifferenceishowtheLBPvaluesareutilized.Pedestriandetectionmethodsuse“uniformLBP”[5],[8],inwhichcertainLBPvaluesthatarecalled“non-uniform”arelumpedtogether.Weare,however,notabletoreconstructtheglobalcontourbecausethenon-uniformvaluesaremissing.Inaddition,[5]and[8]involvesinterpolationofpixelintensities.Theseproceduresmaketheirdescriptorstoonlyencodeablurredversionofthemostimportantinformation,i.e.,signsofneighboringpixelcomparisons.WecomputedthedistributionofsNNfortheuniformLBPdescriptor.Ithasanerrorrateof6.4%forthe1-NNclassifier,morethantwiceoftheerrorrateforCENTRIST(2.9%).However,LBPhasbettersNNdistributionthanHOG(46%1-NNerror).OurconjectureisthattheincompleteandblurredlocalsigninformationinLBPisstilllesssensitivethanHOGinthepresenceofnoiseanddistractionsfromlocaltextures.IV.FASTLINEARMETHODANDDETECTIONFRAMEWORKGiventhevirtuesofCENTRIST,wewilluseittodetecthumans.Weuse108-by-36asthedetectionwindowsize,anddividetheimagepatchinto9×4blocks(soeachblockhas108pixels).Following[1],wetreatanyadjacent2×2blocksasasuper-blockandextractaCENTRISTdescriptorfromeachsuper-block.Thereare8×3=24super-blocks,thusthefeaturevectorforacandidateimagepatchhas256×24=6144dimensions.Aone-pixel-wideborderofeachsuper-blockisnotincludedwhencomputingtheCENTRISTdescriptorbecausetheCensusTransformrequiresa3×3region.A.FastscanningusingalinearclassifierSupposewealreadytrainedalinearclassifierw∈R6144,wecandividewtosmallerunitscorrespondingtothesuper-blocks.Inotherwords,wisconsideredasaconcatenationofwi,j∈R256,1≤i≤8,1≤j≤3.Givenanimagepatchwithfeaturevectorf(similarlyseparatedintofi,j),itisclassifiedascontainingahumanifwTf=8i=13j=1wTi,jfi,j≥θ.(4)Inspiredby[26],weproposeamethodthatcomputesEqn.4usingafixednumbermachineinstructionsforeachimagepatch,i.e.,anO(1)method.Wealsoimprovethemethodin[26]byusingonlyoneintegralimage.Letusdenotethedimensionofadetectionwindowas(h,w).Ablockhassize(hs,ws)=(h/9,w/4),andasuper-blockis(2hs,2ws).GivenanimageI,itscorrespondingSobelimageS,andCTimage(ofS)C.Foradetection
864
windowwithtopleftcorner(t,l),itisnotdifficulttoshowthatthetermwTfinEqn.4isequalto:8i=13j=12hs−1x=22ws−1y=2wC(t+(i−1)hs+x,l+(j−1)ws+y)i,j,(5)wherewki,jisthek-thcomponentofwi,j,andC(x,y)isapixelintheCTimageC.Westartsfromx=2andendsat2hs−1toexcludetheborder.WethencreateauxiliaryimagesAi,jfor1≤i≤8,1≤j≤3withthesamesizeastheinputimageI.The(x,y)pixelofAi,jissettoAx,yi,j=wC(x,y)i,j,(6)thenEqn.5becomes8i=13j=12hs−1x=22ws−1y=2At+(i−1)hs+x,l+(j−1)ws+yi,j.(7)Usingtheintegralimagetrick,thetermintheparenthesisofEqn.7canbecomputedusing3arithmeticoperations.ThusEqn.7(andequivalentlyEqn.4)canbecomputedinO(1)steps.TheadvantageofusingCENTRISTisthatitdoesnotrequirenormalization.Incontrast,normalizationisessentialinHOG[1].WecancomputewTfinasumofpixel-wisecontributionmannerwithoutexplicitlygeneratingf.Eqn.7issimilartothemethodforESSwithspatialpyramidmatchingin[26].However,thereisnoneedtogeneratemultipleintegralimages.Instead,wedefineasingleauxiliaryimageA,withA(x,y)=nxi=1nyj=1wC((i−1)hs+x,(j−1)ws+y)i,j,(8)wherenx=8,ny=3.ThenwTf=2hs−1x=22ws−1y=2A(t+x,l+y).(9)OnlyoneintegralimageisneededtocomputeEqn.9,whichsavesnotonlylargememoryspacebutalsocomputa-tiontime.Inpractice,Eqn.9runsabout3to4timesfasterthanEqn.7.PleasenotethatthetechniqueofEqn.9isgeneral,andcanbeusedtoaccelerateothercomputations,e.g.,ESSwithspatialpyramidmatching.Theproposedmethoddoesnotinvolveimagepre-processing(e.g.,smoothing)orfeaturevectornormalization.Infact,thefeatureextractioncomponentisseamlesslyem-beddedintoclassifierevaluation.Thesepropertiestogethercontributetoareal-timehumandetectionsystem.B.DetectionframeworkInthetrainingphase,wehaveasetof108×36positivetrainingimagepatchesPandasetoflargernegativeimagesNthatdonotcontainanypedestrian.WefirstrandomlychooseasmallsetofpatchesfromtheimagesinNtoformanegativetrainingsetN1.UsingPN1wetrainalinearSVMclassifierH1.AbootstrapprocessisusedtogenerateanewnegativetrainingsetN2.H1isappliedtoallpatchesintheimagesinN.Inthisbootstrappingprocess,wealsore-scalethenegativeimagetoexaminemorepatches.WethentrainH2usingPandN2.ThisprocessisrepeateduntilallpatchesinNareclassifiedasnegativebyatleastoneofH1,H2,...WethentrainalinearSVMclassifierusingPandthecombinednegativesetiNi,whichwecallHlin.Linearclassifiersensurefasttestingspeed(andfastboot-strappingprocess).However,ithasbeenshownthatHIKachieveshigherclassificationaccuraciesonhistogramfea-turesthanlinearSVMclassifiers[17],[4].WewilltrainasecondHIKSVMclassifiertoachievehigherdetectionaccuracy.WeuseHlinonNtobootstrapanewnegativetrainingsetNfinal,andtrainanSVMclassifierusingthelibHIKHIKSVMsolverof[27],whichwecallHhik.Inthetesting/detectionphase,acascadewithtwonodesHlinandHhikisused.WecalltheproposedmethodC4,aswearedetectinghumansbasedontheircontourinformationusingacascadeclassifierandCENTRIST.C.Pedestriandetectionon-boardarobotWeintegratedtheC4pedestriandetectionalgorithmontoaniRobotPackBotinordertoachieveon-boardpedestriandetectionandtoenablepedestrianfollowing.Theimplemen-tationfirstcapturedimagesfromaTYZXG2stereocamerasystemandthenprocessedtheimageryusinganIntel1.2GHzCore2Duoembeddedinanadd-oncomputationalpayload.Weusedtherawcameraimagerytoperformthedetectionandusedthestereorangedatatoestimatethedistancetothepedestrian.Weusedaparticlefiltertotrackthepedestrianbetweenframesandtoeliminateoutliers.Finally,afollowingcomponentwasimplementedtosteertherobotchassisandcommandtheneckpanaxis.Wecomparedthebasicapproachdescribedabovewithanoptimizedmethodthatutilizedthestereodata.Weusetherangeimagetoprovidehypothesesforwherepedestriansmaybestanding.FromthestereodataweuseRANSACtoestimateagroundplane,andwesampledthedepthsalongthegroundplane’shorizon.Withthedepthandcoordinatesoftheplanewecancalculateaboxthatwouldcontainapedestrianstandingontheplaneatthegivenpositiononthehorizonandgivendistance.Thisgivesusfarfewerwindowstotestwiththedetector,whichreducesbothcomputationandfalsepositives.Figures4a,4b,and4cshowtherawdetectionsfromtheC4algorithm,thehypothesesgeneratedfromthestereodata,andtheresultsoftheC4classifierevaluatedonlyonthesehypotheses.NotethattheC4detectorwastailoredtoworkwiththerobottoa3-layercascadeinsteadof2-layerforfasterspeed(butlessaccurate).Bydefaultthedetectionprocedureisasfollows.Thedetectorlooksforpedestriansofdifferentprojectedsizeswithintheimage.Recallthatdistancetothepedestriandeterminesthesizeofthepedestrianintheimage:pedestriansthatarefartherawayappearsmallerintheimage;whilecloserpedestriansappearbigger.Thedetection
865
(a)C4(b)C4+stereo(c)C4+stereo,finalFig.4:On-boarddetectionexample.ThethreeimagesarerawdetectionresultsusingC4,C4+stereo,andthepost-processedresultofC4+stereo,respectively.Thegreenlineisthegroundplaneestimatedusingstereo.systemsearchestheimageatmultiplescales,andforeachscalerunsaclassifieroneverysinglepossiblelocationofapedestrianofthatsizewithintheimage.Lackinganyotherinformationaboutthescene,thisistheonlyreliableapproachtodetection.However,whenotherinformationisavailable,weshouldbeabletousethistoouradvantagetodecreasethenumberofwindowsonwhichtoruntheclassifier,andtodecreasethefalsepositiverate.Inparticular,wecanuseinformationaboutthegroundplane–whichwecanacquirefromthestereocamera.Forexample,Fig4showstheresultofthepedestriandetectionclassifierbeingrunonallpossiblewindows(C4,Fig.4a),andthemanyredundanciesthataregenerateddespitethefactthatinmostofthelocationstherecannotbeapedestrian[28].Pedestriansareboundtothegroundandwecanusethisfactasapriortolimitthesearchrange(C4+stereo,Fig.4b).Theredundanciesareaproblembecauseeachoftheredundantwindowshastobefilteredout,increasingalgorithmandcomputationalcomplexity.Theresultsafterpost-processingareshowninFig.4c(cf.Sec.V-Cforpost-processingdetails).V.RESULTSWeexperimentedontheINRIApedestriandataset[1].Wewillshowthespeed,accuracy,andrelateddiscussionsofC4inSec.V-AtoV-C.Resultsofhumandetectionon-boardtherobotaredescribedinSec.V-D.Thereare2416positivetrainingimagepatchesand1218negativeimagesforbootstrappingintheINRIAdataset.Wecroptheexamplesto108×36pixels,whichisatightboundaryandremovedtheextrapaddingpixels.Attestingtime,abrute-forcestrategyisusedtosearchimagepatchesatallpossiblepositionsandscales.Wesuccessivelydown-samplethetestimagebyafactorof0.8,andscaninagridwithstepsize2.Weusethegroundtruthandmatchingcriterionin[2].AdetectionrectangleRdandagroundtruthrectangleRgisconsideredasacorrectmatchifArea(Rd∩Rg)Area(Rd∪Rg)>0.5.(10)Wealsofollow[2]whichrequiresthatonegroundtruthrectanglecanonlymatchtoatmostonedetectedwindow.TABLEII:Distributionofcomputingtime(inpercentage).ProcessingmodulePercentofusedtimeSobelgradients16.55%ComputingCTvalues9.36%IntegralImage44.65%Resizingimage5.68%Brute-forcescan23.75%Post-processing0.02%A.DetectionspeedC4achievesmuchfasterspeedthanexistinghumandetec-tors.Ona640×480video,itsspeedis20.0fps,usingonly1processingcoreofa2.8GHzCPU.Asfarasweknow,thefastestexistingsystem(withareasonablylowfalsealarmrateandhighdetectionrate)ranatabout10fps[9],whichutilizedtheparallelprocessingcoresofaGPU.DetailedcomparisonsareavailableinTableI(page3).Real-timeprocessingisamust-havepropertyinmosthu-mandetectionapplications.Oursystemisalreadyapplicableinsomedomains,e.g.,robotsystems.However,thereisstillhugespaceforspeedimprovements,whichwillmakeC4suitableevenforthemostdemandingapplications,e.g.,automaticdriverassistance.TableIIisthebreak-downoftimespentindifferentcomponentsofC4.Mostofthesecomponentsareveryfriendlytoaccelerationusingspecialhardware(e.g.,GPU).ThefactthatwedonotneedtoexplicitlyconstructfeaturevectorsforHlinisnottheonlyfactorthatmakesoursystemextremelyfast.Hlinisalsoapowerfulclassifier.Itfiltersawayabout99.43%ofthecandidatepatches,only<0.6%patchesrequireattentionsoftheexpensiveHhikontheINRIAdataset.C4used27.1secondsontheINRIAdataset’stestimages,whiletheexecutablesoftheHOGdetector[1]used2167.5seconds(i.e.,an80foldspeedup).C4runsfasterinsmallerimages.Ina480×360YouTubevideowithmanypedestrians,itsspeedis36.3fps.Itsspeedis109fpson320x240frames.B.DetectionAccuracyontheINRIAdatasetAccuracyofthesystemusingboththefalsepositiveperwindow(FFPW)andimage(FPPI)metricsareshowninFig.5.InFig.5wecomparewithHOG[1](using
866
10−210−11000.10.20.30.40.50.60.70.80.9false positive per imagemiss rate CENTRISTHOG(a)FPPIResults10−610−510−410−310−210−10.010.020.050.10.20.5DET − Person Detectionfalse positives per window (FPPW)miss rate HOGCENTRIST(b)FPPWResultsFig.5:ComparisonC4withHOGontheINRIAdataset.Fig.6:Exampleofpost-processing.executablesaccompanying[1]).Wewillcomparewithothermethodsdirectlyusingtheaccuracynumberspublishedinrespectivepapers.C4detects83.5%humansat0.96falsedetectionperimage.Itiscomparabletothestate-of-the-artresultsontheINRIAdataset,e.g.,ChnFtrs[14]andHOG-LBP[8]),bothhavingaround86%detectionrateat1FPPI.Multipleinformationchannelswereusedinthesemethods.WecouldalsousemultiplechannelstofurtherimproveC4.C4hashigheraccuraciesthanHOG[1](74.4%at1FPPI,Fig.5a)andmanyothermethodscomparedin[14],[2].Fig.5bshowstheFPPWperformanceofHhik(HlinisnotusedwhencomputingtheFPPWcurve.)TheFPPIandFPPWnumbersarenotlinearlycorrelatedbuthavesimilartrends.C4outperformsHOGwhenfalsepositiverateis≥10−4(or0.1intheFPPIcurve),andisnotasgoodasHOGintherangeoflowerfalsepositiverates.Buttheyconvergeintheleftendofbothcurves.C.Importanceofpost-processingC4andHOGintersectedat10−1and10−4respectivelyatFPPIandFPPWcurves.Thenon-maximalsuppression(NMS)stepcontributestothisrelativelybigdifference.InC4alocationistreatedasafalsepositiveiftherearelessthan3detectedwindowsatthatlocation.Thisrequirementwillnothurttruepositivesbecauseourstepsizeis2andthereareusuallymanydetectionwindowsaroundtruehumans.AsmallstepsizealsomeansthatNMSwillgreatlyreducethenumberoffalsedetections.ExamplesareshowninFig.6.Thereare17and5falsedetectionsinthesetwoimages,respectively.3AfterNMS,thefirstimageonlyhas3Themiddlepartofthesecondimagecontains2falsedetectionswhichareveryclosetoeachother.CascadeLevel18FPSCascadeLevel38FPSCascadeLevel1+Stereo20FPSCascadeLevel3+Stereo20FPS0.010.111010010000.00.20.40.60.81.0FalsepositivesframeTruepositiverate(a)ROCCascadeLevel18FPSCascadeLevel38FPSCascadeLevel1+Stereo20FPSCascadeLevel3+Stereo20FPS0.00.20.40.60.81.00.00.20.40.60.81.0recallprecision(b)PRFig.7:ROCandPrecision-RecallcurvesforthecombinedandstandaloneC4detector,aswellasatdifferentcascadelevels.1falsepositive,andthesecondimagedoesnotcontainanyremainingfalsedetection.TheHOGcurveinFig.5aisslightlydifferentfrom[14].InFig.5aHOGdetects34%ofthehumanswithonly1falsepositiveinalltestimages.However,HOGonlydetectsabout10%ofthehumanswith1falsedetectionwhenevaluatedin[14].Onthecontrary,[14]reportedahigherHOGdetectionrateat1FPPI(77%)thanthatinourexperiments(74%).Althoughitisnottotallyclearwhatmakesthesedifferences,webelievethatbeyondnon-maximalsuppression,tightnessofthedetectionwindowisalsoanimportantfactor.Weusedaverytight108×36boundingboxduringthetrainingtime.Werelaxeddetectedpatchesto120×42duringthepost-processingstep.Itseemsthattheoverlyrelaxeddetectionwindowin[2]or[1]isadversewhenweseekanextremelylowfalsepositiverate.D.Detectionresultson-boardarobotInordertobetterunderstandtheperformanceofthecombinedapproach(C4+stereo)ontherobot,wetestedontheimagescollectedatiRobot’sBedfordfacility.Figures7aand7bshowadetailedanalysis,withROCcurves(Fig.7a)andprecision-recallcurves(Fig.7b)forbothapproaches,aswellasshowingtheperformanceforthedifferentcascadelevelsofthe3-layerC4cascadetailoredfortherobot.
867
TheROCcurvesdemonstrateareductioninfalsepositives,however,thedetectionratepeakslowerthanthestandaloneversion.ThisislikelybecausethestandaloneC4wasnotexplicitlytrainedontheoutputofthegroundplanepedestrianhypothesesgenerator.Theprecisionrecallcurvessimilarlyshowanincreaseinprecisionwiththestereoapproachathigherrecallrates-thoughagaintherecallratereachesalowermaximalvaluewiththestereoapproach.WeshowROCandprecision-recallcurvesatdifferentcascadelevelstoshowthecomparativeimprovementsduetocascadelevel.Overall,thecombinedapproachresultsinfewerfalseposi-tives,andisfasteratapproximately20framespersecond(50milliseconds)ontheembeddedIntel1.2GHzCore2Duo.Alone,theC4runsatapproximately8framespersecond(120milliseconds)onthesamehardware.Thecombinedapproachresultsina60%reductionincomputation,anddependingondetectionrate,almostafactorof5reductioninfalsepositives.VI.CONCLUSIONSANDFUTUREWORKInthispaperweproposedareal-timeandaccuratehumandetector,C4,whichdetectshumansusingthecontourcues,acascadeclassifier,andtheCENTRISTvisualdescriptor.Firstweshow,throughcarefullydesignedexperiments,thatcontouristhemostimportantinformationsourceforhumandetection,andthesignsofcomparisonsamongneigh-boringpixelsarethekeytoencodecontours.WethenshowthatCENTRIST[13]isparticularlysuitableforhumandetection,becauseitsuccinctlyencodesthesigninformation,andisabletocapturelargescalestructuresorcontours.Amajorcontributionofthispaperisextremelyfasthumandetection.C4detectshumanswith20fpsspeedon640x480images,usingonly1processingthread,andachievesaccu-raciescomparabletothestate-of-the-art.Timeconsumingpre-processingandfeaturevectornor-malizationarenotneededinCENTRIST.Furthermore,usingalinearclassifierandCENTRIST,wedonotneedtoexplic-itlygeneratetheCENTRISTfeaturevectorsandittakesonlyO(1)operationstoevaluateanimagepatch.CurrentlyC4hasslightlylowerdetectionaccuracythanmethodssuchasthosein[8],[2].However,similartothetechniquesin[8],[2],webelievethattheaccuracyofC4canbeimprovedbyusingmultipleinformationsources,e.g.incorporatingcolorandotherfeaturetypes.Inparticular,multiplefeaturechannelswillhelpC4underverystrictfalsepositiverequirements.Furthermore,thespeedofC4canbefurtherimprovedbyusingspecialhardwarelikeGPU.Finally,onaniRobotPackBotwithembeddedIntelCore2Duo1.2GHzCPU,wecombinedC4withastereovisionsys-tem,andachievedaccurateandvideorate(1.2GHz)humandetectionwithouthinderingotherrobotfunctionalities.ACKNOWLEDGEMENTWethanktheanonymousreviewersfortheirusefulcom-mentsandsuggestions.WearegratefulforthesupportoftheOfficeofNavalResearchunderprimecontractN00014-09-C-0101,andtheNSFunderRIaward0916687.J.WuissupportedbytheSingaporeMoEAcRFTier1grantRG34/09.REFERENCES[1]N.DalalandB.Triggs,“Histogramsoforientedgradientsforhumandetection,”inCVPR,vol.1,2005,pp.886–893.[2]P.Doll´ar,C.Wojek,B.Schiele,andP.Perona,“Pedestriandetection:Abenchmark,”inCVPR,2009.[3]P.F.Felzenszwalb,D.McAllester,andD.Ramanan,“Adiscrimina-tivelytrained,multiscale,deformablepartmodel,”inCVPR,2008.[4]S.MajiandA.C.Berg,“Max-marginadditiveclassifiersfordetec-tion,”inICCV,2009.[5]Y.Mu,S.Yan,Y.Liu,T.Huang,andB.Zhou,“Discriminativelocalbinarypatternsforhumandetectioninpersonalalbum,”inCVPR,2008.[6]W.R.Schwartz,A.Kembhavi,D.Harwood,andL.S.Davis,“Humandetectionusingpartialleastsquaresanalysis,”inICCV,2009.[7]P.Viola,M.Jones,andD.Snow,“Detectingpedestriansusingpatternsofmotionandappearance,”inICCV,2003,pp.734–741.[8]X.Wang,T.X.Han,andS.Yan,“AnHOG-LBPhumandetectorwithpartialocclusionhandling,”inICCV,2009.[9]C.Wojek,G.Dork´o,A.Schulz,andB.Schiele,“Sliding-windowsforrapidobjectclasslocalization:Aparalleltechnique,”inDAGM-Symposium,2008.[10]B.WuandR.Nevatia,“Detectionandtrackingofmultiple,partiallyoccludedhumansbybayesiancombinationofedgeletbasedpartdetectors,”IJCV,vol.75,no.2,pp.247–266,2007.[11]Q.Zhu,M.-C.Yeh,K.-T.Cheng,andS.Avidan,“Fasthumandetectionusingacascadeofhistogramsoforientedgradients,”inCVPR,vol.2,no.1491-1498,2006.[12]D.Ger´onimo,A.M.L´opez,A.D.Sappa,andT.Graf,“Surveyonpedestriandetectionforadvanceddriverassistancesystems,”IEEETPAMI,vol.32,no.7,pp.1239–1258,2010.[13]J.WuandJ.M.Rehg,“CENTRIST:Avisualdescriptorforscenecategorization,”IEEETPAMI,vol.toappear.[14]P.Doll´ar,Z.Tu,P.Perona,andS.Belongie,“Integralchannelfeatures,”inBMVC,2009.[15]B.Leibe,E.Seemann,andB.Schiele,“Pedestriandetectionincrowdedscenes,”inCVPR,vol.I,2005,pp.878–885.[16]S.Maji,A.C.Berg,andJ.Malik,“Classificationusingintersectionkernelsupportvectormachinesisefficient,”inCVPR,2008.[17]J.WuandJ.M.Rehg,“BeyondtheEuclideandistance:Creatingeffectivevisualcodebooksusingthehistogramintersectionkernel,”inICCV,2009.[18]K.O.Arras,´O.M.Mozos,andW.Burgard,“Usingboostedfeaturesforthedetectionofpeoplein2Drangedata,”inICRA,2007.[19]T.Nakada,S.Kagami,andH.Mizoguchi,“Pedestriandetectionusing3Dopticalflowsequencesforamobilerobot,”inSensors,2008,pp.776–779.[20]D.Schulz,W.Burgard,A.Fox,andD.Cremers,“Peopletrackingwithamobilerobotusingsample-basedjointprobabilisticdataassociationfilters,”Intl.J.ofRoboticsResearch,vol.22,no.2,pp.99–116,2003.[21]L.Spinello,K.Arras,R.Triebel,andR.Siegwart,“Alayeredapproachtopeopledetectionin3Drangedata,”inAAAI,2010.[22]R.ZabihandJ.Woodfill,“Non-parametriclocaltransformsforcom-putingvisualcorrespondence,”inECCV,vol.2,1994,pp.151–158.[23]D.Lowe,“Distinctiveimagefeaturesfromscale-invariantkeypoints,”IJCV,vol.60,no.2,pp.91–110,2004.[24]M.J.SwainandD.H.Ballard,“Colorindexing,”IJCV,vol.7,no.1,pp.11–32,1991.[25]T.Ojala,M.Pietik¨ainen,andT.M¨aenp¨a¨a,“Multiresolutiongray-scaleandrotationinvarianttextureclassificationwithlocalbinarypatterns,”IEEETPAMI,vol.24,no.7,pp.971–987,2002.[26]C.H.Lampert,M.B.Blaschko,andT.Hofmann,“Efficientsubwin-dowsearch:Abranchandboundframeworkforobjectlocalization,”IEEETPAMI,vol.31,no.12,pp.2129–2142,2009.[27]J.Wu,“AfastdualmethodforHIKSVMlearning,”inECCV,ser.LNCS6312,2010,pp.552–565.[28]D.Hoiem,A.A.Efros,andM.Hebert,“Puttingobjectsinperspec-tive,”IJCV,vol.80,no.1,pp.3–15,2008.