Journal of Bioinformatics and Computational Biology c ○ Imperial College Press A THEORETIC

2020-06-24 来源：汇智旅游网

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

JournalofBioinformaticsandComputationalBiology󰀆

cImperialCollegePressATHEORETICALANALYSISOFTHESELECTIONOF

DIFFERENTIALLYEXPRESSEDGENES

SACHMUKHERJEE&STEPHENJ.ROBERTS

DepartmentofEngineeringScienceUniversityofOxford,OxfordOX13PJ

UnitedKingdom

{sach,sjrob}@robots.ox.ac.uk

Received(1September2004)Revised(1December2004)Accepted(DayMonthYear)

Agreatdealofrecentresearchhasfocusedonthechallengingtaskofselectingdiﬀeren-tiallyexpressedgenesfrommicroarraydata(‘geneselection’).Numerousgeneselectionalgorithmshavebeenproposedintheliterature,butitisoftenunclearexactlyhowthesealgorithmsrespondtoconditionslikesmallsamplesizesordiﬀeringvariances.Choos-inganappropriatealgorithmcanthereforebediﬃcultinmanycases.Inthispaperweproposeatheoreticalanalysisofgeneselection,inwhichtheprobabilityofsuccessfullyselectingdiﬀerentiallyexpressedgenes,usingagivenrankingfunction,isexplicitlycal-culatedintermsofpopulationparameters.Thetheorydevelopedisapplicabletoanyrankingfunctionwhichhasaknownsamplingdistribution,oronewhichcanbeapproxi-matedanalytically.Incontrasttomethodsbasedonsimulation,theapproachpresentedhereiscomputationallyeﬃcientandcanbeusedtoexaminethebehaviorofgenese-lectionalgorithmsunderawidevarietyofconditions,evenwhenthenumberofgenesinvolvedrunsintothetensofthousands.Theutilityofourapproachisillustratedbycomparingthreewidely-usedgeneselectionmethods.Keywords:microarrays,diﬀerentialanalysis,geneselection

1.Introduction

Theadventofmicroarraytechnology1,2hasmeantthattranscriptionalresponsestochangesincellularstatecannowbequantiﬁedforthousandsofgenesinasingleexperiment.Inrecentyears,anenormousamountofworkhasbeendoneinthisarea,addressingquestionsrelatingtobothnormalcellfunction3,4anddisease5,6.

Microarrayexperimentsaremostoftenperformedwiththeaimofcomparingexpressionlevelsbetweentissuesintwoormoreconditionsofinterest,suchaswild-typeandmutant,orhealthyanddiseased.Geneswhichareup-ordown-regulatedbetweenconditionsaresaidtobe‘diﬀerentiallyexpressed’.Althoughdiﬀerentiallyexpressedgenesinmicroarraystudiesarenotalwaysbiologicallyrelevant,theseanalysesplayavitalroleinnarrowingtheﬁeldforfurtherresearch.Ineﬀect,exper-imentsofthiskindprovidegeneticistswithashort-listofgenesworthcommitting

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

2MukherjeeandRoberts

hard-wonfundstowardsinvestigating.Theselectionofdiﬀerentiallyexpressedgeneshasthereforeemergedasoneofthemostimportanttasksinmicroarraydataanal-ysis.

However,themassivedimensionalityandrelativelysmallnumberofsamplesainmicroarraydatasets,coupledwithvariabilityinherentinbothexperimentalpro-cessandunderlyingbiology,maketheiranalysisstatisticallychallenging.Awidevarietyofalgorithms,includingconventionalandnon-parametrichypothesistests,aswellasBayesianandinformation-theoreticmethods7,8,9,10,11havebeenappliedtomicroarraydata.Yetitisoftenunclearhowaparticularalgorithmwillrespondtospeciﬁcstatisticalpropertiesofthedata.Forexample,inrecentmonthstherehasbeensomediscussioninthebioinformaticscommunityregardingthesuitabilityofthet-testwhenvariancesdiﬀeracrossgenesinasystematicmanner.Thisissue,whichwewillcallthe‘t-testvarianceissue’,isdistinctfromtheBehrens-Fisherproblem12ofvariancesdiﬀeringacrossclasses.Ifdiﬀerentiallyexpressedgenesareexpectedtohavehighervariancesthanothergenes,shouldthet-testbeabandonedinfavorofadiﬀerentanalysis,ordoesthet-statisticimplicitlydealwiththeissueanyway?Questionssuchasthishaveseriousimplicationsforusersofgeneselectionalgorithms,yetcanbediﬃculttoresolve.

Thispaperpresentsatheoreticalstudyofgeneselectioninwhichtheprobabilityofselectinggenescorrectly,usingagivenalgorithm,isderivedfrompopulationparametersb,samplesizesandthenumberofgenesunderconsideration.Whilepopulationparametersareneverknowninpractice,understandingexactlyhowagivenalgorithmrespondstovariousconditionsiscrucialtoidentifyingitsstrengthsandweaknesses.Knowingthattheperformanceofaparticularalgorithmbreaksdownatsmallsamplesizes,orwhenvariancesdiﬀerwidely,makesitpossibletomakeaninformedchoiceofmethod,inaccordancewithpriorknowledgeaboutaspeciﬁcexperimentalset-up.

Existingworkonthecomparisonofgeneselectionalgorithmsislargelybasedonsimulation.Simulation-basedproceduresestimateerrorratesempirically,byapply-ingagivenalgorithmtodatadrawnunderaspeciﬁedmodel.AcommonapproachistoproduceROCcurves(i.e.plotsoftruepositiveagainstfalsepositiveratesacrossarangeofthresholds)usingsimulateddatatoestimatetherequirederrorrates13.However,thesheersizeofmicroarraydatasetslimitsthescopeofstudiesofthiskind,asonlyasmallsubsetofconditionscanpossiblybeexplorediftheeﬀectsofcomparingthousandsofgenesaretobeaccountedfor.Forexample,consideraddressingthe‘t-testvarianceissue’byplottingtheareaundertheROCcurveagainsttheratioofvariancesofdiﬀerentiallyexpressedgenestonon-diﬀerentiallyexpressedgenes(‘varianceratio’).Ifatotalof10000geneswereunderconsideration,

aWe

usesampleanddimensioninthefollowingway:eachgeneisadimensionofthedata;eachtissueorarrayisasample.

bBy‘populationparameters’,wemeanthetruevaluesoftheparameterswhichspecifythestatis-ticalmodelforthedata.

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

ATheoreticalAnalysisoftheSelectionofDiﬀerentiallyExpressedGenes3

obtainingaccurateROCestimateswouldrequirethesamplingofseveralhundred10000-dimensionaldatasetsforeachvalueofthevarianceratio.Incontrast,thetheoreticalapproachpresentedinthispapercomputesmeasuresofaccuracydi-rectlyfromthesamplingdistributionsofgenerankingfunctions.Asaconsequence,theanalysisallowsuserstorapidlyassessthelikelyeﬀectsofvariousconditionsonalgorithmperformance.

Ourinitialanalysisisasimpletwo-genescenario:givendatafromtwogenes,onlyoneofwhichisdiﬀerentiallyexpressed,howlikelyisagivenalgorithmtoselecttherightgene?Wethenextendtheanalysistomultiplegenes.Anillustrativesetofresultsispresentedwhichexaminestheperformanceofthreewell-knowngeneselectionmethodsundervariousconditions.Wediscussourapproach,relatingittotheexistingliterature,andﬁnallylookatsomepossibilitiesforfuturework.2.Generankingandselection

Considermicroarrayslides(orchips)belongingtotwoclasses,e.g.healthyanddiseased.DataDforaparticulargeneconsistofmexpressionlevelsinoneclass,andnintheother:

D=[X1X2...XmY1Y2...Yn]

(1)

Xiareindependentandidenticallydistributed(i.i.d.)randomvariableswithtruemeanµX;Yjarealsoi.i.d.withtruemeanµY.Ifthetrueclassmeansaredistinct,thegeneissaidtobediﬀerentiallyexpressed.Inthelanguageofhypothesistesting,thenullandalternativehypotheses(H0andH1respectively)are:

H0→µX=µY(2)H1→µX=µY

(3)

WerefertogenessatisfyingthenullhypothesisH0asnullgenesandthosesatisfyingH1asalternativegenes.Alternativegenesarethereforesimplythosegeneswhicharetrulydiﬀerentiallyexpressedandnullgenesthosewhicharenot.

Geneselectionalgorithmsﬁrstrankgenesunderafunctionr(therankingfunc-tion)andthenselectasetofgenesastheresultoftheanalysis.Therankingfunctionmayscoreeachgeneindividually(asisthecase,forexample,withthet-statistic),orprocessallofthedatatoassignscorestothegenes(SAM11,forexample,usestheentiredatasettodeterminearegularizingterm).Therankingofgenesunderthefunctionrproducesanorderedlistofallthegenesintheexperiment.Asetofgenesisthenobtainedbyeitherimposinganexplicitthresholdonthescores,orbyselectingacertainnumberoftop-rankedgenes.ThecriticalvalueobtainedfromaP-valueisafamiliarexampleofanexplicitthreshold.3.Theory

InthisSectionwederiveprobabilitiesofsuccessingeneselection.Thematerialpresentedbelowisnecessarilysomewhattechnical,butthequestionbeingaddressed

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

4MukherjeeandRoberts

isstraightforward:wesimplywishtocalculatetheprobabilityofsuccessusingagivengenerankingfunction,intermsofpopulationparameters,samplesizeandnumbersofgenesinvolved.Inotherwords,wewishtounderstandexactlyhowwellagivenrankingfunctionisexpectedtodounderspeciﬁedconditions.3.1.BinarySelectionAccuracy

Letusstartbycomparingjusttwogenes,oneofwhichisdiﬀerentiallyexpressed(thealternativegene)andtheothernot(thenullgene).Weassumethatdataisdrawnaccordingtosomemodel,withparametersθ1forthealternativegene,andθ0forthenullgene(wefollowtheconventionthatsubscript‘1’referstothealternativegeneand‘0’tothenullgene).Forexample,themodelmightbeNormal,inwhichcaseeachparametervectorwouldcontainclassmeansandvariances.Thedataforeachgenearethenacollectionofrandomvariablesdrawnundereitherthenulloralternativemodels,andaredenotedbyD0andD1respectively.TherankingfunctionrmapsdataforthenullandalternativegenestorankingscoresT0andT1respectively:

T0=r(D0);

T1=r(D1)

(4)

Adecisionismadeinthistwo-genecasebysimplycomparingthetwoscores.Assumingthathigherabsoluterankingscorescorrespondtoagreaterchanceofdiﬀerentialexpression,theprobabilityofchoosingthecorrectgeneissimply:

P(|T0|<|T1|)

(5)

TheprobabilitygivenbyEq.(5)representstheabilityofthefunctionrtodiscriminatebetweennullandalternativegenes.WecallthisprobabilityBinarySelectionAccuracyorBSA.

NotethatrankingscoresT0andT1arefunctionsofrandomdataandarethem-selvesrandomvariables;lettheirsamplingdistributionscbep0andp1respectively,andthecorrespondingcumulativedistributionfunctionsbeφ0andφ1.ThentheprobabilitygiveninEq.(5)canbeexpressedP(|T0|<|T1|)=󰀇asfollows:

∞

P(|T1|>|u|)p0(u)du

󰀇−∞∞

(1−φ1(|u|)+φ1(−|u|))p0(u)du−∞

=󰀌1−φ1(|u|)+φ1(−|u|)󰀍p0(6)

Where󰀌·󰀍p0denotesexpectationunderthedensityp0.Forfunctionsrwhichare

validteststatistics,thetermP(|T1|>|u|)isthepowerfunctionatthresholdu,andtheprobabilitygivenaboveiseﬀectivelyanintegraloverthepowerfunction.

cThe

samplingdistributionofafunctionofrandomvariables(i.e.astatistic)issimplythedistri-butionofthestatistic.Forexample,thesamplingdistributionofarankingfunctioncanbethoughtofasthedistributionofvaluesthatwouldemergefromrepeatedlycomputingrankingscoresfromdatasampledunderthemodel,withpopulationparametersandsamplessizesremainingﬁxed.

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

ATheoreticalAnalysisoftheSelectionofDiﬀerentiallyExpressedGenes510.75Diff. of meansDiff. of meansyt−statisticyt−statisticcca0.9ra0.7urcucccAA n0.80.65oniotitcceleS0.7eleS0.6 yyrarnaiB0.6niB0.550.5681012141618200.50.511.52Total sample size (m+n)σ21/σ20Fig.1.BinarySelectionAccuracy.Theleftpanelshowsthetrueprobabilityofcorrectlyorderingtwogenes(‘BinarySelectionAccuracy’orBSA)plottedagainsttotalsamplesize,fortwodiﬀerentrankingfunctions,withbothnullandalternativevariancessetto7.5.TherightpanelshowsBSAplottedagainsttheratioofthevarianceσ21ofthediﬀerentiallyexpressedoralternativegenetothevarianceσ20ofthenullgene.Here,σ20isﬁxedat7.5,andσ21allowedtovary.Inbothpanelstotalsamplesizeis8andthetruediﬀerenceofmeansis2.Thedensitiesp0andp1playtheroleoflinkingBinarySelectionAccuracytopopulationparametersandsamplesizes.ThisbecomesclearifweexplicitlyrewritetheprobabilityinEq.(5)asafunctionofpopulationparametersandsamplesize:P(|T0|<|T1|)=f(θ0󰀇,θ1,󰀅m,󰀇n)

∞|u|

−|u|=1−

p(T1=v|θ1,m,n)dv−=u|θ󰀇p(T1=w|θ1,m,n)dw

−∞

󰀆

p(T00,m,n)du

(7)

BinarySelectionAccuracyisthusderiveddirectlyfromthesamplingdistribu-tionsp0andp1andcanberegardedasatruemeasureoftheabilityoftheranking

functiontodistinguishnullandalternativegenes,onthebasisofdatadrawnunderthemodel.Wehaveconsideredabsolutevaluesofrankingscorestomaketheanal-ysisdirectlyapplicabletocommonfunctionslikethet-statistic,Fisherratioanddiﬀerenceofmeans;however,therelevantEquationscanbeeasilymodiﬁedforthesignedcase.NotealsothatwhiletheexpectationinEq.(6)willnotingeneralbeanalytic,itcanbecalculatedeasilybysamplingfromp0.

Interestingly,itturnsoutthatBSAisequivalenttotheareaunderatrueROCcurve.By‘trueROCcurve’wemeanacurvederiveddirectlyfromthesamplingdistributionsoftherankingfunction;thefamiliarROCcurveobtainedbysimulationcanbethoughtofasanestimateofthistruecurve.SeeAppendixBfordetailsandashortproof.

Figure1showstheresultsofusingBinarySelectionAccuracytocomparethe

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

6MukherjeeandRoberts

t-statisticanddiﬀerenceofmeansmethods.Ofparticularinterestistherightpanel,whichshowsthat,atleastasfarasthetwo-genecaseisconcerned,highervariancesforthediﬀerentiallyexpressedgenereallydohandicapthet-test.3.2.MultipleSelectionAccuracy

Consideramoregeneralscenariowithatotalofggenes,ofwhichg1aretrulydiﬀerentiallyexpressed(alternativegenes)andg0arenot(nullgenes).Table1summarizestherelationshipsbetweenthesegroupsofgenes.Thequestionwewishtoaskisthis:givenpopulationparameters,samplesizesandnumbersofnullandalternativegenes,whatistheprobabilitythatoneofthealternativegenesisrankedinthetopsplacesundertherankingfunction?Togetaquantitativeviewofthepracticaldiﬀerencebetweenalgorithms,wewouldalsoliketoknowthedistributionoverthenumberofalternativegenesreturnedamongthesgenesselected.

Wetreatthenumberofgenestobeselectedasaconstant.Thisnumbertendstodependonfactorslikethescaleoftheexperiment,follow-upplansandsoon14andcanreasonablyberegardedaspartoftheexperimentalset-up,inmuchthesamewayasthetotalnumberofgenesunderstudy,orthesamplesize.Incontrast,scorethresholds,whilestatisticallyconvenient,havenoclearinterpretationinthecontextoftheexperiment.Ageneticistmayhaveanapriorinotionofthenumberofgenesshewantstoselect,butwillnotgenerallycareabouttheactualscores,beyondtheirrankorder.Itthereforemakessensetocenterananalysisofmultiplegeneselectionaroundnumberofgenesselected,ratherthanthreshold.

Table1.Summarytableformultiplegeneselection.

Alternativegenes

s1g1−s1g1Nullgenes

s0g0−s0g0(Total)sg−sg

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

ATheoreticalAnalysisoftheSelectionofDiﬀerentiallyExpressedGenes7

beaprobabilisticrelationshipbetweenthethresholdandnumberofgenesselected.Now,atagiventhresholdτ,theprobabilityofanalternativegenebeingselectedbythealgorithmissimplythetruepositiverateatτ,denotedbyβ(τ).Then,theprobabilityofanalternativegenebeingselectedamongthetopsgenesissimplytheexpectationofthetruepositiverateβ(τ),undertheconditionaldensityp(τ|s).WecallthisprobabilityMultipleSelectionAccuracyorMSA:

MSA=

󰀇∞

β(τ)p(τ|s)dτ0

=󰀌β(τ)󰀍p(τ|s)

(8)

3.2.2.Theconditionalp(τ|s)

Butwhatistheconditionaldistributionofthresholdτgivens,andhowcantheex-pectationinEq.(8)becomputedintermsofsamplingdistributionsandnumbersof

genes?Wesuggestthefollowingunnormalizeddensityfunctionasanapproximationtop(τ|s)(thefullderivationispresentedinAppendixA):

p(τ|s)∝

22πv(τ)

e−[s−m(τ)]/[2v(τ)](9)Where,m(τ)andv(τ)arefunctionsofthresholdτandaregivenby:

m(τ)=g1β(τ)+g0α(τ)

(10)v(τ)=g1β(τ)(1−β(τ))+g0α(τ)(1−α(τ))

(11)

Thetrueandfalsepositiveratesβ(τ)andα(τ)canbewrittenintermsofφ0andφ1asfollows:

β(τ)=1−φ1(τ)+φ1(−τ)(12)α(τ)=1−φ0(τ)+φ0(−τ)

(13)

Notethatiftherankingfunctionrisavalidteststatistic,β(τ)isthepower,andα(τ)theP-value,atthresholdτ.

Takentogether,Eq.(8)throughEq.(13)giveMultipleSelectionAccuracyintermsofp0,p1,φ0,φ1,g0,g1,ands;theexpectation󰀌β(τ)󰀍p(τ|s)iscomputedbysamplingfromthedensityinEq.(9).Foragivenrankingfunction,thesamplingdistributionsp0andp1andtheircorrespondingcumulativedistributionfunctionsdependonlyonpopulationparametersθ0andθ1andsamplesizesmandn.WehavethereforeexpressedMultipleSelectionAccuracyinthedesiredform.

ThedistributionoverthenumberofalternativegenesdiscoveredisBinomial,with‘probabilityofsuccess’beingMSA×max(g1/s,1),andthenumberofBernoullitrialsthesmallerofsandg1s∼B󰀃:

MSA×max

󰀃g

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

8MukherjeeandRoberts

3.2.3.Relaxingtheﬁxed-parameterassumption

Theanalysispresentedaboveassumesﬁxedparametersfornullandalternativegenes.Thisassumptioncanberelaxedbytreatingtheparametersofthemodelasquantitiestobeintegratedout:

󰀇

fMSA(π)p(π)dπ(15)

Where,πisavectorrepresentingalltheparametersandfMSArepresentsMul-tipleSelectionAccuracyasafunctionofπ.

Thedistributionp(π)wouldbechosentoreﬂectknowledgeaboutthemannerinwhichparametersofinteresttendtovary.Forexample,diﬀerencesinmeansareknowntobedistributedexponentiallyinmicroarraydatasets;anappropriatechoiceofp(π)couldcapturesuchvariation.Whilethisapproachisnotusedintheexperimentspresentedbelow,itrepresentsasimplemeansbywhichourresultscanbefurthergeneralized.4.Experiments

InthisSectionwepresenttheresultsofacasestudy,comparingthreewidely-usedgenerankingfunctionsundervariousconditions.4.1.Preliminaries

4.1.1.Modelandpopulationparameters

WeassumeaNormalmodelfor(log-transformed)expressiondata,butemphasizethattheframeworkdevelopedinthepreviousSectionisquitegeneral,andinprin-cipleallowsMultipleSelectionAccuracytobecalculatedunderanymodel.

Themultipleselectionanalysisassumesthatallalternativegeneshaveidenticallydistributedexpressionlevels,asalsoallnullgenes.Thepopulationparametersarethenasfollows:

•Means:Classmeansfornullgenesarebydeﬁnitionidentical,andfortheanalysisofthealgorithmswehavechosen,neednotbeconsideredfurther.Wethereforeonlyexplicitlyrefertoclassmeansforthealternativegenes;thesearedenotedbyµXandµYrespectively.Inallexperiments,theextentofdiﬀerentialexpression(i.e.|µX−µY|)issetto2.

•Variances:Weassumethatclassvariancesareidenticalforeithernulloralternativegenesbutmaydiﬀerbetweenthem.Thepopulationvariance

foralternativegenesisdenotedbyσ21,andfornullgenesbyσ2

0;bydefault,bothvaluesaresetto1.

•Samplesize:Asbefore,thenumberofsamplesineachclassism(forthe‘X’data)andn(forthe‘Y’data);bydefault,bothmandnaresetto4.•Numberofgenes:Weassumeatotalof6000genes,ofwhich50aretrulydiﬀerentiallyexpressed.Thenumberofgenesselectedis100.

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

ATheoreticalAnalysisoftheSelectionofDiﬀerentiallyExpressedGenes9

Theparameterset-upcanbechoseninaccordancewiththespeciﬁcquestionsbeingaddressed.Forexample,inthiscase,variancesareconﬁguredtolookintothe‘t-testvarianceissue’mentionedpreviously.4.1.2.Rankingfunctions

Thethreerankingfunctionsanalyzedherearebrieﬂyreviewedbelow.

t-statistic:Thet-testisacanonicaltwo-samplehypothesistestandhasbeenwidelyusedinthediﬀerentialanalysisofmicroarraydata.Therankingfunctionforaparticulargeneisat-statistic:

r(D)=

¯−Y¯(16)m+

12󰀈Where,X

¯andY¯representsamplemeansofdataSD

thepooledsamplestandarddeviation.󰀈SDineachofthetwoclasses,andThenullsamplingdistributionp0isat-distribution,with(m+n−2)degreesoffreedom.Thealternativedistributionp1isanon-centralt-distribution15,with(m+n−2)degreesoffreedomandnon-centralityparameterψ:

ψ=

µX−µY

1(17)2

Diﬀerenceofmeans:Thesamplemeans,i.e.r(D)=X

¯rankingfunctionissimplythediﬀerencebetween

−Y¯.Thisfunctioncanberegardedasalog-spacefoldanalysis7.Bothp0andp1areNormaldensities:

p0=N(0,σ20/m+σ2

0/n)(18)p1=N(µX−µY,σ21/m+σ21/n)

(19)

SAMstatistic:TheSAMstatistic11isgivenby:

r(D)=X

¯−Y¯12Where,󰀈SD

isthepooledstandarddeviation󰀁󰀈andK󰀂(20)

+SD+Karegularizingterm.ThetermKisdeterminedfromtheentiredataset,sofromthepointofviewofderivingasamplingdistribution,isitselfarandomvariable.ThesamplingdistributionoftheSAMstatisticcanbeapproximatedasaratioofNormals16;wemakeuseofthatapproximationhere,andrefertheinterestedreadertothereference.4.2.Results

Resultsforthethreealgorithmsunderconsiderationarepresentedbelow,withMultipleSelectionAccuraciesplottedagainstparametersofinterest.Alsoshownaredistributionsoverthenumberoftrulydiﬀerentiallyexpressedgenesdiscovered;

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

10MukherjeeandRoberts

Diff. of means0.20.2y1cSAMDiff. of meansDiff. of meansarut−statisticSAMSAMcc0.8A0.15t−statistic0.15t−statistic noit))c0.611esl(eP0.1s(P0.1S e0.4lpitluM0.20.050.0500.511.52000102030405001020304050σ2/σ210Number of differentially expressedgenes discovered (sNumber of differentially expressed1)genes discovered (s1)Fig.2.Unequalvariances.TheleftmostpanelshowsMultipleSelectionAccuracies,plottedagainsttheratioofvariancesofalternativetonullgenes.Nullvarianceσ20isﬁxedat1,whilealternativevarianceσ21isallowedtovary.Themiddlepanelshowsfulldistributionsoverthenumberofalternativegenesselected,whenalternativevarianceisrelativelylow(σ21=1December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

ATheoreticalAnalysisoftheSelectionofDiﬀerentiallyExpressedGenes11

10.2Diff. of meansySAMcaru0.80.15t−statisticccA no0.6)1itsc(0.1ePleS0.4 elpit0.05luM0.2Diff. of meansSAMt−statistic00810121416182001020304050Total sample size (m+n)Number of differentially expressedgenes discovered (s1) Fig.3.Samplesize.TheleftpanelshowsMultipleSelectionAccuracyplottedagainstsamplesize.Therightpanelshowsdistributionsoverthenumbers1oftrulydiﬀerentiallyexpressedgenesdiscovered,whentherearefourarraysineachclass(i.e.whentotalsamplesizeis8).Thetotalnumberofgenesis6000,ofwhich50aretrulydiﬀerentiallyexpressedand100areselected;bothnullandalternativevariancesaresetto1,andthetruediﬀerenceofmeansis2.below),butpersistsathighersamplesizestoo17.Thisisduetothefactthatwhennullandalternativevariancesareidentical,thestandarddeviationterminthede-nominatorofthet-statisticoﬀersnodiscriminatorypower,butmustnonethelessbeestimatedfromdata.Ineﬀect,undertheseconditions,thestandarddeviationtermisredundantintermsoftellingnullandalternativegenesapart(butnonethelessplaysaroleinmakingthefunctionavalidteststatistic).

II.Howsuccessfullycanthealgorithmscopewithsmallsamplesizes?

Figure3showsMultipleSelectionAccuracyasafunctionoftotalsamplesize.DiﬀerenceofmeansandSAMdonoticeablybetterthanthet-statistic,especiallyatsmallersamplesizes.TherightpanelofFigure3showsdistributionsoverthenumberofalternativegenesdiscoveredwhentherearefoursamplesineachclass.

Thesmallsamplebehaviorofthet-statisticisduetothestandarddeviationterminthedenominatorofthestatistic,andhasbeenmentionedpreviouslyinthecontextofgeneselection18.SAMisfarlesssensitiveinthisregardonaccountofitsregularizingterm.Indeed,SAM’sregularizingtermmakesthebehaviorofitssamplingdistributionquitediﬀerentfromthet-distribution,inparticularatsmallsamplesizesandundertheconditionsofdiﬀeringvariancediscussedabove.AstheMSAcurvesshow,theconsequenceofthisdiﬀerenceisthatSAMisconsiderablymorerobustundersuchconditionsthanthet-statistic.III.HowisMSAaﬀectedbythetotalnumberofnullgenes?

Microarraydatasetstypicallycontainordersofmagnitudemorenullgenesthanalternativeones.Howdotheperformancesofthethreealgorithmsscalewiththe

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

12MukherjeeandRoberts1)ASM(0.25 y0.8carutcA0.2c0.6SA Mn−od0.15itAcSe0.4leM0.1S elp0.2Diff. of meansit0.05luSAMMt−statistic02000400060008000100000200040006000800010000non−differentially expressed genes (gNumber of null or0)non−differentially expressed genes (gNumber of null or0)Fig.4.Totalnumberofnullgenes.TheleftpanelshowsMultipleSelectionAccuracyorMSAplottedagainstthenumberg0ofnullgenes,withthenumberofalternativegenesandthenumberofgenesselectedheldconstantat50and100respectively.TherightpanelshowsthediﬀerenceinMSAbetweendiﬀerenceofmeansandthet-statisticasafunctionofg0.Nullandalternativevariancesaresetto1,totalsamplesizeis8,andthetruediﬀerenceofmeansis2.numberofnullgenes?TheleftpanelofFigure4showsMultipleSelectionAccuracyasafunctionofthenumberg0ofnullgenes,withthenumberofalternativegenesandthenumberofgenesselectedbeingheldconstant.TherightpanelplotsthediﬀerenceinMultipleSelectionAccuracybetweendiﬀerenceofmeansandthet-statisticagainstg0.Variancesfornullandalternativegenesareidenticalintheseexperiments.AcloseexaminationoftherightpanelofFigure1revealsthatunderconditionsof

equalvariance(i.ewhentheratioσ21/σ2

0isunity)diﬀerenceofmeanshasaslightadvantageoverthet-statisticintermsofBSA.WecanseefromFigure4,however,thattheeﬀectofthisslightadvantagebecomesmorepronouncedasnullgenesbecomemorenumerous.Theseresultssuggestthatalgorithmchoicemaybeanespeciallycriticalissuewhensmallnumbersofgenesofinterestliehiddeninverylargedatasets.

5.Discussionandconclusions

Theaimofthispaperhasbeentounderstandhowgeneselectionalgorithmsbehave-anddiﬀerintheirbehavior-invariousregionsofparameterspace.Thereisnowavastliteratureonthestatisticalandcomputationalaspectsofgeneselection,andouranalysishasconnectionsto,anddiﬀerencesfrom,multipletestingandpoweranalysis.ThecloserelationshipbetweenBSAandROCcurveshasbeendescribedindetailaboveandwillnotbediscussedfurtherinthisSection.

Attheoutset,wedescribedgeneselectionasatwo-stageprocess,withgenesﬁrstbeingrankedunderafunctionandasetofgenessubsequentlyselected.Mul-tipletestingproceduresforgeneselection19seektoadjustP-valuestoproperly

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

ATheoreticalAnalysisoftheSelectionofDiﬀerentiallyExpressedGenes13

accountformultiplecomparisons.Theseproceduresarethereforeconcernedessen-tiallywiththesettingofthresholdsinthesecondstageofgeneselection.Incontrast,thispaperhassoughttoexaminethepropertiesofrankingfunctionsthemselves.Thus,althoughthenotionofMultipleSelectionAccuracyisrelatedtoFalseDiscov-eryRate(FDR)d,ouranalysisaddressesafundamentallydiﬀerentquestionfromFDRmethodsandotherP-valueadjustments.Thedistinctaimsofouranalysisleadalsotoaquitediﬀerentformulation.FDRmethodscalibratethresholdsforagivendatasetandrankingfunction,sotheygenerallyapproximatethe‘true’FDRvaluewithoutusingpopulationparameters.Incontrast,ourcalculationofMultipleSelectionAccuracyisdoneexplicitlyintermsofpopulationparameters.

Theabilityofarankingfunctionto‘tellapart’nullandalternativegenesisofcriticalimportancetoaccurategeneselection.Thisistrueregardlessofthespeciﬁcapproachtakentothesubsequentselectionofasetofgenes.Wethereforesuggestedaverysimpleanalysistoquantifythediscriminatoryabilityofrankingfunctions,resultingtheformulationofBinarySelectionAccuracy.Theabilitytodiscriminatemustdependonbothnullandalternativesamplingdistributions,andindeedthemathematicalexpressionforBinarySelectionAccuracyinEq.(6)makesuseofbothdistributions.Incontrast,P-valuesarecomputedfromthenulldistributionalone,andstatisticalpowerfromthealternativedistribution.Ourtreatmentoferrorsandalgorithmcomparisonisconsequentlyquitediﬀerentfromtheconventionalapproach.ClassicaltestingtheoryﬁrstﬁxesanacceptablelevelofTypeIerror(viatheP-value)andthenseekstomaximizepower,whereasBinarySelectionAccuracyisdeﬁnedwithoutreferencetoanyspeciﬁcthresholdorP-value.Furthermore,inouranalysisabalancebetweenTypeIandTypeIIerrorsemergesnaturally,anddoesnotneedtobeexplicitlyspeciﬁed.

Onelimitationofourapproachconcernsthevalidityoftheapproximationswehaveusedatverysmallsamplesizes.Standardapproximationsforthenon-centralt-distributiondependonmoderatesamplesizes15,asdoestheapproximationwehaveusedforthesamplingdistributionoftheSAMstatistic16.Ourmultiple-geneexperimentsthereforeassumedaminimumoffourarraysineachclass;wehaveempiricallyveriﬁedtheaccuracyoftheapproximationsusedundertheseconditions.However,inpractice,replicatedmicroarraystudiesmayinvolveasfewastwoarraysineachclass;furtherworkonsmallsampledistributionsforrankingfunctionsofinterestisthereforeneededtoextendourworktoapplytosuchconditions.WenotealsothattheNormalapproximationtotheBinomialwhichisusedinderivingtheconditionaldistributionofthresholdgivennumberofgenesselectedorP(τ|s)meansthatwecannotcurrentlyconsiderthecasewherenumberofgenesselectedfallsbelow∼25.

Ouranalysisiscenteredaroundthesamplingdistributionsofgenerankingfunc-dLet

‘TrueDiscoveryRate’orTDRbedeﬁnedinanalogytoFDR,astheexpectedproportionofalternativegenesamongthoseselected.IfTDR(s)representsthepopulationTDRwhensgenesareselected,itcanbeshownthatMSA≈TDR(s)×s

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

14MukherjeeandRoberts

tions.Apracticalissuewiththisapproachisthattheanalysisofotherrankingfunctionswillrequiretherelevantsamplingdistributionstobederived,orapprox-imatedanalytically.Whilesuchderivationsmaynotalwaysbestraightforward,itcouldbearguedthatwithouttheinformationprovidedbysamplingdistributions,itisdiﬃculttogetatruepictureofthebehaviorofanalgorithmundervariousconditions,beyondnecessarilylimitedempiricalstudies.

Inconclusion,wehavepresentedatheoreticalanalysisofgeneselection,whichcanbeusedtothoroughlyexaminethebehaviorofselectionalgorithms.Theanalysiscanhighlightthestrengthsandweaknessesofalgorithms,andthushelpbioinfor-maticianschoosemethodsappropriatetoparticularexperimentalconditions.

AppendixA.Approximationtotheconditionalp(τ|s)

Asnotedinthemaintext,thegeneralrelationshipbetweenthresholdτandnumberofgenessisprobabilistic.ApplyingBayestheorem:

p(τ|s)=

P(s|τ)p(τ)

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

ATheoreticalAnalysisoftheSelectionofDiﬀerentiallyExpressedGenes15

AppendixB.AreaundertheROCcurve

ThetrueROCcurveissimplyaparametriccurve(α(τ),β(τ)),parameterizedbyarankingthresholdτ,whereα(τ)andβ(τ)arerespectivelythefalsepositiveandtruepositiveratesgiventhresholdτ.Notethatingeneselectiontrueandfalsepositivesrefertothegenesbeingselected,notsamplesbeingclassiﬁed.Thethresholdistakentobealwaysnon-negative.Then,givensamplingdistributionsp0andp1,eachofβandαisafunctionofthresholdτonly,asshowninthemaintext(Equations12and13respectively).TheareaundertheROCcurve(AUC)isgivenAUC=

Theboundsontheintegralcanbethought󰀇by:

βdα(B.1)0

ofasgoingfrom‘strict’to‘lenient’

-thus,afalsepositiverateofzero(thelowerboundintermsofα)correspondstoathresholdof+∞,whileafalsepositiverateofone,correspondstoathresholdofzero.Sincedα=α󰀄(τ)dτwecanchange󰀇variabletoτasfollows:

AUC=β(τ)α󰀄(τ)dτ(B.2)

∞

Now,α󰀄(τ)andβ(τ)canbeexpressedintermsofp0andφ1

α󰀄(τ)=−p0(τ)−p0(−τ)(B.3)β(τ)=1−φ1(τ)+φ1(−τ)

(B.4)

SubstitutingAUC=󰀇theseexpressionsintotheintegral:

(1−φ1(τ)+φ1(−τ))(−p0(τ)−p0(−τ))dτ

∞

=−󰀇

(1−φ1(τ)+φ1(−τ))p0(τ)dτ−󰀇

(1−φ1(τ)+φ1(−τ))p0(−τ)dτ

∞∞=󰀇∞(1−φ1(τ)+φ1(−τ))p0(τ)dτ

󰀇∞+(1−φ1(τ)+φ1(−τ))p0(−τ)dτ(B.5)

Ifp0issymmetricabout0AUC=2

󰀇(asisthecaseformostcommonrankingfunctions):

∞

(1−φ1(τ)+φ1(−τ))p0(τ)dτ(B.6)0

Since(1−φ1(|u|)+φ1(−|u|))isanevenfunctionofu,BSA,asgivenbyEquation

6,canbere-writteninthefollowing=2󰀇form:

∞

BSA(1−φ1(|u|)+φ1(−|u|))p0(u)du(B.7)

0=2󰀇∞(1−φ1(u)+φ1(−u))p0(u)du(B.8)

ThislatterexpressionisidenticaltoAUC.

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

16MukherjeeandRoberts

Acknowledgments

SMgratefullyacknowledgesthesupportoftheU.K.BiotechnologyandBiologicalSciencesResearchCouncil,theRoyalAcademyofEngineeringandSt.John’sCol-legeOxford;thanksalsotoSarahGurr,NickHughesandPeterSykacekforhelpfuldiscussions.References

1.D.J.Lockhart,H.Dong,M.C.Byrne,M.T.Follettie,M.V.Gallo,M.S.Chee,M.Mittmann,C.Wang,M.Kobayashi,H.Horton,andE.L.Brown.Expressionmonitoringbyhybridizationtohigh-densityoligonucleotidearrays.NatureBiotech-nology,14:1675–80,1996.

2.M.Schena,D.Shalon,R.W.Davis,andP.O.Brown.QuantitativemonitoringofgeneexpressionpatternswithacomplementaryDNAmicroarray.Science,270(5235):467–70,1995.

3.R.J.Cho,M.J.Campbell,E.A.Winzeler,L.Steinmetz,A.Conway,L.Wodicka,T.G.Wolfsberg,A.E.Gabrielian,D.Landsman,D.J.Lockhart,andR.W.Davis.Agenome-widetranscriptionalanalysisofthemitoticcellcycle.MolCell,2:65–73,1998.

4.F.C.Holstege,E.G.Jennings,J.J.Wyrick,T.I.Lee,C.J.Hengartner,M.R.Green,T.R.Golub,E.S.Lander,andR.A.Young.Dissectingtheregulatorycircuitryofaeukaryoticgenome.Cell,95(5):717–28,1998.

5.T.R.Golub,D.K.Slonim,P.Tamayo,C.Huard,M.Gaasenbeek,J.P.Mesirov,H.Coller,M.L.Loh,J.R.Downing,M.A.Caligiuri,C.D.Bloomﬁeld,andE.SLander.Molecularclassiﬁcationofcancer:classdiscoveryandclasspredictionbygeneexpressionmonitoring.Science,286(5439):531–37,1999.

6.I.Hedenfalk,D.Duggan,Y.Chen,M.Radmacher,M.Bittner,R.Simon,P.Meltzer,B.Gusterson,M.Esteller,O.P.Kallioniemi,B.Wilfond,A.Borg,andJ.Trent.Gene-ExpressionProﬁlesinHereditaryBreastCancer.NewEnglJMed,344(8):539–48,2001.

7.P.BaldiandA.D.Long.ABayesianFrameworkfortheanalysisofmicroarrayexpres-siondata:regularizedt-testandstatisticalinferencesofgenechanges.Bioinformatics,17(6):509–19,2001.

8.A.Ben-Dor,N.Friedman,andZ.Yakhini.Scoringgenesforrelevance.TechnicalReport2000-38,SchoolofComputerScienceandEngineering,HebrewUniversity,2000.

9.W.Pan.Acomparativereviewofstatisticalmethodsfordiscoveringdiﬀerentiallyexpressedgenesinreplicatedmicroarrayexperiments.Bioinformatics,18(4):546–554,2002.

10.O.G.Troyanskaya,M.E.Garber,P.O.Brown,D.Botstein,andR.B.Altman.

Nonparametricmethodsforidentifyingdiﬀerentiallyexpressedgenesinmicroarraydata.Bioinformatics,18(11):1454–61,2002.

11.V.Tusher,R.Tibshirani,andG.Chu.Signiﬁcanceanalysisofmicroarraysappliedto

theionizingradiationresponse.PNatlAcadSciUSA,98(9):5116–21,2001.

12.M.H.DeGrootandM.J.Schervish.ProbabilityandStatistics.AddisonWesley,3rd

edition,2002.

13.P.Broberg.Statisticalmethodsforrankingdiﬀerentiallyexpressedgenes.Genome

Biol,4(6),2003.

14.I.LonnstedtandT.P.Speed.Replicatedmicroarraydata.StatSinica,12:31–46,2002.

December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised

ATheoreticalAnalysisoftheSelectionofDiﬀerentiallyExpressedGenes17

15.N.Johnson,S.Kotz,andN.Balakrishnan.ContinuousUnivariateDistributions,vol-ume1.JohnWileyandSons,NewYork,1994.

16.S.Mukherjee.SamplingDistributionoftheSAM-statistic.TechnicalReportPARG-04-01,DepartmentofEngineeringScience,UniversityofOxford,2004.Availableathttp://www.robots.ox.ac.uk/∼parg/publications.html.

17.S.MukherjeeandS.J.Roberts.ATheoreticalAnalysisofGeneSelection.InProceed-ingsoftheIEEEComputerSocietyBioinformaticsConference.IEEEPress,2004.18.X.CuiandG.A.Churchill.StatisticaltestsfordiﬀerentialexpressionincDNAmi-croarrayexperiments.GenomeBiol,4(210),2003.

19.S.Dudoit,J.P.Shaﬀer,andJ.C.Boldrick.Multiplehypothesistestinginmicroarray

experiments.StatSci,18(1):71–103,2003.

SachMukherjeeisadoctoralcandidateinInformationEngi-neeringattheDepartmentofEngineeringScience,UniversityofOxford.BeforecomingtoOxford,hetookaMastersdegreeatCambridgeUniversity,specializinginspeechrecognition,andanundergraduatedegreeinComputerScienceattheUniversityofYork,U.K.

Hisresearchinterestslieinthedevelopmentandapplicationofmachinelearningandstatisticalmethodstoproblemsinbioinformatics,andinparticulartheanalysisofgenemicroarraydata.

StephenJ.RobertsisProfessorofEngineeringScienceattheUniversityofOxfordandaFellowofSomervilleCollege.HestudiedPhysicsatOxfordasanundergraduate,andwentontotakeaD.Phil.inEngineeringSciencein1991.HewasappointedtothefacultyatImperialCollege,UniversityofLondonin1994andmovedtoOxfordin1999.

ProfRoberts’mainareaofresearchliesinmachinelearningapproachestodataanalysis,andinparticulartoproblemsinmathematicalbiology.HerunsthePat-ternAnalysisandMachineLearningResearchGroup,andhiscurrentworkfocusesonBayesianstatistics,graphicalmodels,independentcomponentanalysisandin-formationtheory.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

全部栏目

Journal of Bioinformatics and Computational Biology c ○ Imperial College Press A THEORETIC