JournalofBioinformaticsandComputationalBiology
cImperialCollegePressATHEORETICALANALYSISOFTHESELECTIONOF
DIFFERENTIALLYEXPRESSEDGENES
SACHMUKHERJEE&STEPHENJ.ROBERTS
DepartmentofEngineeringScienceUniversityofOxford,OxfordOX13PJ
UnitedKingdom
{sach,sjrob}@robots.ox.ac.uk
Received(1September2004)Revised(1December2004)Accepted(DayMonthYear)
Agreatdealofrecentresearchhasfocusedonthechallengingtaskofselectingdifferen-tiallyexpressedgenesfrommicroarraydata(‘geneselection’).Numerousgeneselectionalgorithmshavebeenproposedintheliterature,butitisoftenunclearexactlyhowthesealgorithmsrespondtoconditionslikesmallsamplesizesordifferingvariances.Choos-inganappropriatealgorithmcanthereforebedifficultinmanycases.Inthispaperweproposeatheoreticalanalysisofgeneselection,inwhichtheprobabilityofsuccessfullyselectingdifferentiallyexpressedgenes,usingagivenrankingfunction,isexplicitlycal-culatedintermsofpopulationparameters.Thetheorydevelopedisapplicabletoanyrankingfunctionwhichhasaknownsamplingdistribution,oronewhichcanbeapproxi-matedanalytically.Incontrasttomethodsbasedonsimulation,theapproachpresentedhereiscomputationallyefficientandcanbeusedtoexaminethebehaviorofgenese-lectionalgorithmsunderawidevarietyofconditions,evenwhenthenumberofgenesinvolvedrunsintothetensofthousands.Theutilityofourapproachisillustratedbycomparingthreewidely-usedgeneselectionmethods.Keywords:microarrays,differentialanalysis,geneselection
1.Introduction
Theadventofmicroarraytechnology1,2hasmeantthattranscriptionalresponsestochangesincellularstatecannowbequantifiedforthousandsofgenesinasingleexperiment.Inrecentyears,anenormousamountofworkhasbeendoneinthisarea,addressingquestionsrelatingtobothnormalcellfunction3,4anddisease5,6.
Microarrayexperimentsaremostoftenperformedwiththeaimofcomparingexpressionlevelsbetweentissuesintwoormoreconditionsofinterest,suchaswild-typeandmutant,orhealthyanddiseased.Geneswhichareup-ordown-regulatedbetweenconditionsaresaidtobe‘differentiallyexpressed’.Althoughdifferentiallyexpressedgenesinmicroarraystudiesarenotalwaysbiologicallyrelevant,theseanalysesplayavitalroleinnarrowingthefieldforfurtherresearch.Ineffect,exper-imentsofthiskindprovidegeneticistswithashort-listofgenesworthcommitting
1
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
2MukherjeeandRoberts
hard-wonfundstowardsinvestigating.Theselectionofdifferentiallyexpressedgeneshasthereforeemergedasoneofthemostimportanttasksinmicroarraydataanal-ysis.
However,themassivedimensionalityandrelativelysmallnumberofsamplesainmicroarraydatasets,coupledwithvariabilityinherentinbothexperimentalpro-cessandunderlyingbiology,maketheiranalysisstatisticallychallenging.Awidevarietyofalgorithms,includingconventionalandnon-parametrichypothesistests,aswellasBayesianandinformation-theoreticmethods7,8,9,10,11havebeenappliedtomicroarraydata.Yetitisoftenunclearhowaparticularalgorithmwillrespondtospecificstatisticalpropertiesofthedata.Forexample,inrecentmonthstherehasbeensomediscussioninthebioinformaticscommunityregardingthesuitabilityofthet-testwhenvariancesdifferacrossgenesinasystematicmanner.Thisissue,whichwewillcallthe‘t-testvarianceissue’,isdistinctfromtheBehrens-Fisherproblem12ofvariancesdifferingacrossclasses.Ifdifferentiallyexpressedgenesareexpectedtohavehighervariancesthanothergenes,shouldthet-testbeabandonedinfavorofadifferentanalysis,ordoesthet-statisticimplicitlydealwiththeissueanyway?Questionssuchasthishaveseriousimplicationsforusersofgeneselectionalgorithms,yetcanbedifficulttoresolve.
Thispaperpresentsatheoreticalstudyofgeneselectioninwhichtheprobabilityofselectinggenescorrectly,usingagivenalgorithm,isderivedfrompopulationparametersb,samplesizesandthenumberofgenesunderconsideration.Whilepopulationparametersareneverknowninpractice,understandingexactlyhowagivenalgorithmrespondstovariousconditionsiscrucialtoidentifyingitsstrengthsandweaknesses.Knowingthattheperformanceofaparticularalgorithmbreaksdownatsmallsamplesizes,orwhenvariancesdifferwidely,makesitpossibletomakeaninformedchoiceofmethod,inaccordancewithpriorknowledgeaboutaspecificexperimentalset-up.
Existingworkonthecomparisonofgeneselectionalgorithmsislargelybasedonsimulation.Simulation-basedproceduresestimateerrorratesempirically,byapply-ingagivenalgorithmtodatadrawnunderaspecifiedmodel.AcommonapproachistoproduceROCcurves(i.e.plotsoftruepositiveagainstfalsepositiveratesacrossarangeofthresholds)usingsimulateddatatoestimatetherequirederrorrates13.However,thesheersizeofmicroarraydatasetslimitsthescopeofstudiesofthiskind,asonlyasmallsubsetofconditionscanpossiblybeexplorediftheeffectsofcomparingthousandsofgenesaretobeaccountedfor.Forexample,consideraddressingthe‘t-testvarianceissue’byplottingtheareaundertheROCcurveagainsttheratioofvariancesofdifferentiallyexpressedgenestonon-differentiallyexpressedgenes(‘varianceratio’).Ifatotalof10000geneswereunderconsideration,
aWe
usesampleanddimensioninthefollowingway:eachgeneisadimensionofthedata;eachtissueorarrayisasample.
bBy‘populationparameters’,wemeanthetruevaluesoftheparameterswhichspecifythestatis-ticalmodelforthedata.
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
ATheoreticalAnalysisoftheSelectionofDifferentiallyExpressedGenes3
obtainingaccurateROCestimateswouldrequirethesamplingofseveralhundred10000-dimensionaldatasetsforeachvalueofthevarianceratio.Incontrast,thetheoreticalapproachpresentedinthispapercomputesmeasuresofaccuracydi-rectlyfromthesamplingdistributionsofgenerankingfunctions.Asaconsequence,theanalysisallowsuserstorapidlyassessthelikelyeffectsofvariousconditionsonalgorithmperformance.
Ourinitialanalysisisasimpletwo-genescenario:givendatafromtwogenes,onlyoneofwhichisdifferentiallyexpressed,howlikelyisagivenalgorithmtoselecttherightgene?Wethenextendtheanalysistomultiplegenes.Anillustrativesetofresultsispresentedwhichexaminestheperformanceofthreewell-knowngeneselectionmethodsundervariousconditions.Wediscussourapproach,relatingittotheexistingliterature,andfinallylookatsomepossibilitiesforfuturework.2.Generankingandselection
Considermicroarrayslides(orchips)belongingtotwoclasses,e.g.healthyanddiseased.DataDforaparticulargeneconsistofmexpressionlevelsinoneclass,andnintheother:
D=[X1X2...XmY1Y2...Yn]
(1)
Xiareindependentandidenticallydistributed(i.i.d.)randomvariableswithtruemeanµX;Yjarealsoi.i.d.withtruemeanµY.Ifthetrueclassmeansaredistinct,thegeneissaidtobedifferentiallyexpressed.Inthelanguageofhypothesistesting,thenullandalternativehypotheses(H0andH1respectively)are:
H0→µX=µY(2)H1→µX=µY
(3)
WerefertogenessatisfyingthenullhypothesisH0asnullgenesandthosesatisfyingH1asalternativegenes.Alternativegenesarethereforesimplythosegeneswhicharetrulydifferentiallyexpressedandnullgenesthosewhicharenot.
Geneselectionalgorithmsfirstrankgenesunderafunctionr(therankingfunc-tion)andthenselectasetofgenesastheresultoftheanalysis.Therankingfunctionmayscoreeachgeneindividually(asisthecase,forexample,withthet-statistic),orprocessallofthedatatoassignscorestothegenes(SAM11,forexample,usestheentiredatasettodeterminearegularizingterm).Therankingofgenesunderthefunctionrproducesanorderedlistofallthegenesintheexperiment.Asetofgenesisthenobtainedbyeitherimposinganexplicitthresholdonthescores,orbyselectingacertainnumberoftop-rankedgenes.ThecriticalvalueobtainedfromaP-valueisafamiliarexampleofanexplicitthreshold.3.Theory
InthisSectionwederiveprobabilitiesofsuccessingeneselection.Thematerialpresentedbelowisnecessarilysomewhattechnical,butthequestionbeingaddressed
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
4MukherjeeandRoberts
isstraightforward:wesimplywishtocalculatetheprobabilityofsuccessusingagivengenerankingfunction,intermsofpopulationparameters,samplesizeandnumbersofgenesinvolved.Inotherwords,wewishtounderstandexactlyhowwellagivenrankingfunctionisexpectedtodounderspecifiedconditions.3.1.BinarySelectionAccuracy
Letusstartbycomparingjusttwogenes,oneofwhichisdifferentiallyexpressed(thealternativegene)andtheothernot(thenullgene).Weassumethatdataisdrawnaccordingtosomemodel,withparametersθ1forthealternativegene,andθ0forthenullgene(wefollowtheconventionthatsubscript‘1’referstothealternativegeneand‘0’tothenullgene).Forexample,themodelmightbeNormal,inwhichcaseeachparametervectorwouldcontainclassmeansandvariances.Thedataforeachgenearethenacollectionofrandomvariablesdrawnundereitherthenulloralternativemodels,andaredenotedbyD0andD1respectively.TherankingfunctionrmapsdataforthenullandalternativegenestorankingscoresT0andT1respectively:
T0=r(D0);
T1=r(D1)
(4)
Adecisionismadeinthistwo-genecasebysimplycomparingthetwoscores.Assumingthathigherabsoluterankingscorescorrespondtoagreaterchanceofdifferentialexpression,theprobabilityofchoosingthecorrectgeneissimply:
P(|T0|<|T1|)
(5)
TheprobabilitygivenbyEq.(5)representstheabilityofthefunctionrtodiscriminatebetweennullandalternativegenes.WecallthisprobabilityBinarySelectionAccuracyorBSA.
NotethatrankingscoresT0andT1arefunctionsofrandomdataandarethem-selvesrandomvariables;lettheirsamplingdistributionscbep0andp1respectively,andthecorrespondingcumulativedistributionfunctionsbeφ0andφ1.ThentheprobabilitygiveninEq.(5)canbeexpressedP(|T0|<|T1|)=asfollows:
∞
P(|T1|>|u|)p0(u)du
=
−∞∞
(1−φ1(|u|)+φ1(−|u|))p0(u)du−∞
=1−φ1(|u|)+φ1(−|u|)p0(6)
Where·p0denotesexpectationunderthedensityp0.Forfunctionsrwhichare
validteststatistics,thetermP(|T1|>|u|)isthepowerfunctionatthresholdu,andtheprobabilitygivenaboveiseffectivelyanintegraloverthepowerfunction.
cThe
samplingdistributionofafunctionofrandomvariables(i.e.astatistic)issimplythedistri-butionofthestatistic.Forexample,thesamplingdistributionofarankingfunctioncanbethoughtofasthedistributionofvaluesthatwouldemergefromrepeatedlycomputingrankingscoresfromdatasampledunderthemodel,withpopulationparametersandsamplessizesremainingfixed.
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
ATheoreticalAnalysisoftheSelectionofDifferentiallyExpressedGenes510.75Diff. of meansDiff. of meansyt−statisticyt−statisticcca0.9ra0.7urcucccAA n0.80.65oniotitcceleS0.7eleS0.6 yyrarnaiB0.6niB0.550.5681012141618200.50.511.52Total sample size (m+n)σ21/σ20Fig.1.BinarySelectionAccuracy.Theleftpanelshowsthetrueprobabilityofcorrectlyorderingtwogenes(‘BinarySelectionAccuracy’orBSA)plottedagainsttotalsamplesize,fortwodifferentrankingfunctions,withbothnullandalternativevariancessetto7.5.TherightpanelshowsBSAplottedagainsttheratioofthevarianceσ21ofthedifferentiallyexpressedoralternativegenetothevarianceσ20ofthenullgene.Here,σ20isfixedat7.5,andσ21allowedtovary.Inbothpanelstotalsamplesizeis8andthetruedifferenceofmeansis2.Thedensitiesp0andp1playtheroleoflinkingBinarySelectionAccuracytopopulationparametersandsamplesizes.ThisbecomesclearifweexplicitlyrewritetheprobabilityinEq.(5)asafunctionofpopulationparametersandsamplesize:P(|T0|<|T1|)=f(θ0,θ1,m,n)
∞|u|
−|u|=1−
p(T1=v|θ1,m,n)dv−=u|θp(T1=w|θ1,m,n)dw
−∞
−∞
−∞
p(T00,m,n)du
(7)
BinarySelectionAccuracyisthusderiveddirectlyfromthesamplingdistribu-tionsp0andp1andcanberegardedasatruemeasureoftheabilityoftheranking
functiontodistinguishnullandalternativegenes,onthebasisofdatadrawnunderthemodel.Wehaveconsideredabsolutevaluesofrankingscorestomaketheanal-ysisdirectlyapplicabletocommonfunctionslikethet-statistic,Fisherratioanddifferenceofmeans;however,therelevantEquationscanbeeasilymodifiedforthesignedcase.NotealsothatwhiletheexpectationinEq.(6)willnotingeneralbeanalytic,itcanbecalculatedeasilybysamplingfromp0.
Interestingly,itturnsoutthatBSAisequivalenttotheareaunderatrueROCcurve.By‘trueROCcurve’wemeanacurvederiveddirectlyfromthesamplingdistributionsoftherankingfunction;thefamiliarROCcurveobtainedbysimulationcanbethoughtofasanestimateofthistruecurve.SeeAppendixBfordetailsandashortproof.
Figure1showstheresultsofusingBinarySelectionAccuracytocomparethe
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
6MukherjeeandRoberts
t-statisticanddifferenceofmeansmethods.Ofparticularinterestistherightpanel,whichshowsthat,atleastasfarasthetwo-genecaseisconcerned,highervariancesforthedifferentiallyexpressedgenereallydohandicapthet-test.3.2.MultipleSelectionAccuracy
Consideramoregeneralscenariowithatotalofggenes,ofwhichg1aretrulydifferentiallyexpressed(alternativegenes)andg0arenot(nullgenes).Table1summarizestherelationshipsbetweenthesegroupsofgenes.Thequestionwewishtoaskisthis:givenpopulationparameters,samplesizesandnumbersofnullandalternativegenes,whatistheprobabilitythatoneofthealternativegenesisrankedinthetopsplacesundertherankingfunction?Togetaquantitativeviewofthepracticaldifferencebetweenalgorithms,wewouldalsoliketoknowthedistributionoverthenumberofalternativegenesreturnedamongthesgenesselected.
Wetreatthenumberofgenestobeselectedasaconstant.Thisnumbertendstodependonfactorslikethescaleoftheexperiment,follow-upplansandsoon14andcanreasonablyberegardedaspartoftheexperimentalset-up,inmuchthesamewayasthetotalnumberofgenesunderstudy,orthesamplesize.Incontrast,scorethresholds,whilestatisticallyconvenient,havenoclearinterpretationinthecontextoftheexperiment.Ageneticistmayhaveanapriorinotionofthenumberofgenesshewantstoselect,butwillnotgenerallycareabouttheactualscores,beyondtheirrankorder.Itthereforemakessensetocenterananalysisofmultiplegeneselectionaroundnumberofgenesselected,ratherthanthreshold.
Table1.Summarytableformultiplegeneselection.
Alternativegenes
s1g1−s1g1Nullgenes
s0g0−s0g0(Total)sg−sg
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
ATheoreticalAnalysisoftheSelectionofDifferentiallyExpressedGenes7
beaprobabilisticrelationshipbetweenthethresholdandnumberofgenesselected.Now,atagiventhresholdτ,theprobabilityofanalternativegenebeingselectedbythealgorithmissimplythetruepositiverateatτ,denotedbyβ(τ).Then,theprobabilityofanalternativegenebeingselectedamongthetopsgenesissimplytheexpectationofthetruepositiverateβ(τ),undertheconditionaldensityp(τ|s).WecallthisprobabilityMultipleSelectionAccuracyorMSA:
MSA=
∞
β(τ)p(τ|s)dτ0
=β(τ)p(τ|s)
(8)
3.2.2.Theconditionalp(τ|s)
Butwhatistheconditionaldistributionofthresholdτgivens,andhowcantheex-pectationinEq.(8)becomputedintermsofsamplingdistributionsandnumbersof
genes?Wesuggestthefollowingunnormalizeddensityfunctionasanapproximationtop(τ|s)(thefullderivationispresentedinAppendixA):
p(τ|s)∝
1
22πv(τ)
e−[s−m(τ)]/[2v(τ)](9)Where,m(τ)andv(τ)arefunctionsofthresholdτandaregivenby:
m(τ)=g1β(τ)+g0α(τ)
(10)v(τ)=g1β(τ)(1−β(τ))+g0α(τ)(1−α(τ))
(11)
Thetrueandfalsepositiveratesβ(τ)andα(τ)canbewrittenintermsofφ0andφ1asfollows:
β(τ)=1−φ1(τ)+φ1(−τ)(12)α(τ)=1−φ0(τ)+φ0(−τ)
(13)
Notethatiftherankingfunctionrisavalidteststatistic,β(τ)isthepower,andα(τ)theP-value,atthresholdτ.
Takentogether,Eq.(8)throughEq.(13)giveMultipleSelectionAccuracyintermsofp0,p1,φ0,φ1,g0,g1,ands;theexpectationβ(τ)p(τ|s)iscomputedbysamplingfromthedensityinEq.(9).Foragivenrankingfunction,thesamplingdistributionsp0andp1andtheircorrespondingcumulativedistributionfunctionsdependonlyonpopulationparametersθ0andθ1andsamplesizesmandn.WehavethereforeexpressedMultipleSelectionAccuracyinthedesiredform.
ThedistributionoverthenumberofalternativegenesdiscoveredisBinomial,with‘probabilityofsuccess’beingMSA×max(g1/s,1),andthenumberofBernoullitrialsthesmallerofsandg1s∼B:
MSA×max
g
11
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
8MukherjeeandRoberts
3.2.3.Relaxingthefixed-parameterassumption
Theanalysispresentedaboveassumesfixedparametersfornullandalternativegenes.Thisassumptioncanberelaxedbytreatingtheparametersofthemodelasquantitiestobeintegratedout:
fMSA(π)p(π)dπ(15)
Where,πisavectorrepresentingalltheparametersandfMSArepresentsMul-tipleSelectionAccuracyasafunctionofπ.
Thedistributionp(π)wouldbechosentoreflectknowledgeaboutthemannerinwhichparametersofinteresttendtovary.Forexample,differencesinmeansareknowntobedistributedexponentiallyinmicroarraydatasets;anappropriatechoiceofp(π)couldcapturesuchvariation.Whilethisapproachisnotusedintheexperimentspresentedbelow,itrepresentsasimplemeansbywhichourresultscanbefurthergeneralized.4.Experiments
InthisSectionwepresenttheresultsofacasestudy,comparingthreewidely-usedgenerankingfunctionsundervariousconditions.4.1.Preliminaries
4.1.1.Modelandpopulationparameters
WeassumeaNormalmodelfor(log-transformed)expressiondata,butemphasizethattheframeworkdevelopedinthepreviousSectionisquitegeneral,andinprin-cipleallowsMultipleSelectionAccuracytobecalculatedunderanymodel.
Themultipleselectionanalysisassumesthatallalternativegeneshaveidenticallydistributedexpressionlevels,asalsoallnullgenes.Thepopulationparametersarethenasfollows:
•Means:Classmeansfornullgenesarebydefinitionidentical,andfortheanalysisofthealgorithmswehavechosen,neednotbeconsideredfurther.Wethereforeonlyexplicitlyrefertoclassmeansforthealternativegenes;thesearedenotedbyµXandµYrespectively.Inallexperiments,theextentofdifferentialexpression(i.e.|µX−µY|)issetto2.
•Variances:Weassumethatclassvariancesareidenticalforeithernulloralternativegenesbutmaydifferbetweenthem.Thepopulationvariance
foralternativegenesisdenotedbyσ21,andfornullgenesbyσ2
0;bydefault,bothvaluesaresetto1.
•Samplesize:Asbefore,thenumberofsamplesineachclassism(forthe‘X’data)andn(forthe‘Y’data);bydefault,bothmandnaresetto4.•Numberofgenes:Weassumeatotalof6000genes,ofwhich50aretrulydifferentiallyexpressed.Thenumberofgenesselectedis100.
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
ATheoreticalAnalysisoftheSelectionofDifferentiallyExpressedGenes9
Theparameterset-upcanbechoseninaccordancewiththespecificquestionsbeingaddressed.Forexample,inthiscase,variancesareconfiguredtolookintothe‘t-testvarianceissue’mentionedpreviously.4.1.2.Rankingfunctions
Thethreerankingfunctionsanalyzedherearebrieflyreviewedbelow.
t-statistic:Thet-testisacanonicaltwo-samplehypothesistestandhasbeenwidelyusedinthedifferentialanalysisofmicroarraydata.Therankingfunctionforaparticulargeneisat-statistic:
r(D)=
X
¯−Y¯(16)m+
12Where,X
¯andY¯representsamplemeansofdataSD
thepooledsamplestandarddeviation.SDineachofthetwoclasses,andThenullsamplingdistributionp0isat-distribution,with(m+n−2)degreesoffreedom.Thealternativedistributionp1isanon-centralt-distribution15,with(m+n−2)degreesoffreedomandnon-centralityparameterψ:
ψ=
µX−µY
1(17)2
m+
Differenceofmeans:Thesamplemeans,i.e.r(D)=X
¯rankingfunctionissimplythedifferencebetween
−Y¯.Thisfunctioncanberegardedasalog-spacefoldanalysis7.Bothp0andp1areNormaldensities:
p0=N(0,σ20/m+σ2
0/n)(18)p1=N(µX−µY,σ21/m+σ21/n)
(19)
SAMstatistic:TheSAMstatistic11isgivenby:
r(D)=X
¯−Y¯12Where,SD
isthepooledstandarddeviationandK(20)
m
+SD+Karegularizingterm.ThetermKisdeterminedfromtheentiredataset,sofromthepointofviewofderivingasamplingdistribution,isitselfarandomvariable.ThesamplingdistributionoftheSAMstatisticcanbeapproximatedasaratioofNormals16;wemakeuseofthatapproximationhere,andrefertheinterestedreadertothereference.4.2.Results
Resultsforthethreealgorithmsunderconsiderationarepresentedbelow,withMultipleSelectionAccuraciesplottedagainstparametersofinterest.Alsoshownaredistributionsoverthenumberoftrulydifferentiallyexpressedgenesdiscovered;
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
10MukherjeeandRoberts
Diff. of means0.20.2y1cSAMDiff. of meansDiff. of meansarut−statisticSAMSAMcc0.8A0.15t−statistic0.15t−statistic noit))c0.611esl(eP0.1s(P0.1S e0.4lpitluM0.20.050.0500.511.52000102030405001020304050σ2/σ210Number of differentially expressedgenes discovered (sNumber of differentially expressed1)genes discovered (s1)Fig.2.Unequalvariances.TheleftmostpanelshowsMultipleSelectionAccuracies,plottedagainsttheratioofvariancesofalternativetonullgenes.Nullvarianceσ20isfixedat1,whilealternativevarianceσ21isallowedtovary.Themiddlepanelshowsfulldistributionsoverthenumberofalternativegenesselected,whenalternativevarianceisrelativelylow(σ21=1December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
ATheoreticalAnalysisoftheSelectionofDifferentiallyExpressedGenes11
10.2Diff. of meansySAMcaru0.80.15t−statisticccA no0.6)1itsc(0.1ePleS0.4 elpit0.05luM0.2Diff. of meansSAMt−statistic00810121416182001020304050Total sample size (m+n)Number of differentially expressedgenes discovered (s1) Fig.3.Samplesize.TheleftpanelshowsMultipleSelectionAccuracyplottedagainstsamplesize.Therightpanelshowsdistributionsoverthenumbers1oftrulydifferentiallyexpressedgenesdiscovered,whentherearefourarraysineachclass(i.e.whentotalsamplesizeis8).Thetotalnumberofgenesis6000,ofwhich50aretrulydifferentiallyexpressedand100areselected;bothnullandalternativevariancesaresetto1,andthetruedifferenceofmeansis2.below),butpersistsathighersamplesizestoo17.Thisisduetothefactthatwhennullandalternativevariancesareidentical,thestandarddeviationterminthede-nominatorofthet-statisticoffersnodiscriminatorypower,butmustnonethelessbeestimatedfromdata.Ineffect,undertheseconditions,thestandarddeviationtermisredundantintermsoftellingnullandalternativegenesapart(butnonethelessplaysaroleinmakingthefunctionavalidteststatistic).
II.Howsuccessfullycanthealgorithmscopewithsmallsamplesizes?
Figure3showsMultipleSelectionAccuracyasafunctionoftotalsamplesize.DifferenceofmeansandSAMdonoticeablybetterthanthet-statistic,especiallyatsmallersamplesizes.TherightpanelofFigure3showsdistributionsoverthenumberofalternativegenesdiscoveredwhentherearefoursamplesineachclass.
Thesmallsamplebehaviorofthet-statisticisduetothestandarddeviationterminthedenominatorofthestatistic,andhasbeenmentionedpreviouslyinthecontextofgeneselection18.SAMisfarlesssensitiveinthisregardonaccountofitsregularizingterm.Indeed,SAM’sregularizingtermmakesthebehaviorofitssamplingdistributionquitedifferentfromthet-distribution,inparticularatsmallsamplesizesandundertheconditionsofdifferingvariancediscussedabove.AstheMSAcurvesshow,theconsequenceofthisdifferenceisthatSAMisconsiderablymorerobustundersuchconditionsthanthet-statistic.III.HowisMSAaffectedbythetotalnumberofnullgenes?
Microarraydatasetstypicallycontainordersofmagnitudemorenullgenesthanalternativeones.Howdotheperformancesofthethreealgorithmsscalewiththe
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
12MukherjeeandRoberts1)ASM(0.25 y0.8carutcA0.2c0.6SA Mn−od0.15itAcSe0.4leM0.1S elp0.2Diff. of meansit0.05luSAMMt−statistic02000400060008000100000200040006000800010000non−differentially expressed genes (gNumber of null or0)non−differentially expressed genes (gNumber of null or0)Fig.4.Totalnumberofnullgenes.TheleftpanelshowsMultipleSelectionAccuracyorMSAplottedagainstthenumberg0ofnullgenes,withthenumberofalternativegenesandthenumberofgenesselectedheldconstantat50and100respectively.TherightpanelshowsthedifferenceinMSAbetweendifferenceofmeansandthet-statisticasafunctionofg0.Nullandalternativevariancesaresetto1,totalsamplesizeis8,andthetruedifferenceofmeansis2.numberofnullgenes?TheleftpanelofFigure4showsMultipleSelectionAccuracyasafunctionofthenumberg0ofnullgenes,withthenumberofalternativegenesandthenumberofgenesselectedbeingheldconstant.TherightpanelplotsthedifferenceinMultipleSelectionAccuracybetweendifferenceofmeansandthet-statisticagainstg0.Variancesfornullandalternativegenesareidenticalintheseexperiments.AcloseexaminationoftherightpanelofFigure1revealsthatunderconditionsof
equalvariance(i.ewhentheratioσ21/σ2
0isunity)differenceofmeanshasaslightadvantageoverthet-statisticintermsofBSA.WecanseefromFigure4,however,thattheeffectofthisslightadvantagebecomesmorepronouncedasnullgenesbecomemorenumerous.Theseresultssuggestthatalgorithmchoicemaybeanespeciallycriticalissuewhensmallnumbersofgenesofinterestliehiddeninverylargedatasets.
5.Discussionandconclusions
Theaimofthispaperhasbeentounderstandhowgeneselectionalgorithmsbehave-anddifferintheirbehavior-invariousregionsofparameterspace.Thereisnowavastliteratureonthestatisticalandcomputationalaspectsofgeneselection,andouranalysishasconnectionsto,anddifferencesfrom,multipletestingandpoweranalysis.ThecloserelationshipbetweenBSAandROCcurveshasbeendescribedindetailaboveandwillnotbediscussedfurtherinthisSection.
Attheoutset,wedescribedgeneselectionasatwo-stageprocess,withgenesfirstbeingrankedunderafunctionandasetofgenessubsequentlyselected.Mul-tipletestingproceduresforgeneselection19seektoadjustP-valuestoproperly
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
ATheoreticalAnalysisoftheSelectionofDifferentiallyExpressedGenes13
accountformultiplecomparisons.Theseproceduresarethereforeconcernedessen-tiallywiththesettingofthresholdsinthesecondstageofgeneselection.Incontrast,thispaperhassoughttoexaminethepropertiesofrankingfunctionsthemselves.Thus,althoughthenotionofMultipleSelectionAccuracyisrelatedtoFalseDiscov-eryRate(FDR)d,ouranalysisaddressesafundamentallydifferentquestionfromFDRmethodsandotherP-valueadjustments.Thedistinctaimsofouranalysisleadalsotoaquitedifferentformulation.FDRmethodscalibratethresholdsforagivendatasetandrankingfunction,sotheygenerallyapproximatethe‘true’FDRvaluewithoutusingpopulationparameters.Incontrast,ourcalculationofMultipleSelectionAccuracyisdoneexplicitlyintermsofpopulationparameters.
Theabilityofarankingfunctionto‘tellapart’nullandalternativegenesisofcriticalimportancetoaccurategeneselection.Thisistrueregardlessofthespecificapproachtakentothesubsequentselectionofasetofgenes.Wethereforesuggestedaverysimpleanalysistoquantifythediscriminatoryabilityofrankingfunctions,resultingtheformulationofBinarySelectionAccuracy.Theabilitytodiscriminatemustdependonbothnullandalternativesamplingdistributions,andindeedthemathematicalexpressionforBinarySelectionAccuracyinEq.(6)makesuseofbothdistributions.Incontrast,P-valuesarecomputedfromthenulldistributionalone,andstatisticalpowerfromthealternativedistribution.Ourtreatmentoferrorsandalgorithmcomparisonisconsequentlyquitedifferentfromtheconventionalapproach.ClassicaltestingtheoryfirstfixesanacceptablelevelofTypeIerror(viatheP-value)andthenseekstomaximizepower,whereasBinarySelectionAccuracyisdefinedwithoutreferencetoanyspecificthresholdorP-value.Furthermore,inouranalysisabalancebetweenTypeIandTypeIIerrorsemergesnaturally,anddoesnotneedtobeexplicitlyspecified.
Onelimitationofourapproachconcernsthevalidityoftheapproximationswehaveusedatverysmallsamplesizes.Standardapproximationsforthenon-centralt-distributiondependonmoderatesamplesizes15,asdoestheapproximationwehaveusedforthesamplingdistributionoftheSAMstatistic16.Ourmultiple-geneexperimentsthereforeassumedaminimumoffourarraysineachclass;wehaveempiricallyverifiedtheaccuracyoftheapproximationsusedundertheseconditions.However,inpractice,replicatedmicroarraystudiesmayinvolveasfewastwoarraysineachclass;furtherworkonsmallsampledistributionsforrankingfunctionsofinterestisthereforeneededtoextendourworktoapplytosuchconditions.WenotealsothattheNormalapproximationtotheBinomialwhichisusedinderivingtheconditionaldistributionofthresholdgivennumberofgenesselectedorP(τ|s)meansthatwecannotcurrentlyconsiderthecasewherenumberofgenesselectedfallsbelow∼25.
Ouranalysisiscenteredaroundthesamplingdistributionsofgenerankingfunc-dLet
‘TrueDiscoveryRate’orTDRbedefinedinanalogytoFDR,astheexpectedproportionofalternativegenesamongthoseselected.IfTDR(s)representsthepopulationTDRwhensgenesareselected,itcanbeshownthatMSA≈TDR(s)×s
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
14MukherjeeandRoberts
tions.Apracticalissuewiththisapproachisthattheanalysisofotherrankingfunctionswillrequiretherelevantsamplingdistributionstobederived,orapprox-imatedanalytically.Whilesuchderivationsmaynotalwaysbestraightforward,itcouldbearguedthatwithouttheinformationprovidedbysamplingdistributions,itisdifficulttogetatruepictureofthebehaviorofanalgorithmundervariousconditions,beyondnecessarilylimitedempiricalstudies.
Inconclusion,wehavepresentedatheoreticalanalysisofgeneselection,whichcanbeusedtothoroughlyexaminethebehaviorofselectionalgorithms.Theanalysiscanhighlightthestrengthsandweaknessesofalgorithms,andthushelpbioinfor-maticianschoosemethodsappropriatetoparticularexperimentalconditions.
AppendixA.Approximationtotheconditionalp(τ|s)
Asnotedinthemaintext,thegeneralrelationshipbetweenthresholdτandnumberofgenessisprobabilistic.ApplyingBayestheorem:
p(τ|s)=
P(s|τ)p(τ)
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
ATheoreticalAnalysisoftheSelectionofDifferentiallyExpressedGenes15
AppendixB.AreaundertheROCcurve
ThetrueROCcurveissimplyaparametriccurve(α(τ),β(τ)),parameterizedbyarankingthresholdτ,whereα(τ)andβ(τ)arerespectivelythefalsepositiveandtruepositiveratesgiventhresholdτ.Notethatingeneselectiontrueandfalsepositivesrefertothegenesbeingselected,notsamplesbeingclassified.Thethresholdistakentobealwaysnon-negative.Then,givensamplingdistributionsp0andp1,eachofβandαisafunctionofthresholdτonly,asshowninthemaintext(Equations12and13respectively).TheareaundertheROCcurve(AUC)isgivenAUC=
Theboundsontheintegralcanbethoughtby:
1
βdα(B.1)0
ofasgoingfrom‘strict’to‘lenient’
-thus,afalsepositiverateofzero(thelowerboundintermsofα)correspondstoathresholdof+∞,whileafalsepositiverateofone,correspondstoathresholdofzero.Sincedα=α(τ)dτwecanchangevariabletoτasfollows:
0
AUC=β(τ)α(τ)dτ(B.2)
∞
Now,α(τ)andβ(τ)canbeexpressedintermsofp0andφ1
α(τ)=−p0(τ)−p0(−τ)(B.3)β(τ)=1−φ1(τ)+φ1(−τ)
(B.4)
SubstitutingAUC=theseexpressionsintotheintegral:
0
(1−φ1(τ)+φ1(−τ))(−p0(τ)−p0(−τ))dτ
∞
0
=−
(1−φ1(τ)+φ1(−τ))p0(τ)dτ−
0
(1−φ1(τ)+φ1(−τ))p0(−τ)dτ
∞∞=∞(1−φ1(τ)+φ1(−τ))p0(τ)dτ
0
∞+(1−φ1(τ)+φ1(−τ))p0(−τ)dτ(B.5)
0
Ifp0issymmetricabout0AUC=2
(asisthecaseformostcommonrankingfunctions):
∞
(1−φ1(τ)+φ1(−τ))p0(τ)dτ(B.6)0
Since(1−φ1(|u|)+φ1(−|u|))isanevenfunctionofu,BSA,asgivenbyEquation
6,canbere-writteninthefollowing=2form:
∞
BSA(1−φ1(|u|)+φ1(−|u|))p0(u)du(B.7)
0=2∞(1−φ1(u)+φ1(−u))p0(u)du(B.8)
0
ThislatterexpressionisidenticaltoAUC.
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
16MukherjeeandRoberts
Acknowledgments
SMgratefullyacknowledgesthesupportoftheU.K.BiotechnologyandBiologicalSciencesResearchCouncil,theRoyalAcademyofEngineeringandSt.John’sCol-legeOxford;thanksalsotoSarahGurr,NickHughesandPeterSykacekforhelpfuldiscussions.References
1.D.J.Lockhart,H.Dong,M.C.Byrne,M.T.Follettie,M.V.Gallo,M.S.Chee,M.Mittmann,C.Wang,M.Kobayashi,H.Horton,andE.L.Brown.Expressionmonitoringbyhybridizationtohigh-densityoligonucleotidearrays.NatureBiotech-nology,14:1675–80,1996.
2.M.Schena,D.Shalon,R.W.Davis,andP.O.Brown.QuantitativemonitoringofgeneexpressionpatternswithacomplementaryDNAmicroarray.Science,270(5235):467–70,1995.
3.R.J.Cho,M.J.Campbell,E.A.Winzeler,L.Steinmetz,A.Conway,L.Wodicka,T.G.Wolfsberg,A.E.Gabrielian,D.Landsman,D.J.Lockhart,andR.W.Davis.Agenome-widetranscriptionalanalysisofthemitoticcellcycle.MolCell,2:65–73,1998.
4.F.C.Holstege,E.G.Jennings,J.J.Wyrick,T.I.Lee,C.J.Hengartner,M.R.Green,T.R.Golub,E.S.Lander,andR.A.Young.Dissectingtheregulatorycircuitryofaeukaryoticgenome.Cell,95(5):717–28,1998.
5.T.R.Golub,D.K.Slonim,P.Tamayo,C.Huard,M.Gaasenbeek,J.P.Mesirov,H.Coller,M.L.Loh,J.R.Downing,M.A.Caligiuri,C.D.Bloomfield,andE.SLander.Molecularclassificationofcancer:classdiscoveryandclasspredictionbygeneexpressionmonitoring.Science,286(5439):531–37,1999.
6.I.Hedenfalk,D.Duggan,Y.Chen,M.Radmacher,M.Bittner,R.Simon,P.Meltzer,B.Gusterson,M.Esteller,O.P.Kallioniemi,B.Wilfond,A.Borg,andJ.Trent.Gene-ExpressionProfilesinHereditaryBreastCancer.NewEnglJMed,344(8):539–48,2001.
7.P.BaldiandA.D.Long.ABayesianFrameworkfortheanalysisofmicroarrayexpres-siondata:regularizedt-testandstatisticalinferencesofgenechanges.Bioinformatics,17(6):509–19,2001.
8.A.Ben-Dor,N.Friedman,andZ.Yakhini.Scoringgenesforrelevance.TechnicalReport2000-38,SchoolofComputerScienceandEngineering,HebrewUniversity,2000.
9.W.Pan.Acomparativereviewofstatisticalmethodsfordiscoveringdifferentiallyexpressedgenesinreplicatedmicroarrayexperiments.Bioinformatics,18(4):546–554,2002.
10.O.G.Troyanskaya,M.E.Garber,P.O.Brown,D.Botstein,andR.B.Altman.
Nonparametricmethodsforidentifyingdifferentiallyexpressedgenesinmicroarraydata.Bioinformatics,18(11):1454–61,2002.
11.V.Tusher,R.Tibshirani,andG.Chu.Significanceanalysisofmicroarraysappliedto
theionizingradiationresponse.PNatlAcadSciUSA,98(9):5116–21,2001.
12.M.H.DeGrootandM.J.Schervish.ProbabilityandStatistics.AddisonWesley,3rd
edition,2002.
13.P.Broberg.Statisticalmethodsforrankingdifferentiallyexpressedgenes.Genome
Biol,4(6),2003.
14.I.LonnstedtandT.P.Speed.Replicatedmicroarraydata.StatSinica,12:31–46,2002.
December1,200417:46WSPC/INSTRUCTIONFILEjbcb˙revised
ATheoreticalAnalysisoftheSelectionofDifferentiallyExpressedGenes17
15.N.Johnson,S.Kotz,andN.Balakrishnan.ContinuousUnivariateDistributions,vol-ume1.JohnWileyandSons,NewYork,1994.
16.S.Mukherjee.SamplingDistributionoftheSAM-statistic.TechnicalReportPARG-04-01,DepartmentofEngineeringScience,UniversityofOxford,2004.Availableathttp://www.robots.ox.ac.uk/∼parg/publications.html.
17.S.MukherjeeandS.J.Roberts.ATheoreticalAnalysisofGeneSelection.InProceed-ingsoftheIEEEComputerSocietyBioinformaticsConference.IEEEPress,2004.18.X.CuiandG.A.Churchill.StatisticaltestsfordifferentialexpressionincDNAmi-croarrayexperiments.GenomeBiol,4(210),2003.
19.S.Dudoit,J.P.Shaffer,andJ.C.Boldrick.Multiplehypothesistestinginmicroarray
experiments.StatSci,18(1):71–103,2003.
SachMukherjeeisadoctoralcandidateinInformationEngi-neeringattheDepartmentofEngineeringScience,UniversityofOxford.BeforecomingtoOxford,hetookaMastersdegreeatCambridgeUniversity,specializinginspeechrecognition,andanundergraduatedegreeinComputerScienceattheUniversityofYork,U.K.
Hisresearchinterestslieinthedevelopmentandapplicationofmachinelearningandstatisticalmethodstoproblemsinbioinformatics,andinparticulartheanalysisofgenemicroarraydata.
StephenJ.RobertsisProfessorofEngineeringScienceattheUniversityofOxfordandaFellowofSomervilleCollege.HestudiedPhysicsatOxfordasanundergraduate,andwentontotakeaD.Phil.inEngineeringSciencein1991.HewasappointedtothefacultyatImperialCollege,UniversityofLondonin1994andmovedtoOxfordin1999.
ProfRoberts’mainareaofresearchliesinmachinelearningapproachestodataanalysis,andinparticulartoproblemsinmathematicalbiology.HerunsthePat-ternAnalysisandMachineLearningResearchGroup,andhiscurrentworkfocusesonBayesianstatistics,graphicalmodels,independentcomponentanalysisandin-formationtheory.
因篇幅问题不能全部显示,请点此查看更多更全内容