logo资料库

Mrbayes进化树分析教程.doc

第1页 / 共8页
第2页 / 共8页
第3页 / 共8页
第4页 / 共8页
第5页 / 共8页
第6页 / 共8页
第7页 / 共8页
第8页 / 共8页
资料共8页,全文预览结束
MrBayes概述及主要参数  
一、MrBayes概述
二、MrBayes的主要参数:
linux 下源代码软件包的安装 z perl hash 常见用法 z MrBayes 概述及主要参数 2012-02-20 19:33:36| 分类: Bioinformatics | 标签:软件 |举报 |字号大中小 订阅 一、MrBayes 概述 1. MrBayes 软件简介 MrBayes 是一个采用贝叶斯法进行系统发育分析的免费软件。 MrBayes 对输入文件格式的要求是“*.nex”格式,但 MrBayes 只能识别 nex 文件中的 MrBayes 模块和数据矩阵模块,nex 文件中的其它模块均不识别, 所以在进行 MrBayes 分析前必须删除许多在 PAUP 中需要的但在 MrBayes 分析中多余的模块。如果数据有隔行, 则必须加入 interleave = yes。数据可以是核 酸或氨基酸序列, 也可以是限制性位点或以 0 、1 表示的形态数据。可以执行批处理或逐条执行命令。为了减少错误,建议严格按照程序所带的示范数据的格 式编辑所分析的数据集。 2. MrBayes 主要命令说明 使用“Help”命令可以显示详细的命令列表:在 MrBayes 命令提示符“>”后面键入“help”命令后按“Enter”键即可。 使用“help”命令后跟命令名可以显示该命令的详细说明:在 MrBayes 命令提示符“>”后面键入“help [命令名]” 后按“Enter”键即可。例如要查看“Lset” 命令的详细说明,则键入“help lset”后按“Enter”键即可。命令可以用缩写,前提是不会与其他命令产生混淆,例如“help lset”可以用“hel lse”代替。 二、MrBayes 的主要参数: (1)“delet” This command deletes taxa from the analysis. 该命令用于指定序列数据集中不予分析的序列。 The correct usage is: delete ... A list of the taxon names or taxon numbers (labelled 1 to ntax in the order in the matrix) or taxset(s) can be used. For example, the following: delete 1 2 Homo_sapiens ##deletes taxa 1, 2, and the taxon labelled Homo_sapiens from the analysis. (2)“outgroup”命令 This command assigns a taxon to the outgroup. 该命令用于指定外群(outgroup) The correct usage is: outgroup / For example, "outgroup 3" assigns the third taxon in the matrix to be the outgroup. Similarly, "outgroup Homo_sapiens" assings the taxon "Homo_sapiens" to be the outgroup. Only a single taxon can be assigned to be the outgroup. (3)“charset”命令
This command defines a character set. Note that you can use "." to stand in for the last character (e.g., charset 1-.\3). This option is best used not from the command line, but rather as a line in the mrbayes block of a file. 该命令用于定义性状集,性状集名称中间不能有任何空格。该命令最好不要在 mrbayes 命令行中使用,而是作为序列数据文件中“mrbayes block”中的一行。若要定义多个性状集,则需要使用多行独立的“charset”命令。 The format for the charset command is: charset = For example, "charset first_pos = 1-720\3" defines a character set called "first_pos" that includes every third site from 1 to 720. The character set name cannot have any spaces in it. The slash (\) is a nifty way of telling the program to assign every third (or second, or fifth, or whatever) character to the character set. 例如:"charset first_pos = 1-720\3"定义的性状集为 1~720 位的每个二隔位点,反斜杠“\”及后面的数字表示划分数据集的标准(如二分性状、三分性状、… 等),可以用“.”表示数据矩阵中的最后一个性状。 (4)“Partition”命令 This command allows you to specify a character partition.该命令用于指定一个性状划分。 The format for this command is: partition = :, ..., For example, "partition by_codon = 3:1st_pos,2nd_pos,3rd_pos" specifies a partition called "by_codon" which consists of three parts (first, second, and third codon positions). Here, we are assuming that the sites in each partition were defined using the charset command. You can specify a partition without using charset as follows: partition by_codon = 3:1 4 6 9 12,2 5 7 10 13,3 6 8 11 14 However, we recommend that you use the charsets to define a set of characters and then use these predefined sets when defining the partition. Also, it makes more sense to define a partition as a line in the mrbayes block than to issue the command from the command line. This command is used to set some general features of the model or program behavior.该命令用于设置一些模型或程序运行的一般特性。 (5)“set”命令 The correct usage is:set = ... = 使用“set”命令能设置 5 种参数,一般情况下需要自行设置的只有“partition”特性。 ① “partition”:set partition=; Set this option to a valid partition id, either the number or name of a defined partition, to enforce a specific partitioning of the data. When a data matrix is read in, a partition called "Default" is automatically created. It divides the data into one part for each data type. If you only have one data type, DNA for instance, the default partition will not divide up the data at all. The default partition is always the first partition, so 'set partition=1' is the same as 'set partition=default'. 一个有效的“partition ID”是一个序号或一个参数名。该命令强制使用一个特定的数据划分。当程序运行时,一个默认的数据划分(可能不 划分数据)即被创建,成为“Default”。可以通过键入“set partition=default 或“set partition=1”回到原始的或默认的划分。 ② “autoclose”:set autoclose= If autoclose is set to 'yes', then the program will not prompt you during the course of executing a file. This is particularly useful when you run MrBayes in batch mode.如果“autoclose=yes”程序运行中间不会提示用户输入命令或者参数,“autoclose=no” 为默认设置。 ③ “nowarnings”:set nowarnings=;If nowarnings is set to yes, then the program will not prompt you when overwriting or appending an ouput file that is already present. If 'nowarnings=no' (the default setting), then the program propts the user before overwriting output files. 如果“nowarnings”被设 置成“yes”,则程序在覆盖或附加已经存在的输出文件时将不会提醒;如果“nowarnings=no (默认设置)”,则程序在覆盖输出文件时将提醒用户。 ④ “quitonerror”:set quitonerror=;If quitonerror is set to yes, then the program will quit when an error is encountered, after printing an error message. If quitonerror is set to no (the default setting), then the program will wait for additional commands from the command line after the error message is printed.如果“quitonerror”被设置成“yes”,则程序遇到错误时会在打印错误信息后退出运行;如果 quitonerror=no (默认设置),则程序将在打印错误信息后 等待来自命令行中的其它命令。
⑤ “autooverwrite”:set autooverwrite=;When nowarnings is set to yes, by default MrBayes will overwrite output files. If autooverwrite=no, output will be appendend if the output file already exists. The default is autooverwrite=yes.当“autooverwrite”被设置成“yes”时,MrBayes 将默认书写输出 文件。如果 autooverwrite=no, 则在输出文件已经存在的情况下输出信息附加被附加到原来的输出文件中。默认设置为“autooverwrite=yes”。 (6) “Prset”命令 This command sets the priors for the phylogenetic model. Remember that in a Bayesian analysis, you must specify a prior probability distribution for the parameters of the likelihood model. The prior distribution represents your prior beliefs about the parameter before observation of the data. This command allows you to tailor your prior assumptions to a large extent. 该命令用于设置系统发育模型的先验概率。在 Bayesian 分析中必须为似然模型参数指定一个先验 概率分布。先验概率分布代表了评价数据前对参数的先验信任。先验概率分布主要包括分支型式、分支长度和进化速率三个方面的概率分布。 Available options: 1)Tratiopr: This parameter sets the prior for the transition/transversion rate ratio (tratio).该选项用于设置转换/颠换速率比(tratio)的先验概率。 prset tratiopr = beta(, ) prset tratiopr = fixed() The program assumes that the transition and transversion rates are independent gamma-distributed random variables with the same scale parameter when beta is selected. If you want a diffuse prior that puts equal emphasis on transition/transversion rate ratios above 1.0 and below 1.0, then use a flat Beta, beta(1,1), which is the default. If you wish to concentrate this distribution more in the equal-rates region, then use a prior of the type beta(x,x), where the magnitude of x determines how much the prior is concentrated in the equal rates region. For instance, a beta(20,20) puts more probability on rate ratios close to 1.0 than a beta(1,1). If you think it is likely that the transition/transversion rate ratio is 2.0, you can use a prior of the type beta(2x,x), where x determines how strongly the prior is concentrated on tratio values near 2.0. For instance, a beta(2,1) is much more diffuse than a beta(80,40) but both have the expected tratio 2.0 in the absence of data. The parameters of the Beta can be interpreted as counts: if you have observed x transitions and y transversions, then a beta(x+1,y+1) is a good representation of this information. The fixed option allows you to fix the tratio to a particular value. 2)Statefreqpr: This parameter specifies the prior on the state frequencies. 该参数用于指定状态频率的先验分布概率。 prset statefreqpr = dirichlet() prset statefreqpr = dirichlet(,...,) prset statefreqpr = fixed(equal) prset statefreqpr = fixed(empirical) prset statefreqpr = fixed(,...,) For the dirichlet, you can specify either a single number or as many numbers as there are states. If you specify a single number, then the prior has all states equally probable with a variance related to the single parameter passed in. 3)Shapepr: This parameter specifies the prior for the gamma shape parameter for among-site rate variation. 该参数用于指定位点间速率变异的 gamma 形状参数的 先验概率。 prset shapepr = uniform(,) prset shapepr = exponential() prset shapepr = fixed() 4)Brlenspr:
This parameter specifies the prior probability distribution on branch lengths. 该参数用于指定分支长度的先验概率分布。 prset brlenspr = unconstrained:uniform(,) prset brlenspr = unconstrained:exponential() prset brlenspr = clock:uniform prset brlenspr = clock:birthdeath prset brlenspr = clock:coalescence Trees with unconstrained branch lengths are unrooted whereas clock-constrained trees are rooted. The option after the colon specifies the details of the probability density of branch lengths. If you choose a birth-death or coalescence prior, you may want to modify the details of the parameters of those processes. 5)Revmatpr: This parameter sets the prior for the substitution rates of the GTR model for nucleotide data. 该参数用于设置核酸数据 GTR 模型的替换率的先验概率。 prset revmatpr = dirichlet(,,...,) prset revmatpr = fixed(,,...,) The program assumes that the six substitution rates are independent gamma-distributed random variables with the same scale parameter when dirichlet is selected. The six numbers in brackets each corresponds to a particular substitution type. Together, they determine the shape of the prior. The six rates are in the order A<->C, A<->G, A<->T, C<->G, C<->T, and G<->T. If you want an uninformative prior you can use dirichlet(1,1,1,1,1,1), also referred to as a 'flat' Dirichlet. This is the default setting. If you wish a prior where the C<->T rate is 5 times and the A<->G rate 2 times higher, on average, than the transversion rates, which are all the same, then you should use a prior of the form dirichlet(x,2x,x,x,5x,x), where x determines how much the prior is focused on these particular rates. For more info, see tratiopr. The fixed option allows you to fix the substitution rates to particular values. 若为氨基酸序列数据,则需考虑如下两个选项的设置: 6)Aamodelpr: This parameter sets the rate matrix for amino acid data. You can either fix the model by specifying aamodelpr= fixed(), where is 'poisson' (a glorified Jukes-Cantor model), 'jones', 'dayhoff', 'mtrev', 'mtmam', 'wag', 'rtrev', 'cprev', 'vt', 'blosum', 'equalin' (a glorified Felsenstein 1981 model), or 'gtr'. You can also average over the first ten models by specifying aamodelpr= mixed. If you do so, the Markov chain will sample each model according to its probability. The sampled model is reported as an index: poisson(0), jones(1), dayhoff(2), mtrev(3), mtmam(4), wag(5), rtrev(6), cprev(7), vt(8), or blosum(9). The 'Sump' command summarizes the MCMC samples and calculates the posterior probability estimate for each of these models. This parameter sets the prior for the substitution rates of the GTR model for amino acid data. The options are: 7)Aarevmatpr: prset revmatpr = dirichlet(,,...,) prset revmatpr = fixed(,,...,) The options are the same as those for 'Revmatpr' except that they are defined over the 190 rates of the time-reversible GTR model for amino acids instead of over the 6 rates of the GTR model for nucleotides. The rates are in the order A<->R, A<->N, etc to Y<->V. In other words, amino acids are listed in alphabetic order based on their full name. The first amino acid (Alanine) is then combined in turn with all amino acids following it in the list, starting with amino acid 2 (Arginine) and finishing with amino acid 20 (Valine). The second amino acid (Arginine) is then combined in turn with all amino acids following it, starting with amino acid 3 (Asparagine) and finishing with amino acid 20 (Valine), and so on. (7)“Lset”命令 This command sets the parameters of the likelihood model. The likelihood function is the probability of observing the data conditional on the phylogenetic model. In order to calculate the likelihood, you must assume a model of character change. This command lets you tailor the biological assumptions made in
the phylogenetic model The correct usage is “lset =
mcmc = ... = For example, mcmc ngen=100000 nchains=4 temp=0.5 performs a MCMCMC analysis with four chains with the temperature set to 0.5. The chains would be run for 100,000 cycles. 表示将运行一次由 4 条链组成的温度设置为“0.5”的“MCMCMC”分析,运行世代数为“100000”个循环。 该命令的选项如下: Ngen (the number of cycles/generations): This option sets the number of cycles for the MCMC algorithm. This should be a big number as you want the chain to first reach stationarity, and then remain there for enough time to take lots of samples. 该选项用于设置 MCMC 算法的世代数,默认值为 1000000 代。它应该是一个很大的数值,使链达到稳 态后还有足够的时间去抽取大量的样本。 Printfreq (print frequency): This specifies how often information about the chain is printed to the screen. 该选项用于设置计算结果打印到屏幕的频率,即计算多少代后打印一次。 Samplefreq (sample frequency): This specifies how often the Markov chain is sampled. You can sample the chain every cycle, but this results in very large output files. Thinning the chain is a way of making these files smaller and making the samples more independent. 该选项用于设置对链进行抽样的频率。 Nchains (the number of chains to be run): How many chains are run for each analysis for the MCMCMC variant. The default is 4: 1 cold chain and 3 heated chains. If Nchains is set to 1, MrBayes will use regular MCMC sampling, without heating. 该选项用于设置运行的链的数量, 默认值为 4,即 1 条冷链和 3 条热链。如果将 Nchains 设置为 1,MrBayes 将使用常规 MCMC 抽样,没有热链。 Savebrlens (save branch length information): This specifies whether branch length information is saved on the trees. 该选项用于设置是否保存分支长度信息,命令格式为“savebrlens=yes”或 “savebrlens=no”。 Nruns(the number of independent analysis run simultaneously): How many independent analyses are started simultaneously. 该选项用于设置同时运行的独立分析的次数。 (9)“sump”命令 During a MCMC analysis, MrBayes prints the sampled parameter values to a tab-delimited text file. This file has the extension ".p". The command 'Sump' summarizes the information in the parameter file. 该命令用于总结参数样本 By default, the name of the parameter file is assumed to be the name of the last matrix-containing nexus file, but with a '.p' extension. You can set 'Sump' to summarize the information in any other parameter file by setting the 'filename' option to the appropriate file name. The 'Sump' command does not require a matrix to be read in first. When you invoke the 'Sump' command, three items are output: (1) a generation plot of the likelihood values; (2) estimates of the marginal likelihood of the model; and (3) a table with the mean, variance, and 95 percent credible interval for the sampled parameters. Each of these items can be switched on or off using the options 'Plot', 'Marglike', and 'Table'. By default, all three items are output but only to the screen. If output to a file is also desired, set 'Printtofile' to 'Yes'. The name of the output file is specified by setting the 'Outputname' option. When a new matrix is read in or when the 'Mcmc' output filename or 'Sump' input filename is changed, the 'Sump' outputname is changed as well. If you want to output to another file than the default, make sure you specify the outputname every time you invoke 'Sump'. If the specified outputfile already exists, you will be prompted about whether you like to overwrite it or append to it. This behavior can be altered using 'Set nowarn=yes'; see the help for the 'Set' command. When running 'Sump' you typically want to discard a specified number of samples from the beginning of the chain as the burn in. Note that the 'Burnin' value of the 'Sump' command is set separately from the 'Burnin' values of the 'Sumt' and 'Mcmc' commands. 一般情况下只要设置“burnin”值选项就行了,其它选项可使用默认选项。“burnin”值是指执 行“sump”命令时从链的起始部分去除的样本数,称为老化值。值得注意的是,“sump”命令中的“burnin”值与“Sumt”和“Mcmc”命令中的“burnin” 值是分别设置的。 sump burnin = 4000
sumt burnin = 2000 sump The burnin of the last 'Sump' command is 4000 and not 2000. The burnin values are reset to 0 every time a new matrix is read in. Similarly, 'Plot', 'Marglike' and 'Table' are all set to 'Yes' and 'Printtofile' to 'No' (the default values) when a new matrix is processed. If you have run several independent MCMC analyses, you may want to summarize and compare the samples from each of these runs. To do this, set 'Nruns' to the number of runs you want to compare and make sure that the '.p' files are named using the MrBayes convention (.run1.p, .run2.p, etc). When you run several independent analyses simultaneously in MrBayes, the 'Nruns' and 'Filename' options are automatically set such that 'Sump' will summarize all the resulting output files. Burnin(老化值): Determines the number of samples that will be discarded from the input file before calculating summary statistics. If there are several input files, the same number of samples will be discarded from each. Note that the burnin is set separately for the 'sump', 'sumt', and 'mcmc' commands. 该选项用于确定开始总结 性统计前从输入文件中去除的样本数。“burnin”值一般设置为取样世代的 1/4~1/2。如总运行世代为 100,000 ,每隔 100 代取样 1 次,则 burnin=(100,000/100) ×1/4 = 250。如果同时输入了几个文件,计算时将从每个文件中去除相同数量的样本。 Nruns(运行次数): Determines how many '.p' files from independent analyses that will be summarized. If Nruns > 1 then the names of the files are derived from 'Filename' by adding '.run1.p', '.run2.p', etc. If Nruns=1, then the single file name is obtained by adding '.p' to 'Filename'. 该选项确定用于总结的来自独立分析的“*.p”文件 数(即扩展名(extention)为“p”的文件,默认值为 2) Filename: The name of the file to be summarized. This is the base of the file name to which endings are added according to the current setting of the 'Nruns' parameter. If 'Nruns' is 1, then only '.p' is added to the file name. Otherwise, the endings will be '.run1.p', '.run2.p', etc. 该选项用于指定执行“sump”命令时将要统计的文件。当 Mrbayes 程序运行完毕后,若在分析结果尚未退出缓存时执行“sump”命令,则不需指定文件 名,程序将自动对所有分析结果进行统计。若 Mrbayes 程序运行完毕后没有马上执行“sump”命令,而是关闭了程序,则下次运行程序执行“sump”命令时 需要指定被统计的文件名(若运行次数设置在 2 次或 2 次以上,则需要分别使用几个独立的“filename”选项设置让程序同时对几个指定的“*.p”文件进行统 计。 (10)“sumt”命令 This command is used to produce summary statistics for trees sampled during a Bayesian MCMC analysis. You can either summarize trees from one individual analysis, or trees coming from several independent analyses. In either case, all the sampled trees are read in and the proportion of the time any single taxon bipartition is found is counted. The proportion of the time that the bipartition is found is an approximation of the posterior probability of the bipartition. (Remember that a taxon bipartition is defined by removing a branch on the tree, dividing the tree into those taxa to the left and right of the removed branch. This set is called a taxon bipartition.) The branch length of the bipartition is also recorded, if branch lengths have been saved to file. The result is a list of the taxon bipartitions found, the frequency with which they were found, the posterior probability of the bipartition and, if the branch lengths were recorded, the mean and variance of the length of the branch. The partition information is output to a file with the suffix ".parts". A consensus tree is also printed to a file with the suffix ".con" and printed to the screen as a cladogram, and as a phylogram if branch lengths have been saved. The consensus tree is either a 50 percent majority rule tree or a majority rule tree showing all compatible partitions. If branch lengths have been recorded during the run, the ".con" file will contain a consensus tree with branch lengths and interior nodes labelled with support values. This tree can be viewed in a program such as TreeView. Finally, MrBayes produces a file with the ending ".trprobs" that contains a list of all the trees that were found during the MCMC analysis, sorted by their probabilities. This list of trees can be used to construct a credible set of trees. For example, if you want to construct a 95 percent credible set of trees, you include all of those trees whose cumulated probability is less than or equal to 0.95. You have the option of displaying the trees to the screen using the
"Showtreeprobs" option. The default is to not display the trees to the screen; the number of different trees sampled by the chain can be quite large. If you are analyzing a large set of taxa, you may actually want to skip the calculation of tree probabilities entirely by setting "Calctreeprobs" to NO. When calculating summary statistics you probably want to skip those tre es that were sampled in the initial part of the run, the so-called burn-in period. The number of skipped samples is controlled by the "burnin" setting. The default is 0 but you typically want to override this setting. If you are summarizing the trees sampled in several independent analyses, such as those resulting from setting the "Nruns" option of the "Mcmc" command to a value larger than 1, MrBayes will also calculate convergence diagnostics for the sampled topologies and branch lengths. These values can help you determine whether it is likely that your chains have converged. The "Sumt" command expands the "Filename" according to the current values of the "Nruns" and "Ntrees" options. For instance, if both "Nruns" and "Ntrees" are set to 1, "Sumt" will try to open a file named ".t". If "Nruns" is set to 2 and "Ntrees" to 1, then "Sumt" will open two files, ".run1.t" and ".run2.t", etc. By default, the "Filename" option will be set such that "Sumt" automatically summarizes all the results from your immediately preceding "Mcmc" command. You can also use the "Sumt" command to summarize tree samples in older analyses. If you want to do that, remember to first read in a matrix so that MrBayes knows what taxon names to expect in the trees. Then set the "Nruns", "Ntrees" and "Filename" options appropriately. 该命令用于总结树样本。该命令中通常需要设置的选项有“burnin”、“filename”和“contype”三项: Burnin: Determines the number of samples that will be discarded from the input file before calculating summary statistics. If there are several input files, the same number of samples will be discarded from each. Note that the burnin is set separately for the 'sumt', 'sump', and 'mcmc' commands. The name of the file(s) to be summarized. This is the base of the file name, to which endings are added according to the current settings of the 'Nruns' Filename: and 'Ntrees' options. Contype: Type of consensus tree. 'Halfcompat' results in a 50 majority rule tree, 'Allcompat' adds all compatible groups to such a tree.
分享到:
收藏