Difference between revisions of "Transition Model Language"
m 

Line 391:  Line 391:  
[[Category:NeXML]]  [[Category:NeXML]]  
[[Category:Standards]]  [[Category:Standards]]  
+  [[Category:Working Group]] 
Latest revision as of 15:59, 12 March 2009
Contents
Introduction
For analyses that use parsimony, the ASSUMPTIONS block of a NEXUS file provides a flexible way to describe rules, including costs or weights for each type of substitution. However, evolutionary analyses today typically use a model of statetransition probabilities parameterized via some (more or less) mechanistic scheme that introduces terms other than observed states (e.g., hidden states like a rate parameter for a site or an equilibrium GC content). Individual programs such as MrBayes or HyPhy have their own languages for describing these models. A common language would be highly useful for creating flexible pipelines and for comparative evaluation of methods.
Goals for the working group
At the first meeting (2023 May, 2007), the working group decided on a modest goal of doing some analysis and putting forward a proposal for support (probably to NESCent):
 clarify problem, identify stakeholders, suggest evaluation scheme
 propose work session under shortterm visiting scientist program (Mark)
Assessment
Categories below provide our assessment of the problem domain. The scope of the domain is defined to some extent by the use cases and the stakeholders. The existing interfaces and terminology impose constraints and provide informative content. The section on relevant technology should prevent us from reinventing the wheel.
Existing domainspecific terminology
Are there agreedupon terms to describe models already? What are they (some of them, at least)? Where do we go to find more terms? (mining statistically improbable phrases from documents?) How much do we want to focus on *instances* of models (e.g., HKY85) vs. the general concepts needed for describing any model?
Substitution rate (R) and base frequency (<math>\pi_i</math>) parameters of the General Time Reversible (GTR) model:
<math>\mathbf{R}_\mbox{GTR}= \begin{matrix}
& \mbox{A} & \mbox{C} & \mbox{G} & \mbox{T} \\
\mbox{A} &  & a & b & c \\ \mbox{C} & a &  & d & e \\ \mbox{G} & b & d &  & f \\ \mbox{T} & c & e & f &  \end{matrix},\; \mathbf{\Pi}_\mbox{GTR} = (\pi_\mbox{A}, \pi_\mbox{C}, \pi_\mbox{G}, \pi_\mbox{T}) </math>
Other commonly used models derive from the GTR model by simplification:
 JC: <math>a = b = c = d = e = f,\, \pi_i = \pi_j</math>
 F81: <math>a = b = c = d = e = f,\, \pi_i \ne \pi_j</math>
 Kimura2P: <math>a = c = d = f = \beta, b = e = \alpha,\, \pi_i = \pi_j</math>
 HKY85: <math>a = c = d = f = \beta, b = e = \alpha,\, \pi_i \ne \pi_j</math>
 JukesCantor (one parameter, same for every type of substitution)
 F81 (JukesCantor with unequal base frequencies)
 Kimura (two parameters, one for transitions, one for transversions)
 HKY (HKY85) (Kimura with unequal base frequencies)
 GTR
 equilibrium frequencies
 transitiontransversion bias (don't need this if we have GTR)
 branchspecific
 sitespecific
 partitioning
Legacy interfaces (constraints)
What data formats and software interfaces currently use substitution model descriptions? What types of conformance do we anticipate?
 Every (almost every?) phylogenetic inference program uses inputs that describe models or restrict parameter values
 Selected sites software such as PAML
 Libraries such as PAL
We decided not to fill out this table initially. The purpose is to note which types of descriptions (of models and parameter values) are used in which interfaces.
name  type  comments 

HyPhy  processor  NA 
PAUP*  processor  NA 
MrBayes  processor  NA 
BEAST  processor  NA 
CIPRES  processor  NA 
Mesquite  processor  NA 
Garli  processor  NA 
NEXUS  document  NA 
name  document or processor  NA 
Use cases
What types of analyses depend on substitution model descriptions? What motivates these analyses?
 phylogenetic inference: users want to specify a model
 comparative evaluation: compare programs (phylo inference, selected sites) to see if they have the same output
 selected sites: identify key sites in target (e.g., HIV RT protein)
 identify variability factors:
 model development: decouple models from implementations so that user can explore implications of different models
 educational: provide standard definition of terms
Stakeholders, potential partners
Who are the other stakeholders? Who will benefit (suffer) from a good (bad) solution?
 software developers
 applications software developers that process models
 developers of environments and portals
 users who do the use cases
 dataminers who want to do metaanalyses
Available Languages for Describing Mathematical Models
What other model description languages can we draw from?
Language  Syntax  Remarks  
XRate  Lisplike  Xrate is a system for fitting models expressed as Phylogrammars (that extend both HMM's and Stochastic Context Free Grammars), using EM. Phylogrammars are very expressive, but not terse. The conciseness of Lisp syntax (relative to XML) helps constrain the length of grammar descriptions. Not everyone finds the Lisp syntax difficult. Note that we can change the syntax of a language while preserving its semantics we can make it more intuitive to use, like HyperTalk or SQL.  
HyPhy  Algollike  HyPhy can express all of those models used in selection analysis (see next section). But the HyPhy parsing language is difficult, have to parse too much of it. Need to use a language for which parsers exist already. This is similar to the point made by Ezechukwu and Maros, 2003 on p. 2.  
AML  XML  AML is an Algebraic Markup Language (drawing on SML and XML) described by Ezechukwu and Maros, 2003, which is attached. This paper is useful to read for its introduction and concrete examples. The web page cited by the paper no longer seems to exist.  
GAMS  GAMS has been around for a while and has accumulated a large library of models; here is a network flow example.  
MPL  MPL is a commercial product for Windows that provides a software environment for representing and solving problems in its MPL language.  
MML  Javalike  MML was developed by an international Physiome project. Its home page gives examples of constraints and differential equations.  
AMPL  AMPL was developed at Bell Labs and is described in the linked book  
AIMMS  Graphical notation  AIMMS is a commercial software environment supporting a language; has a graphical notation  
WinBUGS format  R/Spluslike  "The BUGS language allows a concise expression of the model, using the 'twiddles' symbol ~ to denote stochastic (probabilistic) relationships, and the left arrow ('<' sign followed by '−' sign) to denote deterministic (logical) relationships." 
There was some disagreement about how much these languages could express.
Evaluation (projects)
How can we test the effectiveness of a proposed solution?
Sergei: Positive selection analyses. What models out there? Which packages implement them?
Other examples?
Requirements
We discussed the requirements a transition model language would have to satisfy in a breakout group at the 2nd meeting of the group (Nov 1214, 2007).
 A transition model specification for an analysis assigns one or more substition process models to a set of sites (columns) in the alignment (sometimes also called a partition), a set of edges in a tree.
 The definition of partitions should in fact be implicit from the assignment of parameters to sites. This would also allow for a more straightforward definition of mixture models (see below).
 A substitution process model specifies a set of substitution rates from one state that a site may take to another. and a set of state frequencies giving the equilibrium frequencies for each state.
 Substitution rates and state frequencies may not be given absolutely, but estimated as part of the analysis (for example, by maximum likelihood).
 Within a substitution model, substitution rates for certain state transitions may be constrained to be identical. Similarly, certain state frequencies may be constrained to be identical.
 For example, in the JukesCantor model, all substitution rates are identical, and all state frequencies are identical.
 Other models range from sets of rates constrained as identical, to all rates being allowed to be different.
 Substitution rates and state frequencies may be constrained by upper or lower bounds, or by ratio relative to another rate.
 Layering of models: A model may be a layer on top of another model.
 For example, a codon model layers probabilities for nonsynonymous and synonymous codon transitions over a nucleotide exchange model.
 Amongsite rate variation: Need to be able to specify a distribution of substitution rates that the rates within a model have to be drawn from.
 For example, typically a <math>\Gamma</math>distribution with a shape parameter is used to model which fraction of sites evolve at the mean rate, or at a rate different from the mean rate.
 The continuous distribution (such as <math>\Gamma(k,1/k)</math>) is almost always discretized for computation by the program. There are different ways of discretization, and the method to be used needs to be specifiable.
 Often a parameter <math>I</math> is used giving the proportion of invariable sites.
 Linking rates between model instances: The rates, or the ratio of rates, or the state frequencies, may be constrained to be identical ('linked') for sites across a model instances applied to partitions.
 For example, in a concatenated alignment, each aligned subsequence may constitute its own partition (with its own instance of a transition model, so that parameters may vary), but the substitution rates for sites in specific codon positions should be identical.
 Similarly, one may want to link the rate for a specific state transition or set of state transitions, for example <math>\alpha</math>, the rate for transitions, between specific instances of models.
 One may also want to link the ratio of rates within a model across model instances, such as linking <math>\Kappa=\alpha/\beta</math> (the ratio of transversions over transitions) between model instances (where a and b are from the same model instance).
 Or one may want to link the state frequencies for specific sets of sites, such as those for all intramembrane sites and for all extracellular sites.
 Parameter distribution instead of values: in a Bayesian framework, need to be able to assign prior distributions to model parameters, and an analysis will estimate the posterior distribution of model parameters.
 Evolutionary constraints: one may wish to constrain the ratio of synonymous versus nonsynonymous substitutions (which is indicative of positive or negative selection) on certain branches of the tree.
 For example, one may wish to assign the same positive selection to certain branches in the tree by constraining (linking) <math>\omega=\mbox{d}N/\mbox{d}S</math> for those branches.
 This may be outside of the scope of a transition model specification, though.
 Mixture models: almost always only a single substitution process will be assigned to a partition. If multiple models are be assigned, this is called a mixture model.
Partition:  A  B  C  D 
rate distribution  <math>\alpha_1, \alpha_2</math>  <math>\alpha_1, \alpha_2</math>  <math>\alpha_2</math>  <math>\alpha_2</math> 
rate matrix parameter  <math>\Kappa_1, \Kappa_2</math>  <math>\Kappa_3, \Kappa_4</math>  <math>\Kappa_5</math>  <math>\Kappa_2</math> 
 A variant of a mixture model is if the model specification for a partition has a mixture model for certain parameters, for example a mixture of a local parameter and one linked to the same parameter of the model for another partition.
 MrBayes also has a mechanism for specifying partitioned models with linked parameters between the partitions.
 The mixture parameters may be estimated as part of the analysis.
Error checking
With a transition model language flexible enough to satisfy all (or even many) of the above requirements, there are many possibilities for introducing errors when composing a model specification. How much errorproof should the language be? How much enforcement of internal inconsistency should the language provide?
Errors may fall into the following categories:
 Syntactic errors. For a formally defined language, such as XML, syntactic correctness can be verified by a compiler, or parser.
 A syntactically correct specification may not be semantically meaningful, though.
 Semantic errors. Each model parameter has semantic meaning, and for constraining parameters relatively to another, within the same model instance of across model instances, parameters need to be semantically compatible for such constraints to be meaningful.
 For example, linking <math>\Kappa</math> in one instance to <math>\alpha</math> in a different instance is nonsensical.
 Semantically validating a model specification requires a format ontology covering all model parameters and enumerating allowable relative constraints.
 Over and underspecification of the model. The model specification may be syntactically and semantically correct, but not specify the model with sufficient coverage.
 For example, if there are many partitions or sets of sites, a model may erroneously not be specified for certain sites, making them unintentionally invariant, or discarded from the analysis (what would be the correct way of proceeding for a program?).
 Similarly, parameters may be erroneously shared between sites.
Approaches
1. Modeldriven: express the transition model specification based on expressing the models being instantiated
 This seems to be the more common approach currently.
 Downside: especially in more complex analyses (such as partitioned alignments) there may be large number of model instances that need to be specified, which may not differ much except for a few parameters, and primarily to allow those parameters to vary across partitions.
2. Parameterdriven: express the transition model specificaton based on expressing the parameters for each combination of site (or set of sites) and edge in tree (or set of edges).
 Would require the semantic definition of each parameter (presumably using an ontology). This could be an advantage, as the model description would be semantically anchored.
 Semantic definition would connect parameter to its larger biological context as well as its role in a particular model. For example, HKY.Kappa would be defined both as the ratio of alpha and beta in an HKY model, but (through alpha and beta?) to the biological concepts of transitions and transversions and purines and pyrimidines.
Example of a concise representation of a GTR model with different base frequency parameters for two sets of sites (sites corresponding to intramembrane regions of the protein product, and all other sites):
SubstProc=GTR:unlink all BaseFreqs=pi[2] SiteSet[1]<pi[1] SiteSet[2]<pi[2] SiteSets= SiteSet[1]=/* enumerate intramembrane sites */ SiteSet[2]=!SiteSet[1]
3. Generating language: formalize a language that allows a compact definition of the transition model specification without redundancies or model instances. The actual representation, in either a more verbose standard or programspecific format, would be accomplished by a 'compiler'.
 A 'parser' for the language would then generate a complete model specification with all model instances.
 For example, assume a partitioned alignment of n genes, and we wish to specify a GTR model for all genes, allowing different rates for each gene, but with linked base frequencies for all intramembrane regions, and for all extramembrane regions.
Specification
Example: BUGSbased language
Example of a model specified in the BUGS language. The model is for a random effects logistic regression of two variables, seed type (x1) and root extract (x2) to the number of germinated seeds (r) for each experiment plate i. The stochastic parameters <math>\alpha_0, \alpha_1, \alpha_2, \alpha_{12}</math>, and <math>\tau</math> are given prior distributions, while the logical expression for <math>\sigma</math> allows the standard deviation (of the random effects distribution) to be estimated.
model { for (i in 1:N) { r[i] ~ dbin(p[i], n[i]) b[i] ~ dnorm(0, tau) logit(p[i]) < alpha0 + alpha1 * x1[i] + alpha2 * x2[i] + alpha12 * x1[i] * x2[i] + b[i] } alpha0 ~ dnorm(0, 1.0E6) alpha1 ~ dnorm(0, 1.0E6) alpha2 ~ dnorm(0, 1.0E6) alpha12 ~ dnorm(0, 1.0E6) tau ~ dgamma(0.001, 0.001) sigma < 1 / sqrt(tau) }
The full example is in the WinBUGS 1.4 manual.
The BUGS (Bayesian inference Using Gibbs Sampling) project is concerned with flexible software for the Bayesian analysis of complex statistical models using Markov chain Monte Carlo (MCMC) methods. The BUGS language allows a concise expression of the model using an R/Splus like syntax, using the tilde symbol ('~') to denote probabilistic parameters, and the left arrow ('<') to denote deterministic or logical relationships.
 Pros

 Very concise, clear representation of models
 Mathematically highly expressive
 Cons

 Requires basically an R/Splus language parser (and it is not completely Rcompliant)
 The cost of supporting this language in a multitude of programs developed in different languages and with different capabilities may be prohibitive.
Example: BEAST XMLbased language
Example of expressing a GTR+<math>\Gamma</math>+I model in BEAST XML. There must be parameters for exactly five of the six rates, which are expressed relative to the sixth. The sixth rate has an implied value of 1.0. The example parameterizes the rate matrix relative to the A<math>\leftrightarrow</math>G transition.
<generalDataType id="nucleotides"> <state code="A"/> <state code="C"/> <state code="G"/> <state code="T"/> <alias code="U" state="T"/> <ambiguity code="R" states="AG"/> <ambiguity code="Y" states="CT"/> <ambiguity code="" states="ACGT"/> </generalDataType> <frequencyModel id="freqs"> <generalDataType idref="nucleotides"/> <frequencies> <parameter id="gtr.frequencies" dimension="4"/> </frequencies> </frequencyModel> <gtrModel id="gtr1"> <frequencies> <frequencyModel idref="freqs"/> </frequencies> <rateAC> <parameter id="rateAC1" value="0.8"/> </rateAC> <rateAT> <parameter id="rateAT1" value="0.8"/> </rateAT> <rateCG> <parameter id="rateCG1" value="0.8"/> </rateCG> <rateCT> <parameter id="rateCT1" value="1.0"/> </rateCT> <rateGT> <parameter id="rateGT1" value="0.8"/> </rateGT> </gtrModel> <siteModel id="siteModel"> <substitutionModel><gtrModel idref="gtr1"/></substitutionModel> <gammaShape gammaCategories="4"> <parameter id="siteModel.alpha" value="1.0" lower="0.0" upper="100.0"/> </gammaShape> <proportionInvariant> <parameter id="siteModel.plnv" value="0.01" lower="0.0" upper="1.0"/> </proportionInvariant> </siteModel>
BEAST supports an XML format that includes specifications of substitution models, frequency models, site modesl, parameters, and specific elements for specifying rates or molecular clock models.
 Pros

 Requires only an XML parser to implement. BEAST supports it already, and adopting a standard that is based on its current language may have low or minimal cost.
 The language distinguishes between and specifies models for frequency, substitution, and site heterogeneity.
 Parameters can have lower and upper bounds for estimation.
 Parameters can be tied (constrained) to each other by referencing (using idref) a previously defined parameter.
 Codon models can also be specified.
 Cons

 Not clear how to express partitions.
 Not clear how to give specific (or relative) frequencies to individual states.
 There is no XML Schema definition, this would have to be created from scratch.
 The language needs some improvements (for example, the gammaShape element is directly nested within the siteModel element, rather than within an element named, e.g., distribution, or parametricDistributionModel, and so a standard based on it would not be identical to it.
Proposal: BEAST XMLinspired extension to NeXML
We propose to focus on using the language for specifying and transmitting models to be used (or that were used) in analysis, rather than for the purpose of being able to specify arbitrarily complex models that a generic mathematical engine could instantiate and execute. A BUGSlike language would be desirable for the latter purpose, but for the former BEAST XML, or a similar but somewhat more generic XML specification would probably fully serve the needs, and have a relatively low barrier to support.
A 'BEASTy' XML representation would also readily integrate with the emerging NeXML standard. We propose to create this specification as a NeXML module.
Steps:
 Create Models module XSD, using BEAST XML and the previously created models.xsd as inspiration and start, respectively.
 Communicate and coordinate with Rambaut, Drummond, Midford, Holder.
 Demonstrate sufficient expressiveness (and/or limitations) using proofofconcept examples taken from papers or other sources.
 Integrate module into the NeXML container.
 Allow model specification to be only module in a NeXML file (for specification and transmission)
 Create XSLT for converting NeXML with model specification into BEAST XML.
Applications of a transition model language
We found the following applications of a transition model language to be the most useful, in the sense of applications that can't be easily done with existing packagespecific model specification formats.
 Reproducibility of research.
 Published analyses that used a transition model specification could publish the specification as a downloadable file that (if in a universally supported format) can be directly reproduced in the same program, or in a different program.
 If there is a digital repository for storing supporting data, the transition model specification could become one of the items to be deposited, and would then receive a persistent and globally unique identifier.
 Component in webservice discovery and matching. Webservices offering evolutionary analysis could automatically determine their ability to execute a specific transition model specification.
 For example, a PAUP* service would reject a service request with a transition model that specifies a codon model.
 Extending an analysis. If a published analysis also gives its transition model specification on a standard format, it would facilitate easy extension of such analyses with more complicated models.
 For example, a publication may have used a nucleotide transition model to analyse the evolution of a set of sequences. If the transition model is given in a standard and universally supported format, it would be straightforward to load that model (and the data) in the program of choice and layer a codon model on top of the nucleotide transition model.
Plans
Progress in achieving goals
 At the NESCent phylohackathon in December 2006, Mark Holder, Paul Lewis and Sergei Kosakovsky Pond produced a draft object model in Interface Definition Language (IDL).
 At a work meeting in Lawrence, KS, in Fall 2007, Mark Holder, Peter Midford, Jeet Sukumaran (a graduate student at KU) and Rutger Vos revisited the model description that we sketched out at the hackathon.
 Peter has worked on an OWL expression of it (user: guest, pass: guest).
 Rutger and Jeet worked on an XML syntax and schema fragment and how that would work with Rutger's nexml.
 Jeet is funded as a CIPRES RA and will be wrapping some simulation tools for CIPRES. We are hoping that he can pick up the model spec again, and start building CIPRES support for it. When he does that, we expect that he will generate questions and comments for us all to ponder.
 At the evoinfo meeting in November 2007, Sergei presented an informal overview of the requirements of a model description language. Peter introduced several legacy technologies (languages) for expressing stochastic models.
 During several breakout sessions, David Swofford, Sergei Kosakovsky Pond, Gopal Gupta, Enrico Pontelli, Hilmar Lapp, Peter Midford, Brian DeVries and Rutger Vos further discussed the requirements. Mark Holder called in via videoconference with further comments.
Next Steps
 Flesh out transition model description format for GARLI and PAUP that implements the more general (but most likely not all) of the requirements (Zwickl, Swofford, Holder, Lapp)
 Make a collection of examples, e.g. taken from papers (Kosakovsky Pond)
Prioritization
Added by a Substitution Model subgroup at the 3rd Working Group meeting (Xuhua Xia, Jim LeebensMack, Hilmar Lapp)
 Start with nucleotide and amino acid substitution models
 Codonbased models are oftentimes not well suited for tree reconstruction (complex, inefficient to compute, don't take into account substitution differences between sites)
 Start with phylogenetic tree reconstruction programs, for those identify the most frequently used models
 DAMBE, Mega, PAUP*
 Do evolutionary model testing (codon models, selection, ratesbased hypotheses, etc) only once the first priority ones are done.
Resources
Artefacts by group members and associates
 Document describing earlier work on expanding the NEXUS format to include model descriptions in public blocks (contributed by David Maddison)
 The wiki page of the hackathon work (Fall '06) on the same topic
 An example of a nexml model description and its associated schema
 An overview of model description languages, presented by Peter Midford at the EvoInfo Fall '07 meeting
 Requirements of an Evolutionary Model Description Language (EMDL), presented by Sergei Kosakovsky Pond at the EvoInfo Fall '07 meeting
 Subversion repository of model ontology by Peter Midford (user: guest, pass: guest)
 Mechanism by which MrBayes allows specification of partitioned models and linked parameters