Transition Model Language

From Evolutionary Informatics Working Group
Jump to: navigation, search

Introduction

For analyses that use parsimony, the ASSUMPTIONS block of a NEXUS file provides a flexible way to describe rules, including costs or weights for each type of substitution. However, evolutionary analyses today typically use a model of state-transition probabilities parameterized via some (more or less) mechanistic scheme that introduces terms other than observed states (e.g., hidden states like a rate parameter for a site or an equilibrium GC content). Individual programs such as MrBayes or HyPhy have their own languages for describing these models. A common language would be highly useful for creating flexible pipelines and for comparative evaluation of methods.

Goals for the working group

At the first meeting (20-23 May, 2007), the working group decided on a modest goal of doing some analysis and putting forward a proposal for support (probably to NESCent):

  1. clarify problem, identify stakeholders, suggest evaluation scheme
  2. propose work session under short-term visiting scientist program (Mark)

Assessment

Categories below provide our assessment of the problem domain. The scope of the domain is defined to some extent by the use cases and the stakeholders. The existing interfaces and terminology impose constraints and provide informative content. The section on relevant technology should prevent us from re-inventing the wheel.

Existing domain-specific terminology

Are there agreed-upon terms to describe models already? What are they (some of them, at least)? Where do we go to find more terms? (mining statistically improbable phrases from documents?) How much do we want to focus on *instances* of models (e.g., HKY85) vs. the general concepts needed for describing any model?

Substitution rate (R) and base frequency (<math>\pi_i</math>) parameters of the General Time Reversible (GTR) model:

<math>\mathbf{R}_\mbox{GTR}= \begin{matrix}

  & \mbox{A} & \mbox{C} &  \mbox{G} & \mbox{T} \\

\mbox{A} & - & a & b & c \\ \mbox{C} & a & - & d & e \\ \mbox{G} & b & d & - & f \\ \mbox{T} & c & e & f & - \end{matrix},\; \mathbf{\Pi}_\mbox{GTR} = (\pi_\mbox{A}, \pi_\mbox{C}, \pi_\mbox{G}, \pi_\mbox{T}) </math>

Other commonly used models derive from the GTR model by simplification:

  • JC: <math>a = b = c = d = e = f,\, \pi_i = \pi_j</math>
  • F81: <math>a = b = c = d = e = f,\, \pi_i \ne \pi_j</math>
  • Kimura2P: <math>a = c = d = f = \beta, b = e = \alpha,\, \pi_i = \pi_j</math>
  • HKY85: <math>a = c = d = f = \beta, b = e = \alpha,\, \pi_i \ne \pi_j</math>
  • Jukes-Cantor (one parameter, same for every type of substitution)
  • F81 (Jukes-Cantor with unequal base frequencies)
  • Kimura (two parameters, one for transitions, one for transversions)
  • HKY (HKY85) (Kimura with unequal base frequencies)
  • GTR
  • equilibrium frequencies
  • transition-transversion bias (don't need this if we have GTR)
  • branch-specific
  • site-specific
  • partitioning

Legacy interfaces (constraints)

What data formats and software interfaces currently use substitution model descriptions? What types of conformance do we anticipate?

  • Every (almost every?) phylogenetic inference program uses inputs that describe models or restrict parameter values
  • Selected sites software such as PAML
  • Libraries such as PAL

We decided not to fill out this table initially. The purpose is to note which types of descriptions (of models and parameter values) are used in which interfaces.

name type comments
HyPhy processor NA
PAUP* processor NA
MrBayes processor NA
BEAST processor NA
CIPRES processor NA
Mesquite processor NA
Garli processor NA
NEXUS document NA
name document or processor NA

Use cases

What types of analyses depend on substitution model descriptions? What motivates these analyses?

  • phylogenetic inference: users want to specify a model
  • comparative evaluation: compare programs (phylo inference, selected sites) to see if they have the same output
  • selected sites: identify key sites in target (e.g., HIV RT protein)
  • identify variability factors:
  • model development: de-couple models from implementations so that user can explore implications of different models
  • educational: provide standard definition of terms

Stakeholders, potential partners

Who are the other stakeholders? Who will benefit (suffer) from a good (bad) solution?

  • software developers
    • applications software developers that process models
    • developers of environments and portals
  • users who do the use cases
  • data-miners who want to do meta-analyses

Available Languages for Describing Mathematical Models

What other model description languages can we draw from?

LanguageSyntaxRemarks
XRateLisp-likeXrate is a system for fitting models expressed as Phylogrammars (that extend both HMM's and Stochastic Context Free Grammars), using EM. Phylogrammars are very expressive, but not terse. The conciseness of Lisp syntax (relative to XML) helps constrain the length of grammar descriptions. Not everyone finds the Lisp syntax difficult. Note that we can change the syntax of a language while preserving its semantics-- we can make it more intuitive to use, like HyperTalk or SQL.
HyPhyAlgol-likeHyPhy can express all of those models used in selection analysis (see next section). But the HyPhy parsing language is difficult, have to parse too much of it. Need to use a language for which parsers exist already. This is similar to the point made by Ezechukwu and Maros, 2003 on p. 2.
AMLXMLAML is an Algebraic Markup Language (drawing on SML and XML) described by Ezechukwu and Maros, 2003, which is attached. This paper is useful to read for its introduction and concrete examples. The web page cited by the paper no longer seems to exist.
GAMSGAMS has been around for a while and has accumulated a large library of models; here is a network flow example.
MPLMPL is a commercial product for Windows that provides a software environment for representing and solving problems in its MPL language.
MMLJava-likeMML was developed by an international Physiome project. Its home page gives examples of constraints and differential equations.
AMPLAMPL was developed at Bell Labs and is described in the linked book
AIMMSGraphical notationAIMMS is a commercial software environment supporting a language; has a graphical notation
WinBUGS formatR/S-plus-like"The BUGS language allows a concise expression of the model, using the 'twiddles' symbol ~ to denote stochastic (probabilistic) relationships, and the left arrow ('<' sign followed by '−' sign) to denote deterministic (logical) relationships."

There was some disagreement about how much these languages could express.

Evaluation (projects)

How can we test the effectiveness of a proposed solution?

Sergei: Positive selection analyses. What models out there? Which packages implement them?

Other examples?

Requirements

We discussed the requirements a transition model language would have to satisfy in a break-out group at the 2nd meeting of the group (Nov 12-14, 2007).

  1. A transition model specification for an analysis assigns one or more substition process models to a set of sites (columns) in the alignment (sometimes also called a partition), a set of edges in a tree.
    • The definition of partitions should in fact be implicit from the assignment of parameters to sites. This would also allow for a more straightforward definition of mixture models (see below).
  2. A substitution process model specifies a set of substitution rates from one state that a site may take to another. and a set of state frequencies giving the equilibrium frequencies for each state.
    1. Substitution rates and state frequencies may not be given absolutely, but estimated as part of the analysis (for example, by maximum likelihood).
    2. Within a substitution model, substitution rates for certain state transitions may be constrained to be identical. Similarly, certain state frequencies may be constrained to be identical.
      • For example, in the Jukes-Cantor model, all substitution rates are identical, and all state frequencies are identical.
      • Other models range from sets of rates constrained as identical, to all rates being allowed to be different.
    3. Substitution rates and state frequencies may be constrained by upper or lower bounds, or by ratio relative to another rate.
  3. Layering of models: A model may be a layer on top of another model.
    • For example, a codon model layers probabilities for non-synonymous and synonymous codon transitions over a nucleotide exchange model.
  4. Among-site rate variation: Need to be able to specify a distribution of substitution rates that the rates within a model have to be drawn from.
    • For example, typically a <math>\Gamma</math>-distribution with a shape parameter is used to model which fraction of sites evolve at the mean rate, or at a rate different from the mean rate.
    • The continuous distribution (such as <math>\Gamma(k,1/k)</math>) is almost always discretized for computation by the program. There are different ways of discretization, and the method to be used needs to be specifiable.
    • Often a parameter <math>I</math> is used giving the proportion of invariable sites.
  5. Linking rates between model instances: The rates, or the ratio of rates, or the state frequencies, may be constrained to be identical ('linked') for sites across a model instances applied to partitions.
    • For example, in a concatenated alignment, each aligned sub-sequence may constitute its own partition (with its own instance of a transition model, so that parameters may vary), but the substitution rates for sites in specific codon positions should be identical.
    • Similarly, one may want to link the rate for a specific state transition or set of state transitions, for example <math>\alpha</math>, the rate for transitions, between specific instances of models.
    • One may also want to link the ratio of rates within a model across model instances, such as linking <math>\Kappa=\alpha/\beta</math> (the ratio of transversions over transitions) between model instances (where a and b are from the same model instance).
    • Or one may want to link the state frequencies for specific sets of sites, such as those for all intra-membrane sites and for all extra-cellular sites.
  6. Parameter distribution instead of values: in a Bayesian framework, need to be able to assign prior distributions to model parameters, and an analysis will estimate the posterior distribution of model parameters.
  7. Evolutionary constraints: one may wish to constrain the ratio of synonymous versus non-synonymous substitutions (which is indicative of positive or negative selection) on certain branches of the tree.
    • For example, one may wish to assign the same positive selection to certain branches in the tree by constraining (linking) <math>\omega=\mbox{d}N/\mbox{d}S</math> for those branches.
    • This may be outside of the scope of a transition model specification, though.
  8. Mixture models: almost always only a single substitution process will be assigned to a partition. If multiple models are be assigned, this is called a mixture model.
Example for a mixture model specification:
Partition: A B C D
rate distribution <math>\alpha_1, \alpha_2</math> <math>\alpha_1, \alpha_2</math> <math>\alpha_2</math> <math>\alpha_2</math>
rate matrix parameter <math>\Kappa_1, \Kappa_2</math> <math>\Kappa_3, \Kappa_4</math> <math>\Kappa_5</math> <math>\Kappa_2</math>

Error checking

With a transition model language flexible enough to satisfy all (or even many) of the above requirements, there are many possibilities for introducing errors when composing a model specification. How much error-proof should the language be? How much enforcement of internal inconsistency should the language provide?

Errors may fall into the following categories:

  1. Syntactic errors. For a formally defined language, such as XML, syntactic correctness can be verified by a compiler, or parser.
    • A syntactically correct specification may not be semantically meaningful, though.
  2. Semantic errors. Each model parameter has semantic meaning, and for constraining parameters relatively to another, within the same model instance of across model instances, parameters need to be semantically compatible for such constraints to be meaningful.
    • For example, linking <math>\Kappa</math> in one instance to <math>\alpha</math> in a different instance is non-sensical.
    • Semantically validating a model specification requires a format ontology covering all model parameters and enumerating allowable relative constraints.
  3. Over and underspecification of the model. The model specification may be syntactically and semantically correct, but not specify the model with sufficient coverage.
    • For example, if there are many partitions or sets of sites, a model may erroneously not be specified for certain sites, making them unintentionally invariant, or discarded from the analysis (what would be the correct way of proceeding for a program?).
    • Similarly, parameters may be erroneously shared between sites.

Approaches

1. Model-driven: express the transition model specification based on expressing the models being instantiated

  • This seems to be the more common approach currently.
  • Downside: especially in more complex analyses (such as partitioned alignments) there may be large number of model instances that need to be specified, which may not differ much except for a few parameters, and primarily to allow those parameters to vary across partitions.

2. Parameter-driven: express the transition model specificaton based on expressing the parameters for each combination of site (or set of sites) and edge in tree (or set of edges).

  • Would require the semantic definition of each parameter (presumably using an ontology). This could be an advantage, as the model description would be semantically anchored.
  • Semantic definition would connect parameter to its larger biological context as well as its role in a particular model. For example, HKY.Kappa would be defined both as the ratio of alpha and beta in an HKY model, but (through alpha and beta?) to the biological concepts of transitions and transversions and purines and pyrimidines.

Example of a concise representation of a GTR model with different base frequency parameters for two sets of sites (sites corresponding to intra-membrane regions of the protein product, and all other sites):

SubstProc=GTR:unlink all
BaseFreqs=pi[2]
     SiteSet[1]<-pi[1]
     SiteSet[2]<-pi[2]
SiteSets=
     SiteSet[1]=/* enumerate intra-membrane sites */
     SiteSet[2]=!SiteSet[1]

3. Generating language: formalize a language that allows a compact definition of the transition model specification without redundancies or model instances. The actual representation, in either a more verbose standard or program-specific format, would be accomplished by a 'compiler'.

  • A 'parser' for the language would then generate a complete model specification with all model instances.
  • For example, assume a partitioned alignment of n genes, and we wish to specify a GTR model for all genes, allowing different rates for each gene, but with linked base frequencies for all intra-membrane regions, and for all extra-membrane regions.

Specification

Example: BUGS-based language

Example of a model specified in the BUGS language. The model is for a random effects logistic regression of two variables, seed type (x1) and root extract (x2) to the number of germinated seeds (r) for each experiment plate i. The stochastic parameters <math>\alpha_0, \alpha_1, \alpha_2, \alpha_{12}</math>, and <math>\tau</math> are given prior distributions, while the logical expression for <math>\sigma</math> allows the standard deviation (of the random effects distribution) to be estimated.

model {
    for (i in 1:N) {
        r[i] ~ dbin(p[i], n[i])
        b[i] ~ dnorm(0, tau)
        logit(p[i]) <- alpha0 + alpha1 * x1[i] + alpha2 * x2[i]
        + alpha12 * x1[i] * x2[i] + b[i]
    }
    alpha0 ~ dnorm(0, 1.0E-6)
    alpha1 ~ dnorm(0, 1.0E-6)
    alpha2 ~ dnorm(0, 1.0E-6)
    alpha12 ~ dnorm(0, 1.0E-6)
    tau ~ dgamma(0.001, 0.001)
    sigma <- 1 / sqrt(tau)
}

The full example is in the WinBUGS 1.4 manual.

The BUGS (Bayesian inference Using Gibbs Sampling) project is concerned with flexible software for the Bayesian analysis of complex statistical models using Markov chain Monte Carlo (MCMC) methods. The BUGS language allows a concise expression of the model using an R/S-plus like syntax, using the tilde symbol ('~') to denote probabilistic parameters, and the left arrow ('<-') to denote deterministic or logical relationships.

Pros 
  • Very concise, clear representation of models
  • Mathematically highly expressive
Cons
  • Requires basically an R/S-plus language parser (and it is not completely R-compliant)
  • The cost of supporting this language in a multitude of programs developed in different languages and with different capabilities may be prohibitive.

Example: BEAST XML-based language

Example of expressing a GTR+<math>\Gamma</math>+I model in BEAST XML. There must be parameters for exactly five of the six rates, which are expressed relative to the sixth. The sixth rate has an implied value of 1.0. The example parameterizes the rate matrix relative to the A<math>\leftrightarrow</math>G transition.

<generalDataType id="nucleotides">
	<state code="A"/>
	<state code="C"/>
	<state code="G"/>
	<state code="T"/>
	<alias code="U" state="T"/>
	<ambiguity code="R" states="AG"/>
	<ambiguity code="Y" states="CT"/>
	<ambiguity code="-" states="ACGT"/>
</generalDataType>
<frequencyModel id="freqs">
  <generalDataType idref="nucleotides"/>
  <frequencies>
    <parameter id="gtr.frequencies" dimension="4"/>
  </frequencies>
</frequencyModel>
<gtrModel id="gtr1">
	<frequencies> <frequencyModel idref="freqs"/> </frequencies>
	<rateAC> <parameter id="rateAC1" value="0.8"/> </rateAC>
	<rateAT> <parameter id="rateAT1" value="0.8"/> </rateAT>
	<rateCG> <parameter id="rateCG1" value="0.8"/> </rateCG>
	<rateCT> <parameter id="rateCT1" value="1.0"/> </rateCT>
	<rateGT> <parameter id="rateGT1" value="0.8"/> </rateGT>
</gtrModel>
<siteModel id="siteModel">
    <substitutionModel><gtrModel idref="gtr1"/></substitutionModel>
    <gammaShape gammaCategories="4">
        <parameter id="siteModel.alpha" value="1.0" lower="0.0" upper="100.0"/>
    </gammaShape>
    <proportionInvariant>
        <parameter id="siteModel.plnv" value="0.01" lower="0.0" upper="1.0"/>
    </proportionInvariant>
</siteModel>

BEAST supports an XML format that includes specifications of substitution models, frequency models, site modesl, parameters, and specific elements for specifying rates or molecular clock models.

Pros 
  • Requires only an XML parser to implement. BEAST supports it already, and adopting a standard that is based on its current language may have low or minimal cost.
  • The language distinguishes between and specifies models for frequency, substitution, and site heterogeneity.
  • Parameters can have lower and upper bounds for estimation.
  • Parameters can be tied (constrained) to each other by referencing (using idref) a previously defined parameter.
  • Codon models can also be specified.
Cons
  • Not clear how to express partitions.
  • Not clear how to give specific (or relative) frequencies to individual states.
  • There is no XML Schema definition, this would have to be created from scratch.
  • The language needs some improvements (for example, the gammaShape element is directly nested within the siteModel element, rather than within an element named, e.g., distribution, or parametricDistributionModel, and so a standard based on it would not be identical to it.

Proposal: BEAST XML-inspired extension to NeXML

We propose to focus on using the language for specifying and transmitting models to be used (or that were used) in analysis, rather than for the purpose of being able to specify arbitrarily complex models that a generic mathematical engine could instantiate and execute. A BUGS-like language would be desirable for the latter purpose, but for the former BEAST XML, or a similar but somewhat more generic XML specification would probably fully serve the needs, and have a relatively low barrier to support.

A 'BEASTy' XML representation would also readily integrate with the emerging NeXML standard. We propose to create this specification as a NeXML module.

Steps:

  1. Create Models module XSD, using BEAST XML and the previously created models.xsd as inspiration and start, respectively.
  2. Communicate and coordinate with Rambaut, Drummond, Midford, Holder.
  3. Demonstrate sufficient expressiveness (and/or limitations) using proof-of-concept examples taken from papers or other sources.
  4. Integrate module into the NeXML container.
  5. Allow model specification to be only module in a NeXML file (for specification and transmission)
  6. Create XSLT for converting NeXML with model specification into BEAST XML.

Applications of a transition model language

We found the following applications of a transition model language to be the most useful, in the sense of applications that can't be easily done with existing package-specific model specification formats.

  1. Reproducibility of research.
    • Published analyses that used a transition model specification could publish the specification as a downloadable file that (if in a universally supported format) can be directly reproduced in the same program, or in a different program.
    • If there is a digital repository for storing supporting data, the transition model specification could become one of the items to be deposited, and would then receive a persistent and globally unique identifier.
  2. Component in web-service discovery and matching. Web-services offering evolutionary analysis could automatically determine their ability to execute a specific transition model specification.
    • For example, a PAUP* service would reject a service request with a transition model that specifies a codon model.
  3. Extending an analysis. If a published analysis also gives its transition model specification on a standard format, it would facilitate easy extension of such analyses with more complicated models.
    • For example, a publication may have used a nucleotide transition model to analyse the evolution of a set of sequences. If the transition model is given in a standard and universally supported format, it would be straightforward to load that model (and the data) in the program of choice and layer a codon model on top of the nucleotide transition model.

Plans

Progress in achieving goals

  • At the NESCent phylohackathon in December 2006, Mark Holder, Paul Lewis and Sergei Kosakovsky Pond produced a draft object model in Interface Definition Language (IDL).
  • At a work meeting in Lawrence, KS, in Fall 2007, Mark Holder, Peter Midford, Jeet Sukumaran (a graduate student at KU) and Rutger Vos revisited the model description that we sketched out at the hackathon.
    • Peter has worked on an OWL expression of it (user: guest, pass: guest).
    • Rutger and Jeet worked on an XML syntax and schema fragment and how that would work with Rutger's nexml.
    • Jeet is funded as a CIPRES RA and will be wrapping some simulation tools for CIPRES. We are hoping that he can pick up the model spec again, and start building CIPRES support for it. When he does that, we expect that he will generate questions and comments for us all to ponder.
  • At the evoinfo meeting in November 2007, Sergei presented an informal overview of the requirements of a model description language. Peter introduced several legacy technologies (languages) for expressing stochastic models.
    • During several breakout sessions, David Swofford, Sergei Kosakovsky Pond, Gopal Gupta, Enrico Pontelli, Hilmar Lapp, Peter Midford, Brian DeVries and Rutger Vos further discussed the requirements. Mark Holder called in via videoconference with further comments.

Next Steps

  • Flesh out transition model description format for GARLI and PAUP that implements the more general (but most likely not all) of the requirements (Zwickl, Swofford, Holder, Lapp)
  • Make a collection of examples, e.g. taken from papers (Kosakovsky Pond)

Prioritization

Added by a Substitution Model subgroup at the 3rd Working Group meeting (Xuhua Xia, Jim Leebens-Mack, Hilmar Lapp)

  • Start with nucleotide and amino acid substitution models
    • Codon-based models are oftentimes not well suited for tree reconstruction (complex, inefficient to compute, don't take into account substitution differences between sites)
  • Start with phylogenetic tree reconstruction programs, for those identify the most frequently used models
    • DAMBE, Mega, PAUP*
  • Do evolutionary model testing (codon models, selection, rates-based hypotheses, etc) only once the first priority ones are done.

Resources

Artefacts by group members and associates