From Evolutionary Informatics Working Group
Revision as of 03:49, 12 February 2008 by Hlapp (talk)
Jump to: navigation, search

Phyloinformatics Web Services API: Overview

At present there is no standard web-service API for phylogenetic data that would allow integration of phylogenetic data and service providers into the programmable web. Hence, current approaches to integrate data and services into workflows are highly specific to the integration platform (CIPRES, BioPerl, Bio::Phylo, Kepler), and nearly unusable in other environments.

A web-service API standard would overcome this problem, and make phylogenetic data as well as services universally available to any client application that supports the API. Reference implementations of the client API could simplify and promote adoption.

Rather than proposing a particular implementation, this page is to gather requirements and use-cases that such an API would have to fulfill.


If we define phyloinformatics as the informatics of managing, querying, and manipulating phylogenetic data, we can define the scope of PhyloWS along two axes.

  1. Possible scopes of operations:
    • Managing: storing (create), updating, deleting phylogenetic data
    • Querying: retrieval only
    • Manipulating: manipulating the result (pruning, concatenating, on-the-fly super-tree)
  2. Possible scopes of phylogenetic data types:
    • Phylogenetic trees
    • Character matrices (discrete, continuous, DNA, RNA, protein)
    • Transition models
    • Node data (both internal and external) (minimal data for molecular sequence based trees: taxonomy and sequence identifier; should be able to contain/refer-to any type of data, for example geographic information)

Use Cases

Phylogenetic trees

Topological queries

  1. Find most recent common ancestor of two or more leaf nodes in the specified tree
  2. Find the minimum spanning clade for two or more leaf nodes in the specified tree
  3. Obtain trees with length shorter or longer than an input tree, or a given length
  4. Given a tip (or internal) node, find the tree with the shortest (or longest) root to tip (or node) distance
    • Alternatively, obtain distribution of root-to-tip (or node) distances

Node-based queries

  1. Obtain metadata for a given node
  2. Obtain lineage of ancestors for a given node
  3. Find trees containing a given set of OTUs, or set of taxa, or set of sequences
  4. Obtain the patristic distance(s) between two nodes

Character-based queries

  1. Find clades with all nodes having a given character
  2. Given characters X and Y, which trees support character X evolving before (or after) character Y
    • Note: This requires storing and querying reconstructed ancestral character states.

Tree and node annotation queries

  1. Obtain metadata for tree, such as name, namespace, method, author
    • Attribute/value pairs
  2. Find clades with all nodes having a given annotation
    • For example, find all Drosophila species occurring in Hawaii
  3. Find trees based on tree metadata
    • For example, find all phylogenetic trees constructed with a particular method, or with specific parameters. (Q4 in Nakhleh et al)
    • Another example: find all phylogenetic trees created by a certain author, or based on the date it was created. (Q6 in Nakhleh et al)
    • Note: satisfying this requires extending the current BioSQL PhyloDB module to capture metadata of trees
  4. Find trees based on the type of data they were built with
    • For example, find trees built on DNA alignments, protein alignments, continuous characters, or discrete characters.
    • Question: should this be a metadata query too, or rather query the linked alignment (or character data matrix)
    • Note: Currently there isn't a good way to store character data matrices in BioSQL, and even storing DNA and protein alignments hasn't been formalized yet.
  5. Find trees based on the model of evolution used
    • For example, find trees built using the HKY85 transition model, with no codon model.
    • Note: Supporting this requires storing either the model of evolution in a relational model (which BioSQL currently doesn't do), or to use a controlled vocabulary of models of evolution. The latter would only be able to support a limited number of models of evolution, whereas in reality the possibilities are combinatorial between base frequencies, substitution model, rate heterogeneity, partitioning, and constraining parameters (see Transition Model Language).


Filtering trees:

  1. Filter all (not) matching trees using some metric, where metric might be:
    • Score under some optimality criterion (-lnL, parsimony tree length, posterior probability)
    • Pure topology metrics (Colless' imbalance, I2 imbalance, Pybus gamma, stemminess measure of Fiala and Sokal (1985), stemminess measure from Rohlf et al. (1990), resolution)
    • Distance to another tree (symmetric difference metric sensu Penny and Hendy (1985)), i.e. input requires a reference topology
  2. Filter all (not) matching nodes using some metric, where metric might be:
    • Numerical score (bootstrap value, bremer value, posterior probability)
    • Topological location (distance to root, distance to tallest/shortest tip)
    • Subtended clades (monophyly)
  3. Filter trees by a calculated characteristic
    • Filter trees with size greater or smaller than a given number, or a certain ratio of internal to external nodes

Filtering nodes:

  1. Given a tree (e.g., by identifier), filter nodes by distance to root (greater or smaller than a given number). This results in multiple matching nodes.
  2. Given a tree (e.g., by identifier), retrieve node with longest path (number of nodes from root to tip), or with longest branch from direct ancestor. This results in a single node.

Functions on trees

  1. Modifying functions:
    • Pruning clades (hierarchical subsetting)
    • Rerooting trees
    • Collapsing of branches as a function of support values
    • Modifying branch lengths (exponentiate, ultrametricize, rate-smooth)
  2. Aggregrating functions:
    • Counting functions (the number of matching trees, number of nodes in matching trees, number of internal or external nodes)
    • Topology characteristics: length, height, balance, stemness, resolution
  3. Tree comparison
    • Distance calculation between two specified trees
  4. Consensus calculation
    • Obtain consensus tree from a set of 2 or more input trees
  5. Supertree functions:
    • Automate pruning-grafting super-tree method
    • Min-cut super-tree method
  6. Reconciliation functions:
    • Infer gene duplication on a gene tree given a species tree
  7. Consensus functions:
    • Strict consensus, majority rule, etc.

OTU-oriented queries

  1. For a given OTU (or node,internal or external) identifier:
    • Retrieve data associated with that id (e.g. sequences aligned/unaligned, character state sequences)
    • Retrieve taxonomic identifier(s)
    • Retrieve sequence identifier in case of gene trees

Character Data

Note: this is work in progress, needs cleaning up.

Queries based on data

  • Given an OTU, obtain all matrices, or given a matrix, all characters that have data for that OTU
  • Given a set of characters, obtain OTUs, or given a matrix, all OTUs that have data for that character
  • Given a set of OTUs, obtain a character matrix from all matching matrices that have data for those OTUs

Queries based on character evolution

  • Find characters that have been gained (lost) more or fewer times than n
    • A variation of this is a query for characters that are (or are not) supported by a a given tree (or set thereof), given a model of evolution
    • Note: This requires storing and querying reconstructed ancestral character states.

Character-based functions

  • Given a tree and a model of evolution, simulate a character matrix

PhyloWS Requirements

Phylogenetic Tree API

  1. Task: Find trees
    • Input: one or more (partial) names, or identifiers, and optionally a namespace of matching trees
    • Output: names and identifiers of matching trees
  2. Task: Find trees by nodes
    • Input: a list of labels of nodes
    • Output: names and identifiers of trees that each contain nodes with each of the labels
  3. Task: Find trees by clade
    • Input: clade specification (phylocode)
    • Output: names and identifiers of trees that each contain nodes with each of the labels
  4. Task: Retrieve tree
    • Input: identifier of tree to be retrieved
    • Output: the tree (with complete structure)
  5. Task: Retrieve subtree or root node for matching clades
    • Input:
      • clade specification (identifier or label of clade root, phylocode specification)
      • whether to only return the root of the clade (MRCA query)
      • optionally, filter by namespace and name(s) (or identifier(s)) of trees
    • Output: matching clades as subtrees (with complete structure)
  6. Task: Project tree to subtree induced by a set of nodes
    • Input: specifications of nodes (labels, identifiers) that induce a subtree
    • Output: the subtree induced by the specified nodes, with all other nodes pruned

Tree Matching

  1. Task: Find, or filter trees matching a query topology.
    • The query topology might have polytomies, of which matching trees may be a specialization.
    • Input: A database (or result set) of trees, a query tree, and a distance metric
    • Output: The matching trees (names, identifiers), or alternatively the subtrees of matching trees projected onto the query topology

Tree Functions


PhyloCode as phylogenetic clade query specification

For clade-based queries we need a syntax for specifying clades. One possibility is to adopt the PhyloCode notation for this. Section 9 in Division II gives abbreviations for clade names:

  • <A&B - the least inclusive clade containing 'A' and 'B', where 'A' and 'B' are specifiers. Also known as the minimum spanning clade of A and B, or all descendants of the most recent common ancestor of A and B, including the MRCA itself.
  • >A~B - the most inclusive clade containing all nodes sharing a more recent common ancestor with 'A' than with 'B', where 'A' and 'B' are specifiers. Also known as the maximum spanning clade (or stem query) of the earliest ancestor of A that isn't also an ancestor of B.
  • >M(A) - the most inclusive clade possessing synapomorphy (i.e., character state) M, as inherited by 'A', where 'A' is a specifier. This is a character-based definition of a clade, rather than a topology-based.
  • A specifier here would simply be a node label, a taxon name, a sequence ID, or a gene name, or another phylocode. Thus, a phylocode could potentially be recursive.
  • There needs to be a notation to express what a specifier represents. Otherwise, this metadata needs to be given as a query parameter, and would then hold for each specifier.
    • Using the namespace:specifier convention: TX:"Mus musculus", ND:"TRF1_Mouse", ID:"TRF1_MOUSE", GN:TRF1

Distinction between identifiers and specifiers

Example Resources