PhyloWS

From Evolutionary Informatics Working Group
Jump to: navigation, search

Phyloinformatics Web Services API: Overview

At present there is no standard web-service API for phylogenetic data that would allow integration of phylogenetic data and service providers into the programmable web. Hence, current approaches to integrate data and services into workflows are highly specific to the integration platform (CIPRES, BioPerl, Bio::Phylo, Kepler), and nearly unusable in other environments.

A web-service API standard would overcome this problem, and make phylogenetic data as well as services universally available to any client application that supports the API. Reference implementations of the client API could simplify and promote adoption.

Rather than proposing a particular implementation, this page is to gather requirements and use-cases that such an API would have to fulfill.

Contact

The PhyloWS standard initiative was started at the BioHackathon 2008 in Tokyo, Japan, by Hilmar Lapp and Rutger Vos.

There is a Google Group dedicated to discussing all PhyloWS-related issues and questions, as well as the development of PhyloWS-supporting tools and software.

PhyloWS has been used in several software and database projects, see Links and References.

Scope

If we define phyloinformatics as the informatics of managing, querying, and manipulating phylogenetic data, we can define the scope of PhyloWS along two axes.

  1. Possible scopes of operations:
    • Managing: storing (create), updating, deleting phylogenetic data
    • Querying: retrieval only
    • Manipulating: manipulating the result (pruning, concatenating, on-the-fly super-tree)
  2. Possible scopes of phylogenetic data types:
    • Phylogenetic trees
    • Character matrices (discrete, continuous, DNA, RNA, protein)
    • Transition models
    • Node data (both internal and external) (minimal data for molecular sequence based trees: taxonomy and sequence identifier; should be able to contain/refer-to any type of data, for example geographic information)

Use Cases

Phylogenetic trees

Topological queries

  1. Find most recent common ancestor of two or more leaf nodes in the specified tree
  2. Find the minimum spanning clade for two or more leaf nodes in the specified tree
  3. Find trees matching a query topology
  4. Obtain trees with length shorter or longer than an input tree, or a given length
  5. Given a tip (or internal) node, find the tree with the shortest (or longest) root to tip (or node) distance
    • Alternatively, obtain distribution of root-to-tip (or node) distances

Node-based queries

  1. Obtain metadata for a given node
  2. Obtain lineage of ancestors for a given node
  3. Obtain clade(s) matching a given clade specification (by common ancestry, or by character)
  4. Find trees containing a given set of OTUs, or set of taxa, or set of sequences
  5. Obtain the patristic distance(s) between two nodes

Character-based queries

  1. Find clades with all nodes having a given character
  2. Given characters X and Y, which trees support character X evolving before (or after) character Y
    • Note: This requires storing and querying reconstructed ancestral character states.

Tree and node annotation queries

  1. Obtain metadata for tree, such as name, namespace, method, author
    • Attribute/value pairs
  2. Find clades with all nodes having a given annotation
    • For example, find all Drosophila species occurring in Hawaii
  3. Find trees based on tree metadata
    • For example, find all phylogenetic trees constructed with a particular method, or with specific parameters. (Q4 in Nakhleh et al)
    • Another example: find all phylogenetic trees created by a certain author, or based on the date it was created. (Q6 in Nakhleh et al)
    • Note: satisfying this requires extending the current BioSQL PhyloDB module to capture metadata of trees
  4. Find trees based on the type of data they were built with
    • For example, find trees built on DNA alignments, protein alignments, continuous characters, or discrete characters.
    • Question: should this be a metadata query too, or rather query the linked alignment (or character data matrix)
    • Note: Currently there isn't a good way to store character data matrices in BioSQL, and even storing DNA and protein alignments hasn't been formalized yet.
  5. Find trees based on the model of evolution used
    • For example, find trees built using the HKY85 transition model, with no codon model.
    • Note: Supporting this requires storing either the model of evolution in a relational model (which BioSQL currently doesn't do), or to use a controlled vocabulary of models of evolution. The latter would only be able to support a limited number of models of evolution, whereas in reality the possibilities are combinatorial between base frequencies, substitution model, rate heterogeneity, partitioning, and constraining parameters (see Transition Model Language).

Filtering

Filtering trees:

  1. Filter all (not) matching trees using some metric, where metric might be:
    • Score under some optimality criterion (-lnL, parsimony tree length, posterior probability)
    • Pure topology metrics (Colless' imbalance, I2 imbalance, Pybus gamma, stemminess measure of Fiala and Sokal (1985), stemminess measure from Rohlf et al. (1990), resolution)
    • Distance to another tree (symmetric difference metric sensu Penny and Hendy (1985)), i.e. input requires a reference topology
  2. Filter all (not) matching nodes using some metric, where metric might be:
    • Numerical score (bootstrap value, bremer value, posterior probability)
    • Topological location (distance to root, distance to tallest/shortest tip)
    • Subtended clades (monophyly)
  3. Filter trees by a calculated characteristic
    • Filter trees with size greater or smaller than a given number, or a certain ratio of internal to external nodes

Filtering nodes:

  1. Given a tree (e.g., by identifier), filter nodes by distance to root (greater or smaller than a given number). This results in multiple matching nodes.
  2. Given a tree (e.g., by identifier), retrieve node with longest path (number of nodes from root to tip), or with longest branch from direct ancestor. This results in a single node.

Functions on trees

  1. Modifying functions:
    • Pruning clades (hierarchical subsetting)
    • Rerooting trees
    • Collapsing of branches as a function of support values
    • Modifying branch lengths (exponentiate, ultrametricize, rate-smooth)
  2. Aggregrating functions:
    • Counting functions (the number of matching trees, number of nodes in matching trees, number of internal or external nodes)
    • Topology characteristics: length, height, balance, stemness, resolution
  3. Tree comparison
    • Distance calculation between two specified trees
  4. Consensus calculation
    • Obtain consensus tree from a set of 2 or more input trees
  5. Supertree functions:
    • Automate pruning-grafting super-tree method
    • Min-cut super-tree method
  6. Reconciliation functions:
    • Infer gene duplication on a gene tree given a species tree
  7. Consensus functions:
    • Strict consensus, majority rule, etc.

OTU-oriented queries

  1. For a given OTU (or node,internal or external) identifier:
    • Retrieve data associated with that id (e.g. sequences aligned/unaligned, character state sequences)
    • Retrieve taxonomic identifier(s)
    • Retrieve sequence identifier in case of gene trees

Character Data

Note: this is work in progress, needs cleaning up.

Queries based on data

  • Given an OTU, obtain all matrices, or given a matrix, all characters that have data for that OTU
  • Given a set of characters, obtain OTUs, or given a matrix, all OTUs that have data for that character
  • Given a set of OTUs, obtain a character matrix from all matching matrices that have data for those OTUs

Queries based on character evolution

  • Find characters that have been gained (lost) more or fewer times than n
    • A variation of this is a query for characters that are (or are not) supported by a a given tree (or set thereof), given a model of evolution
    • Note: This requires storing and querying reconstructed ancestral character states.

Character-based functions

  • Given a tree and a model of evolution, simulate a character matrix

PhyloWS Requirements

Scope: Phylogenetic Tree Database

Retrieve:

  1. Task: Find trees by name or identifier
    • Input: one or more (partial) names, or identifiers, and optionally a namespace of matching trees
    • Output: names and identifiers of matching trees
    • Q: Should this also return metadata for each tree?
  2. Task: Find trees by nodes
    • Input: a list of node specifiers, and a designation of what the specifiers should match (node label, sequence ID, taxon, gene name)
    • Output: names and identifiers of trees that each contain nodes matching the node specifiers
    • Q: Should this use a convention for encoding the type of specifier (such as namespace:value)?
  3. Task: Find trees by clade
    • Input: clade specification (phylocode)
    • Output: names and identifiers of trees that each contain nodes with each of the labels
  4. Task: Find trees by metadata
    • Input: metadata constraints as (attribute, operator, value) structures
    • Output: names and identifiers of matching trees
    • Q: Should this also return complete metadata for each tree? Or only the metadata element by which it matched?
    • Q: Should this borrow from or be based on SRU? Or OpenSearch?
  5. Task: Retrieve tree metadata
    • Input: list of one or more identifiers for which to retrieve metadata
    • Output: metadata of the tree, as attribute/value pair
  6. Task: Retrieve tree
    • Input: identifier of tree to be retrieved
    • Output: the tree (with complete structure)
  7. Task: Retrieve subtree or root node for matching clades
    • Input:
      • clade specification (identifier or label of clade root, phylocode specification)
      • whether to only return the root of the clade (MRCA query)
      • optionally, filter by namespace and name(s) (or identifier(s)) of trees
    • Output: matching clades as subtrees (with complete structure)
    • Q: Should this also return all metadata of all nodes in the clade, or would that require a separate request?
  8. Task: Retrieve all metadata for one or more nodes
    • Input: a list of one or more node identifiers
    • Output: the identified nodes with all their metadata as attribute/value pairs
    • Note: see some thoughts on node metadata representation at the BioHackathon 2008 PhyloWS workgroup page.
  9. Task: Project tree to subtree induced by a set of nodes
    • Input: specifications of nodes (labels, identifiers) that induce a subtree
    • Output: the subtree induced by the specified nodes, with all other nodes pruned
    • Q: Should this also return all metadata of all nodes in the clade, or would that require a separate request?
  10. Task: Find, or filter trees matching a query topology.
    • The query topology might have polytomies, of which matching trees may be a specialization.
    • Input: A database (or result set) of trees, a query tree, and a distance metric
    • Output: The matching trees (names, identifiers), or alternatively the subtrees of matching trees projected onto the query topology
  11. Task: Aggregate (summarize) trees
    • Input: a list of identifiers of trees, and an aggregation operation (#nodes, #internal nodes, #tips, length, height, balance, stemness, resolution)
    • Output: for each tree the requested aggregation result(s)

Create: (only for databases supporting write-access)

  1. Task: Create a tree in the database
    • Input: tree with metadata, nodes, node metadata, and structure
    • Output: success status
    • Q: should the input be in NeXML format? Or NH? Or NHX? Or all of these?

Update: (only for databases supporting write-access)

  1. Task: Prune clade from tree in the database
    • Input: identifier of root node of clade to be pruned, optionally identifier of node where to graft the clade
    • Output: success status
  2. Task: Reroot tree
    • Input: identifier of node that is to become the new root of its tree
    • Output: success status

Delete: (only for databases supporting write-access)

  1. Task: Delete tree from database
    • Input: list of 1 or more identifier(s) of trees to be deleted
    • Output: success status

Scope: Phylogenetic Data Conversion

Create: (submit data for conversion)

  1. Task: submit a tree/matrix/taxa block etc. for conversion
    • Input: a phylogenetic datum
    • Output: success status, created resource location

Retrieve: (retrieve converted data)

  1. Task: retrieve converted data)
    • Input: a resource identifier
    • Output: success status, body with data

Update:

  1. Task: modify input (??)
    • Input: a resource identifier and a body with data
    • Output: success status

Delete: (delete a resource)

  1. Task: clean up after conversion (input files, output files)
    • Input: a resource identifier
    • Output: success status

Scope: Phylogenetic Analysis

For phylogenetic analyses there are two issues that might make this more complicated than databases CRUD operations: i) analyses often need many additional parameters (which optimality criterion? Which branch-swapping algorithm? ii) analyses often are longer running, so some form of asynchronicity needs to be implemented.

The letters in parentheses after the task titles are to indicate which tasks are logically related, e.g. all tasks (A) have to do with creating, retrieving or deleting an analysis that modifies a tree topology.

Create: (submit data for an analysis)

  1. Task: submit a tree for topology modifications (A)
    • Input: a tree description
    • Output: success status, created resource location
  2. Task: submit a tree for topology metric calculations (B)
    • Input: a tree description
    • Output: success status, created resource location
  3. Task: submit a pair of trees for pairwise comparisons (C)
    • Input: two tree descriptions
    • Output: success status, created resource location
  4. Task: submit a set of trees for consensus or supertree calculations (D)
    • Input: a set of tree descriptions
    • Output: success status, created resource location
  5. Task: submit a pair of OTUs for pairwise comparisons (E)
    • Input: a pair of OTUs
    • Output: success status, created resource location
  6. Task: submit a pair of nodes for pairwise comparisons (F)
    • Input: a pair of nodes
    • Output: success status, created resource location
  7. Task: submit a pair of sequences for alignment (G)
    • Input: a pair of unaligned molecular sequences
    • Output: success status, created resource location
  8. Task: submit a pair of sequences for divergence calculations (H)
    • Input: a pair of aligned character state sequences
    • Output: success status, created resource location
  9. Task: submit a set of sequences for MSA (I)
    • Input: a set of unaligned molecular sequences
    • Output: success status, created resource location
  10. Task: submit a data matrix for tree inference (J)
    • Input: an aligned character state matrix
    • Output: success status, created resource location
  11. Task: submit a data matrix for divergence calculations (K)
    • Input: an aligned character state matrix
    • Output: success status, created resource location

Retrieve: (retrieve analysis results)

  1. Task: retrieve a modified topology (A)
    • Input: a resource identifier
    • Output: success status, body with tree description
  2. Task: retrieve a topology metric (B)
    • Input: a resource identifier
    • Output: success status, body with a number (int or float, depending on metric)
  3. Task: retrieve a tree-to-tree distance (C)
    • Input: a resource identifier
    • Output: success status, body with a number (int or float, depending on distance metric)
  4. Task: retrieve a consensus or supertree result (D)
    • Input: a resource identifier
    • Output: success status, body with a tree description
  5. Task: retrieve an OTU pairwise comparison result (E)
    • Input: a resource identifier
    • Output: success status, body with distance (int or float, depending on tree type)
  6. Task: a node pairwise comparison result (F)
    • Input: a resource identifier
    • Output: success status, body with distance (int or float, depending on tree type)
  7. Task: retrieve a pairwise sequence alignment (G)
    • Input: a resource identifier
    • Output: success status, body with character state matrix
  8. Task: retrieve a pairwise sequence divergence calculation result (H)
    • Input: a resource identifier
    • Output: success status, a number (int or float, depending on divergence metric)
  9. Task: retrieve an MSA result (I)
    • Input: a resource identifier
    • Output: success status, a character state matrix
  10. Task: retrieve a tree inference result (J)
    • Input: a resource identifier
    • Output: success status, a tree or a set of trees (depending on inference method)
  11. Task: retrieve a divergence calculation result (K)
    • Input: a resource identifier
    • Output: success status, a number (int or float, depending on divergence metric)

Update:

  1. Task: modify analysis input (??)
    • Input: a resource identifier and a body with data
    • Output: success status

Delete: (delete a resource)

  1. Task: clean up after analysis (input files, output files) (A..K)
    • Input: a resource identifier
    • Output: success status
  2. Task: cancel analysis (implies deletion of input files, output files?) (A..K)
    • Input: a resource identifier
    • Output: success status


PhylowWS and CQL

To effectively search phylogenetic data, a CQL profile must be defined. A profile specifies the names of indexes (searchable fields) and how they should be used to search for various types of content. The PhyloWS profile will reference both existing context sets (lists of CQL indexes, such as the Dublin Core set), and introduce a new context set with indexes specific to phylogenetics.

PhyloWS Profile

PhyloWS Context Set

Short name for the context set, which will be used in queries:

  • phylo
  • any other suggestions?

Indexes used for the initial version of Phylr:

  • treeid = globally unique id of the tree
  • otus = labels of otus present in the tree
  • authors = author of the tree or associated publication
  • abstract = description of the tree, or abstract of a publication associated with the tree
  • datatype = (needs to be a vocabulary) the type of data used to generate the tree
  • keywords = concatenation of all above fields
  • nexml = complete nexml document (for fast return of results)
  • treesize = total number of nodes in the tree (including internal nodes)
  • hasbranchlengths = true or false

Other indexes suggested during Phylr development:

  • gene = the label on the sequences used (as opposed to the otu labels)
  • alignment type = how the seqs were aligned (clustal, muscle, not aligned)
  • alignment score
  • sequence length
  • topology resolution score = i.e. is tree binary, or does it contain polytomies
  • orthology = did the sequence cluster pass or fail the test for paralogous sequences
  • treesize = number of taxa and / or sites

Phylota suggests these searches should be available:

  1. name / id search: return all subtrees of this node
  2. list of names: return trees that contain any / all of these labels
  3. list of names (lca version): return trees that are subtrees of the most recent common ancestor of the labels

Phylota suggests these searches as possibly useful:

  1. gene = the label on the sequences used (as opposed to the otu labels)
  2. alignment type = how the seqs were aligned (clustal, muscle, not aligned)
  3. alignment score
  4. sequence length
  5. resolution score = i.e. is tree binary, or does it contain polytomies
  6. orthology = did the sequence cluster pass or fail the test for paralogous sequences
  7. treesize

Hilmar suggests these indexes:

  • studyName
  • taxonIdentifier

TreeBASE would like these indexes:

  • taxon_label (any string attached to a leaf node of a tree or row of a matrix)
  • taxon_name (names explicitly listed in a tree/matrix)
  • higher_taxon_name (explicitly asking for names in a tree/matrix that are descendants of a given name)

Specification

An actual specification of the API as web-services in either RESTful or SOAP binding are actively being worked on:

PhyloCode

PhyloCode as phylogenetic clade query specification

For clade-based queries we need a syntax for specifying clades. One possibility is to adopt the PhyloCode notation for this. Section 9 in Division II gives abbreviations for clade names:

  • <A&B - the least inclusive clade containing 'A' and 'B', where 'A' and 'B' are specifiers. Also known as the minimum spanning clade of A and B, or all descendants of the most recent common ancestor of A and B, including the MRCA itself.
  • >A~B - the most inclusive clade containing all nodes sharing a more recent common ancestor with 'A' than with 'B', where 'A' and 'B' are specifiers. Also known as the maximum spanning clade (or stem query) of the earliest ancestor of A that isn't also an ancestor of B.
  • >M(A) - the most inclusive clade possessing synapomorphy (i.e., character state) M, as inherited by 'A', where 'A' is a specifier. This is a character-based definition of a clade, rather than a topology-based.

Query specification for OTU-based queries

A specifier here would simply be a node label, a taxon name, a sequence ID, or a gene name, or another phylocode (thus, a phylocode could potentially be recursive). There needs to be a notation to express what a specifier represents. Otherwise, this metadata needs to be given as a query parameter, and would then hold for each specifier.

  • Using the tag:specifier convention: We could reuse the NHX format conventions here (which incidentally coincide with the EMBL and UniProt line tags for sequence accession and gene name, respectively).
    • Examples: S=Mus musculus, T=NCBI:10090 ND=TRF1 Mouse, AC=TRF1_MOUSE, GN=TRF1. Note that in reality when passing these in a URL spaces etc will need to be URL-encoded.

Example Resources

Links and References

Links to activities or projects using or supporting PhyloWS:

Other references: