Taxonomic Intelligence Subgroup

From Evolutionary Informatics Working Group
Jump to: navigation, search

Taxonomic Intelligence for Phyloinformatics

Members

  1. Bill Piel
  2. Karen Cranston
  3. Matt Kosnik

Motivation

  1. Taxonomic names are the primary currency in phyloinformatics.
  2. Communication protocols need to be improved and services defined to insure that communication of taxonomic object occurs with an acceptable level of ambiguity.
  3. Taxonomic concepts are inadequately informaticized, so the solution has to be an imperfect but adequate one. One solution assumes a "nominal concept" with each name string for each TNS (taxonomic name service, e.g. ITS, NCBI, Species2000, etc)
  4. The main challenge is to communicate the meaning of a name by associating it with a TNS
  5. An important service is the translation in the meaning of requests among names with different TNSes
  6. Expand this service for the use case that involves requesting generic phylogenetic hypotheses

Activities

  1. Review Available Technologies
    1. NeXML
    2. PhyloWS
    3. Seek API Taxon Object and TDWG Taxon Concept
  2. Expand Syntax, Protocols, and Requirements
    1. Develop desired requirements and example specification for Taxon Name Services: Taxon Name Service I/O
    2. Develop syntax for otu specification within NeXML to associate an otus with a specific name service: NeXML OTU Decoration
  3. Build Proof-of-Concept with Dummy
    1. Client
    2. TNS (Taxon Name Service)
    3. Tree Database: TreeBASE Data

Namespace Resolver

Use cases

synonymy

A user searching for all data on taxon x also could get taxon y where taxon y is a synonym of taxon x even though the tree is labeled only with y.

Example A:

  • A search for "Pagophilus groenlandicus" should also find anything labeled "Phoca groenlandica"
  • ITIS records this synonymy (Pagophilus groenlandicus - valid - Harp Seal / Phoca groenlandica - invalid).
  • NCBI has no results for Pagophilus, but returns results for Phoca...

Example B

  • A search for "Argopecten" should also get anything labeled "Plagioctenium".
  • Paleobiology database records this synonymy (Argopecten - valid / Plagioctenium - invalid).

homonyms

A user searching for all data on taxon x could get wildly divergent trees if the the name "x" refers to two distinct taxa.

Products

PhyloWS Interface to TreeBASE Data

The TreeBASE data is now served on dbhack1, though served very slowly. The interface is a flavor of PhyloWS that returns the tree in NeXML (by default).

Basic URNs to a Tree

GET /phylows/tree/<identifier>/[{format}=<format>]

The pointer to a tree can use a TreeBASE integer (e.g. "TB:2853") and a published legacy id (e.g. "LTB:Tree3586") via the treebase.org website.

That does mean that two different URNs each point to the same thing, e.g.:

If you want a tree in NEXUS, do this:

Basic URNs to a Clade

GET /phylows/tree/<identifier>/clade/<nodeID>/[{format}=<format>]

In this case the <nodeID> is a serially-generated integer starting from the root of the tree. We may redesign this to use a unique nodeID number.

This example returns the fifth node in tree with ID 2853.

Queries using SRU/CQL Syntax

GET /phylows/find/tree/?[query=<CQL query>]&[recordSchema=<format>]&[operation=searchRetrieve]&[version=1.2]

The query statement should be written in Contextual Query Language [CQL]. The "index" keys (as they are known in CQL) were picked arbitrarily for the convenience of TreeBASE's data model and namespace.

  • taxon_name is a string such as "Homo sapiens", "Homo sapiens Linnaeus, 1758", "Mammalia" etc...
  • taxon_label is a string attached to the node of a tree
  • ncbi_taxid is a integer used by NCBI to track its taxonomic names
  • ubio_namebankid is an integer used by uBio to track taxonomic names
  • taxon_id is an integer for TreeBASE's own taxonomic names
  • h.ncbi_taxid is a higher classification search based on NCBI's classification and using the NCBI's taxid
  • h.taxon_name is a higher classification search based on NCBI's classification and using a taxon name string

For example:

  • Search for all trees that have both a taxon starting with Homo sapiens and a node linked to NCBI taxid 9593 (which happens to be Gorilla gorilla): query=taxon_name+any+%22Homo+sapiens%25%22+and+ncbi_taxid+%3D+9593
  • Search for all trees that have either any kind of Primates OR any kind of Aves: query=h%2Etaxon_name+any+Primates+or+h%2Etaxon_name+any+Aves

Taxon Name Service

Given a name (language optional) or a number (language specified) or an LSID (no language required), return a collection of names or numbers in the other languages TNS knows about

Standup Reports

Tuesday

Delivered by Bill. Notes by Dave.

  • Exploring different options for cross referencing OTUs.
  • Approaches
    • use CDAO - this may not be flexible enough to point to all the kinds of resources we want to point to.
    • Create resolver PURLs (persistent ULRs)
  • Will be looking at the NeXML and PhyloWS.
  • Get round trip exercise.

Questions:

  • Karen:
    • Do ontology people have any comments about getting this type of information into NeXML.
  • Roger:
    • We need to parse exactly what taxonomy group is putting in.
  • Hilmar argues against putting human readable content in these XMLs. Want IDs.
  • Bill:
    • When people ask TreeBASE for a tree, it sends them that tree, ideally with as much decoration on each node as we have.
  • Roger:
    • going to have to actively think about how we would digest the DICT elements.
  • Matt K:
    • We don't care how it gets in there, as long as it gets there.
  • Arlin:
    • Concerned that people want to make the XML be human readable.

Wednesday

Delivered by Bill and Karen. Notes by Dave.

  • Building proof of concept
  • Put TreeBASE dump on DBHack repository
  • Have a perl script that takes guid numbers and builds NeXML documents.
  • Will probably build some rest services for that.
  • Develop a better idea of what we want.
  • Playing with Ubio, chatting with Rod Page.
  • What kind of data would we like, what's can't we get at UBio.

Thursday

Delivered by Bill, notes by Dave.

  • TreeBASE REST API Documentation
  • Get so far, no post.
  • Find resources by ID #
    • /tree/tb:1234
    • Get this tree from TreeBASE.
  • XML currently has additional data that could be made available, buried in well formatted comments.
  • Roger: want to label what the author has told you, versus what you are saying.
  • But that is there just to show you what could be rendered in the XML if people are interested.
  • Clade should be a global scoped ID. If someone asks for a clade from a tree, and the clade exists, but not in that tree, generate a 404 error.
  • Allows spec to support resources that don't have global clade IDs.
  • Arlin: Why do you want to reference a clade in a tree?
  • Hilmar/Roger: How do you tell what is the same thing?

References