Phylr Subgroup

From Evolutionary Informatics Working Group
Jump to: navigation, search

The Phylr Subgroup will create a publicly accessible database of phylogenetic data, available through PhyloWS. This database will aggregate data available from other locations, including TreeBASE and MorphoBank. The source code that runs the database will be configurable and extensible, allowing other databases to easily implement the PhyloWS standard.

Participants

The code

All code will be available from the Phylr code repository.

The plan

  1. Fix DendroPy (Jeet/Hilmar)
  2. Convert TreeBASE to NeXML (Jeet/Hilmar)
  3. Index the Nexml data into Lucene (Jeet/Hilmar)
  4. Determine what PhyloWS features exist beyond basic SRU
  5. Install OCLC SRU server
  6. Configure OCLC SRU server to communicate with Lucene
  7. Build full support for PhyloWS by expanding OCLC SRU server
  8. Build an adapter for MorphoBank (or generic SQL?)
  9. Build an adapter for PaleoDB
  10. Build an adapter for BioSQL

(we're expanding this elsewhere)


Design decisions

  • Should the trees be associated with studies or separate?
    • Both should be available, via different result formats. It is probably best for the default to be the "full" Nexml document.
  • How static is PhyloWS?
    • Since this is the first real implementation, we can modify the standard when necessary.
  • What is returned from each query type?
    • always return records that consist of a single tree, possibly annotated with a study information
  • Do we need to create a client? Or just use the default interface from the OCLC server?
    • A client is not in our immediate scope. Some of the other hackathon projects may use our server.


Feature List

Below are the URL formats listed in the PhyloWS spec, along with notes on items we will not implent.

Retrieve by ID:

  • /phylows/tree/<ID>
    • note that this is equivalent to /phylows/find/tree/?query=dc.identifier%3D<ID>
    • we will not implement the metadata option (default to true)
    • we will not implement the topology option (default to true)
  • /phylows/clade/<ID>
    • we will not implement this
  • /phylows/node/<ID>
    • we will not implement this

Search:

  • /phylows/find/tree/?query=dc.title%3DPrimates
  • /phylows/node/<identifier>/?[{recordSchema|format}=<format>]
    • we will not implement this

Database descrption:

  • /phylows/provider/metadata/tree
  • /phylows/provider/metadata/node
  • /phylows/provider/formats/tree

Other notes:

  • Initially we will not support alternate result formats (will only return nexml)
  • We're not implementing create, update, or delete. We don't want to mess with authentication issues.


Resources

Lucene structure

These fields will be stored in the Lucene index:

  • treeid = globally unique id of the tree
  • otus = labels of otus present in the tree
  • authors = author of the tree or associated publication
  • abstract = description of the tree, or abstract of a publication associated with the tree
  • datatype = (needs to be a vocabulary) the type of data used to generate the tree
  • keywords = concatenation of all above fields
  • nexml = complete nexml document (for fast return of results)
  • treesize = total number of nodes in the tree (including internal nodes)
  • hasbranchlengths = true or false

Here are some comments from the perspective of a potential data provider (the Phylota browser, http://loco.biosci.arizona.edu/pb/). At a minimum, we would like to preserve in web services the functionality we already have for manual searching through the web ste:

  1. name / id search: return all subtrees of this node
  2. list of names: return trees that contain any / all of these labels
  3. list of names (lca version): return trees that are subtrees of the most recent common ancestor of the labels

Other information/ modifiers that a user could search by if we enabled it:

  • gene = the label on the sequences used (as opposed to the otu labels)
  • alignment type = how the seqs were aligned (clustal, muscle, not aligned)
  • alignment score
  • sequence length
  • topology resolution score = i.e. is tree binary, or does it contain polytomies
  • orthology = did the sequence cluster pass or fail the test for paralogous sequences
  • treesize = number of taxa and / or sites

Gene seems to be an important one. i.e. return me trees constructed from gene="rbcl"

Code Structure

Ryan's brief notes about the structure of the OCLC code: SRWServlet

  • doGet
    • SRWServletInfo reads the config files and sets up properities for all "known" databases.
    • passes off to processMethodRequest
      • SRWServletInfo.setSRWStuff parses the DB out of the request string,
        • SRWDatabase initializes a database object
          • Instantiates the proper class (as listed in SRWServer.props)
          • Reads in properties from the DB's config file
      • looks up some DB details, and sets them as properties of msgContext
    • Builds an SRW SOAP query out of the URL query and invokes it with AxisEngine.invoke()
    • when it returns, strips the SOAP stuff out
  • doPost
    • very similar to doGet, resulting in an AxisEngine.invoke()

srw_bindings.SRWSoapBindingImpl

  • searchRetrieveOperation is invoked by Axis
  • eventually calls SRWDatabase.doRequest() on the proper database class

Databases derive from SRWDatabaseImpl. This class implements the doRequest() method. It is best not to override this method, as it takes care of caching result sets and other useful administrative stuff. It is best to only override the abstract methods.

Products

  1. Fixed DendroPy to parse and write legal Nexml
  2. Import/export Nexml from BioSQL
  3. Added support for matrices to BioSQL
  4. Improved the PhyloWS spec
  5. System to index Nexml files into Lucene
  6. Phylr -- A modular implementation of the PhyloWS find API (currently supports Lucene, but can be easily expanded for other back-end storage systems)

Daily Standup Reports

Tuesday

Delivered by Ryan. notes by Dave

  • Big challenges have been what does PhyloWS really mean.
  • What portions are we trying to address this week?
  • Settled on
    • Every time you get data from a PhyloWS query you'll get a single file with embedded NeXML trees back.
  • Not tackling character matrix now, probably not this week.
  • Plans for today: Identify what we'll do with the rest of the week

Wednesday

Delivered by Ryan. notes by Dave

  • Clarified some PhyloWS spec issues.
  • Ran through spec focusing on basic retrieval of trees, searching over basic tree content
  • Tackle matrix issues towards end of work, hopefully
  • Convert TreeBASE to NeXML - have script
  • Parse these trees and get indexable content out. Read those index and make it available via PhyloWS
  • By end of today have TreeBASE content available in searchable form.

Questions:

  • Rutger:
    • Noticed extra tags in your NeXML. Hints at that we don't yet know how to add metadata.
    • Can the Semantic API for CDAO Subgroup verify that xml-rdf-xml transition works. (???)
    • Metadata in NeXML still an open issue.
    • Arlin:
      • Can we try separate approaches and try each.
      • Rutger:
        • are dicts going away in favor of RDFA structures?
  • Today experiment a little more. Arlin make examples, and groups see if they can transform it

Thrusday

Given by Ryan, noted by Dave

  • Spent time in PhyloWS Bootcamp arguing over concepts.
  • Throughout week have been modifying the PhyloWS spec
  • TreeBASE into NeXML is done
  • Nearly done parsing those files into an index.
  • Now working out kinks and making it nice and pretty.
  • Somewhere along the line Hilmar and Jeet will work on SQL adaptors.