Difference between revisions of "Database Interop Hackathon/Implementations"

From Evolutionary Informatics Working Group
Jump to: navigation, search
(Example 2: specifying that a reference represents "supporting evidence" for a nexml element)
m
 
(11 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
{{HackHead}}
 
= Summary =
 
= Summary =
 
This page discusses potential target implementations that demonstrate the goals for the Evolutionary [[Database Interop Hackathon]], to be held at NESCent in March 2009. The overall goal is to expose evolutionary data resource on the web in a machine readable architecture so that they can be integrated in complex work flows and mash-ups. To this end, we suggest their implementing of the stack produced by the evolutionary informatics working group:
 
This page discusses potential target implementations that demonstrate the goals for the Evolutionary [[Database Interop Hackathon]], to be held at NESCent in March 2009. The overall goal is to expose evolutionary data resource on the web in a machine readable architecture so that they can be integrated in complex work flows and mash-ups. To this end, we suggest their implementing of the stack produced by the evolutionary informatics working group:
Line 108: Line 109:
  
 
</table>
 
</table>
 
= Suggested hackathon deliverables =
 
== Data retrieval, based on an identifier ==
 
This simple reference service returns phylogenetic data that is identifiable by some GUID, such as a [http://tolweb.org ToLWeb] accession number. The service is implemented following the [[PhyloWS/REST]] proposal, has a [[CDAO]] annotated service description and emits [[Future_Data_Exchange_Standard|NeXML]].
 
 
=== Describing the interface ===
 
The first step is to formally describe the interface. In general terms, PhyloWS/REST proposes that data retrieval services are exposed using a URL API like this: '''<code>/phylows/${dataType}/${nameSpace}:${identifier}</code>''', where '''<code>${dataType}</code>''' is something like "Tree", "Matrix", etc. '''<code>${nameSpace}</code>''' is a naming authority such as ToLWeb, and '''<code>${identifier}</code>''' is unique within '''<code>${nameSpace}</code>''' (and consequently globally unique). This implies URLs such as '''<code>/phylows/Tree/ToLWeb:16299</code>''', which, when accessed using the [http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods GET HTTP method] returns a representation of tree [http://tolweb.org/16299 16299].
 
 
A standard way to describe this behaviour is to express it in [http://en.wikipedia.org/wiki/Web_Services_Description_Language WSDL2.0] - there's a nifty example of wsdl generation and annotation [http://www.keith-chapman.org/2008/09/restfull-mashup-with-wsdl-20-wso2.html here]. At the time of writing, the best free editor for wsdl files comes with the WTP extension for [http://eclipse.org eclipse]. The end result is a file such as [http://nexml.org/nexml/phylows/tolweb?wsdl this one]
 
<div style="1px solid silver">
 
[[Image:Tolrest.jpg]]
 
 
''Graphical representation of [http://nexml.org/nexml/phylows/tolweb?wsdl this] service description.''
 
</div>
 
 
=== Implementing the service ===
 
A service for the interface described in the previous section can be implemented as an MVC-like application. The '''controller''' part of the service needs to find out what the requested Tree ID is. Depending on the implementation language (and whether some advanced web application programming framework is used) the Tree ID is either part of the '''<code>${PATH_INFO}</code>''' environment variable, or encapsulated in some kind of request object. However the Tree ID is retrieved from the request, the next step is to look up the record that the ID refers to. Typically this would be done in a database query. The goal here is to collect all information needed to populate a '''model''' object (in this case a tree) that can be serialized to the right return format. Assuming that the return format is NeXML, libraries for perl, python, java and c++ are available that supply model objects.
 
 
Once populated, the controller object creates a view using the model objects. In the simplest case, for web applications, this boils down to printing out the XML string representations of the model objects, preceded by the correct [http://en.wikipedia.org/wiki/List_of_HTTP_status_codes response code], e.g. <code>200 OK</code>, and [http://en.wikipedia.org/wiki/MIME mime-type], e.g. <code>application/xml</code>. In more complex web application architectures, the string representations of the model objects may be passed to a response object (which in turn is serialized and returned to the client), or the objects may be passed into a template (jsp, Template Toolkit, php) where they are stringified.
 
=== Outstanding issues ===
 
* Dearth of support for PHP
 
* How to deal with errors (e.g. response codes)
 
* Query interface
 
* CDAO integration
 
 
== Metadata support ==
 
 
NeXML has ways of representing character data and trees, whose semantics are implicitly tied to CDAO (through references to the ontology term using [http://www.w3.org/2002/ws/sawsdl/ SAWSDL] syntax). But, many use cases exist where users want to attach other information to these objects. NeXML has a free-form facility to allow this, however, mass adoption of this feature would lead to a soup of annotations lacking clearly defined semantics. Hence, we need to define how ontology-mediated annotations ('metadata' in our definition) can be used. During discussions at NESCent at the pre-meeting we fleshed out some examples of how this would work.
 
 
=== Attaching a concept to an element ===
 
 
In addition to having the NeXML schema (which defined xml schema classes) define the semantics by reference to their CDAO classes, use cases exist where we would need to tie a NeXML instance (in a document) to a CDAO instance. The solution we came with is to enclose RDF inside an "any" value of a dictionary attachment. The RDF specifies the CDAO class and creates a uniquely identifiable instance on the fly.
 
 
''Assigning an XML element to a type means creating an instance of that type.''
 
 
<xml>
 
<tree>
 
  <node id="foo">
 
    <dict xmlns:cdao="http://evolutionaryontology.org/cdao">
 
      <any id="bar">
 
        <cdao:Node rdf:id="baz"/>
 
      </any>
 
    </dict>
 
  </node>
 
<tree>
 
</xml>
 
 
=== Attaching a taxon identifier to an OTU through a relation ===
 
 
An example of why you would want to a NeXML instance in a document to a CDAO instance of a concept is shown below. It extends the previous example to satisfy a common use case: specifying a taxon identifier from some external resource for an otu element. Notice how this uses the CDAO concept "cdao:has_Taxonomy_Reference" to link the otu element (whose ephemeral id is foo) first to a cdao:TU instance (with id baz), which is subsequently linked to an entry in the Teleost taxonomy (whose record id is 1030219).
 
 
<xml>
 
  <otu id="foo">
 
    <dict xmlns:cdao="http://evolutionaryontology.org/cdao">
 
      <any id="bar">
 
        <cdao:TU rdf:id="baz">
 
          <!-- here is where we identify the relation and assign a value -->
 
          <!-- which is here a concept from another ontology -->
 
          <cdao:has_Taxonomy_Reference
 
              rdf:resource="http://purl.org/OBO/TTO:1030219"/>
 
        </cdao:TU>
 
      </any>
 
    </dict>
 
  </otu>
 
</xml>
 
 
=== Attaching a concept to an element through a relation ===
 
 
Another use case along similar syntactical lines as the previous example would be to tie a node in a tree to an inferred gene function from the Gene Ontology. Here we use the CDAO construct "has_function" to specify the semantics of the reference to an external resource.
 
 
<xml>
 
<tree>
 
  <node id="foo">
 
    <dict xmlns:cdao="http://evolutionaryontology.org/cdao">
 
      <any id="bar">
 
        <cdao:Node rdf:id="baz">
 
          <!-- here is where we identify the relation and assign a value -->
 
          <!-- which is here a concept from another ontology -->
 
          <cdao:has_function rdf:resource="http://purl.org/OBO/GO:034"/>
 
        </cdao:Node>
 
      </any>
 
    </dict>
 
  </node>
 
<tree>
 
</xml>
 
 
=== Specimens within collections ===
 
 
Another common use case for external references is one where a NeXML otu element is to be defined as a specimen in a museum, i.e. we want to specify an identifiable collection, and the number of the specimen within it. In this case we suggest using the TDWG Darwin Core syntax, which has constructs for institutions and catalog numbers. A query of the TDWG ontological activities turned up the [http://wiki.tdwg.org/twiki/bin/view/TAG/CoreOntology TDWG core ontology], however, we are unclear about the status and direction of the CoreOntology (and how to use it), so we're leaving that out for now, choosing the mix the DarwinCore syntax with semantics.
 
 
<xml>
 
<otu id="foo">
 
  <dict xmlns:dwc="http://rs.tdwg.org/dwc/dwcore">
 
    <any id="watever">
 
      <cdao:TU rdf:id="baz">
 
        <cdao:has_Specimen_Reference rdf:parseType="rdf:Literal">
 
          <dwc:InstitutionCode rdf:datatype="xsd:uri">http://purl.org/obo/COLLECTION:0000403</dwc:InstitutionCode>
 
          <dwc:CatalogNumber rdf:datatype="xsd:string">207388</dwc:CatalogNumber>
 
        </cdao:has_Specimen_Reference>
 
      </cdao:TU>
 
    </any>
 
  </dict
 
</otu>
 
</xml>
 
 
=== Literature References ===
 
 
Use cases exist where the user would want to attach literature citation records to a phylogenetic object. For example if the users wants to track the provenance of data in a meta-analysis.  The subsequent syntax of the record itself could simply be the widely used Dublic Core standard. Yes, this does mean mixing syntax and semantics to some extent, but we concluded that it's a reasonable solution because DC at least implies some semantics (albeit overloaded in some cases, regrettably), and it's syntactically concise.
 
 
The Dublin Core [[http://dublincore.org/documents/dc-citation-guidelines/ guidelines for dc citations]] recommend to provide authors (creators), title and publisher, along with a string giving bibliographic citation.  However, ''the Dublin core does not define what a reference is''.  Therefore, minimally, we need a term that describes the concept of a reference.
 
 
==== Example 1: associate a reference with a tree (or other) element ====
 
 
In this example, we are just associating a reference with a tree, by placing it in the tree element.  In XML using nexml conventions, this is what a literature reference would look like:
 
 
<xml>
 
<tree id="foo">
 
  <dict xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:dcterms="http://purl.org/dc/terms">
 
    <any id="foo235">
 
      <cdao:Tree rdf:id="bar">
 
      <cdao:has_Reference rdf:parseType="rdf:resource">
 
        <dc:creator>Hill, R. V.</dc:creator>
 
        <dc:title>Integration of Morphological Data Sets for Phylogenetic Analysis of Amniota:
 
                          The Importance of Integumentary Characters and Increased Taxonomic Sampling</dc:title>
 
        <dcterms:bibliographicCitation>Systematic Biology 54(4):530-547, 2005</dcterms:bibliographicCitation>
 
      </cdao:has_Reference>
 
    </cdao:Tree>
 
    </any>
 
  </dict>
 
</tree>
 
</xml>
 
 
==== Example 2: associate a reference with a record ====
 
 
 
<xml>
 
<nexml> <!-- imagine that this is the top level of the nexml doc -->
 
<dict xmlns:obi="http://purl.obofoundry.org/obo/obi.owl" xmlns:cdao="http://evolutionaryontology.org/cdao/cdao.owl">
 
  <any id="foo235">
 
    <!-- IAO_0000100 is a "data set" concept -->
 
    <obi:IAO_0000100 rdf:id="foo234">
 
      <cdao:has_Reference rdf:parseType="Resource">
 
        <dc:creator>Hill, R. V.</dc:creator>
 
        <dc:title>Integration of Morphological Data Sets for Phylogenetic Analysis of Amniota:
 
                      The Importance of Integumentary Characters and Increased Taxonomic Sampling</dc:title>
 
        <dcterms:bibliographicCitation>Systematic Biology 54(4):530-547, 2005</dcterms:bibliographicCitation>
 
      </cdao:has_Reference>
 
    </obi:IAO_0000100>
 
  </any>
 
</dict>
 
</nexml>
 
</xml>
 
 
==== Example 3: specifying that a reference represents "supporting evidence" for a nexml element ====
 
 
Work in progress
 
 
<!--
 
Simply deciding on a syntax to represent citation records does not mean that the semantics of the citation attachment are clear.  The record might be the citation in which the object was first published, or it might be a paper with supporting evidence, a review, etc.
 
 
These different uses would be defined in CDAO. Here's what we suggest: we assign an RDF term label (making it human readable) and an RDF term identifier (about) to the dict element (making it globally unique) from CDAO or some other ontology; and the content of the dict element comes from a namespace that also is clearly identified.
 
 
In  this example we define two references to published  articles, then later, in the tree element, we associate these references with the tree as "supporting evidence":
 
 
<xml>
 
  <dict rdf:label="CDAO Reference" rdf:about="http://purl.org/CDAO/CDAO:723" xsi:type="nex:DC">
 
    <any id="foo234">
 
      <dc:creator>Gauthier, J.</dc:creator>
 
      <dc:creator>Kluge, A.G.</dc:creator>
 
      <dc:creator>Rowe, T.</dc:creator>
 
      <dc:title>Amniote phylogeny and the importance of fossils</dc:title>
 
      <dcterms:bibliographicCitation>Cladistics, 4:105-209, 1988</dcterms:bibliographicCitation>
 
    </any>
 
    <any id="foo235">
 
      <dc:creator>Hill, R. V.</dc:creator>
 
      <dc:title>Integration of Morphological Data Sets for Phylogenetic Analysis of Amniota: The Importance of Integumentary Characters and Increased Taxonomic Sampling</dc:title>
 
      <dcterms:bibliographicCitation>Systematic Biology 54(4):530-547, 2005</dcterms:bibliographicCitation>
 
    </any>
 
  </dict>
 
 
  . . .
 
 
<tree>
 
  <dict rdf:label="supporting evidence" rdf:about="http://purl.org/CDAO/CDAO:334" xsi:type="nex:DC">
 
    <valueref any="foo234"/>
 
    <valueref any="foo235"/>
 
  </dict>
 
</tree>
 
</xml>
 
-->
 
 
=== OBO  phenotype ===
 
 
=== Segments of character data ===
 
 
=== the next case ===
 
 
=== Specimens within collections ===
 
=== segments of character data in a composite OTU ===
 
  
 
[[Category:DB Interop Hackathon]]
 
[[Category:DB Interop Hackathon]]
 
[[Category:Interoperability]]
 
[[Category:Interoperability]]
[[Category:Nexml]]
 
 
[[Category:CDAO]]
 
[[Category:CDAO]]
 +
[[Category:Data Resources]]

Latest revision as of 15:42, 12 March 2009

Summary

This page discusses potential target implementations that demonstrate the goals for the Evolutionary Database Interop Hackathon, to be held at NESCent in March 2009. The overall goal is to expose evolutionary data resource on the web in a machine readable architecture so that they can be integrated in complex work flows and mash-ups. To this end, we suggest their implementing of the stack produced by the evolutionary informatics working group:

  • Syntax - the NeXML data exchange standard
  • Interface/transport - the PhyloWS API for web services
  • Semantics - the CDAO character data analysis ontology

By combining these three components into a single, integrated stack, online data resources will produce output that is easy to parse and validate, whose semantics are well-defined, and whose interface is uniform across data resources.

Current implementations

Some of the online data resources the hackathon seeks to invite for an implementation drive are listed in the table below. The semantics of the output produced by these resources is diverse, including species trees, taxonomies, gene trees, character state matrices and alignments. Also, the programming languages used to implement these services are diverse, including Java, PHP, Python and Perl. This situation goes a long way to explain why standard industry practices (write a WSDL, generate client and server bindings, implement service) have not seen wide adoption in the evolutionary informatics community: different resources have, semantically and syntactically, different inputs and outputs, and are implemented in languages whose support for web services (especially WS-*) is sometimes incomplete.

ResourceExportable objectsImplementation language
GMOD ConceptGlossary#Natural Variation, ConceptGlossary#Polymorphism, Syntenic regions Java, Perl, C, ActionScript
Tree of Life ConceptGlossary#Species_Tree Java
pPOD ?? Java
PhyTome ConceptGlossary#Family_alignment,ConceptGlossary#Phylogenetic_Tree PHP
uBio ConceptGlossary#Taxon, ConceptGlossary#Taxonomic_Rank, ConceptGlossary#Organismal_Taxonomy PHP
TimeTree ConceptGlossary#Species_Tree PHP
PhyloFacts ConceptGlossary#Transition_Model, ConceptGlossary#Phylogenetic_Tree, ConceptGlossary#Family_alignment PHP
MorphBank ConceptGlossary#Character-State_Data_Matrix PHP, Java, JavaScript
MorphoBank ConceptGlossary#Character-State_Data_Matrix PHP, Flash/ActionScript
PhylomeDB ConceptGlossary#Phylogenetic_Tree, ConceptGlossary#Family_alignment JavaScript, Python
PhyloTA ConceptGlossary#Phylogenetic_Tree, ConceptGlossary#Family_alignment Perl
TreeFam ConceptGlossary#Phylogenetic_Tree, ConceptGlossary#Family_alignment Perl
Pandit ConceptGlossary#Phylogenetic_Tree, ConceptGlossary#Family_alignment Perl
MicrobesOnline ConceptGlossary#Phylogenetic_Tree, ConceptGlossary#Family_alignment CGI (=Perl?)
HoVerGen ConceptGlossary#Phylogenetic_Tree, ConceptGlossary#Family_alignment CGI (=Perl?), PHP
PaleoDB ConceptGlossary#Taxon, ConceptGlossary#Taxonomic_Rank, ConceptGlossary#Organismal_Taxonomy CGI (=Perl)