Database Interop Hackathon/Metadata Support

From Evolutionary Informatics Working Group
Jump to: navigation, search


Semantics, data and metadata

NeXML focuses on the representation syntax for basic types of information relevant to evolutionary analysis, such as character data and inferred trees. The NeXML schema provides clearly defined ways of representing such data. Furthermore, the schema ties the semantics of data to CDAO via references to ontology terms using SAWSDL syntax).

But, many use cases exist where users want to attach other information to the representation of basic types of data, e.g.,

  • linking an OTU, or a node, to a taxonomic concept
  • linking a study to a literature reference
  • linking a data element to an ontology concept
  • linking character data to phenotype annotations

One way to deal with this would be to expand the NeXML schema to cover anything that a user might wish to say, but that would lead to a bloated schema redundant with other artifacts. Another approach would be to allow free-form annotations of anything. This would allow a wide range of expressions but would lead to a soup of annotations lacking clearly defined semantics.

The solution is to allow metadata annotations, but to provide for ways to make the semantics clear, using ontologies. During discussions at NESCent at the pre-meeting we fleshed out some examples of how this would work, although we did not settle on a final representation scheme.

Science, linked data and the semantic web

Strategy and desired outcomes

The examples on this page are being reworked as a result of developments after the hackathon. Hopefully we will come out with some more formal statement on this soon. Our plan is to:

  1. revise the examples to be consistent with emerging nexml practice (use of "meta" tag)
  2. link each example to a test file that
    • parses as valid nexml
    • generates the correct RDF triples
    • has a documented relation to use-cases and external vocabularies
  3. issue a request for comment and address any concerns

Some kinds of metadata that we want to support:

  • add a literature reference to a study (or to a tree, or a taxon)
  • link an OTU to an externally defined taxon
  • Attach a sequence(s) to a taxon
  • attach a specimen(s) to a taxon
  • Attach a text or image-referencing annotation to just about anything (including branches)
  • Attach a reference to an ontology to a character or character state

The basic scheme of nexml plus RDFa

Triples, RDF and RDFa

Lets suppose we want to say that a particular person is responsible for a document. "John authored this document" is a way that we might say it. Inside the document, we might just say "John authored this", or "author: John", because it would be implicit that a statement within the document would refer to the root element of the document if it does not seem to refer to anything else.

An English sentence in subject-predicate-object form would be "this_document has_author John". We might expect a formal language to say something like { subject=this_document, predicate=has_author, object=John }.

RDF subject-predicate-object "triples" can be used to make a wide range of statements (some people believe that anything can be said with triples). ( we need a link here to the W3C standard)

RDFa is a way of embedding RDF statements into XHTML documents. The most common use of RDFa is to embed semantics in HTML documents to create a "semantic web". The semantics are hidden from view in a typical web browser, but can be extracted with semantics-aware tools. The EvoInfo Stack developers have settled on RDFa as the way to express metadata in nexml (which is XML, which is a member of the XHTML family). RDFa identifies the subject of a triple by the value assigned to the "about" attribute. The predicate is identified by something like "rel" or "property".

RDFa (there is a delightful primer on youtube) makes a syntactic link between xhtml docs and RDF

  • RDFa is a way to specify triples using xhtml syntax
  • the subject is identified by the "about" attribute in the nearest ancestor; if there is not "about", the subject is the root of the document
  • there is some flexibility for different types of predicates and objects
    • the predicate is a "rel" or "property" attribute
    • the object is specified as an attribute or as a tag-enclosed literal value
    • a "rel" predicate goes with an href object
    • a "property" predicate goes with

Embedding RDFa in nexml

When RDFa is used in HTML documents, it is embedded in various other tags using attributes, or sometimes it is used with "" elements. The manner in which to use RDFa and to integrate it into nexml, an XML document type, has been a subject of discussion. Some consensus is emerging now:

  • to represent metadata in nexml, use the "meta" tag
  • the meta tag is in current nexml schema, but is not fully implemented
  • the meta tag is used (and so far only used) for metadata to be serialized into RDF

In the Java API for nexml (for those who are interested), "annotation" is the object (internal representation) that is written out as a meta statement. Trees, for instance, can have annotations, and each annotation is written out as a meta statement.

In order to allow more restrictive validation of files, nexml has two different xsi:types depending on the object of an RDFa triple:

  • ResourceMeta
  • LiteralMeta


With apologies for a delay of several months, I'm finally getting round to updating this. Please make revisions and add comments, or communicate concerns to me. arlin 10:06, 26 October 2009 (EDT)

The general template procedure is to

  1. Instantiate the concept forming the subject of the association, i.e., the concept corresponding to the element to which an annotation or resource is to be attached.
  2. Nested within that instance, specify the relation through which the annotation or resource is to be attached, i.e., the semantics of the attachment.
  3. As the target (object) of the relation specify the annotation or resource you wish to attach, either as a resource (using the rdf:resource attribute), or as a plain string as element value (specifying the attribute rdf:datatype="xsd:string"), or as a set of RDF statements as the element value (specifying the attribute rdf:parseType="Resource"), or as a snippet of XML as the element value (specifying the attribute rdf:parseType="Literal").

Specifying that an OTU belongs to a taxon

unfinished, in progress

here we use the CDAO concept "cdao:has_Taxonomy_Reference" to link the otu element (whose ephemeral id is foo) first to a cdao:TU instance (with id baz), which is subsequently linked to an entry in the Teleost taxonomy (whose record id is 1030219).


<nexml xmlns:cdao="" >
    <otu id="foo">
         <meta about="#foo" property="cdao:has_Taxonomy_Reference" rdf:resource=""/>


here we use the TDWG taxonomy concepts to do this in a more complex way

attaching some stuff to a tree

This is just an illustration of how it may be achievable using RDFa. For discussion only.

It may be possible to use RDFa notation to add annotations. This should be absolutely standard and available to any RDFa parser but is simple to integrate into a non RDF aware parser.

Simplest case. @property give the predicate of an assertion. The @about is the subject of the assertion and the content of the annotation is the literal object value

<xml> <tree id="foo1" about="foo1" xmlns:bs="">

 <meta property="bs:hasLiteralTerm" xsi:type="nex:LiteralMeta" datatype="xsd:string">My Literal value</meta>
 <meta rel="bs:hasReference" href="" xsi:type="nex:ResourceMeta"/>
 <meta property="bs:hasLiteralXml" xsi:type="nex:LiteralMeta" datatype="rdf:Literal"><phen:phoo/></meta>
 <meta rel="bs:hasName" xsi:type="nex:ResourceMeta">
   <meta property="bs:firstName" xsi:type="nex:LiteralMeta" datatype="xsd:string">Rutger</meta>
   <meta property="bs:lastName" xsi:type="nex:LiteralMeta" datatype="xsd:string">Vos</meta>

</tree> </xml>

If we like we can also make assertions where the object of the assertion is a web resource (any URI). <xml> <tree id="foo1" about="foo2" xmlns:bs="">

   <annotation rel="bs:myotherterm" href=""/>

</tree> </xml>

A fuller example where we say that this OTU is an instance of a TDWG TaxonConcept with a name that is identified by an IPNI LSID. We also make some assertions about the name.


 <otu id="foo1" about="foo2" typeof="tc:TaxonConcept" xmlns:tc="" xmlns:tn="">
    <annotation rel="tc:hasName">
       <annotation typeof="tn:TaxonName" about="">
          <annotation property="tn:genusPart">Rhododendron</annotation>
          <annotation property="tn:specificEpithet">ponticum</annotation>
          <annotation property="tn:authorship">L.</annotation>


What can't we say with this annotation method?

Note that properties will be joined to the next parent element higher up the DOM that has an @about of @typeof so we could make assertions about a tree inside its nodes - for example. It is pretty flexible.

Specimens within collections

Another common use case for external references is one where a NeXML otu element is to be defined as a specimen in a museum, i.e. we want to specify an identifiable collection, and the number of the specimen within it. In this case we suggest using the TDWG Darwin Core syntax, which has constructs for institutions and catalog numbers. A query of the TDWG ontological activities turned up the TDWG core ontology, however, we are unclear about the status and direction of the CoreOntology (and how to use it), so we're leaving that out for now, choosing the mix the DarwinCore syntax with semantics.

<xml> <otu id="foo">

 <dict xmlns:dwc="" id="dict1">
   <any id="watever">
     <cdao:TU rdf:id="baz">
       <cdao:has_Specimen_Reference rdf:parseType="Literal">
         <dwc:InstitutionCode rdf:datatype="xsd:string">COLLECTION:0000403</dwc:InstitutionCode>
         <dwc:CatalogNumber rdf:datatype="xsd:string">207388</dwc:CatalogNumber>

</otu> </xml>

Literature References

Use cases exist where the user would want to attach literature citation records to a phylogenetic object. For example if the users wants to track the provenance of data in a meta-analysis. The subsequent syntax of the record itself could simply be the widely used Dublin Core standard. Yes, this does mean mixing syntax and semantics to some extent, but we concluded that it's a reasonable solution because DC at least implies some semantics (albeit overloaded in some cases, regrettably), and it's syntactically concise.

The Dublin Core [guidelines for dc citations] recommend to provide authors (creators), title and publisher, along with a string giving bibliographic citation. However, the Dublin core does not define what a reference is. Therefore, minimally, we need a term that describes the concept of a reference.

Example 1: associate a reference with a tree (or other) element

In this example, we are just associating a reference with a tree, by placing it in the tree element. In XML using NeXML conventions, this is what a literature reference would look like:


<nexml xmlns:dc=""
  <tree id="foo">
     <meta about="#foo" property="cdao:has_Reference" rdf:parseType="Resource">
        <dc:creator>Hill, R. V.</dc:creator>
        <dc:title>Integration of Morphological Data Sets for Phylogenetic Analysis of Amniota: The Importance of
                 Integumentary Characters and Increased Taxonomic Sampling</dc:title>
        <dcterms:bibliographicCitation>Systematic Biology 54(4):530-547,  2005</dcterms:bibliographicCitation>


Example 2: associate a reference with a record

<xml> <nexml> <dict xmlns:obi=""

 <any id="foo235">
   <obi:IAO_0000100 rdf:id="foo234">
     <cdao:has_Reference rdf:parseType="Resource">
       <dc:creator>Hill, R. V.</dc:creator>
       <dc:title>Integration of Morphological Data Sets for Phylogenetic Analysis of Amniota:
                     The Importance of Integumentary Characters and Increased Taxonomic Sampling</dc:title>
       <dcterms:bibliographicCitation>Systematic Biology 54(4):530-547, 2005</dcterms:bibliographicCitation>

</dict> </nexml> </xml>

Example 3: specifying that a reference represents "supporting evidence" for a NeXML element

Work in progress

Associating an OBO phenotype with a character state

<xml> <state id="x88913" label="present" symbol="1">

 <dict xmlns:cdao=""
       xmlns:phen="" id="dict1">
   <any id="foo345">
     <cdao:Categorical rdf:id="foo456">
       <cdao:has_ExternalReference rdf:parseType="Literal">
             <phen:typeref about="TAO:0000203"/>
             <phen:typeref about="PATO:0000467"/>

</state> </xml>

Segments of character data

case of composite otus. create container of characters, refer to this.

Resources and further reading