Semantic API for CDAO Subgroup

From Evolutionary Informatics Working Group
Jump to: navigation, search

This subgroup aims to get NeXML and CDAO data into an RDF based triple store.

Participants

Goal & Objectives

The main goal for this group during the hackathon is to create a basic API for CDAO using semantic framework. We would focus on the phylogenetic tree initially and build a system that should be extendable to other types of evolutionary data.

The specific objectives are

  • Create a NEXML to CDAO converter script using XSLT
  • Create test script to validate the CDAO file
  • Implement the XSLT script in a data source ( Brandon's project) so that it can be accessed through web

Google Code Repository

admin: Mark

Products

Cdao-group-flow.png

Files

Project source: http://code.google.com/p/dbhack1/ or http://dbhack1.googlecode.com/svn/trunk/cdao-api

  • Get your copy of the command-line NeXML validator: nexvl.pl-- they're going fast.
  • Get the Java example code for loading NeXML direct to a triple store (does not use the preprocessing script - porting that to Java is an exercise left to the reader): Loader.java
  • Get the Java example code for comparing model generated from NeXML with existing model for testing (this was a dumb approach as the blank node ids prevent the comparison working but code gives useful examples of how to handle the models and conversions): NeXML2OwlTest.java

There are a number of sample NeXML files here, that "mostly validate" except for recent dict element changes.

https://www.nescent.org/wg_evoinfo/Phenoscape

the schema: http://nexml-dev.nescent.org/1.0/nexml.xsd (Use dev server for now. --Rvos@interchange.ubc.ca 01:24, 11 March 2009 (EDT))

Chat About OWL

Outcome of the work is that we have two ways of converting NeXML into triples (RDF graphs). One of the conversions binds the triples to the CDAO ontology i.e. it creates objects from NeXML that are CDAO objects. For example CDAO has the notion of an Edge and NeXML has the notion of an edge so instances of CDAO Edges are created to represent the instances of edges in the NeXML file that is parsed.

The other conversion is experimental. It translates the NeXML file in terms of the TDWG TaxonConcept vocabulary. Trees, tree nodes and OTUs are treated as OWL classes subclassing TaxonConcept or each other. i.e. this is a very different approach but maybe useful in comparing trees in NeXML files with existing, synonymised taxonomies. It is slightly easier to make inferences across this subclass hierarchy than the object graph created by the CDAO conversion.

Every flavour asserted Every flavour inferred


Documentation

How to Transform a NeXML File into CDAO

  • To transform using xsltproc use the following command: $ xsltproc nexml2cdao.xsl nexml-instance-document.xml > cdao-instance-document.rdf
  • Common Problems
    • Default namespace not set in the NeXML document
    • Sequences not transformed
      • Use the expand_cells.pl script to convert sequences into cells first for processing. Note: The sequence can not be longer than

the number of characters defined in the <format> block. Sequences longer than the number of declared characters will be truncated to the number of characters that have been declared. The expand_cells.pl script can be run as follows.
$ ./expand_cells.pl < nexml_with_sequences.xml > nexml_with_cells.xml.
Alternatively as a filter $ cat nexml_with_sequences.xml | ./expand_cells.pl > nexml_with_cells.xml.

How to Transform a batch of NeXML files into CDAO

The do_transforms.pl script can be used to translate a batch of NeXML documents into CDAO.

The do_transforms.pl script expects the following arguments: --input-dir, --output-dir, --xslt-path. In addition the following optional arguments may be supplied: --xslt-processor and --xslt-script.

  • --input-dir specifies the path to a directory where the NeXML documents are stored.
  • --output-dir specifies the path to a directory where the CDAO documents will be saved.
  • --xslt-path specifies the path to a directory where the stylesheet is located.
  • --xslt-processor allows the user to specify an alternate XSLT processor. The default is xsltproc
  • --xslt-script allows the user to specify an alternate stylesheet. The default is nexml2cdao.xsl.

The script applies the stylesheet to each of the NeXML files in the input directory and saves them as CDAO files in the output directory with the following naming convention. Suppose the NeXML file is named nexml-data.xml then the transformed file will be names nexml-data.xml.rdf.

FAQ

What is the difference between reasoning and querying?

A simple query only matches triple patterns syntactically but does not compute asserted relations such as equivalent classes. For example say that according to some ontology the class Foo is asserted to be equivalent to some other class Bar. When querying for individuals of type Foo in data containing some Foo individuals and some Bar individuals the result set would only contain the Foo type individuals. However when using a reasoner to interpret the previous query with the previous hypothetical data-set the result set would contain both the Foo and the Bar individuals.

What if the type of character I want to describe is not defined by CDAO?

CDAO contains place-holder classes, such as Standard, for standard characters, to which you can attach additional classes that have been defined externally. To accomplish this define your class, and then declare it to be a sub class of the desired parent class in CDAO. The relationships defined for that CDAO class can now automatically be applied to your new class by reasoners that encounter individuals of its type. For example suppose that you have a standard character Foo that is not defined in CDAO. Create the class Foo. When creating it, assert that Foo is a cdao:Standard character. Finally, annotate your data marking which characters are of type Foo to create instances of the class. Reasoners processing these instances will be able to treat them as they would other cdao:Standard characters.

Note: Our xslt stylesheet will automatically generate some of these characters based on the state definitions provided in a matrix's <states> block.

What if I need some other kind of term not described by CDAO?

Please use our term suggestion form to make your suggestion.

Links

Other Tools
  • RDF Visulalization: RDF Gravity allows one to load and flexibly view rdf graphs.
  • OWL Editing: Protege is an integrated development environment for working with ontologies.
  • RDF Query Tool: Twinkle a simple graphical environment for running SparQL queries.
  • OWL Reasoner: Pellet is an open-source OWL reasoner.
  • RDF Framework: Jena is an open-source Java frame-work for working with semantic web data.
  • OWL Documentation Generator OwlDoc is a JavaDoc style documentation generator for OWL.
Learning SparQL

Daily Status

Arlin's Notes from Monday

  • extracting NeXML semantics in triple stores
    • convert NeXML files into triples & triple store
    • create sparql query
    • put PhyloWS on top of triple store
    • demonstrate binding together of nexml files
    • link out to taxonomies using URIs
  • API

Tuesday Standup

Notes by Dave.

  • Roger's working on the style sheet to produce RDF from NeXML.
    • Can then load it into RDFGravity.
  • Get as much in as we need for test cases. Focusing on Trees, may get to character
  • Matt Y is working on test documents
  • Vivek looking at API for questions we want to be able to answer.

Arlin sees collaboration between Visualization Subgroup and this group in parsing and representing trees.

Thursday Standup

Roger stood up. Dave took note.

  • Brandon and Enrico are working on XSLT transform. Almost done
    • Problem was sequence element in NeXML is just a string.
    • Preprocess it with Perl to break it into fully expanded thing that XSLT can deal with.
  • Matt Y working on NeXML parser in Ruby.
  • Vivek making good progress on generating API on top of OWL
  • We were talking about RDF bucket.
    • Wrote basic loader to load XML
    • Mapping NeXML to TDWG taxon something or other.
    • If get this done, both TDWG and NeXML can be in same triple store.