Hackathon Report

From Evolutionary Informatics Working Group
Jump to: navigation, search

Group leaders: Arlin Stoltzfus and Rutger Vos Release data: June 12, 2009

Executive summary

The progressive nature of science rests on the ability of researchers to revisit, repeat, and build on each other's work. This, in turn, requires access to results, to contextual information or "metadata" (e.g., sources and methods), and to operations (e.g., computing services). Providing computer-based access to data, meta-data, and services is a problem of interoperability. The tools for solving interoperability problems rely on formalizations of the semantics (meaning) and syntax (form) of data, metadata and services. Because the mandate of NESCent's EvoInfo (Evolutionary Informatics) Working Group is to improve interoperability, the group has focused on "glue technologies" that formalize the syntax and semantics of data and services:

  • a future phylogenetic data exchange format (NeXML)
  • an ontology resulting in formal and machine-interpretable semantics of evolutionary data and metadata (CDAO)
  • a programmable web-service based interface for phylogenetic data providers (PhyloWS)

The NESCent Database Interoperability Hackathon, held March 9th-13th 2009, provided a critical evaluation of the usefulness of these glue technologies.

Considerable effort went into planning the hackathon. After identifying 32 phylogenetics-related Data Resources, the organizers recruited 21 key individuals, most with no prior association with the EvoInfo working group. The scope and scale of these data resources, many of them expertly curated, show that researchers are committed to making data available, in response to a perceived demand for data re-use from the relevant research communities (phylogenetics, systematics, comparative genomics, diversity studies). However, making data available online is not enough to make the data re-usable. Barriers to re-use remain substantial when holdings are only available in incompatible formats lacking explicit semantics, and when programmable APIs are not provided. Before the hackathon, the organizers prepared the participants by providing information and by holding 2 teleconferences to entertain questions and discuss possible projects.

At the hackathon, participants self-organized using an open-space approach, resulting in 5 sub-groups. Each subgroup generated a working software product to demonstrate desired interoperability improvements:

  • using NeXML and CDAO, the Semantic processing group applied advanced language technologies to generating a logically queryable representation of phylogenetic data
  • using NeXML and PhyloWS, the Phylogenetic visualization group expanded the capabilities of PhyloWidget to represent tree annotations and to retrieve and display images
  • the Java API for NeXML group created a programmable Java interface to NeXML
  • using NeXML and PhyloWS, the Taxonomic Intelligence group implemented a web services interface to TreeBase
  • using NeXML and PhyloWS, the Phylr group created a modular system to generate databases whose phylogenetic content can be accessed via PhyloWS

The hackathon was successful in raising the profile of NeXML, CDAO and PhyloWS, and in demonstrating that, with a modest amount of training and effort, data providers can improve interoperability, with benefits for data providers and for end users. A larger-scale effort would be required to have a more substantial impact on production systems.

Scope of this report

The working group meets twice per year and issues a report on its activities a few weeks after each meeting. This report covers work accomplished subsequent to the previous reports, including work done before, during and after the Hackathon. That is, this Fourth Working Period report covers group activities from 11 July 2008 (when the report on the third working period was released) to the release date for this report, June 12, 2009. The report closes by outlining the strategies participants will follow to maintain continuity of the projects started at EvoInfo, and the anticipated tangible outcomes from this.

Project leaders and participants

The individuals marked in bold are members of the organizing group. For more information about each participant's background, affiliations, and motivation click the participant's name.

Participant Affiliation Email Subgroup
Jim Balhoff NESCent, Phenoscape balhoff@nescent.org
Lucie Chan SDSC, MorphoBank, CIPRES lcchan@sdsc.edu
Brandon Chisham New Mexico State University, CDAO bchisham@cs.nmsu.edu
Dave Clements NESCent, GMOD clements@nescent.org
Karen Cranston U. Arizona, Encyclopedia of Life, PhyLoTA, Tree of Life cranston@email.arizona.edu
Sam Donnelly U. Pennsylvania, pPOD samd@seas.upenn.edu
Vladimir Gapeyev NESCent vg34@duke.edu
Karla Gendler U. Arizona, iPlant Collaborative gendlerk@gmail.com
Vivek Gopalan BCBB, NIAID, NIH, Nexplorer gopalanv@niaid.nih.gov
Roger Hyam NHM London,RBG Edinburgh,BCI,PESI,SpeciesIndex rogerhyam@googlemail.com
Mark Jensen Fortinbras Research, BioPerl maj@fortinbras.us
Greg Jordan EBI, PhyloWidget, PANDIT greg@ebi.ac.uk
Matt Kosnik Smithsonian, PaleoDB mkosnik@gmail.com
Hilmar Lapp NESCent, PhyloWS, Phenoscape, BioPerl hlapp@nescent.org
Sheldon McKay CSHL, iPlant, modENCODE, BioPerl mckays@cshl.edu
Peter Midford University of Kansas, Mesquite, Phenoscape peter.midford@gmail.com
Bill Piel Yale, TreeBASE, PhyloDB william.piel@yale.edu
Enrico Pontelli New Mexico State University, CDAO epontell@cs.nmsu.edu
Ryan Scherle NESCent, Dryad rscherle@nescent.org
Katja Seltmann MorphBank, Florida State University, HAO, mx seltmann@scs.fsu.edu
Arlin Stoltzfus NIST, U. Maryland, CDAO arlin.stoltzfus@nist.gov
Jeet Sukumaran University of Kansas, NeXML jeet@ku.edu
Todd Vision NESCent, Phenoscape tjv@bio.unc.edu
Rutger Vos University of British Columbia, NeXML, PhyloWS, TreeBASE rutgeraldo@gmail.com
Matt Yoder Ohio State, HAO, Hymenoptera On Line (Norm Johnson), mx diapriid@gmail.com
Associate Affiliation Email Subgroup
Jonathon Cummings Duke Business School, the Science of Science jonathon.cummings@duke.edu

IMG 1647.JPGIMG 1647 key.JPG

1. McKay; 2. Jordan; 3. Chisham; 4. Hyam; 5. Stolzfus; 6. Lapp; 7. Kosnik; 8. Piel; 9. Vos; 10. Gapeyev; 11. Jensen; 12. Sukumaran;

13. Gendler; 14. Donnelly; 15. Clements; 16. Balhoff; 17. Scherle; 18. Seltmann; 19. Midford; 20. Gopalan; 21. Chan; 22. Cranston.


The mandate of the working group is to improve interoperability in evolutionary analysis. The original proposal had a narrow focus of supporting existing technologies better. However, over the course of subsequent first, second and third meetings the group adopted a more forward-looking goal of developing a "Central Unifying Artefact" to serve as the Rosetta stone for interoperability. This goal was pursued using two main strategies, focusing respectively on syntactical and on semantic interoperability. In addition, some work was done on a web services standard. The work done by EvoInfo participants has yielded significant deliverables, including:

Goals for the period prior to the fourth meeting

The goals along with specific aims for this period - as described in greater detail in the third report - were, briefly, to establish a web presence for CDAO[1] and NeXML[2] and to expand software support and documentation (including manuscripts for publication) for the standards, this in preparation for an opportunity (at the time codenamed CarrotBase, and "to be organized") to put the standards to practical test. This opportunity materialized by organizing the fourth meeting as a meeting along the lines of the PhyloInformatics and R meetings previously organized at NESCent, i.e. as a hackathon.

Goals for the hackathon

The following broad objectives were identified. Participants of the hackathon refined these and distilled concrete work targets from them in advance of and at the event.

  1. Unify the data format using NeXML:
    • Define and implement a transformation path from the native data format of the participating data providers to NeXML.
    • Document mappings, gaps, and ambiguities, and resolve those at the event as much as possible, or lay out ways for future resolution.
  2. Unify the data semantics using CDAO
    • Define comprehensive mappings between the metadata of the participating data providers to CDAO terms.
    • Extend CDAO with (possibly provisionary?) terms as much as possible.
    • Identify and document procedure for other data providers with semantics not currently represented within CDAO.
  3. Unify programmable data provider API
    • Complete the PhyloWS specification for RESTful data access and querying.
    • Document NeXML and CDAO needs for specifying metadata queries through PhyloWS.
  4. Create demonstration projects that take advantage of the unified data formats and/or semantics.
    • Database that integrates all participating data providers.
    • PhyloWS implementation on top of an integrated database.
    • Interactive tool that visualizes and navigates across the breadth of data.

The hackathon concentrated on writing code. All code and documentation was made available immediately and freely to the community under an open-source (OSI-approved) license.

For a more specific list of possible deliverables, see Database Interop Deliverables. We also collected use-cases.


Accomplishments of the EvoInfo working group in the interim period

The goals of the working group were advanced considerably since Meeting 3 in May of 2008, due to the efforts of individuals working on the CDAO and nexml projects. A brief summary of these accomplishments is given below.

NeXML project advances

  • NeXML now has a mature and stable home at www.nexml.org, hosted by NESCent.
  • This site hosts the most recent version of the schema[3] and a validation service against this schema[4]
  • Supporting services include a nexus-to-nexml[5] and a nexml-to-json[6] translation service.
  • As demo services the NeXML project has produced a nexml translation[7] of the Tree of Life web project and one[8] of the TimeTree web project
  • In addition, the NeXML project has an active mailing list[9] and a subversion repository[10] with 1026 commits at time of writing.

CDAO project advances

  • CDAO has a mature and stable home at www.evolutionaryontology.org, hosted by NESCent.
  • The CDAO project has produced an OWL ontology[11] for download
  • CDAO manuscript written and submitted, accepted in March
  • The project has started on version 2 of CDAO[12]
  • Participants have submitted another proposal for funding, ranked at 17.7 % (funding is possible but unlikely)

other project advances

Advance preparation for the hackathon

Initial planning, invitations, and applications

Planning for the hackathon began early in the fall of 2008, roughly inspired by the CarrotBase plan developed by the working group to create "carrots", i.e., tangible rewards for utilizing a standard (as opposed to "sticks", i.e., punishments for not using the standard). The organizing team used an email list and occasional teleconferences to communicate; they shared documents via Google docs; and they occasionally made use of a non-public wiki space.

In the course of planning, the goal shifted away from organizing a large demonstration project, toward facilitating a more open-ended hackathon in which many small projects might emerge depending on the interests of the participants. The organizers began by identifying possible targets (see the Data Resources list). Project leaders for data resources were asked to nominate suitable hackathon participants. The organizers invited designated individuals to apply, and also issued an open call. All prospective participants were required to apply by responding to a set of questions about their interests, qualifications, and expected outcomes. The organizers evaluated the applications and chose participants. Originally the intention was for the hackathon to take place in December 2008 or January 2009. However, by the time the complete participant team was recruited, it was necessary to re-schedule the event for March.

Pre-meeting and conference calls

The core developers of NeXML, CDAO, and PhyloWS met for a pre-meeting pre-meeting (detailed notes) Feb 20 to 22 (two weeks prior to the hackathon) to better prepare the participating emerging standards for actual application at the hackathon. This meeting focused on metadata and produced a draft document on the expression of metadata and metadata semantics in NeXML.

In order to prepare the participants, two conference calls were held (follow links for detailed notes):

Documentation of use cases and resources

Accepted participants were asked to provide further information on data resources and to articulate needs, hopes and expectations in the form of use-cases:

Some of the use-cases relate directly to problems tackled at the hackathon, while in other cases they do not. However, even where the use-cases were not pursued further, the task of developing use-cases may have helped to prepare participants to think about interoperability problems.

Activities and discussion at the Hackathon

The following sections summarize the activities at the Hackathon. (Also see: detailed notes.)

Day 1

Day 1 began with an introductory session, then proceeded with an Open-Space meeting. During the afternoon, the open-space meeting was interrupted by a boot camp session on NeXML presented by Rutger.

Introductory session
  • welcome (Hilmar Lapp, NESCent) and introduction to NESCent activities (Todd Vision, NESCent)
  • other NESCent staff introductions
  • Jonathan Cummings' Evaluation Survey
  • participant introductions
  • refreshers on phyloWS (Hilmar, Rutger), CDAO (Arlin, Enrico, Brandon) and nexml (Rutger)
Open Space Pitches

Day 1 continued with an Open Space session[13] where several participants pitched ideas:

  • Hilmar: integrated database of trees
  • Rutger: Java API library to nexml
  • Vivek: API for CDAO using semantic framework
  • Bill: clarify OTU concept to specify search interface
  • Sheldon: high-level UI to mash-up tree metadata
  • Ryan: middleware to expose phyloWS
  • Mark: mashup of LANL HIV data
  • Greg: visualizing trees and metadata
  • Roger: integrating classifications with phylogenetic data
  • Arlin: taxonomy resolution service
  • Matt Y: taxonomy mapping methods
  • Matt K: defining what data users want and how to deliver it

These pitches then self-organized and coalesced into the following subgroups:

Subgroup 6 words or less (description) Members Doc
Taxonomic Intelligence Subgroup Syntax and protocols for resolving taxonomies Bill Piel, Karen Cranston, Matt K, Katja Seltmann Karla
Java API Library to NeXML Subgroup Lightweight, High level API for NeXML Peter, Sam, Jim, Rutger Dave
Phylr Subgroup PhyloWS high level and adaptors Ryan, Hilmar, Jeet, Lucie, Vladimir Dave
Semantic API for CDAO Subgroup extracting NeXML semantics in triple stores Roger, Vivek, Brandon, Enrico, Mark, Matt Y Dave
Visualization Subgroup Web GUI for tree viewing and editing Greg, Sheldon, Katja, (Matt Y, Arlin) Karla

Days 2 through 4

Day 5

Products of the hackathon

The organizers created a Google project space for source code at http://code.google.com/p/dbhack1/source/browse/#svn/trunk. Each of the 5 subgroups generated computer code, and in most cases this was applied to data in a demonstration project.

Semantic processing

The Semantic API for CDAO Subgroup focused on providing access to semantics of data in NeXML files. The plan was to focus on the phylogenetic tree initially and build a system that should be extendable to other types of evolutionary data. The approach was to use XSLT technology to convert content from NeXML syntax into RDF expressions in terms of CDAO concepts, or in terms of an on-the-fly TDWG taxonomy concept ontology. The results were used to populate an RDF triple store accessed by a SPARQL query interface.

A basic java API was also developed using Jena semantic web framework with Pellet reasoner for reasoning SPARQL queries. This Java API has tree and node methods for extraction of data from rooted trees in the RDF files and convert to Newick format. Only basic framework of this library was developed during the hackathon and new API methods should be written to handle generic network and characters data. The key feature of the library is that it uses the CDAO ontology and performs logical reasoning using pellet reasoner through SPARQL queries to extract specific tree or node related properties (refer to methods in CDAOUtils.java file to view the SPARQL queries or Node.java file to view Jena API methods to extract data from RDF graphs) . In addition to this the group also evaluated automatic code-generation tools to develop a Java API for CDAO using jastor.


Detailed information on the project is available from the Semantic_API_for_CDAO_Subgroup page. Code generated by the group is available from https://dbhack1.googlecode.com/svn/trunk/cdao-api .

Phylogenetic visualization

The Visualization Subgroup added new functionality to PhyloWidget and developed a web-based application using PhyloWidget to facilitate database queries and user interaction. The group also implemented a web services interface to MorphBank images and generated NeXML Test Files for visualization use-cases.


PhylowidgetWeb.png Screenshot of the new web interface

This new interface implements some notable new ideas:

  • Intuitive editing of visual parameters, such as branch width, color, and node shape
  • Simple searching of an evolutionary database, with taxonomic auto-completion
    • Showcases: PhyloWS for generic communication between phylogenetic data clients and servers
  • "Tree transformations," or one-click remote services that perform some analysis or transformation of a tree
    • Showcases: NeXML as a standards-based format for transmitting phylogenetic data and associated metadata

Java API for NeXML

After giving initial consideration to the possibility of adding a NeXML API to an existing Java library such as BioJava or Jebl, the Java API Library to NeXML Subgroup (Peter, Jim, Sam, Rutger) decided to develop a standalone, lightweight Java 5 API for reading and writing NeXML on a stored DOM tree.

The group produced 31 classes and 27 interfaces, with 7 test classes, and gave a live demo of the test classes on the last day of the hackathon. Addressing jar-dependency concerns raised from a pre-hackathon email discussion and from the hackathon group as a whole at the conference, the library has no jar dependencies outside of a 1.5 JRE.

After the hackathon, Rutger and others continued to work on this library, currently available from https://nexml.svn.sourceforge.net/svnroot/nexml/trunk/nexml/java/src/org/nexml/ (SourceForge). Recently the library has been expanded to process semantics embedded in NeXML files using RDFa.

Taxonomic Intelligence

The Taxonomic Intelligence Subgroup implemented a web service for TreeBase that follows the PhyloWS specification and returns data in NeXML format. The TreeBASE data are served from NESCent via dbhack1. The interface is a flavor of PhyloWS. Some examples are:

  • Basic URNs to a Tree using the syntax GET /phylows/tree/<identifier>/[{format}=<format>]
  • Basic URNs to a Clade using the syntax GET /phylows/tree/<identifier>/clade/<nodeID>/[{format}=<format>] . In this case the <nodeID> is a serially-generated integer starting from the root of the tree. We may redesign this to use a unique nodeID number. For example, the following request is for the fifth node in tree with ID 2853: http://purl.org/phylo/treebase/phylows/tree/TB:2853/clade/5
  • Queries using SRU/CQL Syntax. The query statement should be written in Contextual Query Language [CQL].
    • The CQL "index" keys (e.g. ,taxon_name , ncbi_taxid, ubio_namebankid, etc) were picked arbitrarily for the convenience of TreeBASE's data model and namespace.
    • the syntax is GET /phylows/find/tree/?[query=<CQL query>]&[recordSchema=<format>]&[operation=searchRetrieve]&[version=1.2]
    • For example, search for all trees that have both a taxon starting with Homo sapiens and a node linked to NCBI taxid 9593 (which happens to be Gorilla gorilla): query=taxon_name+any+%22Homo+sapiens%25%22+and+ncbi_taxid+%3D+9593


The Phylr Subgroup focused on middleware for adding PhyloWS capabilities to existing databases. They developed tools to import data from NeXML files into a database system that responds to PhyloWS requests. The system used BioSQL, with content indexed by Lucene.

  1. Fixed DendroPy to parse and write legal Nexml
  2. Import/export Nexml from BioSQL
  3. Added support for matrices to BioSQL
  4. Improved the PhyloWS spec
  5. System to index Nexml files into Lucene
  6. Phylr -- A modular implementation of the PhyloWS find API (currently supports Lucene, but can be easily expanded for other back-end storage systems)


A documentation team consisting of Karla Gendler (iPlant Collaborative) and Dave Clements (GMOD Help Desk) focused on developing documentation during and after the hackathon. The documentation for the hackathon consists primarily of wiki pages hosted at NESCent, as described in the documentation plan. A special hackathon sidebar was created to organize content. Tagging was found to be the most convenient way to access content.

Evaluation and impact assessment

Hackathon planning and logistics

Advance planning by the organizers Planning for the hackathon dragged out longer than expected and resulted in delays in scheduling the hackathon. This was largely due to some of the organizers being overcommitted and not spending sufficient time during key periods in the planning process. The organizers did not realize until rather late in the process how important metadata standards would turn out to be. At the organizers' pre-meeting 2 weeks before the hackathon, the organizers attempted to develop an RDFa-like standard, but the issue remained unresolved throughout the hackathon. Advance planning was probably the weakest aspect of the project.

Choice of participants and projects Rather than simply advertising the hackathon and broadcasting a request for participation, the organizers identified and targeted data resources and participants in advance. A large amount of time was spent identifying data resources with interop potential, identifying candidates for participation, and evaluating applications. Many applicants were passed over because they did not seem to have the right skills, or they seemed unlikely to collaborate due to low scientific and technical overlap with other participants. At the hackathon, most participants contributed in an interactive way. Some participants were productive but did not work closely with a group. The choice of participants and projects seemed successful in retrospect, however, other choices might have been successful, too.

Participant training and preparation. Past experience with hackathons and working groups has shown that, although participants expect to work hard during a meeting, they do not expect to expend significant efforts to prepare before the meeting (nor to expend effort after the meeting in order to follow up). Therefore the organizers did not expect much from participants before the meeting. At the telecon in February, only a few participants said anything at all.

Hackathon logistics NESCent provided excellent technical support for the hackathon, including meeting rooms, whiteboards, and access to a wireless network and to shared electronic storage. IT staff helped participants to support demonstration projects.

Hackathon outputs and community impact

The outcomes of the meeting fall into several categories

Interoperability innovations Probably the most significant projects for showcasing the unique benefits of a hackathon were the Phylr and Semantic Processing projects. Both projects utilized 2 different glue technologies (NeXML and PhyloWS for phylr; NeXML and CDAO for Semantic Processing) and required a team of people with different skills. Both projects used generalizable methods to convert a stack of separate input files into a structured database that can be queried.

Tangible accomplishments during the hackathon As noted above, the hackathon produced a large amount of computer code, including demonstration projects (Phylr, Semantic processing), stand-alone projects (Java API for NeXML), and substantive improvements to existing resources (e.g., TreeBase services; improvements to PhyloWidget). Most of the hackathon-generated code could be modified for use in a production setting. This is true even for demonstration projects. For instance, the Semantic Processing group clearly focused on carrying out a demonstration project, but their XSLT translation tools could be re-used later for a subsequent project to translate NeXML content into CDAO RDF/XML. Hackathon-generated improvements to PhyloWidget will become part of the standard distribution. The hackathon-generated Java API for NeXML appears poised to become the preferred library for processing NexML.

Momentum for follow-up projects Some participants left the hackathon with vision and energy for follow-up projects (see below). For instance, the Phylr project is being continued as part of Google's Summer of Code. However, it should be noted that some of the hackathon projects or follow-up projects (e.g., Vivek's visualization interface for CDAO) were conceived well before the hackathon and might have happened anyway.

Raised awareness of common problems and solutions Participation in the meeting raised awareness of the EvoInfo group's glue technologies, and of the ways in which these technologies can be used to improve interoperability. Many participants were surprised at the high level of sophistication that became possible by using these tools, e.g., in the ability for PhyloWidget to locate and display taxon-specific images to match the taxa represented in a set of data. The group also became aware of the need for

  • supporting metadata representation in diverse contexts (e.g., via RDFa in NeXML)
  • supporting the use of universal resource identifiers (e.g., LSIDs)

However, until the hackathon group publicizes its efforts more thoroughly, this raised awareness of problems and solutions applies only to a relatively small group of direct participants.

Presentations and publications

Future plans and possible follow-ups

NeXML, CDAO and PhyloWS

  • NeXML. NeXML continues to be developed and is becoming integrated in various real-world projects. An alternative standard, PhyloXML, is also gaining support.
  • CDAO. The CDAO group (Pontelli, Stoltzfus, Thompson, Prosdocimi, Chisham) intends to continue developing CDAO and is seeking external funding for CDAO as part of a larger project (Principle Investigators are Stoltzfus, Pontellia and Gupta). A current challenge is to respond to specific concerns (coming from the NeXML development community) for language to support metadata concepts.
  • PhyloWS. Ryan created a Google Group for the PhyloWS community at http://groups.google.com/group/phylows, and encourages those interested in PhyloWS to resume discussion of issues left hanging at the end of the hackathon, particularly the names of searchable fields (CQL indexes).

Possible follow-up hackathon

The organizers plan to hold a teleconference in June to make plans for a possible follow-up hackathon.

Possible independent follow-ups to hackathon projects

  • NeXML API in Java.
    • Rutger continues to develop this library.
    • When Sam adds NeXML support to pPOD, this library will be used and functionality added to the library as needed.
  • Phylr. The next phase of Phylr development will be undertaken as part of a GSOC project by Dazhi Jiao, supervised by Ryan. Details can be found at http://tinyurl.com/phylr-gsoc
  • PhyloWidget and phylogenetic visualization.
    • Greg and Sheldon plan to incorporate the new Javascript-based user interface into the main PhyloWidget codebase, for use at http://www.phylowidget.org/ and elsewhere within the next six months. By the end of the year, the example "tree transformation" web services will be expanded to include coloring labels by taxonomic group, adding branch lengths to a tree (based on a reference from tolweb.org), and inserting external links to various web resources. All future developments on this front will be uploaded to the PhyloWidget main site for demonstration and distribution.
  • Semantic API for CDAO
    • Vivek is planning to extend the Java API for CDAO library during the hackathon and continue the development in BCBB/NIAID/NIH with Arlin as the collaborator.
    • Vivek is also planning to develop a web-based visualization application with the Java CDAO library for handling the phylogenetic data. Vivek already got a java developer in his group at BCBB to work for this project.
      • The expected release of beta version of the web-based application and the Java CDAO library is the end of this summer ( November 1st 2009).