Database Interop Hackathon

From Evolutionary Informatics Working Group
Revision as of 00:31, 19 September 2008 by Hlapp (talk) (Organization)
Jump to: navigation, search
NESCent Hackathon on Evolutionary Database Interoperability

Synopsis

Even though there is a rich and meticulously curated variety of on-line databases of phylogenetic data, their holdings are only available in incompatible formats lacking explicit semantics, and programmable APIs for querying the data are often not provided. NESCent seeks to address the resulting obstacle to interoperability and data integration by sponsoring a hackathon that brings together data and metadata experts and developers from a number of data providers with the developers of the emerging NeXML and CDAO standards.

Motivation

There is a large variety of phylogenetic data resources in the form of on-line databases, providing data ranging from character state matrices (e.g., MorphBank, Morphobank), molecular sequence alignments (e.g., BAliBASE, PANDIT), phylogenetic trees (e.g., TreeBASE), gene or protein trees (e.g., TreeFAM, PhylomeDB), species trees (e.g., Tree of Life), gene families (e.g., PhyloFacts, HOVERGEN), to species taxonomies (e.g., NCBI Taxonomy, ITIS), and to analytic metadata such as divergence times (e.g., TimeTree). There is no common or unifying exchange format in which these data resources are available, and each of the resources boast a variety of meticulously curated or computed metadata for their holdings that require expert knowledge and manual inspection to interpret. Furthermore, there is no common, predictable way for querying and obtaining the data, and in fact most of those resources don't provide any programmable on-line interface (API).

This situation presents a fundamental obstacle to integrating phylogenetic data and service providers into a network of interoperating services that consume and produce data in predictable, verifiable syntax and with explicit machine-interpretable semantics, key prerequisites to applying tools for resource discovery and for constructing or executing complex workflows. It also renders these resources resistant to large-scale data integration, for example for combining and cross-linking some of these resources with other data, such as genomic, phenotypic, or georeferenced specimen data.

Recently, the development of the NeXML data exchange format and the CDAO ontology for comparative and phylogenetic data and analysis have provided a window of opportunity to apply both of these emerging standards towards solving some of these obstacles, while at the same time validating their ability to satisfy real-world needs that previously used standards (such as NEXUS) have not. Doing so would benefit the data providers by making their data more broadly useful, end-users by having access to a wide variety of phylogenetic data in a common, predictable format, and ultimately tool developers by defining a uniform way for giving their users instant access to a large swath of data.

NESCent seeks to take advantage of this opportunity by sponsoring a hackathon that brings together data and metadata experts and developers from several phylogenetic data providers with the developers of NeXML and CDAO, as well as with developers of phylogenetic data visualization and database integration projects.

Specific objectives

The following broad objectives have been identified. Participants of the hackathon will refine these and distill concrete work targets from them in advance of and at the event.

  1. Define and implement a transformation path from the native data format of the participating data providers to NeXML.
  2. Define comprehensive mappings between the metadata of the participating data providers to CDAO terms.
  3. Create a database that integrates all participating data providers.
  4. Create an interactive tool that visualizes and navigates across the breadth of data.

The hackathon concentrates on writing code. All code and documentation will be made available immediately and freely to the community under an open-source (OSI-approved) license.

Subgroups

Participants

Participation will be arranged by invitation and by self-nomination followed by review.

Organization

Organizing Committee: Hilmar Lapp, Arlin Stoltzfus, Todd Vision, Rutger Vos

Time & Venue: The hackathon is tentatively scheduled to take place from Dec 8-12, 2008 at NESCent.