Difference between revisions of "Database Interop Hackathon"

From Evolutionary Informatics Working Group
Jump to: navigation, search
(Documentation)
(Links & Resources)
Line 89: Line 89:
  
 
== Links & Resources ==
 
== Links & Resources ==
 +
 +
Help with editing the wiki:
 +
* [http://www.nescent.org/informatics/wiki_tutorial.php Online tutorial] (targeted at Working Groups, but the basics all apply)
 +
* [[Help:Wiki|Collection of wiki editing tips]], with links to further documentation
  
 
Past NESCent-sponsored hackathons:
 
Past NESCent-sponsored hackathons:

Revision as of 11:37, 21 February 2009

NESCent Hackathon on Evolutionary Database Interoperability

Synopsis

Even though there is a rich and meticulously curated variety of on-line databases of phylogenetic data, their holdings are only available in incompatible formats lacking explicit semantics, and programmable APIs for querying the data are often not provided. NESCent seeks to address the resulting obstacle to interoperability and data integration by sponsoring a hackathon that brings together data and metadata experts and developers from a number of data providers with the developers of the emerging NeXML and CDAO standards.

Motivation

There is a large variety of phylogenetic data resources in the form of on-line databases, providing data ranging from character state matrices (e.g., MorphBank, Morphobank), molecular sequence alignments (e.g., BAliBASE, PANDIT), phylogenetic trees (e.g., TreeBASE), gene or protein trees (e.g., TreeFAM, PhylomeDB), species trees (e.g., Tree of Life), gene families (e.g., PhyloFacts, HOVERGEN), to species taxonomies (e.g., NCBI Taxonomy, ITIS), and to analytic metadata such as divergence times (e.g., TimeTree). There is no existing common or unifying exchange format in which these data resources are available, and each of the resources boast a variety of meticulously curated or computed metadata for their holdings that require expert knowledge and manual inspection to interpret. Furthermore, there is no common, predictable way for querying and obtaining the data, and in fact most of those resources don't provide any programmable on-line interface (API).

This situation presents a fundamental obstacle to integrating phylogenetic data and service providers into a network of interoperating services that consume and produce data in predictable, verifiable syntax and with explicit machine-interpretable semantics, key prerequisites to applying tools for resource discovery and for constructing or executing complex workflows. It also renders these resources resistant to large-scale data integration, for example for combining and cross-linking some of these resources with other data, such as genomic, phenotypic, or georeferenced specimen data. This is further exacerbated by the fact that existing commonly used standards for phylogenetic data such as NEXUS cannot fully represent the different data sources and their semantics in a consistent manner, further hindering efforts to overcome this situation because they depend on an exchange standard with sufficient syntactical and semantic expressivity.

Recently, the development of the NeXML data exchange format and the CDAO ontology for comparative and phylogenetic data and analysis have provided a window of opportunity to apply both of these emerging standards towards solving some of these obstacles, while at the same time validating their ability to satisfy real-world needs that previously used standards (such as NEXUS) have not. Doing so would benefit the data providers by making their data more broadly useful, end-users by having access to a wide variety of phylogenetic data in a common, predictable format, and ultimately tool developers by defining a uniform way for giving their users instant access to a large swath of data. Furthermore, the recently started development of PhyloWS provides a first attempt at a uniform specification for a programmable phylogenetic data provider API.

NESCent seeks to take advantage of this opportunity by sponsoring a hackathon that brings together data and metadata experts and developers from several phylogenetic data providers with the developers of NeXML and CDAO. In addition, developers and end-users of phylogenetic data visualization and database integration projects will build demonstration projects and ensure the utility of the effort for research applications.

Acknowledgements

Many of the ideas for this hackathon arose from, and are a continuation of, the activities of NESCent's Evolutionary Informatics Working Group. Specifically, NeXML, CDAO, and PhyloWS are products of the group, and the motivation for this hackathon is a distillation of the CarrotBase ideas and concepts, which essentially served as a whitepaper for this event.

Specific objectives

For a list of possible deliverables, see Database Interop Implementations. We are also collecting use-cases.

The following broad objectives have been identified. Participants of the hackathon will refine these and distill concrete work targets from them in advance of and at the event.

  1. Unify the data format using NeXML:
    • Define and implement a transformation path from the native data format of the participating data providers to NeXML.
    • Document mappings, gaps, and ambiguities, and resolve those at the event as much as possible, or lay out ways for future resolution.
  2. Unify the data semantics using CDAO
    • Define comprehensive mappings between the metadata of the participating data providers to CDAO terms.
    • Extend CDAO with (possibly provisionary?) terms as much as possible.
    • Identify and document procedure for other data providers with semantics not currently represented within CDAO.
  3. Unify programmable data provider API
    • Complete the PhyloWS specification for RESTful data access and querying.
    • Document NeXML and CDAO needs for specifying metadata queries through PhyloWS.
  4. Create demonstration projects that take advantage of the unified data formats and/or semantics.
    • Database that integrates all participating data providers.
    • PhyloWS implementation on top of an integrated database.
    • Interactive tool that visualizes and navigates across the breadth of data.

The hackathon concentrates on writing code. All code and documentation will be made available immediately and freely to the community under an open-source (OSI-approved) license.

Activities

Pre-meeting

In order to best prepare for the main event, the participating standards are holding a pre-meeting on Feb 20-22, also on-site at NESCent in Durham, NC.

Conference calls

Hackathon participants


NeXML

Subgroups

Participants will split into subgroups at the event. The composition and tasks of the subgroups will be guided by the overall objectives, but will otherwise emerge and be self-determined by the participants prior to and at the event.

Documentation

The form for entering input and output data is on a separate page.

Participate

  • On-site participation was arranged by invitation and by self-nomination through an Open Call for Participation, followed by review.
  • The list of on-site participants has been finalized as of mid-February.
  • To better enable remove participation and pre-hackathon interactions, we have created an IRC channel on Freenode, named #dbhack1. There are instructions for finding and using an IRC client and for connecting in the help pages.
    • Mibbit parameters screen.png
      Alternatively to a desktop client, you can use a web-client such as Mibbit, which requires no download or installation of software. For Mibbit, on the front page click 'Start chatting now', and then enter the connection parameters as in the screenshot (substituting your desired nickname for 'your_nick' and hit 'Go'. On the page that opens, click the #dbhack1 tab if it isn't in the front already.
  • To enable micro-blogging of the event and as a tool for aggregating content, we have created a Friendfeed room short-named dbhack1. If you have difficult joining the room let me know, and if you have a content stream we should add let me know. Being Friendfeed, the room also has an RSS feed, in case you want to add it to your favorite feed reader.
  • As listed under resources, we encourage you to use the tag dbhack1 on social tagging sites ( such as Connotea, CiteULike, Del.icio.us) to tag online resources or papers relevant to the event.

Organization

Organizing Committee: Hilmar Lapp, Katja Schulz, Arlin Stoltzfus, Todd Vision, Rutger Vos

Time & Venue: The hackathon is scheduled to take place from March 9 to 13, 2009 at NESCent in Durham, North Carolina.

Agenda: The agenda of the event will be posted here once developed by the participants.

Organizers' notes. (Note: these are for organizers only.)

Suggestions

  • Protein domain databases, like Interpro, Pfam/Rfam, Prosite and Prodom also have some kind of tree/taxonomic information. It might be worth inviting people from these projects? - Julie
  • Database Interop Implementations

Links & Resources

Help with editing the wiki:

Past NESCent-sponsored hackathons:

Relevant reading:

You can tag online resources, such as citations, articles, or other URLs, using social tagging sites. Please use the dbhack1 tag.