Difference between revisions of "Database Interop Hackathon"

From Evolutionary Informatics Working Group
Jump to: navigation, search
(Conference calls)
(Quick links)
 
(41 intermediate revisions by 8 users not shown)
Line 1: Line 1:
 
<center><big>'''NESCent Hackathon on Evolutionary Database Interoperability'''</big></center>
 
<center><big>'''NESCent Hackathon on Evolutionary Database Interoperability'''</big></center>
  
== Synopsis ==
 
  
Even though there is a rich and meticulously curated variety of on-line databases of phylogenetic data, their holdings are only available in incompatible formats lacking explicit semantics, and programmable APIs for querying the data are often not provided. NESCent seeks to address the resulting obstacle to interoperability and data integration by sponsoring a hackathon that brings together data and metadata experts and developers from a number of  data providers with the developers of the emerging NeXML and CDAO standards.
+
{{HackHead}}
 +
 
 +
== Quick links ==
 +
 
 +
Hackathon-specific:
 +
* [[Hackathon Report]] with notes of our daily sessions and links to resources
 +
* [[Database Interop Hackathon/Use Cases|Use Cases]] (there is also a more general phyloinformatics [https://www.nescent.org/wg_phyloinformatics/Public:UseCases use case list])
 +
* The [[Database Interop Hackathon/Agenda|agenda]] and [[Database Interop Hackathon/Participants|participants list]] for the upcoming meeting
 +
* [[Subgroups]] - Links to all hackathon subgroups.
 +
* [[Database Interop Hackathon/Implementations|Implementations descriptions]] of participating data resources
 +
* Suggested [[Database Interop Deliverables|target projects]] (ideas page)
 +
* Examples of [[Database Interop Hackathon/Metadata Support|metadata support]] in [[NeXML]] using [[CDAO]] and other ontologies
 +
 
 +
Resources and planning:
 +
* Notes on [[Database Interop Hackathon/Teleconferences|teleconferences]] and [[Database Interop Hackathon/Pre-Meeting|organizer's pre-meeting]]
 +
* Home pages for [http://www.nexml.org NeXML] and [http://www.evolutionaryontology.org CDAO]
 +
* [[Database Interop Hackathon/Documentation|Documentation]] archive
 +
* [[Media:Evolution 2009 Poster HL.pdf| Evolutionary Data Leaping to Web 3.0: Some Highlights From NESCent's Third Hackathon]] (Poster presented at the 2009 Evolution Meetings in Moscow, Idaho)
 +
* suggestions for [[Database Interop Hackathon/Provider Overviews|provider overviews]]
 +
 
 +
[[Son of Evoinfo|Proposal page for a follow-up NESCent working group]]
  
 
== Motivation ==
 
== Motivation ==
  
There is a large variety of phylogenetic data resources in the form of on-line databases, providing data ranging from character state matrices (e.g., MorphBank, Morphobank), molecular sequence alignments (e.g., BAliBASE, PANDIT), phylogenetic trees (e.g., TreeBASE), gene or protein trees (e.g., TreeFAM, PhylomeDB), species trees (e.g., Tree of Life), gene families (e.g., PhyloFacts, HOVERGEN), to species taxonomies (e.g., NCBI Taxonomy, ITIS), and to analytic metadata such as divergence times (e.g., TimeTree). There is no existing common or unifying exchange format in which these data resources are available, and each of the resources boast a variety of meticulously curated or computed metadata for their holdings that require expert knowledge and manual inspection to interpret. Furthermore, there is no common, predictable way for querying and obtaining the data, and in fact most of those resources don't provide any programmable on-line interface (API).
+
A variety of phylogenetic data resources are available in the form of on-line databases, providing data ranging from character state matrices (e.g., [[MorphBank]], [[MorphoBank]]), molecular sequence alignments (e.g., BAliBASE, [[PANDIT]]), phylogenetic trees (e.g., [[TreeBASE]]), gene or protein trees (e.g., TreeFAM, PhylomeDB), species trees (e.g., [[Tree of Life]]), gene families (e.g., PhyloFacts, HOVERGEN), to species taxonomies (e.g., NCBI Taxonomy, ITIS, [[PaleoDB]]), and to analytic metadata such as divergence times (e.g., [[TimeTree]]).
  
This situation presents a fundamental obstacle to integrating phylogenetic data and service providers into a network of interoperating services that consume and produce data in predictable, verifiable syntax and with explicit machine-interpretable semantics, key prerequisites to applying tools for resource discovery and for constructing or executing complex workflows. It also renders these resources resistant to large-scale data integration, for example for combining and cross-linking some of these resources with other data, such as genomic, phenotypic, or georeferenced specimen data. This is further exacerbated by the fact that existing commonly used standards for phylogenetic data such as NEXUS cannot fully represent the different data sources and their semantics in a consistent manner, further hindering efforts to overcome this situation because they depend on an exchange standard with sufficient syntactical and semantic expressivity.
+
Even though there is a rich and meticulously curated variety of on-line resources, their holdings are only available in incompatible formats lacking explicit semantics, and programmable APIs for querying the data are often not provided. NESCent seeks to address the resulting obstacle to interoperability and data integration by sponsoring a hackathon that brings together data and metadata experts and developers from a number of  data providers with the developers of the emerging [[NeXML]] and [[CDAO]] standards.
  
Recently, the development of the NeXML data exchange format and the CDAO ontology for comparative and phylogenetic data and analysis have provided a window of opportunity to apply both of these emerging standards towards solving some of these obstacles, while at the same time validating their ability to satisfy real-world needs that previously used standards (such as NEXUS) have not. Doing so would benefit the data providers by making their data more broadly useful, end-users by having access to a wide variety of phylogenetic data in a common, predictable format, and ultimately tool developers by defining a uniform way for giving their users instant access to a large swath of data. Furthermore, the recently started development of PhyloWS provides a first attempt at a uniform specification for a programmable phylogenetic data provider API.
+
=== The problem ===
  
NESCent seeks to take advantage of this opportunity by sponsoring a hackathon that brings together data and metadata experts and developers from several phylogenetic data providers with the developers of NeXML and CDAO. In addition, developers and end-users of phylogenetic data visualization and database integration projects will build demonstration projects and ensure the utility of the effort for research applications.
+
There is no existing common or unifying exchange format in which these data resources are available, and each of the resources boast a variety of meticulously curated or computed metadata for their holdings that require expert knowledge and manual inspection to interpret. Furthermore, there is no common, predictable way for querying and obtaining the data, and in fact most of those resources don't provide any programmable on-line interface (API). This situation presents a fundamental obstacle to integrating phylogenetic data and service providers into a network of interoperating services that consume and produce data in predictable, verifiable syntax and with explicit machine-interpretable semantics, key prerequisites to applying tools for resource discovery and for constructing or executing complex workflows. It also renders these resources resistant to large-scale data integration, for example for combining and cross-linking some of these resources with other data, such as genomic, phenotypic, or georeferenced specimen data. This is further exacerbated by the fact that existing commonly used standards for phylogenetic data such as NEXUS cannot fully represent the different data sources and their semantics in a consistent manner, further hindering efforts to overcome this situation because they depend on an exchange standard with sufficient syntactical and semantic expressivity.
 +
 
 +
=== Activities of the EvoInfo working group ===
 +
 
 +
Recently, the development of the [[NeXML]] data exchange format and the [[CDAO]] ontology for comparative and phylogenetic data and analysis have provided a window of opportunity to apply both of these emerging standards towards solving some of these obstacles, while at the same time validating their ability to satisfy real-world needs that previously used standards (such as NEXUS) have not. Doing so would benefit the data providers by making their data more broadly useful, end-users by having access to a wide variety of phylogenetic data in a common, predictable format, and ultimately tool developers by defining a uniform way for giving their users instant access to a large swath of data. Furthermore, the recently started development of PhyloWS provides a first attempt at a uniform specification for a programmable phylogenetic data provider API.
 +
 
 +
=== The hackathon ===
 +
 
 +
NESCent seeks to take advantage of this opportunity by sponsoring a hackathon that brings together data and metadata experts and developers from several phylogenetic data providers with the developers of [[NeXML]] and [[CDAO]]. In addition, developers and end-users of phylogenetic data visualization and database integration projects will build demonstration projects and ensure the utility of the effort for research applications.
  
 
=== Acknowledgements ===
 
=== Acknowledgements ===
Line 19: Line 46:
 
Many of the ideas for this hackathon arose from, and are a continuation of, the activities of NESCent's [[Main Page|Evolutionary Informatics Working Group]]. Specifically, [[Future Data Exchange Standard|NeXML]], [[CDAO]], and [[PhyloWS]] are products of the group, and the motivation for this hackathon is a distillation of the [[CarrotBase]] ideas and concepts, which essentially served as a whitepaper for this event.
 
Many of the ideas for this hackathon arose from, and are a continuation of, the activities of NESCent's [[Main Page|Evolutionary Informatics Working Group]]. Specifically, [[Future Data Exchange Standard|NeXML]], [[CDAO]], and [[PhyloWS]] are products of the group, and the motivation for this hackathon is a distillation of the [[CarrotBase]] ideas and concepts, which essentially served as a whitepaper for this event.
  
== Specific objectives ==
+
== Objectives ==
  
For a list of possible deliverables, see [[Database Interop Implementations]]. We are also [[Database Interop Hackathon/Use Cases|collecting use-cases]].
+
For a more specific list of possible deliverables, see [[Database Interop Deliverables]]. We are also [[Database Interop Hackathon/Use Cases|collecting use-cases]].
  
 
The following broad objectives have been identified. Participants of the hackathon will refine these and distill concrete work targets from them in advance of and at the event.
 
The following broad objectives have been identified. Participants of the hackathon will refine these and distill concrete work targets from them in advance of and at the event.
# Unify the data format using NeXML:
+
# Unify the data format using [[NeXML]]:
 
#* Define and implement a transformation path from the native data format of the participating data providers to NeXML.
 
#* Define and implement a transformation path from the native data format of the participating data providers to NeXML.
 
#* Document mappings, gaps, and ambiguities, and resolve those at the event as much as possible, or lay out ways for future resolution.
 
#* Document mappings, gaps, and ambiguities, and resolve those at the event as much as possible, or lay out ways for future resolution.
# Unify the data semantics using CDAO
+
# Unify the data semantics using [[CDAO]]
 
#* Define comprehensive mappings between the metadata of the participating data providers to CDAO terms.
 
#* Define comprehensive mappings between the metadata of the participating data providers to CDAO terms.
 
#* Extend CDAO with (possibly provisionary?) terms as much as possible.
 
#* Extend CDAO with (possibly provisionary?) terms as much as possible.
Line 33: Line 60:
 
# Unify programmable data provider API
 
# Unify programmable data provider API
 
#* Complete the PhyloWS specification for RESTful data access and querying.
 
#* Complete the PhyloWS specification for RESTful data access and querying.
#* Document NeXML and CDAO needs for specifying metadata queries through PhyloWS.
+
#* Document [[NeXML]] and CDAO needs for specifying metadata queries through PhyloWS.
 
# Create demonstration projects that take advantage of the unified data formats and/or semantics.
 
# Create demonstration projects that take advantage of the unified data formats and/or semantics.
 
#* Database that integrates all participating data providers.
 
#* Database that integrates all participating data providers.
Line 50: Line 77:
  
 
;Hackathon participants:
 
;Hackathon participants:
 +
* [[Database Interop Hackathon/Teleconferences#Teleconference June 29, 2009|Teleconference Jun 29, 2009]]
 +
* [[Database Interop Hackathon/Teleconferences#Teleconference June 23, 2009|Teleconference Jun 23, 2009]]
 +
* [[Database Interop Hackathon/Teleconferences#Teleconference Feb 26, 2009|Teleconference Feb 26, 2009]]
 
* [[Database Interop Hackathon/Teleconferences#Teleconference Feb 20, 2009|Teleconference Feb 20, 2009]]
 
* [[Database Interop Hackathon/Teleconferences#Teleconference Feb 20, 2009|Teleconference Feb 20, 2009]]
  
'''NOTES'''
+
;[[NeXML]]:
;PARTICIPANTS:
+
* [[Future_Data_Exchange_Standard/Teleconferences#Teleconference.2C_Feb_3.2C_2009| Teleconference Feb 3, 2009]]
* Brandon Chisham, NMSU, CDAO Project
 
* Jim Balhoff, NESCent, Phenoscape
 
* Enrico Pontelli, NMSU, CDAO Project
 
* Ryan Scherle, NESCent, DRYAD
 
* Rugter Vos, University of British Columbia, NEXML
 
* Hilmar Lapp, NESCent, Co-organizer, designer PhyloWS
 
* Arlin Stoltzfus, NIST, Co-organizer, CDAO project
 
* Jeet Sukumaran, University of Kansas, NEXUS
 
* Peter Midford, University of Kansas, Mesquite
 
* Karen Cranston, University of Arizona, Encyclopedia of Life
 
* Dam Donnelly, U. Pennsylvania, pPOD
 
* Sheldon McKay, modENCODE
 
* Mark Jensen, Fortinbras Research, clinical analysis of sequence data from pathogen
 
* Bill Piel, Yale, TreeBase, iPlant
 
* Lucie Chan, San Diego Supercomputing, MorphoBank
 
* Vivek Gopalan, NCBI,
 
  
;WELCOME MESSAGE (Arlin Stoltzfus)
+
=== Subgroups ===
* Notes will be sent out
 
* Agenda has been sent out
 
 
 
;INTRODUCTIONS (All Participants)
 
 
 
;ROADMAP (Hilmar Lapp)
 
* Kickoff teleconference,
 
* Overview standards
 
* Need to gather some information, important activity over  the next 2 weeks
 
** Data providers?
 
** Use cases to guide?
 
* Dave clemmens has a page developed with Karla
 
* Premeeting is taking place to prepare standards for the hackaton
 
**More messages over weekend as the premeeting develops
 
* 2 more telecons; next one does not need to be a replication, instead discuss more technical issues and gather info for use cases (same for third)
 
* MORNING time for next one (for UK folks)
 
* QUESTIONS/SUGGESTIONS?
 
**No questions
 
 
 
;STANDARDS
 
* 3 technologies (phyloWS, NEXML, CDAO); all are outcomes of  the evoinfo working group
 
* Working Group for 2 years, started to address interoperability issues; started with brainstorming for ideas (e.g., integrated data resource); we settled on specific technologies to facilitate interop. One data standard, one ontology, one interface for web services. Hackaton is the last meeting of the working group. Thus, this is the time and place to put technology to the test.
 
* NEXML: (Rutger Vos)
 
** New XML standard, inspired by the NEXUS format; lots of applications use it; many data resource also use it (as  data input or as serialization format)
 
** NEXUS has issues, dialects, incompatibilites; we want a new  standard, formally developed and that can be validated.
 
** There is a NEXML.org website. It contains the XML schema, some I/O libraries (java, python, javascript, c++ in still  in  progress); on the other hand, there does not seem to be a strong interest towards C++.
 
** It sounds like a useful technology, more reliable exchange of data, we can use it for data exchange for web services; some advantages over previous standards.
 
** QUESTIONS?
 
*** What is the current level of support?  There are some libraries provided; Perl, included in  BioPerl, thus PioPerl supports it;  Java is used by Mesquite; Phenoscape uses its own; Jeet is working on a library
 
for Python.
 
* CDAO (Arlin Stoltzfus)
 
** Ontology that addresses the application area of comparative  data analysis; implemented in OWL
 
** OWL offers good control and formal structuring for the ontology
 
** CDAO formalizes knowledge/semantics; it is useful for  interoperability, to resolve ambiguities using semantics;  For example, the Sequence Ontology has been used with similar  objectives in the case of sequence data. Different sequence  databases use Gene Feature Format (GFF) but with focus on syntax; this led to incompatible definitions of certain terms  (e.g., open reading frame, in some instances it is viewed as   include a stop codon, in other instances it does not; the Sequence Ontology enabled to clarify this ambiguity by creating  two separate concepts).
 
** Similar benefits can be gained in phylogenetic analysis: for example, in the problem of tree reconciliation. There are many tools, each imposing different requirements on the input tree (e.g., completely resolved or not). These distinctions on the inputs are often semantical, not based on syntax.
 
** A formal ontology allows also access to reasoners, that can be  used for  validation of concepts
 
** Note, that formal ontologies are meant to be machine  understandable, not necessarily to be used manually.
 
** QUESTIONS?
 
***Are there tools to generate it? Are there tools to formalize description of an analysis? Yes, there are formalisms to formally describe a biomedical  analysis or protocol, and they can be instantiated using a domain specific ontology. This is the case of OBI or FUGO  (as general ontologies for describing protocols) and BioMoby (as a domain specific ontology)
 
***Comment on workflow languages: there are systems that support  phylogenetic workflows; in Kepler there are mechanisms to  introduce annotations (e.g., on the inputs and outputs) and  these will be used to type check the workflow. But they are  not widely used.
 
***I am new to all these ontologies; how does one connect  different ontologies together into the same application? That can be done, ontologies allow to import other ontologies. CDAO includes an external ontology for amino acids and enables  external ontology to describe different types of characters;
 
 
 
* PhyloWS (Hilmar Lapp)
 
** It is the youngest of the three standards; one year old
 
** Developed at the Biohackaton in Japan
 
** Focused on web services
 
** Obstacle: rich diversity of data resources (digital ones) accessible online, yet, designed for human consumers; the  medatadata could be valuable but not machine accessible
 
** Some people are forced to do complex task to extract knowledge (e.g., HTML screenscraping)
 
** There is a lack of programmable interfaces, and this is an obstacle to interoperability
 
** A programmable interface is aimed at Predictability and Interpretability, and these two aspects builds on the two previously proposed standards (NEXML and CDAO)
 
** Predictability: how to access data holdings, search data holdings, query interfaces, how to access individual items and resources (e.g., one tree in TreeBase, one alignment in an Alignment database) and how are these data returned. NEXML provides a solution to some of these issues by offering a standard interchange format.
 
** Interpretability: how do I use the data returned? What is the  meaning? CDAO represents a solution to this aspect.
 
** If all these online data resources implement a standard  web interface, these tasks become easy, it is simple to write widgets to embed in other web pages or applications, or create large systems (e.g., in Kepler, Mesquite) that can pull data from resources and they know what to do with them.
 
** QUESTIONS?
 
*** Is PhyloWS implemented? Or is it can be implemented but  something is missing?  Yes and no. First of all, it is partially implemented, there is  a prototype for Tree of Life; you can, through Phylows and  a REST interface to obtain ToL trees. However there are  parts of the specifcation that need to be fleshed out (and  we will work on this at the  premeeting)
 
 
 
;INVENTORY (Arlin Stoltzfus)
 
* We would like to think about possibilities and prelude to data collection and capabilities collection
 
* Putting together data standards and web services, we can connect data resources that are now disconnected. For example,  Treebase may want to pull in other data, if there are semantics mapping between schemas it becomes possible, possibly through web services, with data transmitted in NEXML. Or we can described web services to provide access to treefam data sets from EMBL. Or enable existing tools to access sophisticated data matrix viewers, like mx or nexplorer, just by producing data in some standard format. We can integrate resources; e.g., Rutger used ToL queries and then go to TimeTree to get dates for trees (this is an interactive user interface), and a service combines and  integrate them.
 
* We need to think about this; we need an inventory of input and output supported by different data resources (represented by you, participant). We want to create a network, where nodes are data resources and links are shared data types. If you export a character matrix and someone imports a character matrix, there there is a potential link in the graph and an opportunity for interoperation. The link are possible, but it may be theoretical and not practical (there may be format compatibility issues, lack of a robust interface). We want to propose solution for this. Please help us to create this graph.
 
*  After the telecon we will create a form or a shared spreadsheet on googledocs, and we will summarize that in a graph. Your data will become a part of a network of data resources. You will get this after the telecon and we invite you to fill the required information.
 
*  Another thing we want to hear is about use cases or wish lists. We  have a use case wiki and Dave Clemmens has done work on it.  Please suggest use cases that would shape the hackaton to be more concrete.
 
  
 +
Participants will split into subgroups at the event. The composition and tasks of the subgroups will be guided by the overall objectives, but will otherwise emerge and be self-determined by the participants prior to and at the event.
  
;GENERAL QUESTIONS:
+
=== Information gathering ===
* I am new at this; is it the focus on trees? Or data repositories? Highlevel structure of tree? What is the granularity? The focus is on evolutionary data, more specifically phylogenetic data; it includes trees, taxonomies, species taxonomies, character stat matrixes, discrete or continuous characters, sequence alignments, maybe transition models.  In the wider context, this is only part of the picture; metadata are also important and can be linked to nodes of the tree, ranging from gene functional data, gene locations, biodiversity data. The focus is on making phylogentic data available as standards in order for outside users to access these metadata.  Are we going to worry about how to enable linking phylogenetic data to other data or vice versa? Is this too further away from the hackaton? Maybe some of these are down the road, but in they are in the  realm of  workflows, and they are in the scope.
 
  
* What will the event look like? There will be lots of room for creativity, not assignments, people get together and use resources according to their interest.  Try to be focused on certain activities, more profitable to you and the people you group with.
+
The form for entering input and output data is on a [[Database_Interop_Hackathon/ResourceIOForm|separate page]].
  
* I have not been there before: what is the typical day and how does it  change over the period?  The typical day will include programming. Working on a specific task worthwile (determined to be such by a group). People at the event makes tasks feasible. Self organizing and self emerging. We will try to have  lots of speaking on the mailing list before the event, but this is only to coordinate and information gathering. We will not be telling to people what to do. Subgroups will emerge and assume charges. The first morning will be devoted to forming the groups.  Some people will be development of documentation. We may have some bootcamps. We will try to sense which ones from the mailing list discussion.  This is not  a workshop where people get up and talk and brainstorm. It  is very different.  Some of these goals will form over the next two weeks. You may want to start thinking who are the participatns you want to work with. Note that the wiki  is open to everyone to edit, just request account, you can then edit. Send email to help@nescent.org (you can get an account  on evoinfo or hackaton wiki). There is also an online tutorial for using wikis. We will put the link somewhere.
+
=== Documentation ===
  
 +
''Note: this may eventually become its own subgroup.''
  
 
+
* All [[Database Interop Hackathon/Documentation|hackathon-related documentation]]
 
+
* Documentation of [[Database Interop Hackathon/Metadata Support|metadata support]] with explicit semantics
;NeXML:
 
* [[Future_Data_Exchange_Standard/Teleconferences#Teleconference.2C_Feb_3.2C_2009| Teleconference Feb 3, 2009]]
 
 
 
=== Subgroups ===
 
 
 
Participants will split into subgroups at the event. The composition and tasks of the subgroups will be guided by the overall objectives, but will otherwise emerge and be self-determined by the participants prior to and at the event.
 
  
 
== Participate ==
 
== Participate ==
Line 158: Line 105:
 
* The [[Database Interop Hackathon/Participants|list of on-site participants]] has been finalized as of mid-February.
 
* The [[Database Interop Hackathon/Participants|list of on-site participants]] has been finalized as of mid-February.
 
* To better enable remove participation and pre-hackathon interactions, we have created an IRC channel on [http://freenode.net Freenode], named #dbhack1. There are [[Help:Contents#Internet_Relay_Chat_.28IRC.29|instructions for finding and using an IRC client and for connecting]] in the help pages.
 
* To better enable remove participation and pre-hackathon interactions, we have created an IRC channel on [http://freenode.net Freenode], named #dbhack1. There are [[Help:Contents#Internet_Relay_Chat_.28IRC.29|instructions for finding and using an IRC client and for connecting]] in the help pages.
** [[Image:Mibbit parameters screen.png|right]] Alternatively to a desktop client, you can use a web-client such as [http://mibbit.com Mibbit], which requires no download or installation of software. For Mibbit, on the front page click 'Start chatting now', and then enter the connection parameters as in the screenshot (substituting your desired nickname for 'your_nick' and hit 'Go'. On the page that opens, click the #dbhack1 tab if it isn't in the front already.
+
** [[Image:Mibbit parameters screen.dbhack1.png|right]] Alternatively to a desktop client, you can use a web-client such as [http://mibbit.com Mibbit], which requires no download or installation of software. For Mibbit, on the front page click 'Start chatting now', and then enter the connection parameters as in the screenshot (substituting your desired nickname for 'your_nick') and hit 'Go'. On the page that opens, click the #dbhack1 tab if it isn't in the front already.
* To enable micro-blogging of the event and as a tool for aggregating content, we have created a [http://friendfeed.com/rooms/dbhack1 Friendfeed room short-named dbhack1]. If you have difficult joining the room let me know, and if you have a content stream we should add let me know. Being Friendfeed, the room also has an RSS feed, in case you want to add it to your favorite feed reader.
+
* To enable micro-blogging of the event and as a tool for aggregating content, we have created a [http://friendfeed.com/rooms/dbhack1 Friendfeed room short-named dbhack1]. If you have difficult joining the room or have a content stream we should add let me know. Being Friendfeed, the room also has an RSS feed, in case you want to add it to your favorite feed reader.
* As listed under [[Database_Interop_Hackathon#Links_.26_Resources|resources]], we encourage you to use the tag dbhack1 on social tagging sites ( such as [http://connotea.org/tag/dbhack1 Connotea],  [http://citeulike.org/tag/dbhack1 CiteULike],  [http://del.icio.us/tag/dbhack1 Del.icio.us]) to tag online resources or papers relevant to the event.
+
* As listed under [[Database_Interop_Hackathon#Links_.26_Resources|resources]], we encourage you to use the tag dbhack1 on social tagging sites (such as [http://connotea.org/tag/dbhack1 Connotea],  [http://citeulike.org/tag/dbhack1 CiteULike],  [http://del.icio.us/tag/dbhack1 Del.icio.us]) to tag online resources or papers relevant to the event.
  
 
== Organization ==
 
== Organization ==
Line 169: Line 116:
  
 
'''Agenda:''' The [[Database Interop Hackathon/Agenda|agenda of the event]] will be posted here once developed by the participants.
 
'''Agenda:''' The [[Database Interop Hackathon/Agenda|agenda of the event]] will be posted here once developed by the participants.
 +
 +
'''Logistics:'''
 +
* Participants stay at the [http://www.duketower.com/ Duke Tower Hotel & Condominiums] in Durham (807 West Trinity Avenue, Durham, NC  27701, telephone:  866-385-3869 or 919-687-4444).
 +
* Breakfast options are the Tower Cafe at Duke Tower (continental breakfast), [http://www.wholefoodsmarket.com/stores/durham/ Whole Foods], [http://www.madhatterbakeshop.com/menu/breakfast/ Mad Hatter's], and [http://www.beantraderscoffee.com/locations.html Bean Trader's] on 9th Street.
  
 
[[Org:Database_Interop_Hackathon/Agenda|Organizers' notes]]. (''Note: these are for organizers only.'')
 
[[Org:Database_Interop_Hackathon/Agenda|Organizers' notes]]. (''Note: these are for organizers only.'')
  
== Suggestions ==
+
== Demonstration projects, sample code, ideas ==
 +
 
 +
=== miscellaneous ===
 +
 
 +
* There are some  suggested project ideas from our pre-meetings.
 +
* see [[Database_Interop_Hackathon/Target_Projects]]
 +
 
 +
=== Integrating ToL and TimeTree to get trees with dates ===
 +
 
 +
Rutger Vos has written a mashup showing how to use [[PhyloWS]] services to integrate data from two sources.  Specifically it gets a tree from [[Tree of Life]], and then assigns dates to the nodes using [[TimeTree]].
 +
 
 +
The code for both of these is in the nexml svn repository on
 +
sourceforge in the phylows/ subtree:
 +
 
 +
http://nexml.svn.sourceforge.net/viewvc/nexml/trunk/nexml/phylows/
  
* Protein domain databases, like Interpro, Pfam/Rfam, Prosite and Prodom also have some kind of tree/taxonomic information. It might be worth inviting people from these projects? - Julie
+
<!--
* [[Database Interop Implementations]]
+
We're back, baby!
 +
Note that the TimeTree wrapper is currently broken, as it relies on
 +
screen scraping and the TimeTree website has recently changed. This is
 +
no fault of my own, it simply vindicates the argument for machine
 +
readable interfaces :)
 +
-->
  
 
== Links & Resources ==
 
== Links & Resources ==
 +
 +
Help with editing the wiki:
 +
* [http://www.nescent.org/informatics/wiki_tutorial.php Online tutorial] (targeted at Working Groups, but the basics all apply)
 +
* [[Help:Wiki|Collection of wiki editing tips]], with links to further documentation
  
 
Past NESCent-sponsored hackathons:
 
Past NESCent-sponsored hackathons:
Line 190: Line 164:
 
* [http://citeulike.org/tag/dbhack1 CiteULike]
 
* [http://citeulike.org/tag/dbhack1 CiteULike]
 
* [http://del.icio.us/tag/dbhack1 Del.icio.us]
 
* [http://del.icio.us/tag/dbhack1 Del.icio.us]
 +
 +
== Photos ==
 +
 +
<gallery>
 +
Image:IMG_1631.JPG
 +
Image:IMG_1632.JPG
 +
Image:IMG_1634.JPG
 +
Image:IMG_1635.jpg
 +
Image:IMG_1636.JPG
 +
Image:IMG_1637.JPG
 +
Image:IMG_1638.JPG
 +
Image:IMG_1639.JPG
 +
Image:IMG_1640.JPG
 +
Image:IMG_1641.JPG
 +
Image:IMG_1642.jpg
 +
Image:IMG_1643.JPG
 +
Image:IMG_1645.JPG
 +
Image:IMG_1646.JPG
 +
Image:FirstNight.JPG
 +
Image:FirstNight2.JPG
 +
Image:Meeting1.JPG
 +
Image:RogerPresenting.JPG
 +
Image:SheldonPresenting.JPG
 +
</gallery>
  
 
[[Category:DB Interop Hackathon]]
 
[[Category:DB Interop Hackathon]]
[[Category:Nexml]]
+
[[Category:NeXML]]
 
[[Category:CDAO]]
 
[[Category:CDAO]]

Latest revision as of 18:47, 5 February 2011

NESCent Hackathon on Evolutionary Database Interoperability


Quick links

Hackathon-specific:

Resources and planning:

Proposal page for a follow-up NESCent working group

Motivation

A variety of phylogenetic data resources are available in the form of on-line databases, providing data ranging from character state matrices (e.g., MorphBank, MorphoBank), molecular sequence alignments (e.g., BAliBASE, PANDIT), phylogenetic trees (e.g., TreeBASE), gene or protein trees (e.g., TreeFAM, PhylomeDB), species trees (e.g., Tree of Life), gene families (e.g., PhyloFacts, HOVERGEN), to species taxonomies (e.g., NCBI Taxonomy, ITIS, PaleoDB), and to analytic metadata such as divergence times (e.g., TimeTree).

Even though there is a rich and meticulously curated variety of on-line resources, their holdings are only available in incompatible formats lacking explicit semantics, and programmable APIs for querying the data are often not provided. NESCent seeks to address the resulting obstacle to interoperability and data integration by sponsoring a hackathon that brings together data and metadata experts and developers from a number of data providers with the developers of the emerging NeXML and CDAO standards.

The problem

There is no existing common or unifying exchange format in which these data resources are available, and each of the resources boast a variety of meticulously curated or computed metadata for their holdings that require expert knowledge and manual inspection to interpret. Furthermore, there is no common, predictable way for querying and obtaining the data, and in fact most of those resources don't provide any programmable on-line interface (API). This situation presents a fundamental obstacle to integrating phylogenetic data and service providers into a network of interoperating services that consume and produce data in predictable, verifiable syntax and with explicit machine-interpretable semantics, key prerequisites to applying tools for resource discovery and for constructing or executing complex workflows. It also renders these resources resistant to large-scale data integration, for example for combining and cross-linking some of these resources with other data, such as genomic, phenotypic, or georeferenced specimen data. This is further exacerbated by the fact that existing commonly used standards for phylogenetic data such as NEXUS cannot fully represent the different data sources and their semantics in a consistent manner, further hindering efforts to overcome this situation because they depend on an exchange standard with sufficient syntactical and semantic expressivity.

Activities of the EvoInfo working group

Recently, the development of the NeXML data exchange format and the CDAO ontology for comparative and phylogenetic data and analysis have provided a window of opportunity to apply both of these emerging standards towards solving some of these obstacles, while at the same time validating their ability to satisfy real-world needs that previously used standards (such as NEXUS) have not. Doing so would benefit the data providers by making their data more broadly useful, end-users by having access to a wide variety of phylogenetic data in a common, predictable format, and ultimately tool developers by defining a uniform way for giving their users instant access to a large swath of data. Furthermore, the recently started development of PhyloWS provides a first attempt at a uniform specification for a programmable phylogenetic data provider API.

The hackathon

NESCent seeks to take advantage of this opportunity by sponsoring a hackathon that brings together data and metadata experts and developers from several phylogenetic data providers with the developers of NeXML and CDAO. In addition, developers and end-users of phylogenetic data visualization and database integration projects will build demonstration projects and ensure the utility of the effort for research applications.

Acknowledgements

Many of the ideas for this hackathon arose from, and are a continuation of, the activities of NESCent's Evolutionary Informatics Working Group. Specifically, NeXML, CDAO, and PhyloWS are products of the group, and the motivation for this hackathon is a distillation of the CarrotBase ideas and concepts, which essentially served as a whitepaper for this event.

Objectives

For a more specific list of possible deliverables, see Database Interop Deliverables. We are also collecting use-cases.

The following broad objectives have been identified. Participants of the hackathon will refine these and distill concrete work targets from them in advance of and at the event.

  1. Unify the data format using NeXML:
    • Define and implement a transformation path from the native data format of the participating data providers to NeXML.
    • Document mappings, gaps, and ambiguities, and resolve those at the event as much as possible, or lay out ways for future resolution.
  2. Unify the data semantics using CDAO
    • Define comprehensive mappings between the metadata of the participating data providers to CDAO terms.
    • Extend CDAO with (possibly provisionary?) terms as much as possible.
    • Identify and document procedure for other data providers with semantics not currently represented within CDAO.
  3. Unify programmable data provider API
    • Complete the PhyloWS specification for RESTful data access and querying.
    • Document NeXML and CDAO needs for specifying metadata queries through PhyloWS.
  4. Create demonstration projects that take advantage of the unified data formats and/or semantics.
    • Database that integrates all participating data providers.
    • PhyloWS implementation on top of an integrated database.
    • Interactive tool that visualizes and navigates across the breadth of data.

The hackathon concentrates on writing code. All code and documentation will be made available immediately and freely to the community under an open-source (OSI-approved) license.

Activities

Pre-meeting

In order to best prepare for the main event, the participating standards are holding a pre-meeting on Feb 20-22, also on-site at NESCent in Durham, NC.

Conference calls

Hackathon participants
NeXML

Subgroups

Participants will split into subgroups at the event. The composition and tasks of the subgroups will be guided by the overall objectives, but will otherwise emerge and be self-determined by the participants prior to and at the event.

Information gathering

The form for entering input and output data is on a separate page.

Documentation

Note: this may eventually become its own subgroup.

Participate

  • On-site participation was arranged by invitation and by self-nomination through an Open Call for Participation, followed by review.
  • The list of on-site participants has been finalized as of mid-February.
  • To better enable remove participation and pre-hackathon interactions, we have created an IRC channel on Freenode, named #dbhack1. There are instructions for finding and using an IRC client and for connecting in the help pages.
    • Mibbit parameters screen.dbhack1.png
      Alternatively to a desktop client, you can use a web-client such as Mibbit, which requires no download or installation of software. For Mibbit, on the front page click 'Start chatting now', and then enter the connection parameters as in the screenshot (substituting your desired nickname for 'your_nick') and hit 'Go'. On the page that opens, click the #dbhack1 tab if it isn't in the front already.
  • To enable micro-blogging of the event and as a tool for aggregating content, we have created a Friendfeed room short-named dbhack1. If you have difficult joining the room or have a content stream we should add let me know. Being Friendfeed, the room also has an RSS feed, in case you want to add it to your favorite feed reader.
  • As listed under resources, we encourage you to use the tag dbhack1 on social tagging sites (such as Connotea, CiteULike, Del.icio.us) to tag online resources or papers relevant to the event.

Organization

Organizing Committee: Hilmar Lapp, Katja Schulz, Arlin Stoltzfus, Todd Vision, Rutger Vos

Time & Venue: The hackathon is scheduled to take place from March 9 to 13, 2009 at NESCent in Durham, North Carolina.

Agenda: The agenda of the event will be posted here once developed by the participants.

Logistics:

Organizers' notes. (Note: these are for organizers only.)

Demonstration projects, sample code, ideas

miscellaneous

Integrating ToL and TimeTree to get trees with dates

Rutger Vos has written a mashup showing how to use PhyloWS services to integrate data from two sources. Specifically it gets a tree from Tree of Life, and then assigns dates to the nodes using TimeTree.

The code for both of these is in the nexml svn repository on sourceforge in the phylows/ subtree:

http://nexml.svn.sourceforge.net/viewvc/nexml/trunk/nexml/phylows/


Links & Resources

Help with editing the wiki:

Past NESCent-sponsored hackathons:

Relevant reading:

You can tag online resources, such as citations, articles, or other URLs, using social tagging sites. Please use the dbhack1 tag.

Photos