Difference between revisions of "Dbhack2 proposal"

From Evolutionary Informatics Working Group
Jump to: navigation, search
(Proposal for a Phyloinformatics VoCamp)
(Proposal for a Phyloinformatics VoCamp)
 
(7 intermediate revisions by 3 users not shown)
Line 128: Line 128:
 
= Proposal for a Phyloinformatics VoCamp =
 
= Proposal for a Phyloinformatics VoCamp =
  
'''Co-leaders'''. The group of organizers preliminarily consists of N. Cellinese, K. Cranston, H. Lapp, E. Pontelli, Sheldon McKay, A. Stoltzfus, and R. Vos, and is chaired by A. Stoltzfus (corresponding PI).
+
The text of the proposal is at [[VoCamp1 Proposal]].
 
 
'''Synopsis'''.  Standard, interoperable, open and community-developed controlled vocabularies and ontologies that continuously evolve according to changing research demands are key to the ability to integrate and compute over data across fields, experimental systems, and protocols.  At a recent NESCent-sponsored hackathon on [[Database Interop Hackathon| Evolutionary Database Interoperability]], ontologies and vocabularies that meet the needs of diverse community resources and tools emerged as the key gap among the three components of the emerging EvoIO stack of interoperability standards.  Filling this gap on a sustainable basis requires a diverse community of domain experts, users, and stakeholders with a shared awareness of, and commitment to, knowledge-based standards.  To begin building such a community, we propose a <span class="plainlinks">[http://vocamp.org/wiki/Main_Page "VoCamp"]</span>-style meeting for investigators to create and develop ontologies and lightweight vocabularies in support of integration and semantic cross-linking of evolutionary data with its many related fields.  As ontologies are being developed in many biological sub-disciplines and the need to design, maintain, and evolve these according to best practices emerges across such efforts, there is a unique opportunity to bring together logicians, ontology developers, domain experts, and users from multiple subject domains and initiatives at different stages of maturity. To take advantage of such cross-initiative synergy, and to foster better communication and knowledge exchange between them, we propose to co-localize the event with the annual meeting of the <span class="plainlinks">[http://www.tdwg.org/ International Biodiversity Information Standards Organization (TDWG)]</span>.
 
 
 
'''Background'''.  The work of the Evolutionary Informatics Working Group at NESCent on addressing obstacles to interoperability of evolutionary data and tools has given rise to a set of emerging standards for transmission of and access to data and its semantics in a predictable and programmable manner. Specifically, the set consists of a rigorously defined and validatable syntax standard for phylogenetics data and trees with embedded and semantically rich metadata (<span class="plainlinks">[http://www.nexml.org NeXML]</span>), a standard programmable interface to online phylogenetic data resources ([[PhyloWS]]), and an ontology defining the terms and concepts used in comparative evolutionary analysis and the relationships between those (<span class="plainlinks">[http://evolutionaryontology.org/cdao CDAO]</span>).
 
 
 
[[Image:EvoIO stack.png|right|256px]] This set of interoperability standards is now forming the core of the EvoIO technology stack as [[NSF Interop Proposal| proposed to NSF in a INTEROP grant application]], which articulates how these technologies complement each other, as well as a vision for a broad, open, and collaborative community network that sustains their continued development.
 
 
 
The working group concluded its work in March 2009 with a [[Database Interop Hackathon| hackathon event focused on database interoperability]] in evolutionary biology. The event was highly successful in multiple ways. It brough together developers of the EvoIO technologies and related projects with database experts from key online data resources in evolutionary biology, significantly increasing awareness of interoperability obstacles and the work of the EvoInfo group to address those. The event also greatly raised the profile and capabilities of the EvoIO stack of standards (NeXML, CDAO and PhyloWS), and demonstrated that, with a modest amount of training and effort, data providers can improve interoperability, with benefits for data providers and for end users. Furthermore, the group of collaborators who subsequently formulated the [[NSF Interop Proposal| EvoIO community vision proposed to NSF]] emerged from this event.
 
 
 
As another key outcome of the event, participants from various subgroups repeatedly identified a sufficiently expressive ontology, or set of ontologies and vocabularies, and community mechanisms and infrastructure to sustain their continued development and evolution as a critical gap towards fully supporting interoperability, specifically for exchange of metadata and their semantics. This was confirmed by several subsequent efforts to adopt some of the standards, for example incorporating PhyloWS and NeXML-support into the next-generation version of the community resource TreeBASE.
 
 
 
'''Motivation: Need to extend ontologies'''. Online repositories of evolutionary data contain data and metadata that far exceed the scope of the current [http://www.evolutionaryontology.org CDAO ontology]. The overall goal of the EvoIO stack is to allow this data to be accessed, searched, retrieved, and repurposed programmatically. The NeXML and PhyloWS projects allow describing, sharing, and querying data from diverse online databases, but their development is being hampered by a lack of controlled vocabulary and ontologies necessary to transport the semantics in an interoperable manner. Examples that have already been identified through the work of hackathon and EvoInfo participants include metadata on experimental protocols, ontology-based annotation of phenotypes, and taxonomic affiliations of OTUs.
 
 
 
Even if the EvoIO stack projects agree on terminology, developing sustainable solutions to these problems requires coordination and direct involvement of stakeholders from diverse communities to ensure that development efforts address community needs in a way that promotes widespread adoption.  Ultimately, there must be a community dynamic in which domain experts, users, and developers alike know the value of shared terminology, and know how to contribute to its development.
 
 
 
We propose to jump-start such a community dynamic by holding a <span class="plainlinks">[http://vocamp.org/ VoCamp]</span>-style event. A VoCamp is similar in conception to a hackathon in the sence of being an intense, hands-on, working meeting with face-to-face interactions between a diverse group of people who create an intellectually fertile stimulating atmosphere. Instead of developing software source code, it focuses on issues of vocabulary and ontology design, development, and application. VoCamps have emerged only recently (in 2008), but have since spread rapidly (see the online [http://vocamp.org/wiki/Main_Page#VoCamps list of upcoming and previous VoCamps]), and hold significantly potential to foster collaborative and open development and community cohesion for ontologies as hackathons have been shown capable of for open-source software development.
 
 
 
'''Meeting preparation'''. To maximize the productivity of the event, it is important that those who attend are committed to, and capable of, making substantive contributions to the development process, and to choose the right mix of participants between ontological engineers, logicians, domain experts, and developers who would consume and apply the ontologies.
 
 
 
We will recruit a core group of participants from EvoIO stack developers, experts on ontology engineering (such as from the NCBO, or European groups; see below), and developers of online evolutionary data providers and aggregators.
 
 
 
We propose to hold this meeting co-incident with the <span class="plainlinks">[http://www.tdwg.org/conference2009/ 2009 TDWG Annual Meeting]</span> in Montpellier, France. This would allow us to take advantage of opportunities for synergy with the recent shift of TDWG activities towards shared vocabularies and ontology development, as well as for broadening the nascent EvoIO community to biodiversity informatics practitioners, which includes information scientists working in museums, ecology, and conservation.  A European location would also be very cost-effective for some of the CDAO developers, who are based in Strasbourg, and for involving the [http://intranet.cs.man.ac.uk/bhig/ Bio-Health Informatics] and the [http://img.cs.manchester.ac.uk/ Information Management] Groups at the University of Manchester, which comprise some of the world's leading ontologists and semantic web experts. Aside from these opportunities to connect disparate communities facing similar issues, holding the event at the TDWG meeting would double as an ongoing activity of the TDWG Phylogenetics Standards Interest Group, which is co-lead by two members of the organizing group (Lapp and Cellinese).
 
 
 
Additional participants will be recruited through an open call for participation disseminated to data proviers, mailing lists for the EvoIO Stack, the Scientific Observations Network (<span class="plainlinks">[http://sonet.ecoinformatics.org/ SONet]</span>), TDWG mailing lists, [http://gbif.org GBIF], the [http://systbiol.org/ Society for Systematic Biology], and the [http://evol.mcmaster.ca/evoldir.html EvolDir] and [http://listserv.umd.edu/archives/ecolog-l.html EcoLog] mailing lists.
 
 
 
'''Meeting agenda'''. As successfully practiced previously for the NESCent-sponsored hackathons, the exact agenda will be developed by the participants, partly in advance through teleconferences and online electronic media, and partly on-site through an Open Space activity. We anticipate, however, that the following activities and tasks will be needed to ensure a productive event for all participants.
 
* Boot-camps on ontology design and engineering, ontology languages and their restrictions, entailment and reasoning over ontologies, and infrastructure for collaborative development of ontologies.
 
* Identification of focus areas for vocabulary development and term definition.
 
* Identification of targets of opportunity for cross-project and cross-community synergy.
 
Similar as in hackathons, the participants will break into smaller subgroups that collaboratively work on their chosen development targets over a period of 4-4.5 days. The total event duration is envisioned to be between 4-4.5 days. The event will conclude with a wrap-up session, and if held at the TDWG Annual Meeting, a report will be scheduled to be presented afterwards to the TDWG Conference audience.
 
 
 
'''Expected outcomes'''. We expect that the event will have a substantive impact on multiple ontology development efforts, stakeholder communities, and interoperability initiatives, by connecting previously disparate communities with shared objectives, sharing knowledge, and by actually building out existing ontology resources in terms of ontological rigor, semantic richness, and modularity that supports effective reuse. Specifically for evolutionary data interoperability, we expect the event to result not only in a much richer and well-defined CDAO that meets the immediate needs of online data providers and aggregators, but also in a much improved alignment of CDAO design principles with initiatives from ecology (such as <span class="plainlinks">[http://sonet.ecoinformatics.org/ SONet]</span>), biodiversity (such as <span class="plainlinks">[http://code.google.com/p/darwincore/ DarwinCore]</span> and the <span class="plainlinks">[http://wiki.tdwg.org/twiki/bin/view/TAG/TDWGOntology TDWG Ontology]</span>), and genetics / biomedicine (such as <span class="plainlinks">[http://www.obofoundry.org/ OBO]</span>). We anticipate that the event, if successful, will give rise to similarly structured follow-up events organized by these other communities; although subject-focused ontology development sprints have taken place for OBO ontologies (such as for specific regions of an anatomy), more inclusive VoCamp-style events have not yet in any of those.
 

Latest revision as of 12:29, 27 August 2009

Overview

NESCent has some funds and is open to the idea of another hackathon. Past hackathons have been 15 to 20 people from outside NESCent, along with a handful from inside NESCent.

VoCamp related meeting plan

We decided to focus on developing vocabulary and ontology with a diverse group of stakeholders.

Background: drivers and issues

We don't know precisely what this group will choose to focus on, but we need to articulate some of the drivers and state some of the issues to be resolved.

For instance, deciding how we are going to designate taxonomic identifiers is critical. Data integration depends on being able to integrate across common variables. In the genomics world, these are things like genbank accession numbers, species names (or NCBI taxon ids), and sequences, e.g., my colleague John Moult integrates SNP data, function annotations, and protein structure by way of sequence matches and accession numbers. In the larger world of comparative biology, the integrating variables would include (in addition to genbank accessions and species names) other taxon identifiers, specimen (and collection) ids, geographic coordinates, and so on.

In development of the PhyloWS standard, we are working on specifying the available query terms. For example, a user may want to find:

  • subtrees descended from a given internal node
  • trees inferred using maximum likelihood
  • fully resolved (binary) trees

But we need terminology to define these concepts. Ideally, these terms would be in an external ontology. This would also allow the NeXML returned from a PhyloWS query to contain these concepts, linked as metadata through the ontology.

Feel free to expand anything in the list into its own subsection

  • decide on an approach to representing descriptions of studies (probably based on OBI)
  • clarify relations in CDAO, following BFO and REL principles

Meeting preparation

Soliciting and choosing participants

We need to consider how we will choose participants and prepare for the meeting. We may need to pick very carefully to achieve credibility if we want to promote standards. This looks like a diverse group.

Preparing materials for the meeting

It would be a failure if the group got together and split into tiny pieces because there were not enough common interests. We may need to assemble test cases that could be used to evaluate solutions, e.g., test cases of taxonomic disambiguation or cross-mapping.

Meeting plan

What will be the structure of the meeting? Open Space? Agenda?

Locations

Montpellier

  • Cost per person:
    • $ 900 from major US airport to Montpellier
    • $ 110 per night lodging
  • Meeting facilities:
    • essentially free
  • Synergies:

NESCent

  • Cost per person:
  • Meeting facilities:
  • Synergies:
    • NESCent: Vision, Lapp, Balhoff, Swofford, Scherle

New Mexico

  • Cost per person:
  • Meeting facilities:
  • Synergies:
    • Tucson, AZ is 3 or 4 hr away by car (iPlant Collaborative; Maddison will have moved to Oregon by Nov; Mike Worobey in EEB does viral phylogenetics)
    • Los Alamos (LANL HIV db) is 3 hr away by car)
    • NMSU is home of Pontelli, Chisham
  • challenges or disadvantages
    • airport is an hour away

Hackathon-related ideas

Projects focused on consolidating our gains and serving the community of users:

  1. polish up demo projects from dbhack1. The dbhack1 projects were promising but incomplete. Solid well documented demonstration projects are needed to expose our interop technologies.
  2. set up an EvoIO portal (translation, data set curation, etc)
  3. develop an online course complete with information resources, demos, and assignments

Projects focused on building foundations and serving the community of developers:

  1. widen the domain by including viral phylogenetics and molecular epidemiology
  2. ontology, including sticky problems (CDAO relations and upper-level categories) and annotation support (names of programs and file formats)
  3. transition model language

phylo interop portal

the strategic focus of the portal would be interop, but the portal could support other community-building activities such as blogs, bookmarking, forums, etc.

  • objectives
    • provide users with centralized resources
    • demonstrate useful, working aspects of interop technologies
    • illustrate benefits of integrative or large-scale analyses
    • testbed for trying out new concepts and for debugging
    • increase exposure of project to increase chances of funding
  • features
    • format interconversion
    • triple store for download
    • visualization - second
    • data set integration wizard
    • annotation support for curation, metadata
    • analysis operations (implemented by reasoner)
  • how to get this done with limited resources
    • get more players involved by conceiving this broadly
    • provide hosting to some projects where mutually beneficial
    • use the hackathon mechanism to get started
    • use CREST funds to hire graduate student
    • prepare in advance to take advantage of GSoC mechanism
    • work with NSF PIs to get interop-related supplements (e.g., MrBayes)

polish up projects

Sheldon:

phylowidget/viz improvements
clean up and generalize interface; complete modularization
integrate with other projects with an outward facing PhyloWS interface
write-back capability via PhlyoWS

models

transition models

Proposal for a Phyloinformatics VoCamp

The text of the proposal is at VoCamp1 Proposal.