Dbhack2 proposal

From Evolutionary Informatics Working Group
Revision as of 06:42, 26 August 2009 by Hlapp (talk) (Proposal for a phyloinformatics VoCamp)
Jump to: navigation, search


NESCent has some funds and is open to the idea of another hackathon. Past hackathons have been 15 to 20 people from outside NESCent, along with a handful from inside NESCent.

VoCamp related meeting plan

We decided to focus on developing vocabulary and ontology with a diverse group of stakeholders.

Background: drivers and issues

We don't know precisely what this group will choose to focus on, but we need to articulate some of the drivers and state some of the issues to be resolved.

For instance, deciding how we are going to designate taxonomic identifiers is critical. Data integration depends on being able to integrate across common variables. In the genomics world, these are things like genbank accession numbers, species names (or NCBI taxon ids), and sequences, e.g., my colleague John Moult integrates SNP data, function annotations, and protein structure by way of sequence matches and accession numbers. In the larger world of comparative biology, the integrating variables would include (in addition to genbank accessions and species names) other taxon identifiers, specimen (and collection) ids, geographic coordinates, and so on.

In development of the PhyloWS standard, we are working on specifying the available query terms. For example, a user may want to find:

  • subtrees descended from a given internal node
  • trees inferred using maximum likelihood
  • fully resolved (binary) trees

But we need terminology to define these concepts. Ideally, these terms would be in an external ontology. This would also allow the NeXML returned from a PhyloWS query to contain these concepts, linked as metadata through the ontology.

Feel free to expand anything in the list into its own subsection

  • decide on an approach to representing descriptions of studies (probably based on OBI)
  • clarify relations in CDAO, following BFO and REL principles

Meeting preparation

Soliciting and choosing participants

We need to consider how we will choose participants and prepare for the meeting. We may need to pick very carefully to achieve credibility if we want to promote standards. This looks like a diverse group.

Preparing materials for the meeting

It would be a failure if the group got together and split into tiny pieces because there were not enough common interests. We may need to assemble test cases that could be used to evaluate solutions, e.g., test cases of taxonomic disambiguation or cross-mapping.

Meeting plan

What will be the structure of the meeting? Open Space? Agenda?



  • Cost per person:
    • $ 900 from major US airport to Montpellier
    • $ 110 per night lodging
  • Meeting facilities:
    • essentially free
  • Synergies:


  • Cost per person:
  • Meeting facilities:
  • Synergies:
    • NESCent: Vision, Lapp, Balhoff, Swofford, Scherle

New Mexico

  • Cost per person:
  • Meeting facilities:
  • Synergies:
    • Tucson, AZ is 3 or 4 hr away by car (iPlant Collaborative; Maddison will have moved to Oregon by Nov; Mike Worobey in EEB does viral phylogenetics)
    • Los Alamos (LANL HIV db) is 3 hr away by car)
    • NMSU is home of Pontelli, Chisham
  • challenges or disadvantages
    • airport is an hour away

Hackathon-related ideas

Projects focused on consolidating our gains and serving the community of users:

  1. polish up demo projects from dbhack1. The dbhack1 projects were promising but incomplete. Solid well documented demonstration projects are needed to expose our interop technologies.
  2. set up an EvoIO portal (translation, data set curation, etc)
  3. develop an online course complete with information resources, demos, and assignments

Projects focused on building foundations and serving the community of developers:

  1. widen the domain by including viral phylogenetics and molecular epidemiology
  2. ontology, including sticky problems (CDAO relations and upper-level categories) and annotation support (names of programs and file formats)
  3. transition model language

phylo interop portal

the strategic focus of the portal would be interop, but the portal could support other community-building activities such as blogs, bookmarking, forums, etc.

  • objectives
    • provide users with centralized resources
    • demonstrate useful, working aspects of interop technologies
    • illustrate benefits of integrative or large-scale analyses
    • testbed for trying out new concepts and for debugging
    • increase exposure of project to increase chances of funding
  • features
    • format interconversion
    • triple store for download
    • visualization - second
    • data set integration wizard
    • annotation support for curation, metadata
    • analysis operations (implemented by reasoner)
  • how to get this done with limited resources
    • get more players involved by conceiving this broadly
    • provide hosting to some projects where mutually beneficial
    • use the hackathon mechanism to get started
    • use CREST funds to hire graduate student
    • prepare in advance to take advantage of GSoC mechanism
    • work with NSF PIs to get interop-related supplements (e.g., MrBayes)

polish up projects


phylowidget/viz improvements
clean up and generalize interface; complete modularization
integrate with other projects with an outward facing PhyloWS interface
write-back capability via PhlyoWS


transition models

Proposal for a phyloinformatics VoCamp

Co-leaders. do we want to list everyone? some of { N. Cellinese, K. Cranston, H. Lapp, E. Pontelli, Sheldon McKay, A. Stoltzfus, R. Vos }

Synopsis. Standard, interoperable, open and community-developed controlled vocabularies and ontologies that continuously evolve according to changing research demands are key to the ability to integrate and compute over data across fields, experimental systems, and protocols. At a recent NESCent-sponsored hackathon on Evolutionary Database Interoperability, ontologies and vocabularies that meet the needs of diverse community resources and tools emerged as the key gap among the three components of the emerging EvoIO stack of interoperability standards. Filling this gap on a sustainable basis requires a diverse community of domain experts, users, and stakeholders with a shared awareness of, and commitment to, knowledge-based standards. To begin building such a community, we propose a "VoCamp"-style meeting for investigators to create and develop ontologies and lightweight vocabularies in support of integration and semantic cross-linking of evolutionary data with its many related fields. As ontologies are being developed in many biological sub-disciplines and the need to design, maintain, and evolve these according to best practices emerges across such efforts, there is a unique opportunity to bring together logicians, ontology developers, domain experts, and users from multiple subject domains and initiatives at different stages of maturity. To take advantage of such cross-initiative synergy, and to foster better communication and knowledge exchange between them, we propose to co-localize the event with the annual meeting of the International Biodiversity Information Standards Organization.

Background. Through its Evolutionary Informatics working group, NESCent has developed an interoperability strategy based on the "EvoInfo stack":

  • nexml, a next-generation file format;
  • CDAO, an ontology for comparative evolutionary analysis; and
  • phyloWS, a web services standard.

Previous interoperability hackathon. This past spring, the working group organized a highly successful "data resource interoperability hackathon" that brought together "stack" developers with programmers representing key data resources, most of whom had no previous association with the working group. Participants self-organized into five sub-groups, each of which generated a working software product to demonstrate desired interoperability improvements:

  • using NeXML and CDAO, the Semantic processing group applied advanced language technologies to generating a logically queryable representation of phylogenetic data
  • using NeXML, the Phylogenetic visualization group expanded the capabilities of PhyloWidget to generate and represent tree annotations, to retrieve and display images and to support remote web services for adding tree annotations.
  • the Java API for NeXML group created a programmable Java interface to NeXML
  • using NeXML and PhyloWS, the Taxonomic Intelligence group implemented a web services interface to TreeBase
  • using NeXML and PhyloWS, the Phylr group created a modular system to generate databases whose phylogenetic content can be accessed via PhyloWS

The hackathon was successful in raising the profile and capabilities of NeXML, CDAO and PhyloWS, and in demonstrating that, with a modest amount of training and effort, data providers can improve interoperability, with benefits for data providers and for end users.

Clear need to extend ontologies. Online repositories of evolutionary data contain data and metadata that far exceed the scope of the current CDAO ontology. The overall goal of the EvoIO stack is to allow this data to be accessed, searched, retrieved, and repurposed programmatically. The NeXML and PhyloWS projects will allow us to describe, share and query data from diverse online databases, but their development is being hampered by a lack of controlled vocabulary. Even if these projects agree on terminology, the lack of a common online ontology prevents programmatic data sharing and large-scale interoperability.

The hackathon, as well as subsequent work by the participants, has revealed an urgent need to address vocabulary and ontology issues such as

  • metadata on experimental protocols
  • ontology-based annotations of phenotypes
  • taxonomic affiliations

Developing long-term solutions to these problems requires coordination and cannot be left to a small team of developers, no matter how responsive they try to be. Direct involvement of stakeholders from diverse communities is needed to ensure that development efforts address community needs in a way that promotes widespread adoption. Ultimately, there must be a community dynamic in which stakeholders know the value of shared terminology, and know how to contribute to its development.

Meeting preparation. A VoCamp is similar in conception to a hackathon, but focused on issues of vocabulary and ontology. In both cases it is important that those who attend the meeting are committed to, and capable of, making substantive contributions to the development process. While its crucial to choose the right sort of participants, one does not want to close the meeting only to a chosen few.

(more about how we will advertise, and recruit at least some of the participants. What sorts of people do we expect to attend? ) The participants will include those with expert knowledge about ontology development in addition to EvoIO Stack developers and users / providers of evolutionary data. Holding this meeting co-incident with the TDWG meeting will allow for easy participation by experts on ontology development. Additional participants will be recruited through direct contact with data proviers, mailing lists for the EvoIO Stack, the Society for Systematic Biology and the evoldir mailing list for evolutionary biology.

Meeting plan. (how much agenda? what presentations or training about "stack" technologies? I think we want an initial ontology introduction with some of the technical and philosophical challenges) [Enrico]: we could start by inviting someone like Mungall to provide a quick training on stack technology, ontology languages, etc. It is important for people to understand the restrictions in the formal description of an ontology. How about something like this:

  • Crash course on ontology design and technology
  • Identification of focus areas [e.g., experimental protocols, phenotypes annotation, taxonomic affiliations]
  • Composition of groups of people around the focus areas; each group should probably
    • identify core concepts and relations (in natural language -- e.g., creation of local concept term list)
    • motivate core concepts (use cases? existing artifacts?)
  • Cross-groups discussion to identify overlaps/connections among focus areas and with existing ontologies (e.g., CDAO)

There should be some sort of loop in that process.

Meeting followup. (how will we document the outcomes of this meeting? what are the next steps?).