Third Report to NESCent

From Evolutionary Informatics Working Group
Revision as of 11:54, 9 June 2008 by Arlin.stoltzfus@nist.gov (talk) (Executive summary)
Jump to: navigation, search

21 May, 2008

Group leaders: Arlin Stoltzfus and Rutger Vos

Executive summary

need to fill in the plans for next period

The Evolutionary Informatics working group held its third meeting at NESCent headquarters in Durham, NC, May 19 to 22, 2008. The mandate of the working group is to lower the barrier for the broader application of evolutionary methods by enabling the coordination and inter-operation of data and software resources at all levels. The group divides its effort among several projects. Some members work only during group meetings, while others devote considerable time to group projects between meetings. At this meeting, group members reported on the status of major projects before breaking up into task-specific teams. Before departing, the group took stock of its accomplishments and discussed the future trajectory of its projects.

A major goal is to develop a "Central Unifying Artefact" to serve as a Rosetta stone for interoperability. The working group pursues this goal via two strategies represented in the nexml and CDAO projects. Both projects have been productive: each has generated a working draft of the interoperability artefact; each project has attracted interest from other scientists; and each project team is writing a manuscript to be submitted for publication in the next few months.

As hoped, [Nexml], an XML-based successor to NEXUS (legacy format for comparative data), appears poised to become the next-generation standard data exchange format in phylogenetics. Nexml IO (input/output) support (i.e., support for reading and writing nexml files) has been implemented in DAMBE as well as Mesquite, Bio::Phylo and pyNexml; and has been initiated for PAUP* , BioPerl and Phylobase. Based on commitments expressed by developers, we anticipate support for nexml IO in the next few months in packages such as Phycas, GARLI and HyPhy.

While the nexml project is a bottom-up approach focused on a syntax-based view of the interoperability requirements of data representation, the CDAO (Comparative Data Analysis Ontology) project begins at a high level of abstraction, by conceptualizing data and operations in terms of their semantics, expressed formally in the OWL-DL ontology language. Partly through NESCent sponsorship of visiting scientists in March, 2008, working group members built on their earlier work to complete an initial version of CDAO. At the third meeting, working group members developed a mapping between nexml and CDAO, so as to facilitate inter-conversion using knowledge engineering tools.

A third team discussed the design requirements for an Evolutionary model description language, a particularly sticky sub-task in developing a "Central Unifying Artefact". This team hopes to produce a preliminary design specification soon, based on the XML representation used by BEAST, but adapted to conform to NeXML conventions.

At the close of the meeting, and in the follow-up period, the group considered its trajectory and developed plans. Interoperability is a community-wide phenomenon, requiring "buy-in" from the research community. Given its progress in developing artefacts, the working group is now in a position to adopt a more outward focus on using the artefacts to respond more directly to community needs.

The working group developed plans for two projects that align with community needs. One such need, described by guest speaker Dr. James Leebens-Mack, is for a minimal reporting standard, tentatively named "MIAPA" (Minimal Information for a Phylogenetic Analysis). The working group developed a plan that would combine a community exercise in annotation with development of an ontology and user-oriented software to generate MIAPA-compliant submissions. The second project addresses data interoperability among the growing number of databases of sequence families.

Scope of this report

ADD RELEASE DATE

The working group meets twice per year and issues a report on its activities a few weeks after each meeting. This report covers work accomplished subsequent to the previous reports, including work done before, during and after the most recent meeting. That is, this Third Working Period report covers group activities from 21 December 2007 (when the report on the second working period was released) to the release date for this report, sometime soon. The report closes by outlining the strategy the working group will follow in the period up to November 2008, and the anticipated tangible outcomes from this.

Project leaders and participants

DONE

Organizers reach out to the wider community to ensure that interested individuals and organizations are aware of the latest developments and future plans, to invite key individuals to participate, and to request feedback and comments. This results in a fluctuating set of Participants and Colleagues whose involvement in various activities is summarized in the table below. Jim Leebens-Mack (MIAPA project) was a guest at our May meeting, as were graduate students Brandon Chisham and Francisco Prosdocimi. We heard talks from NESCent staff Jim Balhoff (Phenoscape project), Ryan Scherle (Dryad project), and Hilmar Lapp (PhyloWS project).

ParticipantNESCent 2006
proposal
December 2006
hackathon
Prioritization
exercise
May 2007
meeting
Oct 2007 NIH
proposal
Nov 2007
meeting
Mar 2008
vis. sci.
May 2008
meeting
Chisham, Brandon       
Eisen, Jonathan       
Felsenstein, Joe       
Gupta, Gopal      
Holder, Mark    
Huelsenbeck, John       
Kosakovsky Pond, Sergei L.  
Kumar, Sudhir      
Leebens-Mack, Jim        
Lewis, Paul O.      
Mackey, Aaron    
Maddison, David        
Maddison, Wayne        
Piel, Bill       
Pontelli, Enrico    
Prosdocimi, Francisco       
Qiu, Weigang   
Rambaut, Andrew        
Stoltzfus, Arlin¹
Swofford, David L.   
Thompson, Julie     
Vos, Rutger¹   
Xia, Xuhua     
Zmasek, Christian    

¹ Organizers

² Virtual participant (by video link)

Goals for the Third Working Period

DONE

The mandate of the working group is to improve interoperability in evolutionary analysis. Although the original proposal was to focus narrowly on supporting current standards the group has expanded its scope and efforts into additional directions. In particular, the group has undertaken the bolder and more forward-looking project of developing a "Central Unifying Artefact" (a db schema, a file format, or an ontology) that should serve as the Rosetta stone for interoperability and the development of a transition model language. The next two subsections outline what the specific goals within this scope were, for the period leading up to the second meeting and for the second meeting itself respectively.

Goals for the period prior to the third meeting

DONE

The goals along with specific aims for this period - as described in greater detail in the second report - were briefly as follows:

  • developing a general ontology for comparative evolutionary analysis
    • Develop initial version of the CDAO ontology in OWL-DL
    • results of evaluating the initial version of CDAO
    • Write a manuscript or presentation describing the CDAO, for public use
    • Implement semantic transformation (Brian, Gopal, Enrico, Vien)
      • a web server providing translation for PHYLIP, NEXUS, and at least one other format
      • initial version of a validating error-recognizing parser for TreeBase input (NEXUS) files
    • Use-case documentation (Weigang)
      • instantiate additional use cases with data files
  • developing a transition model language
    • Generate pseudo-code (i.e. english statements) which might then be mapped to whatever notation the group deems suitable. Start by making a list of examples (just models of substitutions for now), all from literature on the wg-evoinfo wiki. The expectation is that we can probably sketch out a feature set by next May.
    • David Swofford planned to work with Derrick Zwickl on something that would be implemented in Garli and (a future version of) PAUP*.
  • future data exchange syntax standard
    • Create a web presence at http://www.nexml.org with instance document validation, binding downloads and documentation.
    • Develop A set of generic java class libraries (Rutger, Wayne, Peter) to facilitate more complete nexml writing (currently only reading is supported). This will be done by designing simple interfaces which users of the class libraries (e.g. mesquite) need to implement in their objects (Wayne, Peter) so that the class libraries can fetch the required attributes and child elements of the objects and serialize them to xml.
    • Develop a more complete Bio::Phylo nexml parser, to facilitate parsing of networks and character matrices, and develop xml writability by expanding the Bio::Phylo::XMLWritable class, along the same lines as the java interfaces.
    • Incorporate the python nexml parser libraries (for phycas) in the nexml svn repository (Jeet).
    • Clarify the output of the online nexml validator (http://nexml.org/nexml/validator), formalize difference between grammar-based and rule-based validation (Rutger).
    • Explore C(++)-parsing or data binding, for example for HyPhy (Sergei).

Goals for the third meeting

DONE

The purpose of the third meeting was to advance the goals of the working group by focusing on two complementary projects to develop a "central unifying artefact", the nexml project (www.nexml.org) and the CDAO project (www.evolutionaryontology.org).

The plan for the meeting was to hear talks about progress on these two projects, followed by discussion and planning on how to use our time at NESCent to build on this progress.

The remainder of the meeting would be spent on carrying out those plans in breakout groups, interspersed with talks on related topics: the MIAPE standard, phenotype ontologies, and data repositories.

Accomplishments of the Third Working Period

Rutger's slides for May '08 old material, to be edited

The accomplishments of the second working period include work done over the summer and fall outside of NESCent, the work done at the November meeting at NESCent, and a few items completed in the follow-up period ending 21 December, 2008.

Prior to the meeting

NEXML IS DONE, OTHERS TO DO

The goals of the working group were advanced considerably over the summer and fall, due to the efforts of individuals working on specific projects of interest to them. These outcomes were described in the presentations given on Day 1, starting with the brief presentation (Media:Stoltzfus.ppt) by Stoltzfus.

  • nexml The developers made extensive progress on the emerging nexml standard. See the wiki page and Rutger's presentation.
    • Peter Midford and Rutger Vos implemented nexml i/o for mesquite using xml beans
    • Rutger Vos implemented nexml writing for Bio::Phylo (reading was already implemented by Jason Caravas in previous working period)
    • Xuhua Xia implemented nexml i/o for DAMBE
    • Jeet Sukumaran implemented nexml i/o for pyNexml
    • Rutger Vos refined the schema for easier usage with c++ xml beans, and to implement requirements generated during discussions and outreach
    • NESCent now hosts nexml.org, with online validation and conversion, project news feeds and documentation
  • Ontology development Pontelli, Gupta, and Stoltzfus developed plans for a major project to develop an ontology and computing environment for evolutionary analysis, including
  • Supporting current file formats
    • a bit of progress on format translation technology (see below, semantic transformation)
    • no progress was made on documentation of current formats
    • no progress was made on collecting further format examples
    • no progress was made on assessing usage in the research community
  • Transition Model Language. Progress was described in the talks by Peter Midford and Sergei Kosakovsky Pond
    • some initial review of related technologies (GAMS, MPL, MML, AMPL, AIMMS, XRate)
    • an attempt to extract design principles
    • no progress was made on a funding plan to complete this work

Summary of activities and discussion at Meeting 3

DONE

Day 1

Meeting opens at 9:00AM with remarks from Arlin Stoltzfus and Jeff Sturkey. The meeting then continues with formal presentations:

  • The NESCent EvoInfo working group: Progress, Plans and Prospects (A. Stoltzfus) 20 min
  • Current state of the NeXML project (Rutger Vos) 40 min (slides)
  • Implementation of NEXML in DAMBE (Xuhua Xia) 20 min
  • PhyloWS project (Hilmar Lapp) 20 min
  • First Draft of the Comparative Data Analysis Ontology (Enrico Pontelli) 40 min

After these presentation sessions, Arlin Stoltzfus lead a discussion on the group's near-term prospects. At 4:00PM the meeting broke out in groups to develop specific plans for the week. At 5:00PM the meeting reconvened for stand-ups to brief the other groups on their respective plans. The meeting then adjourned for group dinner.

Day 2

The second day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for two talks:

  • Jim Leebens-Mack, U. Georgia (MIAPA) (20 min)
  • Repositories, data standards, and data reuse (Ryan Scherle, NESCent)

At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.

Day 3

The third day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for three talks:

  • Phenoscape Project (Jim Balhoff, NESCent)
  • Reminder to think about the future of the working group (Arlin)
  • Using ontologies to formalize comparative data on worm development (Arlin Stoltzfus) 20 min

At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.

Day 4

The first half of the final day was devoted to task-specific work sessions. After lunch, at 1:20PM the group convened for final reports on the week's progress and discussion on plans for the coming working period.

Immediately following the meeting

THESE ARE STILL COMING IN, E.G. CARROTBASE

  • nexml
    • Rutger set up the nexml.org site
    • Rutger put the code on SourceForge
  • Ontology
    • Arlin doubled the size of the Concept Glossary
    • Enrico, Arlin and Julie developed a visiting scientist proposal to continue work

MIAPA: a way forward

In the two weeks following the meeting, a considerable amount of time was spent on developing a plan to move forward the development of a MIAPA standard, as described on the Supporting_MIAPA page. Our plan centers on a community exercise in knowledge capture, and the tools to facilitate this exercise, as a way to acquire knowledge and to engage the research community. Ideally, we would get dozens or hundreds of volunteers to generate MIAPA reports about their phylogenetic studies, and we would use this experience to improve our conception of MIAPA and the technology to support it.

So far we have

  • identified, and developed a plan to respond to, infrastructure needs for supporting MIAPA (file format, ontology, etc)
  • implemented a preliminary ontology (miapa.owl) for metadata describing sources and methods
  • identified resources that could be used to flesh out this ontology (e.g., CDAO, mygrid services ontology)
  • developed a Phenote-based proof-of-concept application for creating workflow descriptions from the MIAPA ontology
  • developed a plan for a web site for users to enter workflow descriptions and other info to yield a complete MIAPA thing
  • developed a plan for using this web site in a knowledge capture experiment at Evolution 2008 (or some other meeting).

Initially we hoped to move quickly to deploy this project in time for the ontology workshop (and related buzz) at the Evolution 2008 meetings in Minneapolis. Instead, we will develop this over a longer time. The next step might be to re-factor the wiki page into a White Paper on MIAPA for NESCent. We could suggest to NESCent to recruit a catalysis-like group to implement it. Another direction would be to recruit a team of collaborators and submit a grant proposal.

Strategy and plans for the Fourth Working Period

TO DO

old material, to be edited This section and its subsections outlines the general strategies the members of the subproject teams will follow in the period up to May 2008, and the practical work they will do within those strategies.

Next-generation file format standard

old material, to be edited In general terms, the strategy for the nexml initiative is to generate interest in it by providing code that processes nexml (SAX parser libraries, DOM data binding libraries) and implementations that use nexml (applications and web services) thereby provoking comments and discussions on the underlying assumptions of the schema - which then is improved and extended iteratively based on community input. In order to generate enough interest, an informal estimate is that this would require support from:

  • 5 programming languages (java, perl, python, c++ and one of javascript (json) or ruby (bioruby)),
  • 5 applications or toolkits (mesquite, hyphy, phycas, Bio::Phylo and one of ncl, bioruby, paup, mega, jebl, geneious, beast)
  • 3 web services (three of tree of life, cipres, treebase, ppod, morphbank, morphobank)

In addition, nexml-related outreach must include a credible web presence and publication of the standard in a bioinformatics related open access journal. The web presence is supported by NESCent and will continue to develop, the publication is intended to evolve from the lengthy nexml wiki page.

Transition model language

old material, to be edited A problem in this subproject is that substitution model descriptions can potentially become arbitrarily nested and complex, hence more detailed analysis and experimentation is necessary in order to identify requirements. Sergei, David Swofford and Derrick Zwickl will work on a spec for model descriptions to be implemented first in Garli, then in a future version of paup* and HyPhy. This initial spec will likely be annotated in a flat text format (pseudo-code), which will then be converted to some more formal annotation (e.g. xml, xrate, rdf).

Ontology

old material, to be edited The ontology development stages (use cases, concept glossary, related artefacts, implementation, evaluation) should be undertaken iteratively. We feel that we now are ready to undertake the first (preliminary, rough) iteration of the implementation step, as described below. Meanwhile, work will continue on other steps in development.

Use Case Documentation

old material, to be edited Further effort will be devoted to documenting more use cases (ideally, each of them) with

  • files containing instance data (inputs and outputs)
  • a UML use-case diagram

Concept Glossary

old material, to be edited Work on the glossary will continue with

  • addition of more definitions and new terms
  • attempts to advertise the glossary in order to get feedback
  • exploring technologies to maintain an active glossary linked to CDAO as it emerges

CDAO Implementation

old material, to be edited We will implement an initial version of CDAO as an OWL ontology:

  • Seek funding for a working meeting at NESCent
    • to take place in spring
    • will involve Enrico, Julie, Arlin, and two grads or post-docs
  • Collect a token set of example data to cover basic uses
    • one molecular example with protein-coding gene family data
    • one systematics example with species having discrete and continuous characters
  • Implement basic concepts and relations of CDAO using OWL-DL
  • Evaluate the implementation with regard to the token data, and other data
  • Prepare a manuscript describing the results
  • Prepare a poster or slide presentation for use at meetings

Semantic Transformation

old material, to be edited Work will continue on the semantic transformation project that Brian DeVries described, including

  • translation of a more extensive set of alignment file formats
  • TreeBase input file processing (working towards a validating NEXUS parser with error-recognition)

Anticipated Outcomes and Products

TO DO

old material, to be edited This section and its subsections outlines which deliverables the group expects to produce by following the strategies as outlined in the preceding section.

Next-generation file format standard

old material, to be edited The nexml team anticipates the following deliverables for the coming period:

  • A web presence at http://www.nexml.org with instance document validation, binding downloads and documentation.
  • A set of generic java class libraries (Rutger, Wayne, Peter) to facilitate more complete nexml writing (currently only reading is supported). This will be done by designing simple interfaces which users of the class libraries (e.g. mesquite) need to implement in their objects (Wayne, Peter) so that the class libraries can fetch the required attributes and child elements of the objects and serialize them to xml.
  • A more complete Bio::Phylo nexml parser, to facilitate parsing of networks and character matrices, and develop xml writability by expanding the Bio::Phylo::XMLWritable class, along the same lines as the java interfaces.
  • Incorporate the python nexml parser libraries (for phycas) in the nexml svn repository (Jeet).
  • Clarify the output of the online nexml validator (http://nexml-dev.nescent.org/validator.cgi), formalize difference between grammar-based and rule-based validation (Rutger).
  • Explore C(++)-parsing or data binding, for example for HyPhy (Sergei).

Transition model language

old material, to be edited A per the informal goals outlined on the model language wiki page and in follow-up communication subsequent to the second meeting, anticipated outcomes and results include:

  • Sergei plans to generate pseudo-code (i.e. english statements) which might then be mapped to whatever notation the group deems suitable. He will start by making a list of examples (just models of substitutions for now), all from literature on the wg-evoinfo wiki. The expectation is that we can probably sketch out a feature set by next May.
  • David Swofford plans to work with Derrick Zwickl on something that would be implemented in Garli and (a future version of) PAUP*.

Ontology

old material, to be edited

The outcomes anticipated in the next working period are:

  • Ontology implementation and description (Enrico, Julie, Arlin)
    • initial version of the CDAO ontology in OWL-DL
    • results of evaluating the initial version of CDAO
    • a manuscript or presentation describing the CDAO, for public use
  • Semantic transformation (Brian, Gopal, Enrico, Vien)
    • a web server providing translation for PHYLIP, NEXUS, and at least one other format
    • initial version of a validating error-recognizing parser for TreeBase input (NEXUS) files
  • Use-case documentation (Weigang)
    • instantiate additional use cases with data files