Second Report to NESCent

From Evolutionary Informatics Working Group
Jump to: navigation, search

21 December, 2007

Group leaders: Arlin Stoltzfus and Rutger Vos


Executive summary

The Evolutionary Informatics working group held its second meeting at NESCent headquarters in Durham, NC, November 12 to 14, 2007. The mandate of the working group is to lower the barrier for the broader application of evolutionary methods by enabling the coordination and inter-operation of data and software resources at all levels. The group divides its effort among several projects pursued by different teams, some of whom commit substantial effort to group goals in the periods between meetings. At the most recent meeting, group members reported on the status of these projects before breaking up into task-specific teams.

A major goal is to develop a "Central Unifying Artefact" to serve as a Rosetta stone for interoperability. The working group pursues two different strategies, nexml and CDAO, that reflect different views on how best to tackle problems. Considerable progress was made over the summer on "nexml", an XML-based successor to NEXUS, a legacy format for generalized comparative data. Based on commitments expressed by developers, we anticipate that (limited) support for the emerging nexml standard will appear in the next few months in packages such as Phycas, GARLI, Mesquite, and HyPhy.

At the same time, a second team pursued the development of CDAO (Comparative Data Analysis Ontology), producing a demonstration project in semantic transformation, a concept glossary available from the NESCent web site, and a detailed research plan submitted to NIH for funding. At the meeting, this team evaluated Related Artefacts (database schemas, ontologies, file formats) and made plans for the first draft of CDAO, to be developed in the spring. This team greatly expanded the concept glossary, which now represents a unique resource for researchers and educators in evolutionary analysis.

Although these two efforts are (currently) non-overlapping, they are not mutually exclusive or incompatible. The XML project is a bottom-up approach seeking to identify the requirements of interoperable data representation by defining syntax, whereas the Ontology project works top-down, by first conceptualizing data and operations in terms of their semantics. Eventually these two projects will meet by means of a mapping between the formal ontology (OWL) terms and (XML) schema data types.

A third team used the meeting period to document use cases by attaching data files. This activity has the important goal of making the use-case list much more valuable as a resource for designing and testing interoperability technologies. A fourth team discussed the design requirements for an Evolutionary Model Description Language (Transition Model Language), a particularly sticky sub-task in developing a "Central Unifying Artefact". This team hopes to produce a preliminary design specification soon.

Scope of this report

The working group meets twice per year and issues a report on its activities a few weeks after each meeting. This report covers work accomplished subsequent to the previous report, including work done before, during and after the most recent meeting. That is, this Second Working Period report covers group activities from 22 June 2007 (when the report on the first working period was released) to the release date for this report, 21 December 2007. The report closes by outlining the strategy the working group will follow in the period up to May 2008, and the anticipated tangible outcomes from this.

Project leaders and participants

Development of standards must be largely community-driven, therefore the working group seeks to be widely inclusive. Organizers reach out to the wider community to ensure that interested individuals and organizations are aware of the latest developments and future plans, to invite key individuals to participate, and to request feedback and comments. This results in a fluctuating set of Participants and Colleagues whose involvement in various activities is summarized in the table below.

ParticipantNESCent 2006
proposal
December 2006
hackathon
Prioritization
exercise
May 2007
meeting
Oct 2007 NIH
proposal
Nov 2007
meeting
Eisen, Jonathan     
Felsenstein, Joe     
Gupta, Gopal    
Holder, Mark  
Huelsenbeck, John     
Kosakovsky Pond, Sergei L.
Kumar, Sudhir    
Lewis, Paul O.    
Mackey, Aaron   
Maddison, David      
Maddison, Wayne      
Piel, Bill     
Pontelli, Enrico    
Qiu, Weigang  
Rambaut, Andrew      
Stoltzfus, Arlin¹
Swofford, David L.  
Thompson, Julie     
Vos, Rutger¹  
Xia, Xuhua    
Zmasek, Christian  

¹ Organizers

² Virtual participant (by video link)

Goals for the Second Working Period

The mandate of the working group is to address issues in interoperability. Although the original proposal (File:EvoInfoWorkingGroup proposal.pdf) was to focus narrowly on supporting current standards the group has expanded its scope and efforts into additional directions. In particular, the group has undertaken the bolder and more forward-looking project of developing a "Central Unifying Artefact" (a db schema, a file format, or an ontology) that should serve as the Rosetta stone for interoperability and the development of a transition model language. The next two subsections outline what the specific goals within this scope were, for the period leading up to the second meeting and for the second meeting itself respectively.

Goals for the period prior to the second meeting

The goals along with specific aims for this period (see the First_Report_to_NESCent) were briefly as follows:

  • supporting current standards
    • collect examples of current alignment formats
    • assess usage in the research community
    • begin to develop formal grammars for current alignment formats
    • develop a semantic translation server using these grammars

Goals for the second meeting

The plan for the second meeting was to hear updates on group projects and related work, and then to break out into small teams to work on tasks chosen from this list:

  • steps in developing a central unifying artefact
    • concept glossary
    • documenting use cases
    • studying related artefacts
  • developing a next-generation XML-based data exchange format
  • developing a description language for transition models
  • outreach and education

Accomplishments of the Second Working Period

The accomplishments of the second working period include work done over the summer and fall outside of NESCent, the work done at the November meeting at NESCent, and a few items completed in the follow-up period ending 21 December, 2008.

Prior to the meeting

The goals of the working group were advanced considerably over the summer and fall, due to the efforts of individuals working on specific projects of interest to them. These outcomes were described in the presentations given on Day 1, starting with the brief presentation (Media:Stoltzfus.ppt) by Stoltzfus.

  • nexml Rutger Vos and Wayne Maddison made extensive progress on the emerging nexml standard. See the Future_Data_Exchange_Standard page and Rutger's presentation (Media:Vos.ppt)
    • In his Google Summer of Code project, Jason Caravas worked on implementing a nexml parser for Bio::Phylo.
    • David Maddison, Wayne Maddison, Mark Holder, Peter Midford and Jeet Sukumaran generated requirements for the nexml standard.
    • The schema was further refined based on the lessons learned from the generated requirements and from actually working with nexml data files.
  • Ontology development Pontelli, Gupta, and Stoltzfus developed plans for a major project to develop an ontology and computing environment for evolutionary analysis, including
  • Supporting current file formats
    • a bit of progress on format translation technology (see below, semantic transformation)
    • no progress was made on documentation of current formats
    • no progress was made on collecting further format examples
    • no progress was made on assessing usage in the research community
  • Transition Model Language. Progress was described in the talks by Peter Midford and Sergei Kosakovsky Pond
    • some initial review of related technologies (GAMS, MPL, MML, AMPL, AIMMS, XRate)
    • an attempt to extract design principles
    • no progress was made on a funding plan to complete this work

Summary of activities and discussion at Meeting 2

(A more verbose recording of the meeting proceedings is available elsewhere.)

Day 1

Meeting opens at 9:00AM. Participants introduce themselves to Joe Felsenstein, present via videolink. Arlin then recaps the work-to-date leading up to the meeting. The morning continues with formal presentations:

Participants then broke out into three groups:

  1. ontology: documentation of use cases (Weigang, Brian)
  2. ontology: study of related artefacts (Enrico, Julie, Arlin)
  3. substitution model language and XML (Rutger, David, Gopal, Peter, Sergei)

At 16:40PM the meeting re-convened for brief reporting on the breakouts. The meeting adjourned at 17:40PM.

Day 2

The second day of Meeting 2 was devoted entirely to task-specific work sessions. The group reconvened at 14:40PM, when representatives of the different breakout groups reported on work done that day. Arlin summarized work on related artefacts documented on the Related_Artefacts wiki page. Many of the projects are hypothetical; overlaps and key features were detailed by Julie and Enrico.

Weigang took over at 15:05PM, reporting on use cases. Weigang showed new sample data sets:

  • two homeobox data sets
  • kinase data
  • the "whippo" data, with hemoglobin, cytochrome, and other mammalian data sets
  • alignments for Lyme disease sequences
  • broadly taxon-sampled intron data sets

He then proceeded to follow some of the links, showing some of the data sets. Discussion then ensued whether the sample data sets shouldn't be less "finished", e.g. unaligned, so that whole use case workflows could actually be tested from start to finish. Suggestions also came in for possible other data sets, e.g. some of Sergei's data.

Peter then reported on the model language discussion, starting at 15:24PM. Work on the model language at this meeting consisted primarily of discussion, however, the model language wiki page has progressed with contributions by Hilmar and some reorganization and cleanups by Rutger.

Day 3

The morning was devoted to task-specific work sessions. Derrick Zwickl (NESCent) joined the team working on a transition model language.

The participants reconvened at 1:40, for final stand-ups. The focus was on tangible outcomes and plans to generate them.

  • nexml
    • Several software packages by evoinfo members will implement or extend nexml support (Mesquite, Bio::Phylo, and phycas, HyPhy)
    • The generic class library support (java) will be improved. Round-trip read/write support will be implemented.
    • Code will move to a public repository on SourceForge.
    • A web presence under www.nexml.org will be hosted by nescent.
    • continued development guided by the to do list on the nexml wiki page.
  • Use case documentation
    • weigang's results
  • Transition model language
    • Propose meeting at NESCent to implement initial version of model description language (Hilmar, David, Mark, Derrick)
    • Sergei to document a collection of models that covers typical uses
    • David and Sergei to flesh out an initial design in NEXUS syntax, which could then be ported to other notations such as NeXML
  • Ontology
    • preliminary analysis of related artefacts is finished
    • we are ready to begin with the first implementation of CDAO
    • assemble a set of token examples (protein-coding genes with rich info; a species phylogeny with discrete and continuous data, links to taxonomy)
    • CDAO will link with ontologies including GO, SO; and with the NCBI taxonomy
    • many questions to address about how we link to other artefacts, handle sequences, express developmental relationships (e.g., amino acids to codons) and so on

Immediately following the meeting

  • nexml
    • Rutger set up the nexml.org site
    • Rutger put the code on SourceForge
  • Ontology
    • Arlin doubled the size of the Concept Glossary
    • Enrico, Arlin and Julie developed a visiting scientist proposal to continue work

Strategy and plans for the Third Working Period

This section and its subsections outlines the general strategies the members of the subproject teams will follow in the period up to May 2008, and the practical work they will do within those strategies.

Next-generation file format standard

In general terms, the strategy for the NeXML initiative is to generate interest in it by providing code that processes nexml (SAX parser libraries, DOM data binding libraries) and implementations that use nexml (applications and web services) thereby provoking comments and discussions on the underlying assumptions of the schema - which then is improved and extended iteratively based on community input. In order to generate enough interest, an informal estimate is that this would require support from:

  • 5 programming languages (java, perl, python, c++ and one of javascript (json) or ruby (bioruby)),
  • 5 applications or toolkits (Mesquite, hyphy, phycas, Bio::Phylo and one of ncl, bioruby, paup, mega, jebl, geneious, beast)
  • 3 web services (three of Tree of Life, cipres, TreeBASE, pPOD, MorphBank, MorphoBank)

In addition, nexml-related outreach must include a credible web presence and publication of the standard in a bioinformatics related open access journal. The web presence is supported by NESCent and will continue to develop, the publication is intended to evolve from the lengthy nexml wiki page.

Transition model language

A problem in this subproject is that substitution model descriptions can potentially become arbitrarily nested and complex, hence more detailed analysis and experimentation is necessary in order to identify requirements. Sergei, David Swofford and Derrick Zwickl will work on a spec for model descriptions to be implemented first in Garli, then in a future version of paup* and HyPhy. This initial spec will likely be annotated in a flat text format (pseudo-code), which will then be converted to some more formal annotation (e.g. xml, xrate, rdf).

Ontology

The ontology development stages (use cases, concept glossary, related artefacts, implementation, evaluation) should be undertaken iteratively. We feel that we now are ready to undertake the first (preliminary, rough) iteration of the implementation step, as described below. Meanwhile, work will continue on other steps in development.

Use Case Documentation

Further effort will be devoted to documenting more use cases (ideally, each of them) with

  • files containing instance data (inputs and outputs)
  • a UML use-case diagram

Concept Glossary

Work on the glossary will continue with

  • addition of more definitions and new terms
  • attempts to advertise the glossary in order to get feedback
  • exploring technologies to maintain an active glossary linked to CDAO as it emerges

CDAO Implementation

We will implement an initial version of CDAO as an OWL ontology:

  • Seek funding for a working meeting at NESCent
    • to take place in spring
    • will involve Enrico, Julie, Arlin, and two grads or post-docs
  • Collect a token set of example data to cover basic uses
    • one molecular example with protein-coding gene family data
    • one systematics example with species having discrete and continuous characters
  • Implement basic concepts and relations of CDAO using OWL-DL
  • Evaluate the implementation with regard to the token data, and other data
  • Prepare a manuscript describing the results
  • Prepare a poster or slide presentation for use at meetings

Semantic Transformation

Work will continue on the semantic transformation project that Brian DeVries described, including

  • translation of a more extensive set of alignment file formats
  • TreeBASE input file processing (working towards a validating NEXUS parser with error-recognition)

Anticipated Outcomes and Products

This section and its subsections outlines which deliverables the group expects to produce by following the strategies as outlined in the preceding section.

Next-generation file format standard

The nexml team anticipates the following deliverables for the coming period:

  • A web presence at http://www.nexml.org with instance document validation, binding downloads and documentation.
  • A set of generic java class libraries (Rutger, Wayne, Peter) to facilitate more complete nexml writing (currently only reading is supported). This will be done by designing simple interfaces which users of the class libraries (e.g. Mesquite) need to implement in their objects (Wayne, Peter) so that the class libraries can fetch the required attributes and child elements of the objects and serialize them to xml.
  • A more complete Bio::Phylo nexml parser, to facilitate parsing of networks and character matrices, and develop xml writability by expanding the Bio::Phylo::XMLWritable class, along the same lines as the java interfaces.
  • Incorporate the python nexml parser libraries (for phycas) in the nexml svn repository (Jeet).
  • Clarify the output of the online nexml validator (http://nexml-dev.nescent.org/validator.cgi), formalize difference between grammar-based and rule-based validation (Rutger).
  • Explore C(++)-parsing or data binding, for example for HyPhy (Sergei).

Transition model language

A per the informal goals outlined on the model language wiki page and in follow-up communication subsequent to the second meeting, anticipated outcomes and results include:

  • Sergei plans to generate pseudo-code (i.e. english statements) which might then be mapped to whatever notation the group deems suitable. He will start by making a list of examples (just models of substitutions for now), all from literature on the wg-evoinfo wiki. The expectation is that we can probably sketch out a feature set by next May.
  • David Swofford plans to work with Derrick Zwickl on something that would be implemented in Garli and (a future version of) PAUP*.

Ontology

The outcomes anticipated in the next working period are:

  • Ontology implementation and description (Enrico, Julie, Arlin)
    • initial version of the CDAO ontology in OWL-DL
    • results of evaluating the initial version of CDAO
    • a manuscript or presentation describing the CDAO, for public use
  • Semantic transformation (Brian, Gopal, Enrico, Vien)
    • a web server providing translation for PHYLIP, NEXUS, and at least one other format
    • initial version of a validating error-recognizing parser for TreeBASE input (NEXUS) files
  • Use-case documentation (Weigang)
    • instantiate additional use cases with data files