Difference between revisions of "Third Report to NESCent"

From Evolutionary Informatics Working Group
Jump to: navigation, search
(Prior to the meeting)
(Prior to the meeting)
Line 113: Line 113:
  
 
===Prior to the meeting===
 
===Prior to the meeting===
'''old material, to be edited'''
+
 
'''note: nexml, cdao, weigang's use cases, subst  model language'''
+
<font color="red">NEXML IS DONE, OTHERS TO DO</font>
  
 
The goals of the working group were advanced considerably over the summer and fall, due to the efforts of individuals working on specific projects of interest to them.  These outcomes were described in the presentations given on Day 1, starting with the brief presentation ([[Media:Stoltzfus.ppt]]) by Stoltzfus.
 
The goals of the working group were advanced considerably over the summer and fall, due to the efforts of individuals working on specific projects of interest to them.  These outcomes were described in the presentations given on Day 1, starting with the brief presentation ([[Media:Stoltzfus.ppt]]) by Stoltzfus.
  
 
* '''nexml''' The [http://sourceforge.net/project/memberlist.php?group_id=209571 developers] made extensive progress on the emerging nexml standard. See the [[Future_Data_Exchange_Standard|wiki]] page and [https://www.nescent.org/wg/evoinfo/images/3/3a/Nexml_nescent_19_5_08.ppt Rutger's presentation].
 
* '''nexml''' The [http://sourceforge.net/project/memberlist.php?group_id=209571 developers] made extensive progress on the emerging nexml standard. See the [[Future_Data_Exchange_Standard|wiki]] page and [https://www.nescent.org/wg/evoinfo/images/3/3a/Nexml_nescent_19_5_08.ppt Rutger's presentation].
** In his Google Summer of Code project, Jason Caravas worked on implementing a nexml parser for Bio::Phylo.
+
** Peter Midford and Rutger Vos implemented nexml i/o for mesquite using xml beans
** David Maddison, Wayne Maddison, Mark Holder, Peter Midford and Jeet Sukumaran generated requirements for the nexml standard.
+
** Rutger Vos implemented nexml writing for Bio::Phylo (reading was already implemented by Jason Caravas in previous working period)
** The schema was further refined based on the lessons learned from the generated requirements and from actually working with nexml data files.
+
** Xuhua Xia implemented nexml i/o for DAMBE
 +
** Jeet Sukumaran implemented nexml i/o for pyNexml
 +
** Rutger Vos refined the schema for easier usage with c++ xml beans, and to implement requirements generated during discussions and outreach
  
 
* '''Ontology development''' Pontelli, Gupta, and Stoltzfus developed plans for a major project to develop an ontology and computing environment for evolutionary analysis, including
 
* '''Ontology development''' Pontelli, Gupta, and Stoltzfus developed plans for a major project to develop an ontology and computing environment for evolutionary analysis, including

Revision as of 18:08, 29 May 2008

21 May, 2008

Group leaders: Arlin Stoltzfus and Rutger Vos

Executive summary

DONE

The Evolutionary Informatics working group held its third meeting at NESCent headquarters in Durham, NC, May 19 to 22, 2008. The mandate of the working group is to lower the barrier for the broader application of evolutionary methods by enabling the coordination and inter-operation of data and software resources at all levels. The group divides its effort among several projects pursued by different teams, some of whom commit substantial effort to group goals in the periods between meetings. At the most recent meeting, group members reported on the status of these projects before breaking up into task-specific teams.

A major goal is to develop a "Central Unifying Artefact" to serve as a Rosetta stone for interoperability. The working group pursues two subprojects, nexml and CDAO, that comprise mutually enhancing approaches and views on how best to improve interoperability, respectively in terms of syntax and semantics.

Considerable progress was made over the winter on nexml, an XML-based successor to NEXUS, a legacy format for generalized comparative data. Read/write support has been implemented in Mesquite, DAMBE, Bio::Phylo and pyNexml; and development has started for i/o support for paup, BioPerl and phylobase. In addition, based on commitments expressed by developers, we anticipate that (limited) support for the emerging nexml standard will appear in the next few months in packages such as Phycas, GARLI and HyPhy.

At the same time, a second team pursued the development of CDAO (Comparative Data Analysis Ontology), producing a demonstration project in semantic transformation, a concept glossary available from the NESCent web site, a first draft of an OWL ontology, and a detailed research plan submitted to NIH for funding. At the meeting, this team evaluated related artefacts (database schemas, ontologies, file formats) and made plans for file format and data source mappings, to be developed in the summer.

Although these two efforts are currently only partially overlapping, they are not mutually exclusive or incompatible. The XML project is a bottom-up approach seeking to identify the requirements of interoperable data representation by defining syntax, whereas the Ontology project works top-down, by first conceptualizing data and operations in terms of their semantics. Eventually these two projects will meet by means of a mapping between the formal ontology (OWL) terms and (XML) schema data types, a process that begun during the third meeting.

A third team discussed the design requirements for an Evolutionary model description language, a particularly sticky sub-task in developing a "Central Unifying Artefact". This team hopes to produce a preliminary design specification soon, based on the XML representation used by BEAST, but adapted to conform to NeXML conventions.

Scope of this report

ADD RELEASE DATE

The working group meets twice per year and issues a report on its activities a few weeks after each meeting. This report covers work accomplished subsequent to the previous reports, including work done before, during and after the most recent meeting. That is, this Third Working Period report covers group activities from 21 December 2007 (when the report on the second working period was released) to the release date for this report, sometime soon. The report closes by outlining the strategy the working group will follow in the period up to November 2008, and the anticipated tangible outcomes from this.

Project leaders and participants

DONE

Organizers reach out to the wider community to ensure that interested individuals and organizations are aware of the latest developments and future plans, to invite key individuals to participate, and to request feedback and comments. This results in a fluctuating set of Participants and Colleagues whose involvement in various activities is summarized in the table below. Jim Leebens-Mack (MIAPA project) was a guest at our May meeting, as were graduate students Brandon Chisham and Francisco Prosdocimi. We heard talks from NESCent staff Jim Balhoff (Phenoscape project), Ryan Scherle (Dryad project), and Hilmar Lapp (PhyloWS project).

ParticipantNESCent 2006
proposal
December 2006
hackathon
Prioritization
exercise
May 2007
meeting
Oct 2007 NIH
proposal
Nov 2007
meeting
Mar 2008
vis. sci.
May 2008
meeting
Chisham, Brandon       
Eisen, Jonathan       
Felsenstein, Joe       
Gupta, Gopal      
Holder, Mark    
Huelsenbeck, John       
Kosakovsky Pond, Sergei L.  
Kumar, Sudhir      
Leebens-Mack, Jim        
Lewis, Paul O.      
Mackey, Aaron    
Maddison, David        
Maddison, Wayne        
Piel, Bill       
Pontelli, Enrico    
Prosdocimi, Francisco       
Qiu, Weigang   
Rambaut, Andrew        
Stoltzfus, Arlin¹
Swofford, David L.   
Thompson, Julie     
Vos, Rutger¹   
Xia, Xuhua     
Zmasek, Christian    

¹ Organizers

² Virtual participant (by video link)

Goals for the Third Working Period

DONE

The mandate of the working group is to improve interoperability in evolutionary analysis. Although the original proposal was to focus narrowly on supporting current standards the group has expanded its scope and efforts into additional directions. In particular, the group has undertaken the bolder and more forward-looking project of developing a "Central Unifying Artefact" (a db schema, a file format, or an ontology) that should serve as the Rosetta stone for interoperability and the development of a transition model language. The next two subsections outline what the specific goals within this scope were, for the period leading up to the second meeting and for the second meeting itself respectively.

Goals for the period prior to the third meeting

DONE

The goals along with specific aims for this period - as described in greater detail in the second report - were briefly as follows:

  • developing a general ontology for comparative evolutionary analysis
    • Develop initial version of the CDAO ontology in OWL-DL
    • results of evaluating the initial version of CDAO
    • Write a manuscript or presentation describing the CDAO, for public use
    • Implement semantic transformation (Brian, Gopal, Enrico, Vien)
      • a web server providing translation for PHYLIP, NEXUS, and at least one other format
      • initial version of a validating error-recognizing parser for TreeBase input (NEXUS) files
    • Use-case documentation (Weigang)
      • instantiate additional use cases with data files
  • developing a transition model language
    • Generate pseudo-code (i.e. english statements) which might then be mapped to whatever notation the group deems suitable. Start by making a list of examples (just models of substitutions for now), all from literature on the wg-evoinfo wiki. The expectation is that we can probably sketch out a feature set by next May.
    • David Swofford planned to work with Derrick Zwickl on something that would be implemented in Garli and (a future version of) PAUP*.
  • future data exchange syntax standard
    • Create a web presence at http://www.nexml.org with instance document validation, binding downloads and documentation.
    • Develop A set of generic java class libraries (Rutger, Wayne, Peter) to facilitate more complete nexml writing (currently only reading is supported). This will be done by designing simple interfaces which users of the class libraries (e.g. mesquite) need to implement in their objects (Wayne, Peter) so that the class libraries can fetch the required attributes and child elements of the objects and serialize them to xml.
    • Develop a more complete Bio::Phylo nexml parser, to facilitate parsing of networks and character matrices, and develop xml writability by expanding the Bio::Phylo::XMLWritable class, along the same lines as the java interfaces.
    • Incorporate the python nexml parser libraries (for phycas) in the nexml svn repository (Jeet).
    • Clarify the output of the online nexml validator (http://nexml.org/nexml/validator), formalize difference between grammar-based and rule-based validation (Rutger).
    • Explore C(++)-parsing or data binding, for example for HyPhy (Sergei).

Goals for the third meeting

DONE

The purpose of the third meeting was to advance the goals of the working group by focusing on two complementary projects to develop a "central unifying artefact", the nexml project (www.nexml.org) and the CDAO project (www.evolutionaryontology.org).

The plan for the meeting was to hear talks about progress on these two projects, followed by discussion and planning on how to use our time at NESCent to build on this progress.

The remainder of the meeting would be spent on carrying out those plans in breakout groups, interspersed with talks on related topics: the MIAPE standard, phenotype ontologies, and data repositories.

Accomplishments of the Third Working Period

Rutger's slides for May '08 old material, to be edited

The accomplishments of the second working period include work done over the summer and fall outside of NESCent, the work done at the November meeting at NESCent, and a few items completed in the follow-up period ending 21 December, 2008.

Prior to the meeting

NEXML IS DONE, OTHERS TO DO

The goals of the working group were advanced considerably over the summer and fall, due to the efforts of individuals working on specific projects of interest to them. These outcomes were described in the presentations given on Day 1, starting with the brief presentation (Media:Stoltzfus.ppt) by Stoltzfus.

  • nexml The developers made extensive progress on the emerging nexml standard. See the wiki page and Rutger's presentation.
    • Peter Midford and Rutger Vos implemented nexml i/o for mesquite using xml beans
    • Rutger Vos implemented nexml writing for Bio::Phylo (reading was already implemented by Jason Caravas in previous working period)
    • Xuhua Xia implemented nexml i/o for DAMBE
    • Jeet Sukumaran implemented nexml i/o for pyNexml
    • Rutger Vos refined the schema for easier usage with c++ xml beans, and to implement requirements generated during discussions and outreach
  • Ontology development Pontelli, Gupta, and Stoltzfus developed plans for a major project to develop an ontology and computing environment for evolutionary analysis, including
  • Supporting current file formats
    • a bit of progress on format translation technology (see below, semantic transformation)
    • no progress was made on documentation of current formats
    • no progress was made on collecting further format examples
    • no progress was made on assessing usage in the research community
  • Transition Model Language. Progress was described in the talks by Peter Midford and Sergei Kosakovsky Pond
    • some initial review of related technologies (GAMS, MPL, MML, AMPL, AIMMS, XRate)
    • an attempt to extract design principles
    • no progress was made on a funding plan to complete this work

Summary of activities and discussion at Meeting 3

DONE

Day 1

Meeting opens at 9:00AM with remarks from Arlin Stoltzfus and Jeff Sturkey. The meeting then continues with formal presentations:

  • The NESCent EvoInfo working group: Progress, Plans and Prospects (A. Stoltzfus) 20 min
  • Current state of the NeXML project (Rutger Vos) 40 min (slides)
  • Implementation of NEXML in DAMBE (Xuhua Xia) 20 min
  • PhyloWS project (Hilmar Lapp) 20 min
  • First Draft of the Comparative Data Analysis Ontology (Enrico Pontelli) 40 min

After these presentation sessions, Arlin Stoltzfus lead a discussion on the group's near-term prospects. At 4:00PM the meeting broke out in groups to develop specific plans for the week. At 5:00PM the meeting reconvened for stand-ups to brief the other groups on their respective plans. The meeting then adjourned for group dinner.

Day 2

The second day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for two talks:

  • Jim Leebens-Mack, U. Georgia (MIAPA) (20 min)
  • Repositories, data standards, and data reuse (Ryan Scherle, NESCent)

At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.

Day 3

The third day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for three talks:

  • Phenoscape Project (Jim Balhoff, NESCent)
  • Reminder to think about the future of the working group (Arlin)
  • Using ontologies to formalize comparative data on worm development (Arlin Stoltzfus) 20 min

At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.

Day 4

The first half of the final day was devoted to task-specific work sessions. After lunch, at 1:20PM the group convened for final reports on the week's progress and discussion on plans for the coming working period.

Immediately following the meeting

THESE ARE STILL COMING IN, E.G. CARROTBASE

  • nexml
    • Rutger set up the nexml.org site
    • Rutger put the code on SourceForge
  • Ontology
    • Arlin doubled the size of the Concept Glossary
    • Enrico, Arlin and Julie developed a visiting scientist proposal to continue work

Strategy and plans for the Fourth Working Period

old material, to be edited This section and its subsections outlines the general strategies the members of the subproject teams will follow in the period up to May 2008, and the practical work they will do within those strategies.

Next-generation file format standard

old material, to be edited In general terms, the strategy for the nexml initiative is to generate interest in it by providing code that processes nexml (SAX parser libraries, DOM data binding libraries) and implementations that use nexml (applications and web services) thereby provoking comments and discussions on the underlying assumptions of the schema - which then is improved and extended iteratively based on community input. In order to generate enough interest, an informal estimate is that this would require support from:

  • 5 programming languages (java, perl, python, c++ and one of javascript (json) or ruby (bioruby)),
  • 5 applications or toolkits (mesquite, hyphy, phycas, Bio::Phylo and one of ncl, bioruby, paup, mega, jebl, geneious, beast)
  • 3 web services (three of tree of life, cipres, treebase, ppod, morphbank, morphobank)

In addition, nexml-related outreach must include a credible web presence and publication of the standard in a bioinformatics related open access journal. The web presence is supported by NESCent and will continue to develop, the publication is intended to evolve from the lengthy nexml wiki page.

Transition model language

old material, to be edited A problem in this subproject is that substitution model descriptions can potentially become arbitrarily nested and complex, hence more detailed analysis and experimentation is necessary in order to identify requirements. Sergei, David Swofford and Derrick Zwickl will work on a spec for model descriptions to be implemented first in Garli, then in a future version of paup* and HyPhy. This initial spec will likely be annotated in a flat text format (pseudo-code), which will then be converted to some more formal annotation (e.g. xml, xrate, rdf).

Ontology

old material, to be edited The ontology development stages (use cases, concept glossary, related artefacts, implementation, evaluation) should be undertaken iteratively. We feel that we now are ready to undertake the first (preliminary, rough) iteration of the implementation step, as described below. Meanwhile, work will continue on other steps in development.

Use Case Documentation

old material, to be edited Further effort will be devoted to documenting more use cases (ideally, each of them) with

  • files containing instance data (inputs and outputs)
  • a UML use-case diagram

Concept Glossary

old material, to be edited Work on the glossary will continue with

  • addition of more definitions and new terms
  • attempts to advertise the glossary in order to get feedback
  • exploring technologies to maintain an active glossary linked to CDAO as it emerges

CDAO Implementation

old material, to be edited We will implement an initial version of CDAO as an OWL ontology:

  • Seek funding for a working meeting at NESCent
    • to take place in spring
    • will involve Enrico, Julie, Arlin, and two grads or post-docs
  • Collect a token set of example data to cover basic uses
    • one molecular example with protein-coding gene family data
    • one systematics example with species having discrete and continuous characters
  • Implement basic concepts and relations of CDAO using OWL-DL
  • Evaluate the implementation with regard to the token data, and other data
  • Prepare a manuscript describing the results
  • Prepare a poster or slide presentation for use at meetings

Semantic Transformation

old material, to be edited Work will continue on the semantic transformation project that Brian DeVries described, including

  • translation of a more extensive set of alignment file formats
  • TreeBase input file processing (working towards a validating NEXUS parser with error-recognition)

Anticipated Outcomes and Products

old material, to be edited This section and its subsections outlines which deliverables the group expects to produce by following the strategies as outlined in the preceding section.

Next-generation file format standard

old material, to be edited The nexml team anticipates the following deliverables for the coming period:

  • A web presence at http://www.nexml.org with instance document validation, binding downloads and documentation.
  • A set of generic java class libraries (Rutger, Wayne, Peter) to facilitate more complete nexml writing (currently only reading is supported). This will be done by designing simple interfaces which users of the class libraries (e.g. mesquite) need to implement in their objects (Wayne, Peter) so that the class libraries can fetch the required attributes and child elements of the objects and serialize them to xml.
  • A more complete Bio::Phylo nexml parser, to facilitate parsing of networks and character matrices, and develop xml writability by expanding the Bio::Phylo::XMLWritable class, along the same lines as the java interfaces.
  • Incorporate the python nexml parser libraries (for phycas) in the nexml svn repository (Jeet).
  • Clarify the output of the online nexml validator (http://nexml-dev.nescent.org/validator.cgi), formalize difference between grammar-based and rule-based validation (Rutger).
  • Explore C(++)-parsing or data binding, for example for HyPhy (Sergei).

Transition model language

old material, to be edited A per the informal goals outlined on the model language wiki page and in follow-up communication subsequent to the second meeting, anticipated outcomes and results include:

  • Sergei plans to generate pseudo-code (i.e. english statements) which might then be mapped to whatever notation the group deems suitable. He will start by making a list of examples (just models of substitutions for now), all from literature on the wg-evoinfo wiki. The expectation is that we can probably sketch out a feature set by next May.
  • David Swofford plans to work with Derrick Zwickl on something that would be implemented in Garli and (a future version of) PAUP*.

Ontology

old material, to be edited

The outcomes anticipated in the next working period are:

  • Ontology implementation and description (Enrico, Julie, Arlin)
    • initial version of the CDAO ontology in OWL-DL
    • results of evaluating the initial version of CDAO
    • a manuscript or presentation describing the CDAO, for public use
  • Semantic transformation (Brian, Gopal, Enrico, Vien)
    • a web server providing translation for PHYLIP, NEXUS, and at least one other format
    • initial version of a validating error-recognizing parser for TreeBase input (NEXUS) files
  • Use-case documentation (Weigang)
    • instantiate additional use cases with data files