Third Report to NESCent

From Evolutionary Informatics Working Group
Jump to: navigation, search

21 May, 2008

Group leaders: Arlin Stoltzfus and Rutger Vos

Executive summary

The Evolutionary Informatics working group held its third meeting at NESCent headquarters in Durham, NC, May 19 to 22, 2008. The mandate of the working group is to lower the barrier for the broader application of evolutionary methods by enabling the coordination and inter-operation of data and software resources. During meetings, and between meetings, the group divides its effort among several projects. At this meeting, group members reported on the status of major projects before breaking up into task-specific teams to continue their work. Before departing, the group took stock of its accomplishments and developed a plan for the next working period.

A major goal is to develop a "Central Unifying Artefact" to serve as a Rosetta stone for data interoperability. The working group pursues this goal via two distinct strategies: NeXML and CDAO. Both projects have been productive: each has generated an open-source working draft of an interoperability artefact; each project has attracted interest from other scientists; and each project team is writing a manuscript to be submitted for publication in the next few months.

As hoped, [NeXML] appears poised to become the next-generation standard data exchange format in phylogenetics. NeXML IO (input/output) support (i.e., support for reading and writing nexml files) has been implemented in DAMBE as well as Mesquite, Bio::Phylo and pyNexml; and has been initiated for PAUP* and Phylobase. At the meeting, Nexml supported was implemented in R and BioPerl. Based on commitments expressed by developers, we anticipate support for nexml IO in the next few months in packages such as Phycas, GARLI and HyPhy.

While the nexml project is a bottom-up approach focused on a syntax-based view of the interoperability requirements of data representation, the CDAO (Comparative Data Analysis Ontology) project begins at a high level of abstraction, by conceptualizing data and operations in terms of their semantics, expressed formally in the OWL-DL ontology language. Partly through NESCent sponsorship of visiting scientists in March, 2008, working group members completed an initial version of CDAO. At the third meeting, working group members developed a mapping between nexml and CDAO, so as to facilitate inter-conversion using knowledge engineering tools.

A third team discussed the design requirements for an Evolutionary model description language.

At the close of the meeting, and in the follow-up period, the group considered its trajectory and developed plans. Interoperability is a community-wide phenomenon, requiring "buy-in" from the research community. Given its progress in developing artefacts, the working group is now in a position to adopt a more outward focus. This includes publishing descriptions of the artefacts as well as applying them to needs of the research community.

Accordingly, we developed plans for two projects that align with community needs. One such need, described by guest speaker Dr. James Leebens-Mack, is for a minimal reporting standard, tentatively named "MIAPA" (Minimal Information for a Phylogenetic Analysis). The working group developed a plan (described in Supporting_MIAPA) that would combine a community exercise in annotation with development of an ontology and of user-oriented software to generate MIAPA-compliant reports. The second project is based on the idea that, just as the combination of SO (Sequence Ontology) and GFF (Gene Feature Format) provided a basis for interoperability among gene-entry databases, CDAO and nexml can provide for coordination and interoperability among a growing number of data resources for phylogenetics (e.g., TreeBASE, TreeFam, HOVERGEN, etc). The project is tentatively named CarrotBase because its successful outcome would create a reward (a "carrot") for database users and developers to adopt interoperability standards. In addition, the CDAO development group plans to submit a major proposal for external funding during the coming work period.

Scope of this report

The working group meets twice per year and issues a report on its activities a few weeks after each meeting. This report covers work accomplished subsequent to the previous reports, including work done before, during and after the most recent meeting. That is, this Third Working Period report covers group activities from 21 December 2007 (when the report on the second working period was released) to the release date for this report, 11 July, 2008. The report closes by outlining the strategy the working group will follow in the period up to November 2008, and the anticipated tangible outcomes from this.

Project leaders and participants

Organizers reach out to the wider community to ensure that interested individuals and organizations are aware of the latest developments and future plans, to invite key individuals to participate, and to request feedback and comments. This results in a fluctuating set of Participants and Colleagues whose involvement in various activities is summarized in the table below. Jim Leebens-Mack (MIAPA project) was a guest at our May meeting, as were graduate students Brandon Chisham and Francisco Prosdocimi. We heard talks from NESCent staff Jim Balhoff (Phenoscape project), Ryan Scherle (Dryad project), and Hilmar Lapp (PhyloWS project).

ParticipantNESCent 2006
proposal
December 2006
hackathon
Prioritization
exercise
May 2007
meeting
Oct 2007 NIH
proposal
Nov 2007
meeting
Mar 2008
vis. sci.
May 2008
meeting
Chisham, Brandon       
Eisen, Jonathan       
Felsenstein, Joe       
Gupta, Gopal      
Holder, Mark    
Huelsenbeck, John       
Kosakovsky Pond, Sergei L.  
Kumar, Sudhir      
Leebens-Mack, Jim        
Lewis, Paul O.      
Mackey, Aaron    
Maddison, David        
Maddison, Wayne        
Piel, Bill       
Pontelli, Enrico    
Prosdocimi, Francisco       
Qiu, Weigang   
Rambaut, Andrew        
Stoltzfus, Arlin¹
Swofford, David L.   
Thompson, Julie     
Vos, Rutger¹   
Xia, Xuhua     
Zmasek, Christian    

¹ Organizers

² Virtual participant (by video link)

Goals for the Third Working Period

The mandate of the working group is to improve interoperability in evolutionary analysis. Although the original proposal had a narrow focus of supporting existing technologies better, the group has adopted the more forward-looking goal of developing a "Central Unifying Artefact" to serve as the Rosetta stone for interoperability and the development of a transition model language. The next two subsections outline what the specific goals within this scope were, for the period leading up to the third meeting and for the third meeting itself.

Goals for the period prior to the third meeting

The goals along with specific aims for this period - as described in greater detail in the second report - were briefly as follows:

  • developing a general ontology for comparative evolutionary analysis
    • Develop initial version of the CDAO ontology in OWL-DL
    • results of evaluating the initial version of CDAO
    • Write a manuscript or presentation describing the CDAO, for public use
    • Implement semantic transformation (Brian, Gopal, Enrico, Vien)
      • a web server providing translation for PHYLIP, NEXUS, and at least one other format
      • initial version of a validating error-recognizing parser for TreeBASE input (NEXUS) files
    • Use-case documentation (Weigang)
      • instantiate additional use cases with data files
  • developing a transition model language
    • Generate pseudo-code (i.e. english statements) which might then be mapped to whatever notation the group deems suitable. Start by making a list of examples (just models of substitutions for now), all from literature on the wg-evoinfo wiki. The expectation is that we can probably sketch out a feature set by next May.
    • David Swofford planned to work with Derrick Zwickl on something that would be implemented in Garli and (a future version of) PAUP*.
  • future data exchange syntax standard
    • Create a web presence at http://www.nexml.org with instance document validation, binding downloads and documentation.
    • Develop A set of generic java class libraries (Rutger, Wayne, Peter) to facilitate more complete nexml writing (currently only reading is supported). This will be done by designing simple interfaces which users of the class libraries (e.g. Mesquite) need to implement in their objects (Wayne, Peter) so that the class libraries can fetch the required attributes and child elements of the objects and serialize them to xml.
    • Develop a more complete Bio::Phylo nexml parser, to facilitate parsing of networks and character matrices, and develop xml writability by expanding the Bio::Phylo::XMLWritable class, along the same lines as the java interfaces.
    • Incorporate the python nexml parser libraries (for phycas) in the nexml svn repository (Jeet).
    • Clarify the output of the online nexml validator (http://nexml.org/nexml/validator), formalize difference between grammar-based and rule-based validation (Rutger).
    • Explore C(++)-parsing or data binding, for example for HyPhy (Sergei).

Goals for the third meeting

The purpose of the third meeting was to advance the goals of the working group by focusing on two complementary projects to develop a "central unifying artefact", the nexml project (www.nexml.org) and the CDAO project (www.evolutionaryontology.org).

The plan for the meeting was to hear talks about progress on these two projects, followed by discussion and planning on how to use our time at NESCent to build on this progress.

The remainder of the meeting would be spent on carrying out those plans in breakout groups, interspersed with talks on related topics: the MIAPA standard, phenotype ontologies, and data repositories.

Accomplishments of the Third Working Period

The accomplishments of the third working period include work done over the winter outside of NESCent, the work done at the May meeting at NESCent, and a few items completed in the follow-up period ending in June (although the report was not completed until 11 July).

Prior to the meeting

The goals of the working group were advanced considerably over the summer and fall, due to the efforts of individuals working on specific projects of interest to them. These outcomes were described in the presentations given on Day 1, starting with the brief presentation (Media:Stoltzfus.ppt) by Stoltzfus.

  • nexml The developers made extensive progress on the emerging nexml standard. See the wiki page and Rutger's presentation.
    • Peter Midford and Rutger Vos implemented nexml i/o for Mesquite using xml beans
    • Rutger Vos implemented nexml writing for Bio::Phylo (reading was already implemented by Jason Caravas in previous working period)
    • Xuhua Xia implemented nexml i/o for DAMBE
    • Jeet Sukumaran implemented nexml i/o for pyNexml
    • Rutger Vos refined the schema for easier usage with c++ xml beans, and to implement requirements generated during discussions and outreach
    • NESCent now hosts nexml.org, with online validation and conversion, project news feeds and documentation
  • Ontology development The ontology development team implemented an initial draft of CDAO
    • preliminary analysis of related artefacts was completed over the winter
    • the ConceptGlossary was expanded further
    • the core team (Pontelli, Thompson, Stoltzfus, Prosdocimi and Chisham) visited NESCent in March, developing 11 successive versions of CDAO in OWL-DL
    • Prosdocimi and Stoltzfus set up a [| CDAO SourceForge project]
    • the core team went through another 10 revisions and annotated the ontology
    • the development team completed a rough draft of a manuscript describing CDAO
    • Weigang Qiu (working with students at Hunter College) began to formalize use-cases in UML
  • Transition Model Language.
    • no progress was made during this period

Summary of activities and discussion at Meeting 3

Day 1

Meeting opens at 9:00AM with remarks from Arlin Stoltzfus File:Stoltzfus intro.ppt and Jeff Sturkey. The meeting then continues with formal presentations:

  • The NESCent EvoInfo working group: Progress, Plans and Prospects (A. Stoltzfus) 20 min
  • Current state of the NeXML project (Rutger Vos) 40 min (slides)
    • problems with NEXUS, advantages of xml
    • triangle of semantics (CDAO), syntax (nexml) and "transport" (phyloWS)
    • design principles of nexml
      • re-use (property lists, graphml-like)
      • streaming-friendly (declare-before-use, meta-data first, venetian blinds, avoid deep hierarchy for trees)
    • implementation
  • Implementation of NEXML in DAMBE (Xuhua Xia) 20 min
    • implementation via visual basic xml parser
    • integrated smoothly into DAMBE interface for input or output
  • PhyloWS project (Hilmar Lapp) 20 min (see PhyloWS wiki pages)
  • First Draft of the Comparative Data Analysis Ontology (Enrico Pontelli) 40 min (see File:Enrico cdao.ppt)
    • Motivations (interoperation, reasoning)
    • Development process
    • core concepts (TUs, trees, character data)
    • some discussion of tree concepts
    • implementation details (OWL 1.1, translators, reasoners)

After these presentation sessions, Arlin Stoltzfus lead a discussion on the group's near-term prospects. At 4:00PM the meeting broke out in groups to develop specific plans for the week. At 5:00PM the meeting reconvened for stand-ups to brief the other groups on their respective plans. The meeting then adjourned for group dinner.

Day 2

The second day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for two talks. At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.

Lecture and discussion: Data standards and repositories

This session took place at 10:40.

  • Jim Leebens-Mack, U. Georgia (MIAPA) (20 min)
  • Repositories, data standards, and data reuse (Ryan Scherle, NESCent)
daily report from nexml-CDAO coordination group
  1. use appinfo field in nexml schema to specify CDAO classes
  2. coordinate names - very little progress was made on this; nexml terms are set already, difficult to change now
  3. Enrico began defining mappings by creating a third, mediating, ontology.
daily report from nexml implementation group
  1. R - nexml parsing and writing (Aaron)
  2. BioPerl (Weigang, Hilmar) - a new module was written: Bio::TreeIO::nexml, so a nexml tree file can be read as:

<perl> use Bio::Phylo::IO qw(parse) my $tree_in=Bio::TreeIO->new(-file=>'trees.xml', -format=>'nexml'); while(my $tree=$tree_in->next_tree){

   print $tree->calc_tree_length, "\n";

} </perl>

<perl> use Bio::TreeIO; my $tree_in=Bio::TreeIO->new(-file=>'longnames.dnd', -format=>'newick'); while(my $tree=$tree_in->next_tree){

    my $tree_out=Bio::TreeIO->new(-format=>'nexml');
    $tree_out->write_tree($tree);

} </perl>

  • Results:
    • 2 new files: bioperl-live/Bio/TreeIO/nexml.pm; bioperl-live/t/data/trees.xml
    • 3 new tests in bioperl-live/t/TreeIO.t
  • Task 2. Use the standard BioPerl interface to write nexml character matrices.
    • Problem: Bio::Phylo reading of "characters.xml" generates exceptions
daily report from nexml transition model language group

focused on ways to limit scope of problem, e.g., focus on common models (HKY, F81, ...), specific packages (MEGA, DAMBE, PHYLIP, ...), most-used model concepts (based on TreeBASE submissions)

Day 3

The third day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for three talks. At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.

Lecture and discussion: Representing non-molecular data

This session took place at 10:40.

  • Phenoscape Project (Jim Balhoff, NESCent)
  • Using ontologies to formalize comparative data on worm development (Arlin Stoltzfus)
stand-up: nexml implementation group

Aaron's R support for nexml in phylobase package

  • implementation, round-trip tests
  • problem with internal ids getting changed every time
    • is due to only one slot for r-assigned-id-or-user-assigned-label
    • need to alter object model to add another slot
  • code will be available in phylobase on r-forge
stand-up: nexml-CDAO group
  • Arlin - discussion of what kind of ontological things are aligned sequence residue characters
  • Enrico - more results of mapping ontology from nexml to CDAO
  • Brandon - bindings to CDAO from nexml.org C++ library (based on xerces)
  • Francisco - thinking about annotations (MIAPA compliance target), joining data sets, finding correlations
stand-up: transition model language group

Hilmar

  • working on stuff from Peter Midford and Jeet
  • inspired by BEAST implementation
  • nexml part with model substitution language
  • Hilmar added to nexml developer list
post-standup discussion: nexml promulgation strategy

The idea of a NESCent nexml support hackathon

  • need to have conformance levels defined in advance, with tests based on test data
  • applications developers: Ronquist, Swofford, Beerli, Kuhner, Felsenstein, Zwickl, Kosakovsky Pond, Rod Page, Sanderson, Eulenstein, Burleigh, Zmasek, Stamatakis, Goloboff, Farris, Rambaut, Drummond, Holder, Maddison, Maddison,
  • library developers: Mackey, Paradis, Bolker, Thierer, Lewis,
  • data resource managers: need list for treefam, hovergen,
  • need for carrots: data resources, capabilities, services (id resolution, species links, visualization)

Day 4

The first half of the final day was devoted to task-specific work sessions. After lunch, at 1:20PM the group convened for final reports on the week's progress and discussion on plans for the coming working period. By this time, several members had left. Nevertheless, several other members had late or next-day flights and continued to talk into the evening. The main topics of discussion were how to build our ongoing projects, how to reach out to the community, and how to evaluate performance.

1. Creating carrots: CDAO-nexml data interop project

To take advantage of Francisco's presence, we began with a discussion of CDAO, but this quickly led into discussion of a joint project to tackle data resources as implementation targets for CDAO-based nexml support. See the CarrotBase page for the further development of this idea.

2. Outreach
  1. Manuscripts for nexml and cdao
  2. Websites
    • evolutionaryontology.org (belongs to Arlin, will map to NESCent, provide home for CDAO)
    • nexml.org
  3. Blog
    • main issue is keeping up level of activity
    • distribute among group: 2 posts per week rate of entries given 10 contributors works out to about 1 post per person per month
    • Hilmar has agreed to set up a trial site
  4. Other means, such as evoldir announcements
3. Performance evaluation

Our goal is to promote interoperability. How do we know if this is working? How can we measure performance?

Our strategy for reaching this interop goal is to develop artefacts that facilitate interop, so the evaluation strategy is based on the use of these artefacts, mainly cdao and nexml.

  1. Indirect indicators, i.e., indicators of activity or interest:
    • pubs, or grants, with 3 or more group authors
    • invited talks on group projects
    • citation of project pubs
    • web site hits to cdao or nexml home sites
  2. Direct indicators, i.e., indicators of actual use
    • number (or fraction) of nexml implementations (import or export) in data resources, apps, libraries
    • number (or fraction) of nexml instances in archive submissions, service calls (translation, validation, etc)
    • number of cdao-mediated service calls (calculations, database transactions, etc)

Immediately following the meeting

In the two weeks following the meeting, a considerable amount of time was spent on developing a plan to move forward the development of a MIAPA standard, as described on the Supporting_MIAPA page.

MIAPA: a way forward

Our plan (see Supporting_MIAPA) centers on a community exercise in knowledge capture, and the tools to facilitate this exercise, as a way to acquire knowledge and to engage the research community. Ideally, we would get dozens or hundreds of volunteers to generate MIAPA reports about their phylogenetic studies, and we would use this experience to improve our conception of MIAPA and the technology to support it.

So far we have

  • identified, and developed a plan to respond to, infrastructure needs for supporting MIAPA (file format, ontology, etc)
  • implemented a preliminary ontology (miapa.owl) for metadata describing sources and methods
  • identified resources that could be used to flesh out this ontology (e.g., CDAO, mygrid services ontology)
  • developed a Phenote-based proof-of-concept application for creating workflow descriptions from the MIAPA ontology
  • developed a plan for a web site for users to enter workflow descriptions and other info to yield a complete MIAPA thing
  • developed a plan for using this web site in a knowledge capture experiment at Evolution 2008 (or some other meeting).

Initially we hoped to move quickly to deploy this project in time for the ontology workshop (and related buzz) at the Evolution 2008 meetings in Minneapolis. Instead, we will develop this over a longer time. The next step might be to re-factor the wiki page into a White Paper on MIAPA for NESCent. We could suggest to NESCent to recruit a catalysis-like group to implement it. Another direction would be to recruit a team of collaborators and submit a grant proposal.


Strategy and plans for the Fourth Working Period

Development typically goes through a cycle of design, implementation and evaluation. Over the past year the working group has focused on the initial round of design and implementation. While this process will continue in the near future, the time has come to begin evaluation, i.e., subjecting the artefacts to practical tests in the context of aiding research projects or serving community needs.

Next-generation file format standard

For NeXML to be useful and to build support in the community, we must first establish an initial presence. The Nexml team set goals in terms of i) a baseline level of penetration of Nexml into resources (programming language libraries, applications, and web services), ii) a web presence, and iii) a published description. In this period, the Nexml team will

  1. complete and submit a manuscript describing Nexml
  2. meet the baseline level of penetration as given in Second_Report_to_NESCent: Strategy and plans for the Third Working Period
    • 5 programming languages (java, perl, python, c++ and one of javascript (json) or ruby (bioruby)),
    • 5 applications or toolkits (Mesquite, hyphy, phycas, Bio::Phylo and one of ncl, bioruby, paup, mega, jebl, geneious, beast)
    • 3 web services (three of Tree of Life, cipres, TreeBASE, pPOD, MorphBank, MorphoBank)
  3. plan and initiate a practical evaluation project

Ontology

In the last period our goal was to make it through the implementation stage in the first development cycle (design, implementation, evaluation). This goal was completed on time. Our goals now are to:

  1. publish a manuscript (currently nearing completion) describing CDAO
  2. establish a better web presence
  3. participate in a practical evaluation project, either CarrotBase or Supporting_MIAPA
  4. submit a (revised) major proposal for external funding

Anticipated Outcomes and Products

The anticipated outcomes and products for the coming working period follow rather directly from the goals above.

Next-generation file format standard

  • a publication describing Nexml
  • improved software support for nexml in terms of
    • applications that use nexml as input or output
    • software libraries that provide methods for nexml
    • archives or databases with nexml output or input
    • services that use nexml input or output
  • the use of Nexml in actual research projects, as measured in terms of citations
  • increased community interest in Nexml, as measured by SourceForge downloads, citations

Ontology

  • a publication describing CDAO
  • inclusion of CDAO in OBO
  • demonstrations of CDAO representations of
    • molecular sequence data
    • comparative data on developmental and anatomical characters