Difference between revisions of "Third Report to NESCent"

From Evolutionary Informatics Working Group
Jump to: navigation, search
(Prior to the meeting)
m (Ontology)
 
(42 intermediate revisions by 4 users not shown)
Line 5: Line 5:
 
==Executive summary==
 
==Executive summary==
  
<font color="green">'''DONE'''</font>
+
The Evolutionary Informatics working group held its third meeting at NESCent headquarters in Durham, NC, May 19 to 22, 2008.  The mandate of the working group is to lower the barrier for the broader application of evolutionary methods by enabling the coordination and inter-operation of data and software resources.  During meetings, and between meetings, the group divides its effort among several projects.  At this meeting, group members reported on the status of major projects before breaking up into task-specific teams to continue their work.  Before departing, the group took stock of its accomplishments and developed a plan for the next working period.
  
The Evolutionary Informatics working group held its third meeting at NESCent headquarters in Durham, NC, May 19 to 22, 2008.  The mandate of the working group is to lower the barrier for the broader application of evolutionary methods by enabling the coordination and inter-operation of data and software resources at all levelsThe group divides its effort among several projects pursued by different teams, some of whom commit substantial effort to group goals in the periods between meetings. At the most recent meeting, group members reported on the status of these projects before breaking up into task-specific teams.
+
A major goal is to develop a "Central Unifying Artefact" to serve as a Rosetta stone for data interoperability.  The working group pursues this goal via two distinct strategies: [[Future_Data_Exchange_Standard|NeXML]] and [[General Ontology|CDAO]]Both projects have been productive: each has generated an open-source working draft of an interoperability artefact; each project has attracted interest from other scientists; and each project team is writing a manuscript to be submitted for publication in the next few months.
  
A major goal is to develop a "Central Unifying Artefact" to serve as a Rosetta stone for interoperabilityThe working group pursues two subprojects, [[Future_Data_Exchange_Standard|nexml]] and [[General Ontology|CDAO]], that comprise mutually enhancing approaches and views on how best to improve interoperability, respectively in terms of syntax and semantics.
+
As hoped, [[http://www.nexml.org| NeXML]] appears poised to become the next-generation standard data exchange format in phylogenetics.  [[NeXML]] IO (input/output) support (i.e., support for reading and writing nexml files) has been implemented in DAMBE as well as [[Mesquite]], Bio::Phylo and pyNexml; and has been initiated for PAUP* and Phylobase.  At the meeting, Nexml supported was implemented in R and BioPerl.  Based on commitments expressed by developers, we anticipate support for nexml IO in the next few months in packages such as Phycas, GARLI and HyPhy.
  
Considerable progress was made over the winter on [[Future_Data_Exchange_Standard|nexml]], an XML-based successor to [https://www.nescent.org/wg_phyloinformatics/NEXUS_Specification NEXUS], a legacy format for generalized comparative data.  Read/write support has been implemented in Mesquite, DAMBE, Bio::Phylo and pyNexml; and development has started for i/o support for paup, BioPerl and phylobase. In addition, based on commitments expressed by developers, we anticipate that (limited) support for the emerging nexml standard will appear in the next few months in packages such as Phycas, GARLI and HyPhy.
+
While the nexml project is a bottom-up approach focused on a syntax-based view of the interoperability requirements of data representation, the CDAO (Comparative Data Analysis Ontology) project begins at a high level of abstraction, by conceptualizing data and operations in terms of their semantics, expressed formally in the OWL-DL ontology languagePartly through NESCent sponsorship of visiting scientists in March, 2008, working group members completed an initial version of [[CDAO]].  At the third meeting, working group members developed a mapping between nexml and CDAO, so as to facilitate inter-conversion using knowledge engineering tools.
  
At the same time, a second team pursued the development of [[CDAO]] (Comparative Data Analysis Ontology), producing a [http://www.ivory-tower-theorist.com/fconv/ demonstration project in semantic transformation], a [[ConceptGlossary|concept glossary]] available from the NESCent web site, a first draft of an OWL ontology, and a detailed research plan submitted to NIH for funding. At the meeting, this team evaluated [[Related Artefacts|related artefacts]] (database schemas, ontologies, file formats) and made plans for file format and data source mappings, to be developed in the summer.
+
A third team discussed the design requirements for an  [[Transition Model Language|Evolutionary model description language]].
  
Although these two efforts are currently only partially overlapping, they are not mutually exclusive or incompatible. The XML project is a bottom-up approach seeking to identify the requirements of interoperable data representation by defining syntax, whereas the Ontology project works top-down, by first conceptualizing data and operations in terms of their semanticsEventually these two projects will meet by means of a mapping between the formal ontology (OWL) terms and (XML) schema data types, a process that begun during the third meeting.
+
At the close of the meeting, and in the follow-up period, the group considered its trajectory and developed plans. Interoperability is a community-wide phenomenon, requiring "buy-in" from the research community.  Given its progress in developing artefacts, the working group is now in a position to adopt a more outward focusThis includes publishing descriptions of the artefacts as well as applying them to needs of the research community.
  
A third team discussed the design requirements for an  [[Transition Model Language|Evolutionary model description language]], a particularly sticky sub-task in developing a "Central Unifying Artefact".  This team hopes to produce a preliminary design specification soon, based on the XML representation used by BEAST, but adapted to conform to NeXML conventions.
+
Accordingly, we developed plans for two projects that align with community needs.  One such need, described by guest speaker Dr. James Leebens-Mack, is for a minimal reporting standard, tentatively named "MIAPA" (Minimal Information for a Phylogenetic Analysis).  The working group developed a plan (described in [[Supporting_MIAPA]]) that would combine a community exercise in annotation with development of an ontology and of user-oriented software to generate MIAPA-compliant reports. The second project is based on the idea that, just as the combination of SO (Sequence Ontology) and GFF (Gene Feature Format) provided a basis for interoperability among gene-entry databases, [[CDAO]] and nexml can provide for coordination and interoperability among a growing number of data resources for phylogenetics (e.g., [[TreeBASE]], TreeFam, HOVERGEN, etc).  The project is tentatively named [[CarrotBase]] because its successful outcome would create a reward (a "carrot") for database users and developers to adopt interoperability standardsIn addition, the CDAO development group plans to submit a major proposal for external funding during the coming work period.
  
 
==Scope of this report==
 
==Scope of this report==
  
<font color="red">'''ADD RELEASE DATE'''</font>
+
The working group meets twice per year and issues a report on its activities a few weeks after each meeting. This report covers work accomplished subsequent to the previous reports, including work done before, during and after the most recent meeting. That is, this Third Working Period report covers group activities from 21 December 2007 (when the report on the second working period was released) to the release date for this report, 11 July, 2008. The report closes by outlining the strategy the working group will follow in the period up to November 2008, and the anticipated tangible outcomes from this.
 
 
The working group meets twice per year and issues a report on its activities a few weeks after each meeting. This report covers work accomplished subsequent to the previous reports, including work done before, during and after the most recent meeting. That is, this Third Working Period report covers group activities from 21 December 2007 (when the report on the second working period was released) to the release date for this report, <font color="red">'''sometime soon'''</font>. The report closes by outlining the strategy the working group will follow in the period up to November 2008, and the anticipated tangible outcomes from this.
 
  
 
==Project leaders and participants==
 
==Project leaders and participants==
 
<font color="green">'''DONE'''</font>
 
  
 
Organizers reach out to the wider community to ensure that interested individuals and organizations are aware of the latest developments and future plans, to invite key individuals to participate, and to request feedback and comments. This results in a fluctuating set of [[Participants|Participants and Colleagues]]  whose involvement in various activities is summarized in the table below.  Jim Leebens-Mack ([http://mibbi.sourceforge.net/projects/MIAPA/ MIAPA project]) was a guest at our May meeting, as were graduate students Brandon Chisham and Francisco Prosdocimi.  We heard talks from NESCent staff Jim Balhoff ([https://www.nescent.org/phenoscape/Main_Page Phenoscape project]), Ryan Scherle ([http://datadryad.org/ Dryad project]), and Hilmar Lapp ([http://evoinfo.nescent.org/PhyloWS PhyloWS project]).
 
Organizers reach out to the wider community to ensure that interested individuals and organizations are aware of the latest developments and future plans, to invite key individuals to participate, and to request feedback and comments. This results in a fluctuating set of [[Participants|Participants and Colleagues]]  whose involvement in various activities is summarized in the table below.  Jim Leebens-Mack ([http://mibbi.sourceforge.net/projects/MIAPA/ MIAPA project]) was a guest at our May meeting, as were graduate students Brandon Chisham and Francisco Prosdocimi.  We heard talks from NESCent staff Jim Balhoff ([https://www.nescent.org/phenoscape/Main_Page Phenoscape project]), Ryan Scherle ([http://datadryad.org/ Dryad project]), and Hilmar Lapp ([http://evoinfo.nescent.org/PhyloWS PhyloWS project]).
Line 64: Line 60:
 
== Goals for the Third Working Period ==
 
== Goals for the Third Working Period ==
  
<font color="green">'''DONE'''</font>
+
The mandate of the working group is to improve interoperability in evolutionary analysis.  Although the [https://www.nescent.org/wg_evoinfo/Image:EvoInfoWorkingGroup_proposal.pdf original proposal] had a narrow focus of supporting existing technologies better, the group has adopted the more forward-looking goal of developing a "Central Unifying Artefact" to serve as the Rosetta stone for interoperability and the development of a transition model language. The next two subsections outline what the specific goals within this scope were, for the period leading up to the third meeting and for the third meeting itself.
 
 
The mandate of the working group is to improve interoperability in evolutionary analysis.  Although the [https://www.nescent.org/wg_evoinfo/Image:EvoInfoWorkingGroup_proposal.pdf original proposal] was to focus narrowly on supporting current standards the group has expanded its scope and efforts into additional directions. In particular, the group has undertaken the bolder and more forward-looking project of developing a "Central Unifying Artefact" (a db schema, a file format, or an ontology) that should serve as the Rosetta stone for interoperability and the development of a transition model language. The next two subsections outline what the specific goals within this scope were, for the period leading up to the second meeting and for the second meeting itself respectively.
 
  
 
=== Goals for the period prior to the third meeting ===
 
=== Goals for the period prior to the third meeting ===
 
<font color="green">'''DONE'''</font>
 
  
 
The goals along with specific aims for this period - as described in greater detail in the [[Second_Report_to_NESCent|second report]] - were briefly as follows:
 
The goals along with specific aims for this period - as described in greater detail in the [[Second_Report_to_NESCent|second report]] - were briefly as follows:
  
 
* '''[[General_Ontology|developing a general ontology for comparative evolutionary analysis]]'''
 
* '''[[General_Ontology|developing a general ontology for comparative evolutionary analysis]]'''
** Develop initial version of the CDAO ontology in OWL-DL
+
** Develop initial version of the [[CDAO]] ontology in OWL-DL
 
** results of evaluating the initial version of CDAO
 
** results of evaluating the initial version of CDAO
 
** Write a manuscript or presentation describing the CDAO, for public use
 
** Write a manuscript or presentation describing the CDAO, for public use
 
** Implement semantic transformation (Brian, Gopal, Enrico, Vien)
 
** Implement semantic transformation (Brian, Gopal, Enrico, Vien)
*** a web server providing translation for PHYLIP, NEXUS, and at least one other format
+
*** a web server providing translation for PHYLIP, [[NEXUS]], and at least one other format
*** initial version of a validating error-recognizing parser for TreeBase input (NEXUS) files
+
*** initial version of a validating error-recognizing parser for [[TreeBASE]] input (NEXUS) files
 
** Use-case documentation (Weigang)
 
** Use-case documentation (Weigang)
 
*** instantiate additional use cases with data files
 
*** instantiate additional use cases with data files
Line 90: Line 82:
 
* '''[[Future_Data_Exchange_Standard|future data exchange syntax standard]]'''
 
* '''[[Future_Data_Exchange_Standard|future data exchange syntax standard]]'''
 
** Create a web presence at http://www.nexml.org with instance document validation, binding downloads and documentation.
 
** Create a web presence at http://www.nexml.org with instance document validation, binding downloads and documentation.
** Develop A set of generic java class libraries (Rutger, Wayne, Peter) to facilitate more complete nexml writing (currently only reading is supported). This will be done by designing simple interfaces which users of the class libraries (e.g. mesquite) need to implement in their objects (Wayne, Peter) so that the class libraries can fetch the required attributes and child elements of the objects and serialize them to xml.
+
** Develop A set of generic java class libraries (Rutger, Wayne, Peter) to facilitate more complete nexml writing (currently only reading is supported). This will be done by designing simple interfaces which users of the class libraries (e.g. [[Mesquite]]) need to implement in their objects (Wayne, Peter) so that the class libraries can fetch the required attributes and child elements of the objects and serialize them to xml.
 
** Develop a more complete Bio::Phylo nexml parser, to facilitate parsing of networks and character matrices, and develop xml writability by expanding the Bio::Phylo::XMLWritable class, along the same lines as the java interfaces.
 
** Develop a more complete Bio::Phylo nexml parser, to facilitate parsing of networks and character matrices, and develop xml writability by expanding the Bio::Phylo::XMLWritable class, along the same lines as the java interfaces.
 
** Incorporate the python nexml parser libraries (for phycas) in the nexml svn repository (Jeet).
 
** Incorporate the python nexml parser libraries (for phycas) in the nexml svn repository (Jeet).
Line 97: Line 89:
  
 
=== Goals for the third meeting ===
 
=== Goals for the third meeting ===
 
<font color="green">'''DONE'''</font>
 
  
 
The purpose of the third meeting was to advance the goals of the working group by focusing on two complementary projects to develop a "central unifying artefact", the nexml project (www.nexml.org) and the CDAO project (www.evolutionaryontology.org).
 
The purpose of the third meeting was to advance the goals of the working group by focusing on two complementary projects to develop a "central unifying artefact", the nexml project (www.nexml.org) and the CDAO project (www.evolutionaryontology.org).
Line 104: Line 94:
 
The plan for the meeting was to hear talks about progress on these two projects, followed by discussion and planning on how to use our time at NESCent to build on this progress.
 
The plan for the meeting was to hear talks about progress on these two projects, followed by discussion and planning on how to use our time at NESCent to build on this progress.
  
The remainder of the meeting would be spent on carrying out those plans in breakout groups, interspersed with talks on related topics: the MIAPE standard, phenotype ontologies, and data repositories.
+
The remainder of the meeting would be spent on carrying out those plans in breakout groups, interspersed with talks on related topics: the MIAPA standard, phenotype ontologies, and data repositories.
  
 
== Accomplishments of the Third Working Period ==
 
== Accomplishments of the Third Working Period ==
[https://www.nescent.org/wg/evoinfo/images/3/3a/Nexml_nescent_19_5_08.ppt Rutger's slides for May '08]
 
'''old material, to be edited'''
 
  
The accomplishments of the second working period include work done over the summer and fall outside of NESCent, the work done at the November meeting at NESCent, and a few items completed in the follow-up period ending 21 December, 2008.
+
The accomplishments of the third working period include work done over the winter outside of NESCent, the work done at the May meeting at NESCent, and a few items completed in the follow-up period ending in June (although the report was not completed until 11 July).
  
 
===Prior to the meeting===
 
===Prior to the meeting===
'''old material, to be edited'''
 
'''note: nexml, cdao, weigang's use cases, subst  model language'''
 
  
 
The goals of the working group were advanced considerably over the summer and fall, due to the efforts of individuals working on specific projects of interest to them.  These outcomes were described in the presentations given on Day 1, starting with the brief presentation ([[Media:Stoltzfus.ppt]]) by Stoltzfus.
 
The goals of the working group were advanced considerably over the summer and fall, due to the efforts of individuals working on specific projects of interest to them.  These outcomes were described in the presentations given on Day 1, starting with the brief presentation ([[Media:Stoltzfus.ppt]]) by Stoltzfus.
  
* '''nexml''' The [http://sourceforge.net/project/memberlist.php?group_id=209571 developers] made extensive progress on the emerging nexml standard. See the [[Future_Data_Exchange_Standard]] page and Rutger's presentation ([[Media:Vos.ppt]])
+
* '''nexml''' The [http://sourceforge.net/project/memberlist.php?group_id=209571 developers] made extensive progress on the emerging nexml standard. See the [[Future_Data_Exchange_Standard|wiki]] page and [https://www.nescent.org/wg/evoinfo/images/3/3a/Nexml_nescent_19_5_08.ppt Rutger's presentation].
** In his Google Summer of Code project, Jason Caravas worked on implementing a nexml parser for Bio::Phylo.
+
** Peter Midford and Rutger Vos implemented nexml i/o for [[Mesquite]] using xml beans
** David Maddison, Wayne Maddison, Mark Holder, Peter Midford and Jeet Sukumaran generated requirements for the nexml standard.
+
** Rutger Vos implemented nexml writing for Bio::Phylo (reading was already implemented by Jason Caravas in previous working period)
** The schema was further refined based on the lessons learned from the generated requirements and from actually working with nexml data files.
+
** Xuhua Xia implemented nexml i/o for DAMBE
 +
** Jeet Sukumaran implemented nexml i/o for pyNexml
 +
** Rutger Vos refined the schema for easier usage with c++ xml beans, and to implement requirements generated during discussions and outreach
 +
** NESCent now hosts nexml.org, with online validation and conversion, project news feeds and documentation
  
* '''Ontology development''' Pontelli, Gupta, and Stoltzfus developed plans for a major project to develop an ontology and computing environment for evolutionary analysis, including
+
* '''Ontology development''' The ontology development team implemented an initial draft of CDAO
** a staged development strategy for the Comparative Data Analysis Ontology (CDAO)
+
** preliminary analysis of related artefacts was completed over the winter
** a rudimentary [[ConceptGlossary]]
+
** the [[ConceptGlossary]] was expanded further
** demo projects including [http://www.ivory-tower-theorist.com/fconv/| ontology-based semantic transformation] (see below)
+
** the core team (Pontelli, Thompson, Stoltzfus, Prosdocimi and Chisham) visited NESCent in March, developing 11 successive versions of CDAO in OWL-DL
** a detailed funding proposal (4 years, ~1.2 M$) submitted to NIH 10 October
+
** Prosdocimi and Stoltzfus set up a [[https://sourceforge.net/projects/cdao | CDAO SourceForge project]]
 +
** the core team went through another 10 revisions and annotated the ontology
 +
** the development team completed a rough draft of a manuscript describing CDAO
 +
** Weigang Qiu (working with students at Hunter College) began to formalize use-cases in UML
  
* '''Supporting current file formats'''
+
* '''Transition Model Language'''.
** a bit of progress on format translation technology (see below, semantic transformation)
+
** no progress was made during this period
** no progress was made on documentation of current formats
 
** no progress was made on collecting further format examples
 
** no progress was made on assessing usage in the research community
 
 
 
* '''Transition Model Language'''. Progress was described in the talks by [[Media:Midford.ppt|Peter Midford]] and [[Media:Evoinfo_SLKP.pdf|Sergei Kosakovsky Pond]]
 
** some initial review of related technologies (GAMS, MPL, MML, AMPL, AIMMS, XRate)
 
** an attempt to extract design principles
 
** no progress was made on a funding plan to complete this work
 
  
 
===Summary of activities and discussion at Meeting 3===
 
===Summary of activities and discussion at Meeting 3===
 
<font color="green">'''DONE'''</font>
 
  
 
==== Day 1 ====
 
==== Day 1 ====
  
Meeting opens at 9:00AM with remarks from Arlin Stoltzfus and Jeff Sturkey. The meeting then continues with formal presentations:
+
Meeting opens at 9:00AM with remarks from Arlin Stoltzfus [[Image:Stoltzfus_intro.ppt]] and Jeff Sturkey. The meeting then continues with formal presentations:
 
* The NESCent EvoInfo working group: Progress, Plans and Prospects (A. Stoltzfus) 20 min
 
* The NESCent EvoInfo working group: Progress, Plans and Prospects (A. Stoltzfus) 20 min
 
* Current state of the NeXML project (Rutger Vos) 40 min ([https://www.nescent.org/wg/evoinfo/images/3/3a/Nexml_nescent_19_5_08.ppt slides])
 
* Current state of the NeXML project (Rutger Vos) 40 min ([https://www.nescent.org/wg/evoinfo/images/3/3a/Nexml_nescent_19_5_08.ppt slides])
 +
 +
** problems with NEXUS, advantages of xml
 +
** triangle of semantics (CDAO), syntax (nexml) and "transport" (phyloWS)
 +
** design principles of nexml
 +
*** re-use (property lists, graphml-like)
 +
*** streaming-friendly (declare-before-use, meta-data first, venetian blinds, avoid deep hierarchy for trees)
 +
** implementation
 +
 
* Implementation of NEXML in DAMBE (Xuhua Xia) 20 min
 
* Implementation of NEXML in DAMBE (Xuhua Xia) 20 min
* PhyloWS project (Hilmar Lapp) 20 min
+
** implementation via visual basic xml parser
* First Draft of the Comparative Data Analysis Ontology (Enrico Pontelli) 40 min
+
** integrated smoothly into DAMBE interface for input or output
 +
* PhyloWS project (Hilmar Lapp) 20 min (see [https://www.nescent.org/wg_evoinfo/PhyloWS PhyloWS wiki pages])
 +
* First Draft of the Comparative Data Analysis Ontology (Enrico Pontelli) 40 min (see [[Image:Enrico_cdao.ppt]])
 +
** Motivations (interoperation, reasoning)
 +
** Development process
 +
** core concepts (TUs, trees, character data)
 +
** some discussion of tree concepts
 +
** implementation details (OWL 1.1, translators, reasoners)
 +
 
 
After these presentation sessions, Arlin Stoltzfus lead a discussion on the group's near-term prospects. At 4:00PM the meeting broke out in groups to develop specific plans for the week. At 5:00PM the meeting reconvened for stand-ups to brief the other groups on their respective plans. The meeting then adjourned for group dinner.
 
After these presentation sessions, Arlin Stoltzfus lead a discussion on the group's near-term prospects. At 4:00PM the meeting broke out in groups to develop specific plans for the week. At 5:00PM the meeting reconvened for stand-ups to brief the other groups on their respective plans. The meeting then adjourned for group dinner.
  
 
==== Day 2 ====
 
==== Day 2 ====
  
The second day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for two talks:
+
The second day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for two talks. At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.
 +
 
 +
===== Lecture and discussion: Data standards and repositories =====
 +
 
 +
This session took place at 10:40.
 
* Jim Leebens-Mack, U. Georgia (MIAPA) (20 min)
 
* Jim Leebens-Mack, U. Georgia (MIAPA) (20 min)
 
* Repositories, data standards, and data reuse (Ryan Scherle, NESCent)
 
* Repositories, data standards, and data reuse (Ryan Scherle, NESCent)
At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.
+
 
 +
===== daily report from nexml-CDAO coordination group =====
 +
 
 +
# use appinfo field in nexml schema to specify CDAO classes
 +
# coordinate names - very little progress was made on this; nexml terms are set already, difficult to change now
 +
# Enrico began defining mappings by creating a third, mediating, ontology.
 +
 
 +
===== daily report from nexml implementation group =====
 +
 
 +
# R - nexml parsing and writing  (Aaron)
 +
# BioPerl (Weigang, Hilmar) - a new module was written: Bio::TreeIO::nexml, so a nexml tree file can be read as:
 +
<perl>
 +
use Bio::Phylo::IO qw(parse)
 +
my $tree_in=Bio::TreeIO->new(-file=>'trees.xml', -format=>'nexml');
 +
while(my $tree=$tree_in->next_tree){
 +
    print $tree->calc_tree_length, "\n";
 +
}
 +
</perl>
 +
 
 +
<perl>
 +
use Bio::TreeIO;
 +
my $tree_in=Bio::TreeIO->new(-file=>'longnames.dnd', -format=>'newick');
 +
while(my $tree=$tree_in->next_tree){
 +
    my $tree_out=Bio::TreeIO->new(-format=>'nexml');
 +
    $tree_out->write_tree($tree);
 +
}
 +
</perl>
 +
 
 +
* Results:
 +
** 2 new files: bioperl-live/Bio/TreeIO/nexml.pm; bioperl-live/t/data/trees.xml
 +
** 3 new tests in bioperl-live/t/TreeIO.t
 +
* Task 2. Use the standard BioPerl interface to write nexml character matrices.
 +
** Problem: Bio::Phylo reading of "characters.xml" generates exceptions
 +
 
 +
===== daily report from nexml transition model language group =====
 +
 
 +
focused on ways to limit scope of problem, e.g., focus on common models (HKY, F81, ...), specific packages (MEGA, DAMBE, PHYLIP, ...), most-used model concepts (based on [[TreeBASE]] submissions)
  
 
==== Day 3 ====
 
==== Day 3 ====
  
The third day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for three talks:
+
The third day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for three talks. At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.
 +
 
 +
===== Lecture and discussion: Representing non-molecular data =====
 +
 
 +
This session took place at 10:40.
 
* Phenoscape Project (Jim Balhoff, NESCent)
 
* Phenoscape Project (Jim Balhoff, NESCent)
* Reminder to think about the future of the working group (Arlin)
+
* Using ontologies to formalize comparative data on worm development (Arlin Stoltzfus)
* Using ontologies to formalize comparative data on worm development (Arlin Stoltzfus) 20 min
+
 
At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.
+
===== stand-up: nexml implementation group =====
 +
 
 +
Aaron's R support for nexml in phylobase package
 +
* implementation, round-trip tests
 +
* problem with internal ids getting changed every time
 +
** is due to only one slot for r-assigned-id-or-user-assigned-label
 +
** need to alter object model to add another slot
 +
* code will be available in phylobase on r-forge
 +
 
 +
===== stand-up: nexml-CDAO group =====
 +
 
 +
* Arlin - discussion of what kind of ontological things are aligned sequence residue characters
 +
* Enrico - more results of mapping ontology from nexml to CDAO
 +
* Brandon - bindings to CDAO from nexml.org C++ library (based on xerces)
 +
* Francisco - thinking about annotations (MIAPA compliance target), joining data sets, finding correlations
 +
 
 +
===== stand-up: transition model language group =====
 +
 
 +
Hilmar
 +
* working on stuff from Peter Midford and Jeet
 +
* inspired by BEAST implementation
 +
* nexml part with model substitution language
 +
* Hilmar added to nexml developer list
 +
 
 +
=====  post-standup discussion: nexml promulgation strategy =====
 +
 
 +
The idea of a NESCent nexml support hackathon
 +
* need to have conformance levels defined in advance, with tests based on test data
 +
* applications developers: Ronquist, Swofford, Beerli, Kuhner, Felsenstein, Zwickl, Kosakovsky Pond, Rod Page, Sanderson, Eulenstein, Burleigh, Zmasek, Stamatakis, Goloboff, Farris, Rambaut, Drummond, Holder, Maddison, Maddison,
 +
* library developers: Mackey, Paradis, Bolker, Thierer, Lewis,
 +
* data resource managers: need list for treefam, hovergen,
 +
* need for carrots: data resources, capabilities, services (id resolution, species links, visualization)
  
 
==== Day 4 ====
 
==== Day 4 ====
  
The first half of the final day was devoted to task-specific work sessions. After lunch, at 1:20PM the group convened for final reports on the week's progress and discussion on plans for the coming working period.
+
The first half of the final day was devoted to task-specific work sessions. After lunch, at 1:20PM the group convened for final reports on the week's progress and discussion on plans for the coming working period. By this time, several members had left.  Nevertheless, several other members had late or next-day flights and continued to talk into the evening.  The main topics of discussion were how to build our ongoing projects, how to reach out to the community, and how to evaluate performance.
 +
 
 +
===== 1. Creating carrots: CDAO-nexml data interop project =====
 +
 
 +
To take advantage of Francisco's presence, we began with a discussion of CDAO, but this quickly led into discussion of a joint project to tackle data resources as implementation targets for CDAO-based nexml support.  See the [[CarrotBase]] page for the further development of this idea.
 +
 
 +
===== 2. Outreach =====
 +
 
 +
# Manuscripts for nexml and cdao
 +
# Websites
 +
#* evolutionaryontology.org (belongs to  Arlin, will map to NESCent, provide home for CDAO)
 +
#* nexml.org
 +
# Blog
 +
#* main issue is keeping up level of activity
 +
#* distribute among group: 2 posts per week rate of entries given 10 contributors works out to about 1 post per person per month
 +
#* Hilmar has  agreed to set up  a trial site
 +
# Other means, such as evoldir announcements
 +
 
 +
===== 3. Performance evaluation =====
 +
 
 +
Our goal is to promote interoperability. How do we know if this is working?  How can we measure performance?
 +
 
 +
Our strategy for reaching this interop goal is to develop artefacts that facilitate interop, so the evaluation strategy is based on the use of these artefacts, mainly cdao and nexml.
 +
 
 +
# Indirect indicators, i.e., indicators of activity or interest:
 +
#* pubs, or grants, with 3 or more group authors
 +
#* invited talks on group projects
 +
#* citation of project pubs
 +
#* web site hits to cdao or nexml home sites
 +
# Direct indicators, i.e., indicators of actual use
 +
#* number (or fraction) of nexml implementations (import or export) in data resources, apps, libraries
 +
#* number (or fraction) of nexml instances in archive submissions, service calls (translation, validation, etc)
 +
#* number of cdao-mediated service calls (calculations, database transactions, etc)
  
 
=== Immediately following the meeting ===
 
=== Immediately following the meeting ===
  
<font color="red">'''THESE ARE STILL COMING IN, E.G. CARROTBASE'''</font>
+
In the two weeks following the meeting, a considerable amount of time was spent on developing a plan to move forward the development of a MIAPA standard, as described on the [[Supporting_MIAPA]] page.
 +
 
 +
==== MIAPA: a way forward ====
 +
 
 +
Our plan (see [[Supporting_MIAPA]]) centers on a community exercise in knowledge capture, and the tools to facilitate this exercise, as a way to acquire knowledge and to engage the research community. Ideally, we would get dozens or hundreds of volunteers to generate MIAPA reports about their phylogenetic studies, and we would use this experience to improve our conception of MIAPA and the technology to support it.
 +
 
 +
So far we have
 +
* identified, and developed a plan to respond to, infrastructure needs for supporting MIAPA (file format, ontology, etc)
 +
* implemented a preliminary ontology (miapa.owl) for metadata describing sources and methods
 +
* identified resources that could be used to flesh out this ontology (e.g., CDAO, mygrid services ontology)
 +
* developed a Phenote-based proof-of-concept application for creating workflow descriptions from the MIAPA ontology
 +
* developed a plan for a web site for users to enter workflow descriptions and other info to yield a complete MIAPA thing
 +
* developed a plan for using this web site in a knowledge capture experiment at Evolution 2008 (or some other meeting).
 +
 
 +
Initially we hoped to move quickly to deploy this project in time for the ontology workshop (and related buzz) at the Evolution 2008 meetings in Minneapolis.  Instead, we will develop this over a longer time.  The next step might be to re-factor the wiki page into a White Paper on MIAPA for NESCent.  We could suggest to NESCent to recruit a catalysis-like group to implement it. Another direction would be to recruit a team of collaborators and submit a grant proposal.
  
* nexml
 
** Rutger set up the nexml.org site
 
** Rutger put the code on SourceForge
 
  
* Ontology
 
** Arlin doubled the size of the Concept Glossary
 
** Enrico, Arlin and Julie developed a visiting scientist proposal to continue work
 
  
 
== Strategy and plans for the Fourth Working Period ==
 
== Strategy and plans for the Fourth Working Period ==
'''old material, to be edited'''
+
 
This section and its subsections outlines the general strategies the members of the subproject teams will follow in the period up to May 2008, and the practical work they will do within those strategies.
+
Development typically goes through a cycle of design, implementation and evaluation.  Over the past year the working group has focused on the initial round of design and implementation.  While this process will continue in the near future, the time has come to begin evaluation, i.e., subjecting the artefacts to practical tests in the context of aiding research projects or serving community needs.
 +
 
 
=== Next-generation file format standard ===
 
=== Next-generation file format standard ===
'''old material, to be edited'''
 
In general terms, the strategy for the nexml initiative is to generate interest in it by providing code that processes nexml (SAX parser libraries, DOM data binding libraries) and implementations that use nexml (applications and web services) thereby provoking comments and discussions on the underlying assumptions of the schema - which then is improved and extended iteratively based on community input. In order to generate enough interest, an informal estimate is that this would require support from:
 
* 5 programming languages (java, perl, python, c++ and one of javascript (json) or ruby (bioruby)),
 
* 5 applications or toolkits (mesquite, hyphy, phycas, Bio::Phylo and one of ncl, bioruby, paup, mega, jebl, geneious, beast)
 
* 3 web services (three of tree of life, cipres, treebase, ppod, morphbank, morphobank)
 
In addition, nexml-related outreach must include a credible [http://www.nexml.org web presence] and publication of the standard in a bioinformatics related open access journal. The web presence is supported by NESCent and will continue to develop, the publication is intended to evolve from the [[Future_Data_Exchange_Standard | lengthy nexml wiki page]].
 
  
=== Transition model language ===
+
For [[NeXML]] to be useful and to build support in the community, we must first establish an initial presence.  The Nexml team set goals in terms of i) a baseline level of penetration of Nexml into resources (programming language libraries, applications, and web services), ii) a web presence, and iii) a published description. In this period, the Nexml team will
'''old material, to be edited'''
+
# complete and submit a manuscript describing Nexml
A problem in this subproject is that substitution model descriptions can potentially become arbitrarily nested and complex, hence more detailed analysis and experimentation is necessary in order to identify requirements. Sergei, David Swofford and Derrick Zwickl will work on a spec for model descriptions to be implemented first in Garli, then in a future version of paup* and HyPhy. This initial spec will likely be annotated in a flat text format (pseudo-code), which will then be converted to some more formal annotation (e.g. xml, xrate, rdf).
+
# meet the baseline level of penetration as given in [[Second_Report_to_NESCent#Strategy and plans for the Third Working Period|Second_Report_to_NESCent: Strategy and plans for the Third Working Period]]
 +
#* 5 programming languages (java, perl, python, c++ and one of javascript (json) or ruby (bioruby)),
 +
#* 5 applications or toolkits ([[Mesquite]], hyphy, phycas, Bio::Phylo and one of ncl, bioruby, paup, mega, jebl, geneious, beast)
 +
#* 3 web services (three of [[Tree of Life]], cipres, [[TreeBASE]], [[pPOD]], [[MorphBank]], [[MorphoBank]])
 +
# plan and initiate a practical evaluation project
  
 
=== Ontology ===
 
=== Ontology ===
'''old material, to be edited'''
 
The ontology development stages (use cases, concept glossary, related artefacts, implementation, evaluation) should be undertaken iteratively. We feel that we now are ready to undertake the first (preliminary, rough) iteration of the implementation step, as described below.  Meanwhile, work will continue on other steps in development.
 
  
==== Use Case Documentation ====
+
In the last period our goal was to make it through the implementation stage in the first development cycle (design, implementation, evaluation).  This goal was completed on time.  Our goals now are to:
'''old material, to be edited'''
+
# publish a manuscript (currently nearing completion) describing CDAO
Further effort will be devoted to documenting more use cases (ideally, each of them) with
+
# establish a better web presence
* files containing instance data (inputs and outputs)
+
# participate in a practical evaluation project, either [[CarrotBase]] or [[Supporting_MIAPA]]
* a UML use-case diagram
+
# submit a (revised) major proposal for external funding
  
==== Concept Glossary ====
+
== Anticipated Outcomes and Products ==
'''old material, to be edited'''
 
Work on the [[ConceptGlossary|glossary]] will continue with
 
* addition of more definitions and new terms
 
* attempts to advertise the glossary in order to get feedback
 
* exploring technologies to maintain an active glossary linked to CDAO as it emerges
 
 
 
==== CDAO Implementation ====
 
'''old material, to be edited'''
 
We will implement an initial version of CDAO as an OWL ontology:
 
 
 
* Seek funding for a working meeting at NESCent
 
** to take place in spring
 
** will involve Enrico, Julie, Arlin, and two grads or post-docs
 
* Collect a token set of example data to cover basic uses
 
** one molecular example with protein-coding gene family data
 
** one systematics example with species having discrete and continuous characters
 
* Implement basic concepts and relations of CDAO using OWL-DL
 
* Evaluate the implementation with regard to the token data, and other data
 
* Prepare a manuscript describing the results
 
* Prepare a poster or slide presentation for use at meetings
 
  
==== Semantic Transformation ====
+
The anticipated outcomes and products for the coming working period follow rather directly from the goals above.
'''old material, to be edited'''
 
Work will continue on the semantic transformation project that Brian DeVries described, including
 
* translation of a more extensive set of alignment file formats
 
* TreeBase input file processing (working towards a validating NEXUS parser with error-recognition)
 
  
== Anticipated Outcomes and Products ==
 
'''old material, to be edited'''
 
This section and its subsections outlines which deliverables the group expects to produce by following the strategies as outlined in the preceding section.
 
 
=== Next-generation file format standard ===
 
=== Next-generation file format standard ===
'''old material, to be edited'''
 
The [[Future Data Exchange Standard|nexml]] team anticipates the following deliverables for the coming period:
 
* A web presence at http://www.nexml.org with instance document validation, binding downloads and documentation.
 
* A set of generic java class libraries (Rutger, Wayne, Peter) to facilitate more complete nexml writing (currently only reading is supported). This will be done by designing simple interfaces which users of the class libraries (e.g. mesquite) need to implement in their objects (Wayne, Peter) so that the class libraries can fetch the required attributes and child elements of the objects and serialize them to xml.
 
* A more complete Bio::Phylo nexml parser, to facilitate parsing of networks and character matrices, and develop xml writability by expanding the Bio::Phylo::XMLWritable class, along the same lines as the java interfaces.
 
* Incorporate the python nexml parser libraries (for phycas) in the nexml svn repository (Jeet).
 
* Clarify the output of the online nexml validator (http://nexml-dev.nescent.org/validator.cgi), formalize difference between grammar-based and rule-based validation (Rutger).
 
* Explore C(++)-parsing or data binding, for example for HyPhy (Sergei).
 
  
=== Transition model language ===
+
* a publication describing Nexml
'''old material, to be edited'''
+
* improved software support for nexml in terms of
A per the informal goals outlined on [[Transition_Model_Language#Next_Steps|the model language wiki page]] and in follow-up communication subsequent to the second meeting, anticipated outcomes and results include:
+
** applications that use nexml as  input or output
* Sergei plans to generate pseudo-code (i.e. english statements) which might then be mapped to whatever notation the group deems suitable. He will start by making a list of examples (just models of substitutions for now), all from literature on the wg-evoinfo wiki. The expectation is that we can probably sketch out a feature set by next May.
+
** software libraries that provide methods for nexml
* David Swofford plans to work with Derrick Zwickl on something that would be implemented in Garli and (a future version of) PAUP*.
+
** archives or databases with nexml output or input
 +
** services that use nexml input or output
 +
* the use of Nexml in actual research projects, as measured in terms of citations
 +
* increased community interest in Nexml, as measured by SourceForge downloads, citations
  
 
=== Ontology ===
 
=== Ontology ===
'''old material, to be edited'''
 
  
The outcomes anticipated in the next working period are:
+
* a publication describing [[CDAO]]
 +
* inclusion of CDAO in OBO
 +
* demonstrations of CDAO representations of
 +
** molecular sequence data
 +
** comparative data on developmental and anatomical characters
  
* Ontology implementation and description (Enrico, Julie, Arlin)
+
[[Category:MIAPA]]
** initial version of the CDAO ontology in OWL-DL
+
[[Category: CDAO]]
** results of evaluating the initial version of CDAO
+
[[Category:Meetings]]
** a manuscript or presentation describing the CDAO, for public use
+
[[Category:NeXML]]
* Semantic transformation (Brian, Gopal, Enrico, Vien)
+
[[Category:Working Group]]
** a web server providing translation for PHYLIP, NEXUS, and at least one other format
 
** initial version of a validating error-recognizing parser for TreeBase input (NEXUS) files
 
* Use-case documentation (Weigang)
 
** instantiate additional use cases with data files
 

Latest revision as of 01:43, 3 March 2009

21 May, 2008

Group leaders: Arlin Stoltzfus and Rutger Vos

Executive summary

The Evolutionary Informatics working group held its third meeting at NESCent headquarters in Durham, NC, May 19 to 22, 2008. The mandate of the working group is to lower the barrier for the broader application of evolutionary methods by enabling the coordination and inter-operation of data and software resources. During meetings, and between meetings, the group divides its effort among several projects. At this meeting, group members reported on the status of major projects before breaking up into task-specific teams to continue their work. Before departing, the group took stock of its accomplishments and developed a plan for the next working period.

A major goal is to develop a "Central Unifying Artefact" to serve as a Rosetta stone for data interoperability. The working group pursues this goal via two distinct strategies: NeXML and CDAO. Both projects have been productive: each has generated an open-source working draft of an interoperability artefact; each project has attracted interest from other scientists; and each project team is writing a manuscript to be submitted for publication in the next few months.

As hoped, [NeXML] appears poised to become the next-generation standard data exchange format in phylogenetics. NeXML IO (input/output) support (i.e., support for reading and writing nexml files) has been implemented in DAMBE as well as Mesquite, Bio::Phylo and pyNexml; and has been initiated for PAUP* and Phylobase. At the meeting, Nexml supported was implemented in R and BioPerl. Based on commitments expressed by developers, we anticipate support for nexml IO in the next few months in packages such as Phycas, GARLI and HyPhy.

While the nexml project is a bottom-up approach focused on a syntax-based view of the interoperability requirements of data representation, the CDAO (Comparative Data Analysis Ontology) project begins at a high level of abstraction, by conceptualizing data and operations in terms of their semantics, expressed formally in the OWL-DL ontology language. Partly through NESCent sponsorship of visiting scientists in March, 2008, working group members completed an initial version of CDAO. At the third meeting, working group members developed a mapping between nexml and CDAO, so as to facilitate inter-conversion using knowledge engineering tools.

A third team discussed the design requirements for an Evolutionary model description language.

At the close of the meeting, and in the follow-up period, the group considered its trajectory and developed plans. Interoperability is a community-wide phenomenon, requiring "buy-in" from the research community. Given its progress in developing artefacts, the working group is now in a position to adopt a more outward focus. This includes publishing descriptions of the artefacts as well as applying them to needs of the research community.

Accordingly, we developed plans for two projects that align with community needs. One such need, described by guest speaker Dr. James Leebens-Mack, is for a minimal reporting standard, tentatively named "MIAPA" (Minimal Information for a Phylogenetic Analysis). The working group developed a plan (described in Supporting_MIAPA) that would combine a community exercise in annotation with development of an ontology and of user-oriented software to generate MIAPA-compliant reports. The second project is based on the idea that, just as the combination of SO (Sequence Ontology) and GFF (Gene Feature Format) provided a basis for interoperability among gene-entry databases, CDAO and nexml can provide for coordination and interoperability among a growing number of data resources for phylogenetics (e.g., TreeBASE, TreeFam, HOVERGEN, etc). The project is tentatively named CarrotBase because its successful outcome would create a reward (a "carrot") for database users and developers to adopt interoperability standards. In addition, the CDAO development group plans to submit a major proposal for external funding during the coming work period.

Scope of this report

The working group meets twice per year and issues a report on its activities a few weeks after each meeting. This report covers work accomplished subsequent to the previous reports, including work done before, during and after the most recent meeting. That is, this Third Working Period report covers group activities from 21 December 2007 (when the report on the second working period was released) to the release date for this report, 11 July, 2008. The report closes by outlining the strategy the working group will follow in the period up to November 2008, and the anticipated tangible outcomes from this.

Project leaders and participants

Organizers reach out to the wider community to ensure that interested individuals and organizations are aware of the latest developments and future plans, to invite key individuals to participate, and to request feedback and comments. This results in a fluctuating set of Participants and Colleagues whose involvement in various activities is summarized in the table below. Jim Leebens-Mack (MIAPA project) was a guest at our May meeting, as were graduate students Brandon Chisham and Francisco Prosdocimi. We heard talks from NESCent staff Jim Balhoff (Phenoscape project), Ryan Scherle (Dryad project), and Hilmar Lapp (PhyloWS project).

ParticipantNESCent 2006
proposal
December 2006
hackathon
Prioritization
exercise
May 2007
meeting
Oct 2007 NIH
proposal
Nov 2007
meeting
Mar 2008
vis. sci.
May 2008
meeting
Chisham, Brandon       
Eisen, Jonathan       
Felsenstein, Joe       
Gupta, Gopal      
Holder, Mark    
Huelsenbeck, John       
Kosakovsky Pond, Sergei L.  
Kumar, Sudhir      
Leebens-Mack, Jim        
Lewis, Paul O.      
Mackey, Aaron    
Maddison, David        
Maddison, Wayne        
Piel, Bill       
Pontelli, Enrico    
Prosdocimi, Francisco       
Qiu, Weigang   
Rambaut, Andrew        
Stoltzfus, Arlin¹
Swofford, David L.   
Thompson, Julie     
Vos, Rutger¹   
Xia, Xuhua     
Zmasek, Christian    

¹ Organizers

² Virtual participant (by video link)

Goals for the Third Working Period

The mandate of the working group is to improve interoperability in evolutionary analysis. Although the original proposal had a narrow focus of supporting existing technologies better, the group has adopted the more forward-looking goal of developing a "Central Unifying Artefact" to serve as the Rosetta stone for interoperability and the development of a transition model language. The next two subsections outline what the specific goals within this scope were, for the period leading up to the third meeting and for the third meeting itself.

Goals for the period prior to the third meeting

The goals along with specific aims for this period - as described in greater detail in the second report - were briefly as follows:

  • developing a general ontology for comparative evolutionary analysis
    • Develop initial version of the CDAO ontology in OWL-DL
    • results of evaluating the initial version of CDAO
    • Write a manuscript or presentation describing the CDAO, for public use
    • Implement semantic transformation (Brian, Gopal, Enrico, Vien)
      • a web server providing translation for PHYLIP, NEXUS, and at least one other format
      • initial version of a validating error-recognizing parser for TreeBASE input (NEXUS) files
    • Use-case documentation (Weigang)
      • instantiate additional use cases with data files
  • developing a transition model language
    • Generate pseudo-code (i.e. english statements) which might then be mapped to whatever notation the group deems suitable. Start by making a list of examples (just models of substitutions for now), all from literature on the wg-evoinfo wiki. The expectation is that we can probably sketch out a feature set by next May.
    • David Swofford planned to work with Derrick Zwickl on something that would be implemented in Garli and (a future version of) PAUP*.
  • future data exchange syntax standard
    • Create a web presence at http://www.nexml.org with instance document validation, binding downloads and documentation.
    • Develop A set of generic java class libraries (Rutger, Wayne, Peter) to facilitate more complete nexml writing (currently only reading is supported). This will be done by designing simple interfaces which users of the class libraries (e.g. Mesquite) need to implement in their objects (Wayne, Peter) so that the class libraries can fetch the required attributes and child elements of the objects and serialize them to xml.
    • Develop a more complete Bio::Phylo nexml parser, to facilitate parsing of networks and character matrices, and develop xml writability by expanding the Bio::Phylo::XMLWritable class, along the same lines as the java interfaces.
    • Incorporate the python nexml parser libraries (for phycas) in the nexml svn repository (Jeet).
    • Clarify the output of the online nexml validator (http://nexml.org/nexml/validator), formalize difference between grammar-based and rule-based validation (Rutger).
    • Explore C(++)-parsing or data binding, for example for HyPhy (Sergei).

Goals for the third meeting

The purpose of the third meeting was to advance the goals of the working group by focusing on two complementary projects to develop a "central unifying artefact", the nexml project (www.nexml.org) and the CDAO project (www.evolutionaryontology.org).

The plan for the meeting was to hear talks about progress on these two projects, followed by discussion and planning on how to use our time at NESCent to build on this progress.

The remainder of the meeting would be spent on carrying out those plans in breakout groups, interspersed with talks on related topics: the MIAPA standard, phenotype ontologies, and data repositories.

Accomplishments of the Third Working Period

The accomplishments of the third working period include work done over the winter outside of NESCent, the work done at the May meeting at NESCent, and a few items completed in the follow-up period ending in June (although the report was not completed until 11 July).

Prior to the meeting

The goals of the working group were advanced considerably over the summer and fall, due to the efforts of individuals working on specific projects of interest to them. These outcomes were described in the presentations given on Day 1, starting with the brief presentation (Media:Stoltzfus.ppt) by Stoltzfus.

  • nexml The developers made extensive progress on the emerging nexml standard. See the wiki page and Rutger's presentation.
    • Peter Midford and Rutger Vos implemented nexml i/o for Mesquite using xml beans
    • Rutger Vos implemented nexml writing for Bio::Phylo (reading was already implemented by Jason Caravas in previous working period)
    • Xuhua Xia implemented nexml i/o for DAMBE
    • Jeet Sukumaran implemented nexml i/o for pyNexml
    • Rutger Vos refined the schema for easier usage with c++ xml beans, and to implement requirements generated during discussions and outreach
    • NESCent now hosts nexml.org, with online validation and conversion, project news feeds and documentation
  • Ontology development The ontology development team implemented an initial draft of CDAO
    • preliminary analysis of related artefacts was completed over the winter
    • the ConceptGlossary was expanded further
    • the core team (Pontelli, Thompson, Stoltzfus, Prosdocimi and Chisham) visited NESCent in March, developing 11 successive versions of CDAO in OWL-DL
    • Prosdocimi and Stoltzfus set up a [| CDAO SourceForge project]
    • the core team went through another 10 revisions and annotated the ontology
    • the development team completed a rough draft of a manuscript describing CDAO
    • Weigang Qiu (working with students at Hunter College) began to formalize use-cases in UML
  • Transition Model Language.
    • no progress was made during this period

Summary of activities and discussion at Meeting 3

Day 1

Meeting opens at 9:00AM with remarks from Arlin Stoltzfus File:Stoltzfus intro.ppt and Jeff Sturkey. The meeting then continues with formal presentations:

  • The NESCent EvoInfo working group: Progress, Plans and Prospects (A. Stoltzfus) 20 min
  • Current state of the NeXML project (Rutger Vos) 40 min (slides)
    • problems with NEXUS, advantages of xml
    • triangle of semantics (CDAO), syntax (nexml) and "transport" (phyloWS)
    • design principles of nexml
      • re-use (property lists, graphml-like)
      • streaming-friendly (declare-before-use, meta-data first, venetian blinds, avoid deep hierarchy for trees)
    • implementation
  • Implementation of NEXML in DAMBE (Xuhua Xia) 20 min
    • implementation via visual basic xml parser
    • integrated smoothly into DAMBE interface for input or output
  • PhyloWS project (Hilmar Lapp) 20 min (see PhyloWS wiki pages)
  • First Draft of the Comparative Data Analysis Ontology (Enrico Pontelli) 40 min (see File:Enrico cdao.ppt)
    • Motivations (interoperation, reasoning)
    • Development process
    • core concepts (TUs, trees, character data)
    • some discussion of tree concepts
    • implementation details (OWL 1.1, translators, reasoners)

After these presentation sessions, Arlin Stoltzfus lead a discussion on the group's near-term prospects. At 4:00PM the meeting broke out in groups to develop specific plans for the week. At 5:00PM the meeting reconvened for stand-ups to brief the other groups on their respective plans. The meeting then adjourned for group dinner.

Day 2

The second day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for two talks. At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.

Lecture and discussion: Data standards and repositories

This session took place at 10:40.

  • Jim Leebens-Mack, U. Georgia (MIAPA) (20 min)
  • Repositories, data standards, and data reuse (Ryan Scherle, NESCent)
daily report from nexml-CDAO coordination group
  1. use appinfo field in nexml schema to specify CDAO classes
  2. coordinate names - very little progress was made on this; nexml terms are set already, difficult to change now
  3. Enrico began defining mappings by creating a third, mediating, ontology.
daily report from nexml implementation group
  1. R - nexml parsing and writing (Aaron)
  2. BioPerl (Weigang, Hilmar) - a new module was written: Bio::TreeIO::nexml, so a nexml tree file can be read as:

<perl> use Bio::Phylo::IO qw(parse) my $tree_in=Bio::TreeIO->new(-file=>'trees.xml', -format=>'nexml'); while(my $tree=$tree_in->next_tree){

   print $tree->calc_tree_length, "\n";

} </perl>

<perl> use Bio::TreeIO; my $tree_in=Bio::TreeIO->new(-file=>'longnames.dnd', -format=>'newick'); while(my $tree=$tree_in->next_tree){

    my $tree_out=Bio::TreeIO->new(-format=>'nexml');
    $tree_out->write_tree($tree);

} </perl>

  • Results:
    • 2 new files: bioperl-live/Bio/TreeIO/nexml.pm; bioperl-live/t/data/trees.xml
    • 3 new tests in bioperl-live/t/TreeIO.t
  • Task 2. Use the standard BioPerl interface to write nexml character matrices.
    • Problem: Bio::Phylo reading of "characters.xml" generates exceptions
daily report from nexml transition model language group

focused on ways to limit scope of problem, e.g., focus on common models (HKY, F81, ...), specific packages (MEGA, DAMBE, PHYLIP, ...), most-used model concepts (based on TreeBASE submissions)

Day 3

The third day of Meeting 3 was devoted almost entirely to task-specific work sessions. The group convened at 10:40AM for three talks. At 4:40PM the group reconvened for stand-ups and discussion on the day's progress. The meeting adjourned at 5:40PM for dinner.

Lecture and discussion: Representing non-molecular data

This session took place at 10:40.

  • Phenoscape Project (Jim Balhoff, NESCent)
  • Using ontologies to formalize comparative data on worm development (Arlin Stoltzfus)
stand-up: nexml implementation group

Aaron's R support for nexml in phylobase package

  • implementation, round-trip tests
  • problem with internal ids getting changed every time
    • is due to only one slot for r-assigned-id-or-user-assigned-label
    • need to alter object model to add another slot
  • code will be available in phylobase on r-forge
stand-up: nexml-CDAO group
  • Arlin - discussion of what kind of ontological things are aligned sequence residue characters
  • Enrico - more results of mapping ontology from nexml to CDAO
  • Brandon - bindings to CDAO from nexml.org C++ library (based on xerces)
  • Francisco - thinking about annotations (MIAPA compliance target), joining data sets, finding correlations
stand-up: transition model language group

Hilmar

  • working on stuff from Peter Midford and Jeet
  • inspired by BEAST implementation
  • nexml part with model substitution language
  • Hilmar added to nexml developer list
post-standup discussion: nexml promulgation strategy

The idea of a NESCent nexml support hackathon

  • need to have conformance levels defined in advance, with tests based on test data
  • applications developers: Ronquist, Swofford, Beerli, Kuhner, Felsenstein, Zwickl, Kosakovsky Pond, Rod Page, Sanderson, Eulenstein, Burleigh, Zmasek, Stamatakis, Goloboff, Farris, Rambaut, Drummond, Holder, Maddison, Maddison,
  • library developers: Mackey, Paradis, Bolker, Thierer, Lewis,
  • data resource managers: need list for treefam, hovergen,
  • need for carrots: data resources, capabilities, services (id resolution, species links, visualization)

Day 4

The first half of the final day was devoted to task-specific work sessions. After lunch, at 1:20PM the group convened for final reports on the week's progress and discussion on plans for the coming working period. By this time, several members had left. Nevertheless, several other members had late or next-day flights and continued to talk into the evening. The main topics of discussion were how to build our ongoing projects, how to reach out to the community, and how to evaluate performance.

1. Creating carrots: CDAO-nexml data interop project

To take advantage of Francisco's presence, we began with a discussion of CDAO, but this quickly led into discussion of a joint project to tackle data resources as implementation targets for CDAO-based nexml support. See the CarrotBase page for the further development of this idea.

2. Outreach
  1. Manuscripts for nexml and cdao
  2. Websites
    • evolutionaryontology.org (belongs to Arlin, will map to NESCent, provide home for CDAO)
    • nexml.org
  3. Blog
    • main issue is keeping up level of activity
    • distribute among group: 2 posts per week rate of entries given 10 contributors works out to about 1 post per person per month
    • Hilmar has agreed to set up a trial site
  4. Other means, such as evoldir announcements
3. Performance evaluation

Our goal is to promote interoperability. How do we know if this is working? How can we measure performance?

Our strategy for reaching this interop goal is to develop artefacts that facilitate interop, so the evaluation strategy is based on the use of these artefacts, mainly cdao and nexml.

  1. Indirect indicators, i.e., indicators of activity or interest:
    • pubs, or grants, with 3 or more group authors
    • invited talks on group projects
    • citation of project pubs
    • web site hits to cdao or nexml home sites
  2. Direct indicators, i.e., indicators of actual use
    • number (or fraction) of nexml implementations (import or export) in data resources, apps, libraries
    • number (or fraction) of nexml instances in archive submissions, service calls (translation, validation, etc)
    • number of cdao-mediated service calls (calculations, database transactions, etc)

Immediately following the meeting

In the two weeks following the meeting, a considerable amount of time was spent on developing a plan to move forward the development of a MIAPA standard, as described on the Supporting_MIAPA page.

MIAPA: a way forward

Our plan (see Supporting_MIAPA) centers on a community exercise in knowledge capture, and the tools to facilitate this exercise, as a way to acquire knowledge and to engage the research community. Ideally, we would get dozens or hundreds of volunteers to generate MIAPA reports about their phylogenetic studies, and we would use this experience to improve our conception of MIAPA and the technology to support it.

So far we have

  • identified, and developed a plan to respond to, infrastructure needs for supporting MIAPA (file format, ontology, etc)
  • implemented a preliminary ontology (miapa.owl) for metadata describing sources and methods
  • identified resources that could be used to flesh out this ontology (e.g., CDAO, mygrid services ontology)
  • developed a Phenote-based proof-of-concept application for creating workflow descriptions from the MIAPA ontology
  • developed a plan for a web site for users to enter workflow descriptions and other info to yield a complete MIAPA thing
  • developed a plan for using this web site in a knowledge capture experiment at Evolution 2008 (or some other meeting).

Initially we hoped to move quickly to deploy this project in time for the ontology workshop (and related buzz) at the Evolution 2008 meetings in Minneapolis. Instead, we will develop this over a longer time. The next step might be to re-factor the wiki page into a White Paper on MIAPA for NESCent. We could suggest to NESCent to recruit a catalysis-like group to implement it. Another direction would be to recruit a team of collaborators and submit a grant proposal.


Strategy and plans for the Fourth Working Period

Development typically goes through a cycle of design, implementation and evaluation. Over the past year the working group has focused on the initial round of design and implementation. While this process will continue in the near future, the time has come to begin evaluation, i.e., subjecting the artefacts to practical tests in the context of aiding research projects or serving community needs.

Next-generation file format standard

For NeXML to be useful and to build support in the community, we must first establish an initial presence. The Nexml team set goals in terms of i) a baseline level of penetration of Nexml into resources (programming language libraries, applications, and web services), ii) a web presence, and iii) a published description. In this period, the Nexml team will

  1. complete and submit a manuscript describing Nexml
  2. meet the baseline level of penetration as given in Second_Report_to_NESCent: Strategy and plans for the Third Working Period
    • 5 programming languages (java, perl, python, c++ and one of javascript (json) or ruby (bioruby)),
    • 5 applications or toolkits (Mesquite, hyphy, phycas, Bio::Phylo and one of ncl, bioruby, paup, mega, jebl, geneious, beast)
    • 3 web services (three of Tree of Life, cipres, TreeBASE, pPOD, MorphBank, MorphoBank)
  3. plan and initiate a practical evaluation project

Ontology

In the last period our goal was to make it through the implementation stage in the first development cycle (design, implementation, evaluation). This goal was completed on time. Our goals now are to:

  1. publish a manuscript (currently nearing completion) describing CDAO
  2. establish a better web presence
  3. participate in a practical evaluation project, either CarrotBase or Supporting_MIAPA
  4. submit a (revised) major proposal for external funding

Anticipated Outcomes and Products

The anticipated outcomes and products for the coming working period follow rather directly from the goals above.

Next-generation file format standard

  • a publication describing Nexml
  • improved software support for nexml in terms of
    • applications that use nexml as input or output
    • software libraries that provide methods for nexml
    • archives or databases with nexml output or input
    • services that use nexml input or output
  • the use of Nexml in actual research projects, as measured in terms of citations
  • increased community interest in Nexml, as measured by SourceForge downloads, citations

Ontology

  • a publication describing CDAO
  • inclusion of CDAO in OBO
  • demonstrations of CDAO representations of
    • molecular sequence data
    • comparative data on developmental and anatomical characters