First Report to NESCent

From Evolutionary Informatics Working Group
Jump to: navigation, search

22 June, 2007

Group leaders: Arlin Stoltzfus and Rutger Vos

Executive summary

The mandate of the evolutionary informatics working group is to promote a broader and deeper application of evolutionary methods in biology by addressing relevant issues in software and data interoperability. The first meeting of the evolutionary informatics working group took place at NESCent, May 21-23, with 12 members in attendance. Prior to their first meeting, group members participated in a planning exercise and also worked with NESCent to organize a phyloinformatics hack-a-thon. The meeting consisted largely of discussions aimed at identifying the intersection between what the group can accomplish, and what activities would contribute most to its goal of advancing interoperability in evolutionary analysis. The working group came up with a short of list of task areas, each with specific deliverables: developing an ontology for molecular character analysis; developing a language for describing evolutionary transition models; establishing communication with interoperability stakeholders; and remaining seized of efforts to develop a next-generation file format standard. The activities and accomplishments of the working group are documented on its wiki site.

Project leaders and participants

As adoption and adherence to standards must be largely a community-driven effort, the evolutionary informatics working group seeks to be widely inclusive. To this end, working group organizers and participants continuously reach out to the wider community to keep interested individuals and organizations abreast of the latest developments and future plans, and to request feedback and comments. This results in a somewhat fluctuating group "membership" from one step to the next in the working group's project - though in total many of the leading developers and theoreticians in the field of evolutionary informatics are involved, as shown below.

The following table lists the participants and the steps of the project they were involved with to date.
ParticipantOriginal
proposal
December
hackathon
Prioritization
exercise
May 2007
meeting
Eisen, Jonathan   
Felsenstein, Joe    
Gupta, Gopal    
Holder, Mark
Huelsenbeck, John   
Kosakovsky Pond, Sergei L.
Kumar, Sudhir  
Lewis, Paul O.  
Mackey, Aaron
Maddison, David    
Maddison, Wayne    
Pontelli, Enrico    
Qiu, Weigang
Rambaut, Andrew    
Stoltzfus, Arlin¹
Swofford, David L.
Vos, Rutger¹
Xia, Xuhua  
Zmasek, Christian

¹ Organizers

Report on the First Meeting

Pre-meeting activities

Before the working group had met, the proposal itself already had a catalytic effect, with group members Holder, Mackey, Qiu, Stoltzfus, Swofford and Vos helping NESCent’s informatics team to organize a phyloinformatics hackathon (which they attended in December, 2006). These individuals contributed much of the material in the list of use-cases that became a focal point for planning, both for the hackathon and for the working group (see https://www.nescent.org/wg_phyloinformatics/Public:UseCases)

Indeed, some of the activities carried out at the hackathon were envisioned originally for the informatics working group. In particular, group members Stoltzfus, Lewis and Holder produced a substantial document on NEXUS conformance (https://www.nescent.org/wg_phyloinformatics/Supporting_NEXUS).

Prioritized goals for the first meeting

To set the agenda for the first meeting, a prioritization exercise was carried out. Group members were asked to rank the top 3 priorities from a set of 10 possible task areas. The group leaders analyzed the results and assigned four levels of priority:

  • (1) Supporting current file format standards so software can communicate data easily
  • (2) A description language for state-transition models.
  • (2) Supporting a database archive for flexible storage of rich data sets and metadata
  • (2) Library of examples to serve as targets for development and testing
  • (3) Developing a next-generation data exchange standard (e.g., XML of NEXUS)
  • (3) Analysis templates so that novices can apply expert choices to their own data
  • (4) The future: identifying new demands from new types of data and analyses.
  • (4) Support for validation and comparative assessment of methods.
  • (4) Education and outreach to provide teaching materials, organize workshops, etc.
  • (4) Supporting special demands of Bayesian analysis.

However, the “database archive” goal was down-graded because an NSF-funded collaborative project is working on this type of project already (see http://phylodata.seas.upenn.edu/cgi-bin/wiki/pmwiki.php).

The group leaders developed an agenda focused on the top priority, supporting current standards, while laying the groundwork for other high priorities. The agenda called for some initial discussions, after which the participants would break out into small task-oriented groups on the first day of the meeting. The group leaders suggested that an appropriate set of task-oriented groups would be

  • formalization: develop grammars for current standards
  • instance library: collect sample files for common file formats
  • user practices: research user choices of software, formats, data types

Summary of activities and discussion at first meeting

At the meeting, the agenda was somewhat amended because some participants wanted to have more discussion prior to deciding on a strategy.

Day 1

Day 1 began with brief talks from NESCent officials followed by a 3-minute "lightning talk" from each participant. This was followed by a "visions of interoperability" discussion that sometimes strayed from the big picture. One participant expressed a grand vision of evolutionary methods being integrated transparently into workflows in genomics; another participant had a more pragmatic vision of getting a handful of most-used phylogenetic analysis packages to talk to each other.

After lunch, Stoltzfus presented the results of the prioritization exercise and suggested that we define specific tasks for supporting current standards. However, there was an objection to the presumption that we were ready for this, so we went back to discussions.

Day 2

During day 1 some topics came up repeatedly, so at the end of that day, the leaders recruited participants to begin Day 2 with focused presentations on four key topics:

  1. (Holder) Pragmatic gap analysis and remediation
  2. (Pond) Substitution model language as a token problem
  3. (Qiu) Dissemination and advocacy of standards and best practices
  4. (Mackey) Unifying data model for evolutionary analysis

In the afternoon of Day 2 we began to discuss the form of the data model, starting with the question of what are the OTUs or "taxa" (aggregate or monod; a reference or a container; are ancestors or fossils included?). This proceeded to a discussion of what "characters" represent, and what is the unit of analysis. By the end of Day 2, the group had a clearer idea of priorities and how to accomplish them.

Day 3

Day 3 began with the following summary:

  • Themes to which we keep returning
    • NEXUS flavors, conflicts in interpretation, ambiguities
    • the next data exchange standard format (e.g., XML-ified NEXUS)
    • maintaining quality, avoiding errors
    • a language for substitution models
    • conventions for managing mixed data sets
  • Some tentative conclusions
    • We value developing a central unifying artefact (e.g., ontology, data model)
    • We recognize an acute need for a substitution model language, but otherwise, our concerns are driven more by emerging opportunities than by current barriers
    • We value the 80:20 rule: implement the simple solution that covers 80 % of user needs (defer the far greater effort required to get the remaining 20 %).

This summary was followed by discussion to identify possible deliverables listed in the next section.

Strategy and plans for follow-up activities emerging from the meeting

In the course of the meeting, three areas of attention for follow-up activities emerged. Firstly, an area focused on creating deliverables in the form of analyses of current practices and problems, and in the form of coding artifacts such as an ontology or data model, flat file grammars for current standards, and data schemas for future ones. Secondly, a focus on documenting the group activities and discussions in the interest of compiling an informal knowledge base, for future reference, of the current status of evolutionary informatics. Lastly, a focus on outreach to like-minded organizations, the dissemination of this group's activities to the wider community and an effort to partner with interested organizations and individuals in that community.

Deliverables

The working group seeks to produce tangible deliverables that aide the evolutionary informatics community in combating interoperability problems. These deliverables include documentation of current practices and standards "in the wild" and coding artifacts such as schemas and ontologies.

Analysis of Current Standards

Many different standards to express evolutionary data now exist. For the most part, these consist of flat file formats whose exact grammar and syntax are ambiguously defined (i.e. without recourse to a formal schema or grammar). The working group seeks to "reverse engineer" these standards and develop more formal specifications for them. This involves:

  • collecting examples of current file formats;
  • documenting pathologies and custom extensions (with particular attention for the NEXUS format);
  • developing formal grammars for these formats (with particular attention for MEGA and PHYLIP);
  • implementing a validation server for selected file formats (contingent on grammars);
  • implementing a semantic translation server for selected file formats (contingent on grammars);
  • developing additional applications using grammars (for example syntax highlighting);
Ontology or data model

The working group seeks to establish an ontology or data model that defines the relations between the union of data types currently represented in selected commonly used file formats. This ontology will be used as the "super representation" to which the formal grammars described in the previous section will translate the files they operate on so that translation between formats will be facilitated. In addition, the working group envisions a situation where future data standards will serialize directly from this ontology into markup or flat file formats. To both these ends, the current commitments of the working group are:

  • generation of a UML or entity relationship diagram (ERD) of the perceived entity relations as emerged in the discussions of the first meeting;
  • collection of publications on ontologies and semantic transformation;
  • sketching a staged evaluation strategy for ontology;
Substitution (transition) model language

At present, no shared standard to express substitution models (such as nucleotide substitutions, e.g. JC69) exists. Commonly used software packages and the file formats they operate on each have their own syntax and grammar for models. The working group seeks to develop a standard model language so that assumptions about the underlying evolutionary process for a given data set can be exchanged between programs. During the hack-a-thon in December, first steps were taken to define such a model language in IDL. Future plans and commitments focus on extending and improving that first effort. This would involve:

  • clarification of the problem (define problem space, use cases, requirements);
  • identification and outreach to further stakeholders;
  • suggestion of an evaluation scheme (possibly through inclusion in ontology, validation server);
  • further planning for completing the work done so far (e.g. through a proposal to NESCent for the visiting scientist program)
Future data exchange standard

As stated in the section on the ontology or data model, the working group anticipates that artifact to be of great use in serializing anything mapped onto the ontology (e.g. data parsed from current file formats using formal grammars) into a future data exchange standard. However, by creating an ontology per se, the semantics of that future standard or not yet defined. Therefore, a separate effort is underway to develop what this standard would look like. Through these parallel areas of attention, the working group undertakes not just a "top down" approach - from abstract representation of data in an ontology to serialization - but also a "bottom up" approach - from serialized format to schema. The working group expects these two efforts to have synergistic effects on each other: problems or omissions in one effort may be identified by the other. To date, some effort has already gone into an XML-based data exchange standard (dubbed "nexml"). The group plans to revisit this ongoing effort during the next meeting.

Documentation

The working group is strongly committed to documenting the efforts and discussions as they emerge during - and between - meetings. This ongoing effort takes place on the working group's wiki ([1]), where members record, discuss, edit and update the status of the different areas of attention for the group.

Outreach

The working group is committed to reaching out to and partnering with like-minded, interested individuals and organizations in the wider evolutionary informatics community. To this end, we identified a number of potential partners during the first meeting. In particular, on the topic of ontology development we identified:

In addition, we concluded that scientific journals in our field should be invited to take note of our efforts, the rationale being that adoption and adherence of data standards will be greatly enhanced should journals make these a requirement for submission.

Lastly, we identified some other institutional centers of influence or possibly interest, notably NCBI and EBI.

The working group is committed to contacting these organizations and individuals to keep them abreast of the working group's efforts.

Future plans of the working group

This section needs to be written. The working group does not have a strategic multi-year plan. We are just going to go from one meeting to the next. Our goal for the next meeting is to accomplish minor goals that we set at the first meeting, and to make substantial progress on major goals.

When the next meeting comes, we will re-assess and go from there? Or will we try to plan ahead of the meeting what we need to work, and make the meeting a work session? Depends on how much gets done in between.