First Report to NESCent

From Evolutionary Informatics Working Group
Revision as of 16:50, 22 June 2007 by Arlin.stoltzfus@nist.gov (talk)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

22 June, 2007

Group leaders: Arlin Stoltzfus and Rutger Vos

Executive summary

The mandate of the evolutionary informatics working group is to facilitate advances in interoperability in evolutionary analysis. The first meeting of the evolutionary informatics working group took place at NESCent, May 21-23, with 12 members in attendance. Prior to their first meeting, group members participated in group planning and also worked with NESCent to organize a phyloinformatics hack-a-thon. The meeting consisted largely of discussions aimed at identifying the intersection between what the group can accomplish, and what activities would contribute most to its goal of advancing interoperability in evolutionary analysis. The working group came up with a short of list of task areas, each with specific deliverables: developing an ontology for molecular character analysis; developing a language for describing evolutionary transition models; establishing communication with interoperability stakeholders; and remaining seized of efforts to develop a next-generation file format standard. The activities and accomplishments of the working group are documented on its wiki site.

Project leaders and participants

Original proposal December hackathon Prioritization exercise May, 2007 meeting Eisen, Jonathan 1 0 1 0 Felsenstein, Joe 1 0 0 0 Gupta, Gopal 0 0 0 1 Holder, Mark 1 1 1 1 Huelsenbeck, John 1 0 1 0 Kosakovsky Pond, Sergei L. 1 1 1 1 Kumar, Sudhir 1 0 1 1 Lewis, Paul O. 1 1 1 0 Mackey, Aaron 1 1 1 1 Maddison, David 1 0 0 0 Maddison, Wayne 1 0 0 0 Pontelli, Enrico 0 0 0 1 Qiu, Weigang 1 1 1 1 Rambaut, Andrew 1 0 0 0 Stoltzfus, Arlin 1 1 1 1 Swofford, David L. 1 1 1 1 Vos, Rutger 1 1 1 1 Xia, Xuhua 1 0 1 1 Zmasek, Christian 1 1 1 1

Report on the First Meeting

Pre-meeting activities

Before the working group had ever met, the proposal itself had a kind of catalytic effect, with group members Holder, Mackey, Qiu, Stoltzfus, Swofford and Vos helping NESCent’s informatics team to organize a phyloinformatics hackathon (which they attended in December, 2006). These individuals contributed much of the material in the list of use-cases that became a focal point for planning, both for the hackathon and for the working group (see https://www.nescent.org/wg_phyloinformatics/Public:UseCases)

Indeed, some of the activities carried out at the hackathon were envisioned originally for the informatics working group. In particular, group members Stoltzfus, Lewis and Holder produced a substantial document on NEXUS conformance (https://www.nescent.org/wg_phyloinformatics/Supporting_NEXUS).

Goals for the first meeting

To set the agenda for the first meeting, a prioritization exercise was carried out in which group members ranked the top 3 of a set of 10 task areas. The group leaders analyzed the results, which led to four levels of priority:

  • (1) Supporting current file format standards so software can communicate data easily
  • (2) A description language for state-transition models.
  • (2) Supporting a database archive for flexible storage of rich data sets and metadata
  • (2) Library of examples to serve as targets for development and testing
  • (3) Developing a next-generation data exchange standard (e.g., XML of NEXUS)
  • (3) Analysis templates so that novices can apply expert choices to their own data
  • (4) The future: identifying new demands from new types of data and analyses.
  • (4) Support for validation and comparative assessment of methods.
  • (4) Education and outreach to provide teaching materials, organize workshops, etc.
  • (4) Supporting special demands of Bayesian analysis.

However, the “database archive” goal was down-graded because an NSF-funded collaborative project is working on this type of project already (see http://phylodata.seas.upenn.edu/cgi-bin/wiki/pmwiki.php).

The group leaders developed an agenda focused on the top priority, supporting current standards, while laying the groundwork for other high priorities. The agenda called for some initial discussions, after which the participants would break out into small task-oriented groups on the first day of the meeting. The group leaders suggested that an appropriate set of task-oriented groups would be

  • formalization: develop grammars for current standards
  • instance library: collect sample files for common file formats
  • user practices: research user choices of software, formats, data types

Summary of activities and discussion at the meeting

At the actual meeting, the agenda was discarded because some participants wanted to have more discussion prior to deciding on a strategy.

Day 1 began with brief talks from NESCent officials followed by a 3-minute "lightning talk" from each participant. This was followed by a "visions of interoperability" discussion that sometimes strayed from the big picture. One participant expressed a grand vision of evolutionary methods being integrated transparently into workflows in genomics; another participant had a more pragmatic vision of getting a handful of most-used phylogenetic analysis packages to talk to each other.

After lunch, Stoltzfus presented the results of the prioritization exercise and suggested that we define specific tasks for supporting current standards. However, there was an objection to the presumption that we were ready for this, so we went back to discussions. Some topics came up repeatedly, so at the end of the day, the leaders recruited participants to begin Day 2 with focused presentations on four key topics:

  1. (Holder) Pragmatic gap analysis and remediation
  2. (Pond) Substitution model language as a token problem
  3. (Qiu) Dissemination and advocacy of standards and best practices
  4. (Mackey) Unifying data model for evolutionary analysis

In the afternoon of Day 2 we actually began to discuss the form of the data model, starting with the question of what are the OTUs or "taxa" (aggregate or monod; a reference or a container; are ancestors or fossils included?). This proceeded to a discussion of what "characters" represent, and what is the unit of analysis. By the end of Day 2, the group had a clearer idea of priorities and how to accomplish them.

Day 3 began with the following summary:

  • Themes to which we keep returning
    • NEXUS flavors, conflicts in interpretation, ambiguities
    • the next data exchange standard format (e.g., XML-ified NEXUS)
    • maintaining quality, avoiding errors
    • a language for transition models
    • conventions for managing mixed data sets
  • Some tentative conclusions
    • We value developing a central unifying artefact (e.g., ontology, data model)
    • We recognize an acute need for a transition model language
    • Other than that, our interoperability concerns are driven by emerging opportunities rather than current barriers
    • We value the 80:20 rule: implement the simple solution that covers 80 % of user needs (defer the far greater effort required to get the remaining 20 %).

This summary was followed by discussion to identify possible deliverables listed in the next section.

Strategy and plans for follow-up activities emerging from the meeting

Planned activities fall into three categories.

  1. Task areas. Most of them fall into the list of original aims for prioritization.
  2. Documentation
  3. Outreach, dissemination, partnering

Substitution (transition) model language

  1. clarify problem, identify stakeholders, suggest evaluation scheme
  2. propose work session under short-term visiting scientist program (Mark)

Supporting Current Standards

  1. file format examples (wiki topic), include documentation, pathological examples, NEXUS extensions
  2. formalization of MEGA, PHYLIP
  3. validation of chosen file formats, possible via NESCent-hosted server
  4. (possible) circumscribed demonstration of syntax highlighting to allow identification of errors

Ontology version 0.1

  1. generate and add UML from Tuesday session (Hilmar)
  2. link publications on ontologies and semantic transformation (Gopal)
  3. sketch a staged evaluation strategy for ontology (Gopal, Enrico, Arlin)

Future data exchange standard

  1. XML-ified NEXUS (Wayne, Rutger)

Outreach, dissemination, partnering

  1. identify possible partners and funding opportunities for ontology development
    1. National Center for Biomedical Ontology [[1]] (Hilmar)
    2. pPod [2]
    3. TreeBase
    4. Adam Goldstein (the philosopher, not the disk jockey) his blog[3]
    5. PosSUM (european) data standard initiative http://www.possum-datastandard.org/possum/index.php/Main_Page
    6. INTEROP: a new NSF Program Solicitation: Community-based Data Interoperability Networks[4]
  2. Analysis (wiki report) of interest of journals in databasing alignments and trees (Weigang)
  3. Other institutional centers of influence or possible interest
    1. NCBI (pop set), ask Lipman (Arlin)
    2. EB, as Ewan Birney (Aaron)

Anticipated outcomes and products

  1. Documentation. The working group is committed to documenting its deliberations and outputs on the NESCent wiki.
  2. other
  3. other

Future plans of the working group