Meeting 2 notes

From Evolutionary Informatics Working Group
Jump to: navigation, search

Some Pre-meeting Comments

From those who could not attend

Bill Piel's comments

Since Bill decided not to attend due to a conflict, he sent me (Arlin) his thoughts on lessons from TreeBASE and what is needed for a new data format or ontology or schema.

The TreeBASE schema (ER and OM) is not especially exciting or unexpected. It's largely NEXUS with the following added on: citation info (largely compatible with myphpbib); and a basic analysis consisting of analysis steps, each with data in, data out, algorithm, software, and commands/parameters. One thing it does which neither nexus nor Mesquite-extended nexus can do is attached metadata to row segments -- i.e., if one or more matrices all map to a set of trees, each matrix and tree must contain the same set of taxon labels so that all the rows match up with one another and match up with the leaves of trees. However, separately, each matrix row can be subdivided into row segments, each row segment having its own identity. i.e. a leaf node called "Pan_sp." could map to a matrix row called "Pan_sp.", but that matrix row can be subdivided into two segments, one came from one species Pan troglodytes, the other from a different species Pan paniscus. (and the annotation on row segment here extends to genbank accession numbers, museum specimen ids, lat longs, etc.). This "segment row" concept doesn't exist in nexus or even in Mesquite-nexus annotation (which can only annotate one matrix element at a time, not a string of them).

so... my main contribution to this meeting would be to request that new ontologies/formats/schemas include the following:

  • ability to annotate a publication handle for the origin of the data (e.g., citation & DOI)
  • ability to track origin of characters to individual specimen/tissues (e.g. GBIF LSIDs, ncbi accession numbers...)
  • ability to map different taxonomic identification to multiple "levels" (e.g. a tree leaf label might be "Pan sp."; which maps to two matrices, one that is labeled "Pan paniscus" and the other labeled "Pan troglodytes"; and the latter matrix row might be divided into two row segments labeled "Pan troglodytes schweinfurthii" and "Pan troglodytes troglodytes". )
  • ability to store generic evolutionary models and parameters, in addition to the parsimony models (weights, transformation matrices, etc that nexus currently supports)

Finally, one thing that TreeBASE's schema does not do, but which would be useful, is the ability to provide intelligence to characters -- i.e. some sort of mapping to a character ontology.

That's about it. So.. with these thoughts my contribution is pretty much dealt with. I look forward to reading any reports that emerge after I return from Russia.

Xuhua Xia's comments

Greetings from the North!

As I am going to miss the meeting, I wonder if you may add the following questions in your agenda.

  1. What will constitute the dominant molecular data in coming years? What biological questions will motivate the collection of these data?
  2. What computational analyses will be performed by biologists on these data in coming years? What biological questions could be addressed directly by analyzing these data?
  3. With #1 and #2 in mind, what would constitute the minium and sufficient specification of each type of data (to facilitate data exchange among software programs and platforms)?
  4. What would be the best way to coordinate the effort of software development to implement these analyses?
  5. Can we have a consensus (e.g., in the form of a review) on "The ins and outs of evolutionary computational analyses of molecular data"?


Monday, 12 November

Session 1

Joe Felsenstein joined us remotely by videolink for this session.

Opening

Arlin Stoltzfus opens session 1 at 9:00AM. Participants introduce themselves to Joe Felsenstein, who is listening in by videoconferencing. Arlin then recaps the work-to-date leading up to the meeting:

  • Work on a CDAT OO database
  • Bio::NEXUS development
  • Bio::CDAT development
  • Formation of an evolutionary informatics group
  • Organization of first phyloinformatics hackathon, Fall 2006
  • Submission of a working group proposal, resulting in:
    • First wg-evoinfo meeting in Spring 2007
    • The present meeting
  • NESCent-facilitated Google Summer of Code projects
  • NIH proposal for ontology-driven workflow development

Arlin then re-iterates the prioritization exercise leading up to first meeting, presenting the flowchart diagram. The first meeting resulted in work on several topics of interest:

  • Supporting current standards and file formats
  • Development of a substitution model language
  • Development of a new data exchange format
  • A Central Unifying Artefact open to multiple interpretations, e.g. DB schema, ontology, XML schema. Final interpretation of CUA will be driven by:
    • Richness of different technologies
    • Ease of expression
    • Tech support for:
      • Transfer
      • Storage
      • Query
      • Validation

The outcomes of the first meeting were:

  • Development of new format
  • Support for current formats, collection of (pathological) file format examples
  • IDL definition of substitution model language
  • More concrete manifestations of CUA: nexml, ontology, glossary, semantic transformation, NIH proposal
  • Outreach efforts, including to international collaborators

Further development on ontology will require:

  • Definition of domain by means of use cases on the wiki
  • From use cases, identify core concepts and relations and capture these in a wiki glossary
  • Study related artefacts, including ontologies to be integrated (MAO, PhyloXML, NEXUS, etc.)
  • Formally encode concepts and relations in an ontology language
  • Evaluate and revise the encoding.

In more general terms, to accomplish our goals, we should:

  • Complete the glossary
  • See substitution model language development through to a deliverable
  • Create a formal ontology
  • Continue supporting current standards
  • Prepare for events:
    • SSE'08
    • Workshops
    • GSoC
  • Look for funding opportunities for visitors and hackathons

To this end, this week's meeting includes:

  • Presentations on ontologies, formal reasoning and workflows, MAO, nexml, work-to-date on model language
  • Practical work on ontology: use cases, glossary, studying related artefacts
  • Collect use cases for current standards
  • Select technology for substitution model language, document in glossary

Arlin finishes at 9:27AM. slides are here

Presentation 1. SPAN: A System for Phyloinformatics ANalysis using Ontologies and Automated Reasoning

Enrico Pontelli begins at 9:28AM, giving a presentation co-authored with Arlin and Gopal Gupta about SPAN. The work is subsequent to an NIH proposal for an infrastructure for evolutionary informatics intended to enable more widespread use of EI. Current research is hampered by several problems, including a multitude of file formats and applications, the lack of formal validation and quality control and the delayed adoption of computing advancements.

SPAN will remedy some of these issues by the creation of an integrated and open architecture, following principles of interoperation, workflows design and use-case driven development. Technologies used include formal ontology, domain-specific languages and workflow infrastructure such as web services, automated reasoning and Bio* toolkits (specifically, BioPerl).

The Big Picture of this project encompasses workflow design and execution infrastructure and ontology development. The ontology development will proceed following best practices (use cases, glossary compilation, formalization). Ontology will be implemented in OWL. SPAN will meet the following requirements: cover existing file formats, integrate with OBO, provide an API to converse between ontology and legacy files.

Enrico then discusses workflows, defining them as sequences of transformation and conversion. He notes there is a growing body of literature of workflow technologies. He enumerates several design principles:

  • Creation of a workflow design environment, both graphical and textual
  • Templating (i.e. re-use and generation of workflows)
  • Automated configuration
  • Execution of workflow

Enrico concludes his talk with examples of ontology-mediated format interoperation and of PhyLOG, finishing at 9:52AM. Subsequent questions originate from Peter Midford, asking about SPAN's reasoning. Enrico explains this is based on the concept of "agents", i.e. components with inputs and outputs, capabilities and goals. Sergei Pond interjects, expressing his worries about naive WS projects (websites), fearing the constraints placed by simple portals on the web. Enrico replies that SPAN would be more expressive and customizable. Discussion then moves on to outreach and documentation, with Arlin mentioning Mesquite as an example of a project that would benefit from better outreach and documentation of its capabilities. Questions close at 9:59AM. slides are here

Presentation 2. Problem Solving in Evolutionary Analysis with Web Services and DSLs

Gopal begins at 10:00AM with a talk about a new problem solving framework - inspired by bioinformatics - a joint project with Pontelli, Stoltzfus, DeVries and Kona. He introduces his talk with brief mention of commercial enterprise (Interoperate LLC) that addresses some of the issues discussed in this talk. He then establishes context by defining software development as the three-step process of i) outlining the problem, ii) coming up with a solution, iii) expressing the solution in formal notation. These steps are challenging, and the goal is to reduce this challenge. A typical scenario is that of problem solving being done at a high level by domain experts, programmers then implementing these solutions at a lower level, creating a "semantic gap". Someone has to close this gap, but if this closing is done by programmers, this may introduce translation mistakes.

Instead, the gap might be bridged by empowering end-users, by creating Domain Specific Languages (DSLs). The challenge then becomes the design of DSLs and the implementation of the infrastructure for DSLs. Additional pressures in this approach are the need for fast turnaround. The goal thus becomes to rapidly obtain an implementation infrastructure in a provably correct manner, which is what Gopal's group works on. Their semantics-based approach includes semantics specification and making these executable, yielding an implementation infrastructure.

The semantics-based approach is dependent on the combination of web services. But how to find those? Gopal introduces the notion of a service discovery engine that returns the correct service, or composition of services, for a task. This requires a service description, with more defined semantics than WSDL including pre- and post-processing conditions so that conditional workflows can be generated. Gopal then notes that bioinformatics would be one application for automatic workflow generation. He concludes that, in essence, "composite service description" is a workflow, if its meets the following requirements:

  • CSD needs to be formulated in a DSL
  • DSL should be such that workflow can be understood and modified
  • DSL can be graphical or textual, would be "compiled" into something like BioPerl
  • DSL should cater to different levels of expertise

Gopal concludes with a service composition demo LINK TO DEMO HERE and closes at 10:30AM. Session adjourns for a quick coffee break. slides are here

Session 2

Arlin opens session 2 at 10:46AM. Joe Felsenstein joined us remotely by videolink for most of this session.

Presentation 3. MAO

Julie Thompson starts by outlining the scope of MAO (DNA, RNA, AA), which is developed by a consortium whose members she goes on to acknowledge. She then presents an expanding graphical display of the ontology network, outlining the areas dealing with nucleotide sequences, residues, annotations and features, and similar areas dealing with amino acid sequences. She notes that the ontology differs from sequence ontologies by descriptions of alignment columns.

The MAO ontology links to OBO, which she goes on to introduce. She outlines the reasons for the development of the ontology, which include promotion of interoperability between alignment programs (MACSIMS) and application in data collection, information management and efficient exploitation of data. She then lists several high-throughput application examples:

  • alignment expert system ALEXSYS
  • MyoNet, interactome, transcriptome analyses extension of MAO
  • MACSIMS, move to higher level of MSA interactions
  • EvolHHuPro, reconstruction of evolutionary histories for human proteome.

She then further describes EvolHHuPro, a system for genome scale analysis used for the classification of histories, functional analysis of clusters, better understanding of mechanisms. She concludes her talk at 11:05AM. Joe Felsenstein then asks about the translation of features from one MSA to another. Julie replies that many features are not propagated unless homology is confirmed (through a heuristic) and feature can be translated. Joe interjects that propagation of features implies evolutionary assumptions. David Swofford mentions "clouds" of Bayesian alignments, Joe mentions similarly fuzzy "clouds" of histories. slides are here.

Presentation 4. Nexml

Rutger Vos started at 11:20AM by introducing the work-to-date on nexml. He introduced the general idea, i.e. to design a file format like nexus, but fixing some of the problems with the standard. He noted that XML gives access to data at a higher level, and that the standard would be more extensible and would expose phylogenetic data to various off-the-shelf goodies. He then enumerated the design principles followed in nexml:

  • re-use of prior art
  • follow design patterns
  • referencing
  • verbose vs compact representations

He then continued to outline the design approach, i.e. xml schema design, with lots of community feedback and testing of the format with various parsers and other xml tools. He then discussed the main elements and attributes now defined in the schema, showing code samples in eclipse. Joe Felsenstein asked about rooting of the tree description. Rutger answered that, although rooting is implicit from the in-degree of node elements, this would be better specified using an attribute which may occur on multiple nodes - which would facilitate a suggestion made by David Maddison earlier. Joe Felsenstein also asked if the compact sequence representations were aligned. Rutger replied this wasn't the case, and that the preferred approach would be to specify multiple alternative alignments separate from the raw sequence. Several annotations were discussed, with Sergei noting that gaps useful come in long rows, which might be compacted. Discussion also ensued about the re-use of GraphML, and issues of normalization of links between nodes and otu elements. Rutger ended at 12:02PM. slides are here

Presentation 5. Substitution Model Languages

Peter Midford opens at 12:02PM. He starts by outlining a continuum of languages for models, leading from general purpose languages such as C or Java, on to modeling languages to AML, Representation focused languages such as XRate and controlled vocabularies. He then identifies several stakeholders, and notes some use cases (phylogenetic inference, identification of sites). He then gives an overview of legacy modeling languages: GAMS, MPL, MML, AMPL, AIMMS. He gives code examples of MPL, which looks somewhat like pascal, of MML, which looks like java, and XRate, which looks lispish. He then expands on XRate, an implementation of phylogrammars. XRate looks like lisp. It handles feature prediction. However, there is some difficulty expressing models. He gives an example of a representation of the K2P model, which looks verbose. Discussion then ensues about IUPAC ambiguity codes.

Peter continues with a discussion of AML, an XML schema for modeling languages. He notes that XSLT style sheets exist for GAMS, AMPL and MPL. AML avoids the use of MathML, for poorly defined reasons. All that is available is a pdf. He shows an example of AML language elements. He concludes with a mention of work-to-date on a simple OWL for models. He notes the OWL class hierarchy, other relations to define common concepts: parameters, constants and constraints and concludes by mentioning the Lawrence, KS, work on porting portions of the OWL to the nexml schema. He closes at 12:19PM. slides are here

Presentation 6. Evolutionary Model Description: Towards a description language for substitution models

Sergei starts at 12:19PM by coining an acronym for the evolutionary model description language EMDL. He notes that the main requirement to meet is the development of a complete, self-contained description of model and parameters. He notes the following components of evolutionary models:

  • Alphabet, giving examples of ambiguity mappings and encodings
  • Model parameter description
  • Character substitution process

What needs to be formalized to this end? Sergei mentions state spaces, standard mapping functions, exception handling, etc. He continues with model parameter description, noting several classes of parameters: time, rate, frequency and "other" and that these can be fixed, estimated or constrained and estimated.

Further formalization includes identifiers for parameters, parameter namespaces and references; parameter scope such as branches, trees, partitions or global; estimation policies such as fixed, estimated, stochastic/integrated; parameter constraints, e.g. frequencies sum to one.

For the substitution process, he notes that EMDL should define transition matrices and describe process properties such as time reversibility. This requires formalization of compact definition of rate/transition matrix on alphabet.

Subsequently, models need to be combined with analyses, e.g. assignment to trees, partitions, define constraints, describe estimation procedures. He concludes by enumerating several test cases:

  • 4x4 nucleotide, e.g. GTR+G+I
  • site codon models
  • branch codon models
  • stem-RNA models
  • mixture amino acid models
  • covarion codon models
  • codon mixture models
  • hidden Markov models

Sergei finishes his talk at 12:40PM. Peter Midford then mentions other models, such as birth/death models. Sergei responds that they are also important, but there is such diversity in representation that these may now be skipped, following the 80:20 rule. Arlin comments we need a killer app for EMDL. Swofford comments that a general enough description is hard. The sessions adjourns for lunch at 12:45PM. slides are here

Session 3

Presentation 7. A Simple NEXUS-Phylip Translator

Brian DeVries opens at 13:56PM, introducing his work on a nexus to phylip translation tool. His work-to-date supports a subset of nexus, handling sequential and interleaved data. He notes there is an online mini-demo. He took an inventory of currently existing programs, primarily examining ReadSeq and SeqVerter. His subsequent approach was to focus on nexus and phylip, employing logic programming. He outlines future work, which includes:

  • increase support
  • handle more complex data
  • handle more formats
  • create a web interface
  • build a components-based architecture

He then mentions some ideas, such as the use of formal theorem provers, support for logic programming in heterogenous environments (e.g. integration with Bio* toolkits).

Several questions followed. Rutger Vos asked about the extent to which the translator can validate. Brian replies the BNF approach currently allows only for pass/fail. Brian was asked about the number of test files used so far, which was two. Some discussion ensued about the internal representation of the parsed data, which is in the form of ASTs. A concluding remark was that the translator might be used as a validator for TreeBaseII. Brian closed at 14:11PM. slides are here

Breakout sessions

Leaders reiterated list of 6 suggested task groups. These were written on the board, and participants were asked to put their initials by the topic that was most important to them. Most selected 2 or 3 topics. On this basis, 4 topics were the most popular:

  1. collection of use cases
  2. study of related artefacts
  3. substitution model language
  4. XML

David thought that the latter two groups overlapped and should stay together. Arlin thought that this group might be too big to be effective. The meeting then broke out in three groups:

  • use case documentation (Weigang, Brian)
  • related artefacts analysis (Enrico, Julie, Arlin)
  • XML and substitution model language (David, Rutger, Gopal, Peter, Hilmar)

At 16:40PM the meeting re-convened for brief reporting on the breakouts. For the related artefacts, a complete list of those was compiled elsewhere on the wiki. Gopal reported on the EMDL, whose requirements were explained to him by others in the group. He outlined several steps: creation of sets of sites, sets of edges and sets of processes, and naming these. Then, from these, create triplets of collections, and specify substitution processes and declare relative substitution matrix. At this point, the sketch on the whiteboard was messy enough to move on to the next group. Weigang reported on the collection of use cases. He found several potentially useful examples in a paper, though for the first one the data turned out not to be available. He went on to suggest a second example from the same paper, for which data was available, as well as examples from his own research. Discussion then ensued about refactoring of use cases, whether steps shared by multiple use cases should be normalized into smaller mini use cases documented in one location. Pros and cons were discussed fiercely. For the XML, Rutger explained the difference between DOM and SAX apis to Dave and Sergei and updated to wiki page to link to terms in the glossary. The meeting adjourned at 17:40PM.

Tuesday, 13 November

The second day of Meeting 2 was devoted entirely to breakout sessions. The group reconvened at 14:40PM, when representatives of the different breakout groups reported on work done that day. Arlin summarized work on related artefacts. He showed the related artefacts wiki page, going through the results table. Many of the projects are actually hypothetical, but some were detailed further by Julie Thompson and Enrico Pontelli.

Weigang took over at 15:05PM, reporting on use cases. Weigang showed new sample data sets:

  • two homeobox data sets
  • kinase data
  • the "whippo" data, with hemoglobin, cytochrome, and other mammalian data sets
  • alignments for Lyme disease sequences
  • broadly taxon-sampled intron data sets

He then proceeded to follow some of the links, showing some of the data sets. Discussion then ensued whether the sample data sets shouldn't be less "finished", e.g. unaligned, so that whole use case workflows could actually be tested from start to finish. Suggestions also came in for possible other data sets, e.g. some of Sergei's data.

Peter then reported on the model language discussion, starting at 15:24PM. Work on the model language at this meeting consisted primarily of discussion, however, the model language wiki page has progressed with contributions by Hilmar and some reorganization and cleanups by Rutger.

Wednesday, 14 November

Wednesday morning was devoted to break-out sessions.

Derrick Zwickl (NESCent) joined the substitution model discussion.

The participants reconvened at 1:40, for final stand-ups. The focus was on tangible outcomes and plans to generate them.

Next-generation file format standard

The short term plans for the nexml format are as follows:

  • Several software packages by evoinfo members and associates will implement / extend nexml support. Specifically, nexml support will be improved in Mesquite, Bio::Phylo, and phycas (or related python code) and HyPhy will start experimenting with the format.
  • The generic class library support (java) will be improved. Round-trip read/write support will be implemented.
  • Code (xml schema, perl code and generic java code) will move to a public repository on SourceForge.
  • A web presence under www.nexml.org will be supported by nescent.
  • The standard itself will continue to be expanded upon following the to do list on the nexml wiki page.

Use cases

Transition model language

Propose meeting at NESCent to implement initial version of model description language (Hilmar, David, Mark, Derrick)

Sergei to document a collection of models that cover a wide range of models used in sequence analysis. David and Sergei to flesh out an initial design in nexus syntax, which could then be ported to other notations such as nexml.

Ontology

The ontology development stages (use cases, concept glossary, related artefacts, implementation, evaluation) should be undertaken iteratively. We feel that we now are ready to undertake the first (preliminary, rough) iteration of the implementation step. We will assemble a token set of examples and use them to illustrate core concepts and relations of CDAO, as well as relations to other ontologies.

Token set of examples

  • small set of protein-coding genes with phylogeny
    • codon alignment
    • aa alignment
    • intron presence-and-absence values
    • GO terms
    • some kind of continuous activity value k
  • small set of species with phylogeny
    • discrete morpho data
    • continuous morpho data
    • links to organismal taxonomy

Ontologies that we want to link with

  • GO
  • SO
  • NCBI taxonomy

Questions. How will we

  1. integrate other ontologies?
  2. handle sequences?
  3. define macromolecules biochemically (without reinventing the wheel)?
  4. respect developmental relationships (e.g., amino acids to codons)?
  5. represent ancestors distinct from OTUs?
  6. represent and integrate transformations?
    1. for enumerated states, default is any state to any other
    2. characters present in only one OTU can be treated as missing states in other OTUs
    3. common transformations (replace, insert, delete, duplicate, recombine, invert, translocate)
    4. note that some of these are not unary (e.g., recombine)