Meeting 3 notes

From Evolutionary Informatics Working Group
Jump to: navigation, search

Meeting Notes, Third Meeting (19-23 May, 2008)

Day 4, Thursday

Discussion of goals and strategy. Three topics:

  1. ongoing projects (nexml, cdao, glossary, etc)
  2. outreach
  3. performance evaluation

1. Creating carrots: CDAO-nexml data interop project

This is just the notes page, a historical record. If you were referred here for current project ideas, please go to the new CarrotBase page

To take advantage of Francisco's presence, we began with a discussion of CDAO, but this quickly led into discussion of a joint project to tackle data resources as implementation targets for CDAO-based nexml support.

Data resource interop project

  • rationale
    • need to develop critical mass for nexml-cdao (critical mass of community involvement, use-cases, eyes on code)
    • hard to get critical mass by targeting app developers
    • hard to show value by targeting app developers
    • strategy of targeting data resources puts data in hands of users
    • however there also are carrots for db maintainers:
      • interchange with other sources
      • common treatment of keys such as taxonomy
  • data resources to target (see Data Resources page)
    • TreeBaseII has the kind of granular schema that would be a good challenge to try to accommodate using cdao.
    • pPOD (not really a data resource, its a db tech project led by computer scientists)
    • TreeFam
    • Pandit
    • Hovergen, hogenom
    • organism-centered gene-family databases, e.g., for plasmodium
    • others?
  • implementation strategy
    1. recruit participants
      • possible leadership from Hilmar, other suggestions (Rod Page, Bill Piel, Encyc of Life, TOL)
      • other participants drawn from projects above
    2. develop project plan
      • consider implementing a coordinated db with families from many sources
      • CDAO to providing constraining vocabulary for nexml schema
        • disagreement over whether this is a good idea
        • to succeed, we must provide users a way to expand nexml, thus to expand cdao
        • mechanism is the feature request, e.g., I want to represent Bremer values
          1. user contacts about how to represent new type of data
          2. if a formal mechanism is desired but does not exist
          3. user is referred to cdao feature request (must provide searchable interface and front page link to feature request form)
          4. cdao developers are obliged to respond
    3. get support
      • NSF grant
      • NESCent-sponsored hackathon

2. Outreach

Manuscripts for nexml and cdao

  • ideal to make coordinated submission
  • ideal to coordinate content and discuss mapping between artefacts
  • discussion of venue
    • cdao group made choice to target bmc evol biol (open access, pubmed-indexed) instead of others
    • some people favor evolutionary bioinformatics (not pubmed-indexed)


  • (belongs to Arlin)
    • will map to NESCent server
    • will be home to CDAO project pages (currently under construction)
  • also maps to NESCent
  • should set cdao sourceforge home page to point to


  • blog about evolutionary informatics
  • distribute among group (2 posts per week rate of entries given 10 contributors works out to about 1 post per person per month)
  • include some other content
  • feed this blog to other sites maintained by group members
  • Hilmar has agreed to set up a trial site


  • each month, when applicable, send an announcement to evoldir about group activities

3. Performance evaluation

Our mandate is to promote interoperability. How do we know if this is working? How can we measure performance?

If we can measure performance, this will prove to be very useful in assessing progress and in justifying continued work (good grantsmanship too).

Our strategy for promoting interop is to develop artefacts that facilitate interop, so the evaluation strategy is based on the use of these artefacts, mainly cdao and nexml.

Indirect indicators, i.e., indicators of activity or interest:

  • pubs with 3 or more group authors
  • grants with 3 or more group PIs and collaborators
  • invited talks on group projects
  • workshop attendance
  • citation of project pubs
  • web site hits to cdao or nexml home sites

Direct indicators, i.e., indicators of actual use

  • number (or fraction) of nexml implementations (import or export) in
    • relevant data resources
    • relevant applications
    • relevant bioinfo libraries (bioperl, jebl, phylobase, ...)
    • languages
    • services
  • can weight above metric by importance (heavier weight for more widely used apps)
  • number (or fraction) of nexml instances
    • in submissions to TreeBASE, Dryad
    • as data source in service calls to monitored services
      • this requires monitoring the service
      • this includes the validation service (i.e., count all validation instances)
      • this includes translation services TO nexml (and FROM, but thats not as interesting)
  • number of downloads
    • of cdao from OBO or sourceforge
    • of nexml APIs
  • number of cdao-mediated database transactions
  • number of cdao-mediated translation service calls

Day 3, Wednesday

Lecture and discussion: Representing non-molecular data

This session took place at 10:40.

  • Phenoscape Project (Jim Balhoff, NESCent)
  • Using ontologies to formalize comparative data on worm development (Arlin Stoltzfus)

nexml promotion strategy (post-standups discussion)

The idea of a NESCent nexml support hackathon

  • test data
  • need to have conformance levels defined in advance, with tests
  • applications developers: Ronquist, Swofford, Beerli, Kuhner, Felsenstein, Zwickl, Kosakovsky Pond, Rod Page, Sanderson, Eulenstein, Burleigh, Zmasek, Stamatakis, Goloboff, Farris, Rambaut, Drummond, Holder, Maddison, Maddison,
  • library developers: Mackey, Paradis, Bolker, Thierer, Lewis,
  • data resource managers: need list for treefam, hovergen,
  • need for carrots:
    • data resources: phenote-entered data on morphology, available in nexml
    • capabilities:
    • services:
      • find species names from ids (accessions) in input file, output as annotations
      • visualization ?

nexml implementation group (standup)

Aaron's R support for nexml in phylobase package

  • implementation, round-trip tests
  • problem with internal ids getting changed every time
    • is due to only one slot for r-assigned-id-or-user-assigned-label
    • need to alter object model to add another slot
  • code will be available in phylobase on r-forge

nexml-CDAO group (standup)

  • Arlin
    • what kind of characters are aligned sequence residue characters?
    • what are the implications of one choice or another
      • chemical: nucleotidyl group
      • informational (semantides): nucleobases
      • development or synthesis: nucleotide precursors
    • note to arlin: add rutger to sourceforge list of cdao
  • Enrico
    • mapping ontology from nexml to CDAO
  • Brandon
    • bindings to CDAO from C++ library (based on xerces)
  • Rutger
    • Phenote with Jim Balhoff
  • Francisco
    • annotations for phylogenetic methods
    • possible target: MIAPA compliance via CDAO
    • future issues for CDAO
      • joining data sets into a composite data set, e.g., as in TOL project
      • counting queries on rooted tree topology
      • find correlation between character type and evolutionary pattern

transition model language group (standup)


  • working on stuff from Peter Midford and Jeet
  • inspired by BEAST implementation
  • nexml part with model substitution language
  • Hilmar added to nexml developer list

Day 2, Tuesday

Lecture and discussion: Data standards and repositories

This session took place at 10:40.

  • Jim Leebens-Mack, U. Georgia (MIAPA) (20 min)
  • Repositories, data standards, and data reuse (Ryan Scherle, NESCent)

nexml-CDAO coordination group

  1. use appinfo field in nexml schema to specify CDAO classes
  2. coordinate names
  3. Aaron suggests defining mappings by creating a third, mediating, ontology.

nexml implementation group

Aaron, Weigang, Hilmar

  1. connect nexml to alignment and tree objects thinly via Bio::Phylo (Weigang)
  • Task 1. Use the standard BioPerl interface to read and write nexml trees.

A new module was written: Bio::TreeIO::nexml, so a nexml tree file can be read as: <perl> use Bio::Phylo::IO qw(parse) my $tree_in=Bio::TreeIO->new(-file=>'trees.xml', -format=>'nexml'); while(my $tree=$tree_in->next_tree){

   print $tree->calc_tree_length, "\n";

} </perl>

<perl> use Bio::TreeIO; my $tree_in=Bio::TreeIO->new(-file=>'longnames.dnd', -format=>'newick'); while(my $tree=$tree_in->next_tree){

    my $tree_out=Bio::TreeIO->new(-format=>'nexml');

} </perl>

  • Note: 1. Dependency: requires Bio::Phylo; 2. Return: a Bio::Phylo::Forest::Tree object (not a Bio::Tree::TreeI)
  • Results:

2 new files: bioperl-live/Bio/TreeIO/; bioperl-live/t/data/trees.xml 3 new tests in bioperl-live/t/TreeIO.t

  • Task 2. Use the standard BioPerl interface to write nexml character matrices.

Problem: Bio::Phylo reading of "characters.xml" generates exceptions

  1. nexml parsing and writing to R PhyloBase (Aaron)

nexml transition model language group

substitution model language

aspects of the problem

  • need for nexml to implement instructions for processing
  • subst model is key part of processing instructions
  • current agreed names for certain models (HKY, F81, ...)
  • prior work
    • wiki resources
    • BEAST xml
  • need for ways to circumscribe problem
    • limit to support for specific packages (MEGA, DAMBE, PHYLIP, ...)
    • focus on most-used model concepts, based on TreeBASE submissions

Day 1, Monday

The meeting opens with a round of introductions of all participants and opening remarks by Jeff Sturkey and administrative housekeeping. Arlin then introduces the meeting with an overview (see (File:Stoltzfus intro.ppt) of work-to-date and goals. This leads into a discussion about (lack of) progress in the Transition_Model_Language subproject.

Current state of the NeXML project

Presentation by Rutger Vos, see File:Nexml nescent 19 5 08.ppt)

  • problems with NEXUS
  • advantages of xml
    • streaming parser (sax) and DOM parser
  • triangle of semantics (CDAO), syntax (nexml) and "transport" (phyloWS)
  • design principles of nexml
    • re-use
      • property lists
      • graphml concepts (not imported directly, just inspirational)
    • streaming-friendly
      • declare-before-use
      • meta-data first
      • venetian blinds
      • avoid deep hierarchy for trees (i.e., graphml representation with nodes and edges, not nested clades as in phyloxml)
  • implementation

Implementation of NEXML in DAMBE

A presentation by Xuhua Xia

  • visual basic xml parser
  • integrated into DAMBE interface
  • very smooth transparent for user to input or output nexml

The phyloWS Project

A presentation by Hilmar Lapp (see PhyloWS wiki pages).

Inspiration from CIPRES services examples. See Hilmar's blog for examples of access using curl.

scopes --> use cases --> requirements --> spec --> implementations

  • operations scope
  • data types scope
    1. character data
    2. trees
    3. analyses
    4. conversion

Hilmar and Rutger focused on trees as the most well circumscribed data type. See the PhyloWS wiki pages for details on use cases and requirements.


  • Simple Object Access Protocol (e.g., corba over http) was never simple
  • REST takes a different worldview, calls via url
  • soap is operation-oriented, REST is object-centric, stateless
  • REST is easier for users; for developers, depends on language
    • Java5 has containers that make it easy to generate SOAP or REST interfaces
    • BioPerl has SoapLite, not fully up to date
    • support is limited in Python, Ruby,

Project status, completion

  1. scopes 90 %
  2. use cases:
    • 79 % of phylo trees and char data
    • not much of the others
  3. requirements
    • 70 % (for subset of use cases above)
  4. specification
    • 90 % for phylo trees
    • 0 % for others
  5. implementations: none except for conversions example

First Draft of the Comparative Data Analysis Ontology

A presentation by Enrico Pontelli (see File:Enrico cdao.ppt).


  • interoperation
  • reasoning

Development process

  1. specification
  2. choice of representation
  3. repeat cycle of
    1. conceptualization
    2. implementation
    3. evaluation

core concepts

  • TUs
  • trees
    • discussion of rooted and unrooted
    • representation in terms of nodes and edges
    • parent vs. ancestor (transitive closure of parent)
    • edges, lineages, nca (nearest common ancestor, should be "mrca")
  • character data matrix
    • character, TU, datum, state

implementation details

  • OWL 1.1
  • protege 4 alpha
  • translators
  • reasoning

preliminary evaluation

  • example translated from NEXUS

trajectory and strategy


  1. substitution model (Xuhua) (see Transition_Model_Language)
  2. MIAPA compliance (Jim)
  3. implementation targets
    1. services
      • translation
      • visualization and editing: Nexplorer
    2. applications
      • done: DAMBE
      • in progress: Mesquite
      • developer commitment: PAUP*, HyPhy, GARLI,
      • other targets: Geneious, Phenote Java app to create phenotype annotations (Jim)
    3. libraries
      • done: Bio::Phylo
      • in progress: PyNexml (Kansas), R and PhyloBase (Aaron, Hilmar)
      • developer commitment:
      • other targets: Bio::NEXUS, BioPerl (Weigang, Hilmar, Aaron)
    4. data resources
      • done: none
      • in progress: none
      • developer commitment: ?
      • other targets: TreeBASE, PANDIT, TreeFam, etc
  4. Implementation conformance policy
    • definition of levels of conformance
    • evaluation of conformance levels
  5. Outreach
  6. Support


  1. coordination with nexml (Enrico, Arlin, Rutger)
    • similar labels
    • annotate element types with corresponding ontology terms
  2. evaluation (Brandon, Arlin)
    • concrete cases for evaluation
    • minimum standards for representation
  3. how to reason with external ontology about ( need use case ) (Arlin, Aaron)