Database Interop Hackathon/Teleconferences

From Evolutionary Informatics Working Group
Revision as of 13:18, 29 June 2009 by (talk) (Next Teleconference June 28, 2009)
Jump to: navigation, search

Teleconference June 29, 2009

The teleconference is planned for 1:00 pm EDT. Connection info will be distributed on the day of the event.




Karla, Enrico, Ryan, Sheldon, Todd, Karen, Greg, Arlin, Mark

  • NSF interop grant
 blah blah blah
  • working group proposal
Tabled till next telecon.
  • followup hackathon
Tabled till next telecon.

Teleconference June 23, 2009

The teleconference is planned for 1:00 pm EDT.

The purpose of this meeting is to plan follow-ups to the successful hackathon in March.


tentative draft of agenda

  1. Report from the hackathon
    • what worked
    • what didn't work
    • subsequent work on the projects started at the hackathon
    • your comments
  2. Publicity - how to publicize our work (manuscripts, posters, others)
  3. discussion of the big picture. Whats the most that this group can do for interoperability (interoperability means that operations-- get, process, store, query, etc-- can be controlled and combined and integrated and automated in a hands-off way, without expert intervention, i.e., without doing things like manually editing files and manually operating a software interface).
    • what is the most we can do for phylogenetics-systematics-diversity studies in the next 2 to 5 years?
    • what is the most we can do for comparative-genomics and molecular-evolution studies in the next 2 to 5 years?
  4. Possible followups to consider:
    1. another hackathon
    2. another working group (NESCent deadlines July 10 and December 1)
    3. an NSF interop proposal (deadline July 23). Note
Responsibilities of the Networks: Each Network holds dual responsibilities for: (1) enabling broad community engagement in the development of consensus and agreement on strategies, priorities, and best approaches for achieving broad interoperability; and (2) providing the technical expertise necessary to turn consensus and agreement into robust interoperability frameworks along with the appropriate tools and resources for their broad use and implementation. Proposals for activities not based on significant community engagement and consensus-building activities are not responsive to this solicitation and will be returned without review.


  1. participants: Sam, Sheldon, Karen, Rutger, Jim, Karla, Greg, Mark, Ryan, Enrico, Todd, Arlin, Hilmar, Peter Midford
  2. Participants' feedback and evaluation of the hackathon: (A) hackathon, (B) followup activities, (C) incomplete or future plans
    • Sam: (A) useful java API for NeXML; meeting/working with people; (C) rough plans to incorporate api into pPOD
    • Sheldon: (A) pushing NeXML as standard; improvements in visualization and components integration; web services; (B) code clean-up; presentation to iPlant and NSF prog officers; starting phylowidget package w/Greg; (C) roll phylowidget/vis tools into TreeBASE frontend
    • Karen: (A) h'thon met need for more [specific, actionable] information about the standards; got a good start on a names-resolution component for phylota; made decisions concerning the allowable scope of queries; (C) needed more info on stds technologies; sub-group bit off "too much" technically (but came to understand the real scope of the names-resolution issue only by getting together with the h'thon participants).
    • Jim: (A) moving forward with nexml; started the key discussions leading to NeXML incorporation of metadata; (B) continued to flesh out the NeXML metadata annotation standard needed; now "have got it", using Java API with RDFa in phenex in test-research (not production) environment; (C) bring the api up to production grade, migrate to RDFa syntax;
      • "Didn't solve the issue at the h'thon, but formed a foundation for a solution" --Arlin
    • Karla: (A) diversity of the participants, coalescing around five different ideas, and producing results; (B,C) would like to try Open Space technology ideas within the iPlant collaborative;
      • Sheldon - interoperability concepts very important in the crafting of challenge grants.
    • Greg: (A,B) reiterates Sheldon's comments on PhyloWidget; (C) more progress needed in getting data 'out there' into these standard [or accessible?] formats: e.g., protein alignments/gene trees (i.e. Pfam)
      • Karen: possible action item for next hackathon?
    • Mark: (A) opportunity to get plugged in; important for professional development, particularly in learning and finding placing to contribute; (B) created XML schemas, NeXML formatting for LANL HIV data and was able to deliver data from mash-up (wants someone like LANL to bite, hasn't happened yet); (C) summer-of-code student (bioperl native modules to deal with nexml);
    • Ryan: (A) prof. networking opportunities; familiarization with standards; (B,C) continuing to hammer on PhyloWS, still some issues to resolve-- there was a need going in to build generalized architecture forced a focus on resolving phylows, h'thon provided the catalyst for this effort.
    • Enrico: (A) chance to apply CDAO practically; developed tools on the spot; clarification of metadata description; got perspective on what is missing in CDAO with respect to community needs (this was the most useful thing); (B) filling in CDAO gaps based on what was learned at the h'thon.
    • Peter: (A) resolved the "competition" between the two Java APIs through consensus and agreement; agreed on importance of representing metadata; (B) continuing to flesh out the API, including the metadata support; summer-of-code student, project: display of metadata in Mesquite (at level of the char data matrix interface [?]).
    • Rutger: (A) a great success; expanded Java API to include annotations; v. interested in summer-o-code project; (B) follow-ups include additions to Perl API; json API mapping nexml to javascript object notation.
      • Hilmar: RDFa standard important, developed at hackathon; important to make data accessible. In particular, need to make it a top priority to update the schema at to comply with the RDF/a-based standard. Two SoC projects depend on it.
    • Hilmar: (A) h'thon a "huge success", impressed by the diversity of people coalescing around a few shared objectives; acceptance of a RDF/a-compliant standard for representing phylogenetically rich metadata, which "de-silos" phylogenetic data, making it accessible to off-the-shelf tools and non-specialist brains.
    • Todd: (A) always a question of how much will the participants work together after the meeting-- This time, this aspect panned out very well.
    • Arlin: (A) we've hit on a successful formula: diverse group of people+open space approach, started out rough but it worked out. Reiterate Hilmar's point that we showed how to "de-silo" phylo data with semantics-based methods to make "insider" knowledge accessible to the computing world
  3. Publicity - how to publicize our work (manuscripts, posters, others)
    • documentation support for nexml; stabilize schema; evangelize;
    • nexml manuscript needed; bioinfo application notes?
    • Hilmar's Evolution meetings poster: basis for a paper?
      • Rutger: doc support is more needed now than code support in the drive for NeXML uptake; suggestions: push for TDWIG adoption, fill out and improve wiki, get out "mini-papers" and app notes. Probably requires a concerted effort (rather than an ad-hoc, free time one).
      • Hilmar: Now approaching a point where there is a standards narrative, rather than just the pieces of the puzzle. This group is at the forefront of this; the cohesion between the three legs (NeXML/CDAO/PhyloWS) makes the story--validatable syntax, rich computable semantics, and a consistent, predictably programmable interface. The technical groundwork is laid, some polishing is required, more complete doc necessary, and some compelling biological examples.
  4. Discussion of the big picture - improving interop in 1) phylo-systematics-diversity or 2) comparative-genomics-and-molevol
    • annotating trees, decorating trees is an important use-case in data integration
      • Karen: The 'big picture' in systematics/diversity: What do people in "the big projects" want to be able to do? Open up a tree from a web server/drive, add pictures, sequences, their own annotations to that tree. We are much closer to the glue to make that happen, but not ready to do that in a large scale way.
      • Hilmar - The "decorating trees" use-case is very, very significant; this is why all these processes need to be online and talking seamlessly to one another. In the future, it will be the data on the web will be significant, and not the "sites".
  5. A list of ideas or challenges that arose during the discussion
    • larger effort needed to pursue taxonomic resolution service
    • getting data into accessible formats
    • C or C++ interface to nexml, natively or via swig
    • importance of validating metadata-containing files
    • polishing and documenting examples, using this as basis for interop strategy presentation
  6. Opportunities:
    • Another working group: (see draft proposal Son of Evoinfo) July 10 NESCent proposal deadline, is doable, but need another leader, as Arlin needs to move out of the leadership role.
    • NSF INTEROP grant: July 23 deadline, $250K/3 yr, supporting a network of researchers. Requires a community cohesion component-- needs a systematic effort to promote and penetrate, bringing others in. Proposal due within next 30 days, difficult but possible, may be the last opportunity to apply. Discuss further with others who may be interested.
  7. Next steps: Arlin will arrange a follow-up telecon to this one.

Summary : Now have feedback now on hackathon: we hit on successful formula, people are happy about what happened, and are continuing to be happy.

organizer follow-up to telecon

The organizers (Todd, Hilmar, Rutger, Arlin) talked for 10 minutes after the telecon in order to make a plan to proceed.

To do:

  1. for NSF interop
    • its important to move quickly (e.g., requests for letters of support should go out very soon)
    • start wiki page for NSF proposal (Arlin)
    • start list of collaborators
  2. for a possible follow-up hackathon
    • Todd will ask whether NESCent has funds for hackathon on the scale of $20 K (i.e., 15 to 18 people instead of 25)
  3. for a possible NESCent working group

Teleconference Feb 26, 2009


Thursday, February 26, 2009, at 1pm Eastern US. You should have received a phone number and access code from Hilmar.


  1. Report from pre-meeting
  2. Introduction to participating standards (Arlin)
  3. Information gathering
  4. Use case gathering (Dave Clements)
  5. Setting the agenda for the hackathon
    • Activities and plan for first day
    • Development targets
    • General routine for days 2-5
    • Wrap-up on last day
    • Whole-group brainstorming about future activities
    • Brainstorming MIAPA
  6. Participant questions


  1. Introductions
  • present: Sheldon M., Matt Y., Rutger V., Mark J., Hilmar L., Vivek G., Ryan S., Karla G., Roger H., Todd V., Karen C., Greg J., Arlin S. (recording), Dave C.
  1. Pre-meeting report (Rutger)
    • telecon
    • use cases
      • need to gather information on inputs and outputs from participants
      • developed form for participants to fill out
    • PhyloWS, using SRU syntax (RESTful)
    • Combine CDAO with nexml to represent metadata
      • how to attach a GO term, taxon identifier, specimen-collection info, phenotype information
      • corrected mistakes
    • questions
      • use of nexml-CDAO
      • how to use ontology? don't need it to parse file, only to do reasoning
      • syntax for expressing statements, e.g., RDF triples
      • should be explored further how to express semantics in nexml
  2. Information gathering (Arlin)
    • coding (Mark)
    • use cases (Dave)
  3. Agenda for hackathon (Hilmar)
    • principles: self-organize; do its quickly; match interests with projects so as to get energy & commitment
    • need for people to communicate, get involved, in order for self-organization to work
    • more discussion of use case list
      • ok to add uses at any level of completeness or technical detail
      • 'targets' page also has space for ideas
    • mixer activity
    • supporting MIAPA (not important?)
  4. Open question period
    • bootcamps: nexml, cdao, syntax
  5. organizer follow-up
    • need pedagogic materials on
      • syntax and semantics
      • bootcamp about reasoning
      • example xml file with data and ontology links, used for reasoning example
        • tree-taxonomy correspondence (Rutger)
      • web services
      • integration (mash-up)
      • cdao
      • nexml

Teleconference Feb 20, 2009


Friday, February 20, 2009, at 3pm Eastern US. You should have received a phone number and access code from Hilmar.


  1. Welcome & Kick-off (Arlin)
  2. Introductions (all)
  3. Roadmap until event (Hilmar)
    • Teleconferences
    • Pre-meeting
    • Information gathering
  4. Introduction to participating standards
  5. Taking inventory (technical, semantics, purpose)
    • Vignette about the "network" (Arlin)
    • Spreadsheet requesting information input (Arlin)
  6. Use case gathering (Dave Clements)
  7. Participant questions


  • Brandon Chisham, NMSU, CDAO Project
  • Jim Balhoff, NESCent, Phenoscape
  • Enrico Pontelli, NMSU, CDAO Project
  • Ryan Scherle, NESCent, Dryad
  • Rugter Vos, University of British Columbia, NeXML
  • Hilmar Lapp, NESCent, Co-organizer, designer PhyloWS
  • Arlin Stoltzfus, NIST, Co-organizer, CDAO project
  • Jeet Sukumaran, University of Kansas, NEXUS
  • Peter Midford, University of Kansas, Mesquite
  • Karla Gendler, University of Arizona, iPlant
  • Sam Donnelly, U. Pennsylvania, pPOD
  • Sheldon McKay, modENCODE, iPlant
  • Mark Jensen, Fortinbras Research, clinical analysis of sequence data from pathogen
  • Bill Piel, Yale, TreeBASE, iPlant
  • Lucie Chan, San Diego Supercomputing, MorphoBank
  • Vivek Gopalan, NCBI
WELCOME MESSAGE (Arlin Stoltzfus)
  • Agenda has been sent out
  • Notes will be sent out after the call
INTRODUCTIONS (All Participants)
ROADMAP (Hilmar Lapp)
  • Kickoff teleconference,
  • Overview standards
  • Need to gather some information, important activity over the next 2 weeks
    • Data providers?
    • Use cases to guide?
  • Dave Clements has a page developed with Karla
  • Premeeting is taking place to prepare standards for the hackaton
    • More messages over weekend as the premeeting develops
  • 2 more telecons; next one does not need to be a replication, instead discuss more technical issues and gather info for use cases (same for third)
  • MORNING time for next one (for UK folks)
    • No questions
  • 3 technologies (phyloWS, NEXML, CDAO); all are outcomes of the evoinfo working group
  • Working Group for 2 years, started to address interoperability issues; started with brainstorming for ideas (e.g., integrated data resource); we settled on specific technologies to facilitate interop. One data standard, one ontology, one interface for web services. Hackaton is the last meeting of the working group. Thus, this is the time and place to put technology to the test.
  • NEXML: (Rutger Vos)
    • New XML standard, inspired by the NEXUS format; lots of applications use it; many data resource also use it (as data input or as serialization format)
    • NEXUS has issues, dialects, incompatibilites; we want a new standard, formally developed and that can be validated.
    • There is a website. It contains the XML schema, some I/O libraries (java, python, javascript, c++ in still in progress); on the other hand, there does not seem to be a strong interest towards C++.
    • It sounds like a useful technology, more reliable exchange of data, we can use it for data exchange for web services; some advantages over previous standards.
      • What is the current level of support? There are some libraries provided; Perl, included in BioPerl, thus PioPerl supports it; Java is used by Mesquite; Phenoscape uses its own; Jeet is working on a library

for Python.

  • CDAO (Arlin Stoltzfus)
    • Ontology that addresses the application area of comparative data analysis; implemented in OWL
    • OWL offers good control and formal structuring for the ontology
    • CDAO formalizes knowledge/semantics; it is useful for interoperability, to resolve ambiguities using semantics; For example, the Sequence Ontology has been used with similar objectives in the case of sequence data. Different sequence databases use Gene Feature Format (GFF) but with focus on syntax; this led to incompatible definitions of certain terms (e.g., open reading frame, in some instances it is viewed as including a stop codon, in other instances it does not; the Sequence Ontology enabled to clarify this ambiguity by creating two separate concepts).
    • Similar benefits can be gained in phylogenetic analysis: for example, in the problem of tree reconciliation. There are many tools, each imposing different requirements on the input tree (e.g., completely resolved or not). These distinctions on the inputs are often semantical, not based on syntax.
    • A formal ontology allows also access to reasoners, that can be used for validation of concepts
    • Note, that formal ontologies are meant to be machine understandable, not necessarily to be used manually.
      • Are there tools to generate it? Are there tools to formalize description of an analysis? Yes, there are formalisms to formally describe a biomedical analysis or protocol, and they can be instantiated using a domain specific ontology. This is the case of OBI or FUGO (as general ontologies for describing protocols) and BioMoby (as a domain specific ontology)
      • Comment on workflow languages: there are systems that support phylogenetic workflows; in Kepler there are mechanisms to introduce annotations (e.g., on the inputs and outputs) and these will be used to type check the workflow. But they are not widely used.
      • I am new to all these ontologies; how does one connect different ontologies together into the same application? That can be done, ontologies allow to import other ontologies. CDAO includes an external ontology for amino acids and enables external ontology to describe different types of characters;
  • PhyloWS (Hilmar Lapp)
    • It is the youngest of the three standards; one year old
    • Developed at the Biohackaton in Japan
    • Focused on web services
    • Obstacle: rich diversity of data resources (digital ones) accessible online, yet, designed for human consumers; the medatadata could be valuable but not machine accessible
    • Some people are forced to do complex task to extract knowledge (e.g., HTML screenscraping)
    • There is a lack of programmable interfaces, and this is an obstacle to interoperability
    • A programmable interface is aimed at Predictability and Interpretability, and these two aspects builds on the two previously proposed standards (NEXML and CDAO)
    • Predictability: how to access data holdings, search data holdings, query interfaces, how to access individual items and resources (e.g., one tree in TreeBASE, one alignment in an Alignment database) and how are these data returned. NEXML provides a solution to some of these issues by offering a standard interchange format.
    • Interpretability: how do I use the data returned? What is the meaning? CDAO represents a solution to this aspect.
    • If all these online data resources implement a standard web interface, these tasks become easy, it is simple to write widgets to embed in other web pages or applications, or create large systems (e.g., in Kepler, Mesquite) that can pull data from resources and they know what to do with them.
      • Is PhyloWS implemented? Or is it can be implemented but something is missing? Yes and no. First of all, it is partially implemented, there is a prototype for Tree of Life; you can, through Phylows and a REST interface, obtain ToL trees. However there are parts of the specification that need to be fleshed out (and we will work on this at the premeeting)
INVENTORY (Arlin Stoltzfus)
  • We would like to think about possibilities and prelude to data collection and capabilities collection
  • Putting together data standards and web services, we can connect data resources that are now disconnected. For example, TreeBASE may want to pull in other data, if there are semantic mappings between schemas it becomes possible, possibly through web services, with data transmitted in NEXML. Or we can described web services to provide access to treefam data sets from EMBL. Or enable existing tools to access sophisticated data matrix viewers, like mx or nexplorer, just by producing data in some standard format. We can integrate resources; e.g., Rutger used ToL queries and then went to TimeTree to get dates for trees (this is an interactive user interface), and a service combines and integrate them.
  • We need to think about this; we need an inventory of input and output supported by different data resources (represented by you, participant). We want to create a network, where nodes are data resources and links are shared data types. If you export a character matrix and someone imports a character matrix, there there is a potential link in the graph and an opportunity for interoperation. The links are possible, but it may be theoretical and not practical (there may be format compatibility issues, lack of a robust interface). We want to propose solution for this. Please help us to create this graph.
  • After the telecon we will create a form or a shared spreadsheet on googledocs, and we will summarize that in a graph. Your data will become a part of a network of data resources. You will get this after the telecon and we invite you to fill the required information.
  • Another thing we want to hear is about use cases or wish lists. We have a use case wiki and Dave Clements and Kara Gendler have set up a template to fill in. Please suggest use cases that would shape the hackathon to be more concrete.
  • I am new at this; is it the focus on trees? Or data repositories? Highlevel structure of tree? What is the granularity? The focus is on evolutionary data, more specifically phylogenetic data; it includes trees, taxonomies, species taxonomies, character state matrixes, discrete or continuous characters, sequence alignments, maybe transition models. In the wider context, this is only part of the picture; metadata are also important and can be linked to nodes of the tree, ranging from gene functional data, gene locations, biodiversity data. The focus is on making phylogentic data available as standards in order for outside users to access these metadata. Are we going to worry about how to enable linking phylogenetic data to other data or vice versa? Is this too further away from the hackathon? Maybe some of these are down the road, but in they are in the realm of workflows, and they are in the scope.
  • What will the event look like? There will be lots of room for creativity, not assignments, people get together and use resources according to their interest. Try to be focused on certain activities, more profitable to you and the people you group with.
  • I have not been there before: what is the typical day and how does it change over the period? The typical day will include programming. Working on a specific task that is determined to be worthwhile by a group. People at the event makes tasks feasible. Self organizing and self emerging. We will try to have lots of conversation on the mailing list before the event, but this is only to coordinate and gather information. We will not be telling people what to do. Subgroups will emerge and assume charges. The first morning will be devoted to forming the groups. Some people will be responsible for documentation. We may have some bootcamps. We will try to sense which ones from the mailing list discussion. This is not a workshop where people get up and talk and brainstorm. It is very different. Some of these goals will form over the next two weeks. You may want to start thinking who are the participatns you want to work with. Note that the wiki is open to everyone to edit, just request an account. Send email to help{at} to request an account on evoinfo wiki. There is also an online tutorial for using wikis. We will put the link somewhere.