Database Interop Hackathon/Teleconferences

From Evolutionary Informatics Working Group
Revision as of 12:41, 23 June 2009 by Arlin.stoltzfus@nist.gov (talk) (Teleconference June 23, 2009)
Jump to: navigation, search

Teleconference June 23, 2009

The teleconference is planned for 1:00 pm EST.

The purpose of this meeting is to plan follow-ups to the successful hackathon in March.

Agenda

tentative draft of agenda

  1. Report from the hackathon
    • what worked
    • what didn't work
    • subsequent work on the projects started at the hackathon
    • your comments
  2. Publicity - how to publicize our work (manuscripts, posters, others)
  3. discussion of the big picture. Whats the most that this group can do for interoperability (interoperability means that operations-- get, process, store, query, etc-- can be controlled and combined and integrated and automated in a hands-off way, without expert intervention, i.e., without doing things like manually editing files and manually operating a software interface).
    • what is the most we can do for phylogenetics-systematics-diversity studies in the next 2 to 5 years?
    • what is the most we can do for comparative-genomics and molecular-evolution studies in the next 2 to 5 years?
  4. Possible followups to consider:
    1. another hackathon
    2. another working group
    3. an NSF interop proposal

Notes

  1. participants:
  1. feedback and evaluation on the hackathon
  2. Publicity - how to publicize our work (manuscripts, posters, others)
  3. Discussion of the big picture - improving interop in 1) phylo-systematics-diversity or 2) comparative-genomics-and-molevol
  4. Possible followups to consider:

Teleconference Feb 26, 2009

Agenda

Thursday, February 26, 2009, at 1pm Eastern US. You should have received a phone number and access code from Hilmar.

Agenda:

  1. Report from pre-meeting
  2. Introduction to participating standards (Arlin)
  3. Information gathering
  4. Use case gathering (Dave Clements)
  5. Setting the agenda for the hackathon
    • Activities and plan for first day
    • Development targets
    • General routine for days 2-5
    • Wrap-up on last day
    • Whole-group brainstorming about future activities
    • Brainstorming MIAPA
  6. Participant questions

Notes

  1. Introductions
  • present: Sheldon M., Matt Y., Rutger V., Mark J., Hilmar L., Vivek G., Ryan S., Karla G., Roger H., Todd V., Karen C., Greg J., Arlin S. (recording), Dave C.
  1. Pre-meeting report (Rutger)
    • telecon
    • use cases
      • need to gather information on inputs and outputs from participants
      • developed form for participants to fill out
    • PhyloWS, using SRU syntax (RESTful)
    • Combine CDAO with nexml to represent metadata
      • how to attach a GO term, taxon identifier, specimen-collection info, phenotype information
      • corrected mistakes
    • questions
      • use of nexml-CDAO
      • how to use ontology? don't need it to parse file, only to do reasoning
      • syntax for expressing statements, e.g., RDF triples
      • should be explored further how to express semantics in nexml
  2. Information gathering (Arlin)
    • coding (Mark)
    • use cases (Dave)
  3. Agenda for hackathon (Hilmar)
    • principles: self-organize; do its quickly; match interests with projects so as to get energy & commitment
    • need for people to communicate, get involved, in order for self-organization to work
    • more discussion of use case list
      • ok to add uses at any level of completeness or technical detail
      • 'targets' page also has space for ideas
    • mixer activity
    • supporting MIAPA (not important?)
  4. Open question period
    • bootcamps: nexml, cdao, syntax
  5. organizer follow-up
    • need pedagogic materials on
      • syntax and semantics
      • bootcamp about reasoning
      • example xml file with data and ontology links, used for reasoning example
        • tree-taxonomy correspondence (Rutger)
      • web services
      • integration (mash-up)
      • cdao
      • nexml

Teleconference Feb 20, 2009

Agenda

Friday, February 20, 2009, at 3pm Eastern US. You should have received a phone number and access code from Hilmar.

Agenda:

  1. Welcome & Kick-off (Arlin)
  2. Introductions (all)
  3. Roadmap until event (Hilmar)
    • Teleconferences
    • Pre-meeting
    • Information gathering
  4. Introduction to participating standards
  5. Taking inventory (technical, semantics, purpose)
    • Vignette about the "network" (Arlin)
    • Spreadsheet requesting information input (Arlin)
  6. Use case gathering (Dave Clements)
  7. Participant questions

Minutes

PARTICIPANTS
  • Brandon Chisham, NMSU, CDAO Project
  • Jim Balhoff, NESCent, Phenoscape
  • Enrico Pontelli, NMSU, CDAO Project
  • Ryan Scherle, NESCent, Dryad
  • Rugter Vos, University of British Columbia, NeXML
  • Hilmar Lapp, NESCent, Co-organizer, designer PhyloWS
  • Arlin Stoltzfus, NIST, Co-organizer, CDAO project
  • Jeet Sukumaran, University of Kansas, NEXUS
  • Peter Midford, University of Kansas, Mesquite
  • Karla Gendler, University of Arizona, iPlant
  • Sam Donnelly, U. Pennsylvania, pPOD
  • Sheldon McKay, modENCODE, iPlant
  • Mark Jensen, Fortinbras Research, clinical analysis of sequence data from pathogen
  • Bill Piel, Yale, TreeBASE, iPlant
  • Lucie Chan, San Diego Supercomputing, MorphoBank
  • Vivek Gopalan, NCBI
WELCOME MESSAGE (Arlin Stoltzfus)
  • Agenda has been sent out
  • Notes will be sent out after the call
INTRODUCTIONS (All Participants)
ROADMAP (Hilmar Lapp)
  • Kickoff teleconference,
  • Overview standards
  • Need to gather some information, important activity over the next 2 weeks
    • Data providers?
    • Use cases to guide?
  • Dave Clements has a page developed with Karla
  • Premeeting is taking place to prepare standards for the hackaton
    • More messages over weekend as the premeeting develops
  • 2 more telecons; next one does not need to be a replication, instead discuss more technical issues and gather info for use cases (same for third)
  • MORNING time for next one (for UK folks)
  • QUESTIONS/SUGGESTIONS?
    • No questions
STANDARDS
  • 3 technologies (phyloWS, NEXML, CDAO); all are outcomes of the evoinfo working group
  • Working Group for 2 years, started to address interoperability issues; started with brainstorming for ideas (e.g., integrated data resource); we settled on specific technologies to facilitate interop. One data standard, one ontology, one interface for web services. Hackaton is the last meeting of the working group. Thus, this is the time and place to put technology to the test.
  • NEXML: (Rutger Vos)
    • New XML standard, inspired by the NEXUS format; lots of applications use it; many data resource also use it (as data input or as serialization format)
    • NEXUS has issues, dialects, incompatibilites; we want a new standard, formally developed and that can be validated.
    • There is a NEXML.org website. It contains the XML schema, some I/O libraries (java, python, javascript, c++ in still in progress); on the other hand, there does not seem to be a strong interest towards C++.
    • It sounds like a useful technology, more reliable exchange of data, we can use it for data exchange for web services; some advantages over previous standards.
    • QUESTIONS?
      • What is the current level of support? There are some libraries provided; Perl, included in BioPerl, thus PioPerl supports it; Java is used by Mesquite; Phenoscape uses its own; Jeet is working on a library

for Python.

  • CDAO (Arlin Stoltzfus)
    • Ontology that addresses the application area of comparative data analysis; implemented in OWL
    • OWL offers good control and formal structuring for the ontology
    • CDAO formalizes knowledge/semantics; it is useful for interoperability, to resolve ambiguities using semantics; For example, the Sequence Ontology has been used with similar objectives in the case of sequence data. Different sequence databases use Gene Feature Format (GFF) but with focus on syntax; this led to incompatible definitions of certain terms (e.g., open reading frame, in some instances it is viewed as including a stop codon, in other instances it does not; the Sequence Ontology enabled to clarify this ambiguity by creating two separate concepts).
    • Similar benefits can be gained in phylogenetic analysis: for example, in the problem of tree reconciliation. There are many tools, each imposing different requirements on the input tree (e.g., completely resolved or not). These distinctions on the inputs are often semantical, not based on syntax.
    • A formal ontology allows also access to reasoners, that can be used for validation of concepts
    • Note, that formal ontologies are meant to be machine understandable, not necessarily to be used manually.
    • QUESTIONS?
      • Are there tools to generate it? Are there tools to formalize description of an analysis? Yes, there are formalisms to formally describe a biomedical analysis or protocol, and they can be instantiated using a domain specific ontology. This is the case of OBI or FUGO (as general ontologies for describing protocols) and BioMoby (as a domain specific ontology)
      • Comment on workflow languages: there are systems that support phylogenetic workflows; in Kepler there are mechanisms to introduce annotations (e.g., on the inputs and outputs) and these will be used to type check the workflow. But they are not widely used.
      • I am new to all these ontologies; how does one connect different ontologies together into the same application? That can be done, ontologies allow to import other ontologies. CDAO includes an external ontology for amino acids and enables external ontology to describe different types of characters;
  • PhyloWS (Hilmar Lapp)
    • It is the youngest of the three standards; one year old
    • Developed at the Biohackaton in Japan
    • Focused on web services
    • Obstacle: rich diversity of data resources (digital ones) accessible online, yet, designed for human consumers; the medatadata could be valuable but not machine accessible
    • Some people are forced to do complex task to extract knowledge (e.g., HTML screenscraping)
    • There is a lack of programmable interfaces, and this is an obstacle to interoperability
    • A programmable interface is aimed at Predictability and Interpretability, and these two aspects builds on the two previously proposed standards (NEXML and CDAO)
    • Predictability: how to access data holdings, search data holdings, query interfaces, how to access individual items and resources (e.g., one tree in TreeBASE, one alignment in an Alignment database) and how are these data returned. NEXML provides a solution to some of these issues by offering a standard interchange format.
    • Interpretability: how do I use the data returned? What is the meaning? CDAO represents a solution to this aspect.
    • If all these online data resources implement a standard web interface, these tasks become easy, it is simple to write widgets to embed in other web pages or applications, or create large systems (e.g., in Kepler, Mesquite) that can pull data from resources and they know what to do with them.
    • QUESTIONS?
      • Is PhyloWS implemented? Or is it can be implemented but something is missing? Yes and no. First of all, it is partially implemented, there is a prototype for Tree of Life; you can, through Phylows and a REST interface, obtain ToL trees. However there are parts of the specification that need to be fleshed out (and we will work on this at the premeeting)
INVENTORY (Arlin Stoltzfus)
  • We would like to think about possibilities and prelude to data collection and capabilities collection
  • Putting together data standards and web services, we can connect data resources that are now disconnected. For example, TreeBASE may want to pull in other data, if there are semantic mappings between schemas it becomes possible, possibly through web services, with data transmitted in NEXML. Or we can described web services to provide access to treefam data sets from EMBL. Or enable existing tools to access sophisticated data matrix viewers, like mx or nexplorer, just by producing data in some standard format. We can integrate resources; e.g., Rutger used ToL queries and then went to TimeTree to get dates for trees (this is an interactive user interface), and a service combines and integrate them.
  • We need to think about this; we need an inventory of input and output supported by different data resources (represented by you, participant). We want to create a network, where nodes are data resources and links are shared data types. If you export a character matrix and someone imports a character matrix, there there is a potential link in the graph and an opportunity for interoperation. The links are possible, but it may be theoretical and not practical (there may be format compatibility issues, lack of a robust interface). We want to propose solution for this. Please help us to create this graph.
  • After the telecon we will create a form or a shared spreadsheet on googledocs, and we will summarize that in a graph. Your data will become a part of a network of data resources. You will get this after the telecon and we invite you to fill the required information.
  • Another thing we want to hear is about use cases or wish lists. We have a use case wiki and Dave Clements and Kara Gendler have set up a template to fill in. Please suggest use cases that would shape the hackathon to be more concrete.
GENERAL QUESTIONS
  • I am new at this; is it the focus on trees? Or data repositories? Highlevel structure of tree? What is the granularity? The focus is on evolutionary data, more specifically phylogenetic data; it includes trees, taxonomies, species taxonomies, character state matrixes, discrete or continuous characters, sequence alignments, maybe transition models. In the wider context, this is only part of the picture; metadata are also important and can be linked to nodes of the tree, ranging from gene functional data, gene locations, biodiversity data. The focus is on making phylogentic data available as standards in order for outside users to access these metadata. Are we going to worry about how to enable linking phylogenetic data to other data or vice versa? Is this too further away from the hackathon? Maybe some of these are down the road, but in they are in the realm of workflows, and they are in the scope.
  • What will the event look like? There will be lots of room for creativity, not assignments, people get together and use resources according to their interest. Try to be focused on certain activities, more profitable to you and the people you group with.
  • I have not been there before: what is the typical day and how does it change over the period? The typical day will include programming. Working on a specific task that is determined to be worthwhile by a group. People at the event makes tasks feasible. Self organizing and self emerging. We will try to have lots of conversation on the mailing list before the event, but this is only to coordinate and gather information. We will not be telling people what to do. Subgroups will emerge and assume charges. The first morning will be devoted to forming the groups. Some people will be responsible for documentation. We may have some bootcamps. We will try to sense which ones from the mailing list discussion. This is not a workshop where people get up and talk and brainstorm. It is very different. Some of these goals will form over the next two weeks. You may want to start thinking who are the participatns you want to work with. Note that the wiki is open to everyone to edit, just request an account. Send email to help{at}nescent.org to request an account on evoinfo wiki. There is also an online tutorial for using wikis. We will put the link somewhere.