Database Interop Hackathon/Use Cases

From Evolutionary Informatics Working Group
Jump to: navigation, search

Karla Gendler and Dave Clements are taking on the compilation of use cases for the upcoming hackathon. Use cases are specific problems that we could address or solutions we could implement using the hackathon's deliverables. Use cases will help the group prioritize possible deliverables, and then give developers concrete examples to address when developing solutions. (Put another way, use cases help developers deliver things that are actually useful.)

We already have

The Growing List of Use Cases

NeXML + Chado
Metadata support
Clinical annotation of pathogen evolutionary data
Visualization of arbitrary annotations along phylogenies
mx + NeXML
mx + Mesquite
Large matrices + jQuery + REST
CDAO API based on semantic framework (Vivek Gopalan)
Reconciliation of Semantic Heterogeneity in PhyloWS Requests
Connecting Tree Data Repositories with Visualization and Manipulation Tools
NeXML to TaxonConcept
Your new use case here

What's in a Use Case?

Use cases typically specify:

Use Case Name - General & Specific
  • Both a general area, and a specific scenario in that area.
Motivation
  • Explanation of why this use case is important.
Key Challenges
  • What are they key issues in implementing a solution to this use case?
Preconditions for Use
  • What must a user have to use the solution. Sequence (what kinds)?, Character-state data matrices?, ...
User Steps
  • What steps would a user go through to achieve this task?
Results
  • What is the final output / result of this use case?
References (optional)
  • Links to documents related to this use case.

A whole slew of example use cases are available from the Phyloinformatics Hackathon.


Use Cases

We've created an example first use case below. Please take a look at it and the discussion after it, and then ...

You are strongly encouraged to add your use cases to this list. If you don't want to update the wiki directly, contact Karla Gendler or Dave Clements and they'll post the use case for you (and they'll also encourage you to learn how to update the wiki).

Broad Data Integration: Integrating NeXML & CDAO with general purpose bio databases

Note: This example use case could also be applied to BioMart, InterMine, GUS, or any other widely used general purpose biological database schema. I picked Chado, because that's the one I know best. Dave Clements

Motivation
  • GMOD's Chado schema is very good at sequence and ontologies, supports basic phylogeny and computational analysis annotation, and supports many other other data types, and is widely used. If NeXML feeds could go directly into Chado, the data in them could be integrated with many other data types such as publications, stocks, sequence annotation and computation analysis of all kinds
Key Challenges
  • Existing phylogeny tables may not be sophisticated enough.
  • Existing support for phenotypes is flawed and needs rewriting.
  • There is no support for natural diversity data, but this is planned for later in 2009, based on the GDPDM.
  • Existing computational analysis module is not ontology aware.The CDAO could be easily represented in Chado, but very little would use it.
  • All these would have to be adjusted, causing backward compatibility issues.
  • No current visualization tools for Chado Phylogeny
Preconditions for Use
  • User must have multiple kinds of data that can be integrated into a Chado database. This can be sequence data of any kind, microarrays, targeted gene expression results, phenotypes... Otherwise there is nothing for the NeXML to integrate with.
User Steps
  1. Install Chado and load other types of data into it using existing tools for that purpose.
  2. Either get their phylogenitic data into NeXML format, or identify NeXML sources with the data they want.
  3. Load the XML into Chado using the (newly created) NeXML Chado adaptor.
Results
  • Ability to run custom analysis on their newly integrated data, or view it in an integrated Chado-driven web site.
References (optional)

Discussion

Why is this example a good use case? Why is it not a good use case?

It is certainly broadly useful. There are many installations that use Chado, BioMart, InterMine, GUS and so on, and it would be beneficial to be able to integrate NeXML data with that. However, this use case doesn't specify any particular questions that we could now answer as a result of this integration. Without the specific questions to anchor it, any solution based on this use case (as written) runs the risk of not enabling anything useful.

Please add any other comments here.

And please add your use cases below.

Metadata Support

See Metadata Support Use Case.


Clinical annotation of pathogen sequences, alignments and phylogenies

Motivation

Linking the processes of pathogen sequence evolution to pathogenesis in humans and animals has become one of the most important endeavors in modern molecular epidemiology (broadly defined), and has been a part of HIV research for many years. Identifying valid and reproducible associations between sequence changes and disease processes absolutely depends upon standardized sequence and "phylodynamic" annotations that describe clinical states within the patient at the time of sequence sampling. However, an initimate look at (for example) the Los Alamos HIV Sequence database reveals that field names and data descriptors are frequently slurped in from primary sources, misspellings, duplications and all. This greatly reduces the value of the resource, so that the most careful investigators (who generally do the best work) are unwilling to use the data in their studies.

The development of a unified, standardized protocol for the clinical annotation of pathogen sequences, alignments, and phylogenies would lead to

  • improved database interoperability
  • increased legitimacy and confidence in the data by the end users, and so to
  • improved database usage and return on investment.
Key Challenges
  • Development of a computable (i.e., standard) interfaces to data resources
  • Standardization of the vocabulary of pathogenesis generally and for specific infectious diseases, and creating computable ontologies from this vocabulary
  • Some progress has been made on this front by OBO, in the infectious disease ontology for example, which already appears to have HIV-related concepts (immune deficiency, disease progression) and even evolutionary (fitness) concepts built-in.
  • Identification and coding of natural "join points" between clinical, epidemiological and microbiological ontologies and evolutionary/population genetic ontologies like CDAO
  • Development of user tools (both GUI/interactive and APIs) that accept reasonably easy-to-formulate queries, and deliver ontologically annotated sequence/alignment/phylogenetic pathogen data
  • Under the hood could be a mashup driven by WSDL/RESTful interfaces among the data providers, or (at a slightly higher level) establishment and execution of an automated workflow.
  • Protection of patient identity and compliance with HIPAA privacy regulations
  • The onus of this lies mainly with the data providers. However, the matter of identifiability and de-identification is made more complex when, as in the case of HIV but also other rapidly evolving pathogens like Helicobacter pylori and hepatitis C virus, the pathogen sequence itself (i.e., the deliverable) contains enough distinguishing information to identify the patient.
Preconditions for Use
  • Network access
...but an ideal interface might be able to use locally maintained private slices of the databases involved.
  • GUI or API query interface
To be most useful, the interface should allow the user to browse the relevant ontologies for the correct query terms, and perhaps allow the user to create personal aliases for those terms, so that the aliases can be used directly in queries.
User Steps
  • Build a query with the interface and click Submit
  • Ease and naturalness of use is key to increasing the utility of these valuable and expensive resources. Compare the following interfaces (shameless, yes, but I think a good example): HIVDB Advanced Search vs.HIVQuery.
Results
  • Sequence/alignment/phyologenetic data with associated ontologically valid clinical metadata
...in NeXML, of course!
References


See Clinical annotation of pathogen sequences and phylogenies for some further thoughts. Comments welcome! --Mark Jensen

Visualization of arbitrary annotations along phylogenies

Motivation

To demonstrate the richness and flexibility of NeXML for holding various types of phylogenetic / biological data, the goal of this "use case" would be to create an interface for "discovering" the data stored within a NeXML file by interactively visualizing the data on a phylogenetic tree. The end result would hopefully be something along the lines of Google's Motion Chart gadget (see this example), where the user can first see what types of annotation are stored within the phylogeny, and then use those different data values to modulate the color/size/shape of the nodes/edges/labels.

Key Challenges
  • Find a dataset with a rich set of categorical & numerical annotations (very important!)
  • Create an interface for choosing and tuning the visualization parameters
    • Keep it simple but powerful!
  • What to do with missing data?
  • Define a minimal communications channel between Java (which will be drawing the trees and parsing the NeXML) and Javascript (which will respond to the controls).
References

Initiate a new collaboration using mx and data queried for and returned in NeXML format

Motivation

A web based workbench to rapidly prototype and initiate preliminary phylogenetic analysis (to be expanded later) can kick start a successful project. A team of researchers is starting a new phylogenetic analysis knowing there is existing data they will need to incorporate. They need to complete a preliminary phylogenetic analysis to submit for a grant proposal. This is largely similar to the introduction NeXML & CDAO proposal but with mx as the core.

Key Challenges
  • a Ruby NeXML parsing library needs to be written mapped to the data structures in mx
  • Molecular tables in mx may not be flexible enough to handle incoming data, updating mx to use the Chado schema is possible solution
  • RESTFUL interaction with CIPRES for phylogenetic inference and perhaps a secondary application for storage/retrieval of inference metadata
Preconditions for Use
  • a previously published data, exported from Mesquite perhaps, is available for query and import, molecular data from Genbank of similar is another possible source
  • perhaps more complex - the interface queries Phenex based on some ontological criterion (characters of the head), return pertinent characters
User Steps
  • user configures and installs mx
  • user access the NeXML import interface, queries and/or selects data for import (interface to be designed at hackathon)
  • user creates a customized dataset based on the imported data, this includes a subset of the imported data, and several new datapoints manually entered (functionality expanded at hackathon)
  • user submits (RESTFUL) dataset to CIPRES using [NeXML, CDAO?] redirecting the results to a linked database (or extension of mx) which handles CIPRES results (submission process to be created)
Results
  • a portable Ruby library (BioRuby?) for handling NeXML files is available to all researchers, not just users of mx
  • a preliminary phylogenetic analysis containing morphological and molecular data is available for use as evidence of prior work for the team's grant submission
  • a rapidly prototyped web page of input is served through mx (functionality to be updated/created), and output (phylogenetic analysis) is linked to through [some other linked database that handles analysis metadata], the linkage appears as a unified view (to be created)
References (optional)

Work seamlessly between Mesquite (offline) and mx (online)

Motivation

This is a frequently requested feature for mx and a major roadblock in general to convincing researchers to adopt a web-based portal for storing and manipulating their data. "Can I edit my data offline?" "What do I do when I don't have internet access?" Meeting this challenge would increase the probability that phylogenetic data would be seamlessly "added" to the web when an analysis is completed.

Key Challenges
  • a Ruby NeXML parsing library (BioRuby?) needs to be written and then mapped to the data structures in mx
  • buffering changes (diffs) of NeXML files so that updates to the relational db in mx can be seamlessly synchronized while maintaining referential integrity
  • Mesquite/mx data model closely examined and linkages formalized
Preconditions for Use
  • mx/mesquite are installed
User Steps
  • user configures and installs mx and initiates a new matrix by importing a Mesquite generated NeXML file or creating a matrix in mx
  • remote users concurrently modify and update the data in mx while a research stranded in Canada is forced to work offline
  • Canadian makes back to civilization, logs on, and uploads their work, diffs are presented and either buffered for review by admin, or added immediately as needed
Results
  • the tool set of users throughout the world who do not have or who have limited internet capabilities is greatly enhanced
  • data are seamlessly and rapidly merged and made available across the web?

Visualize and review large matrices using a jQuery or Protoype plugin library and RESTFUL resources

Motivation

Large morphological matrices are more routinely being created thanks to collaborative efforts. Visualizing these data to check for status is difficult and improvements for doing are constantly being requested.

Key Challenges
  • write jQuery library/module/plugin that queries a RESTFUL interface which returns NeXML formatted data as needed, perhaps in a Google-maps tiling style (to create)
  • javascript library for parsing NeXML formatted data (to create)
  • write RESTFUL resources that return the necessary data (can be rapidly prototyped and tested in mx
Preconditions for Use
  • user includes the jQuery library and sets some parameters on their web page
  • RESTFUL interfaces to NeXML are created which can return subsets, slices, partitions (whatever you want to call them) of the matrix and associated metadata
User Steps
  • user configures and installs mx (which would use the plugin) or creates a site and references the javascript library
  • library queries RESTFUL resource for initial dataset
  • user drags (sensu google maps) the matrix around to visually inspect the data (to create)
  • user applies various filters (confidence levels, cells with tags, cells annotated with images, text linked to Ontology) to visually alter matrix (to create)
  • advanced options could include any possible matrix manipulation (CRUD), but the core should focus on visualizing/querying
Results
  • a javascript library useful to any web application
References (optional)

Semantic application development - Development of a basic API for CDAO ontology based on semantic framework (Vivek Gopalan).

Description

User has a NeXML file or CDAO individual file containing a gene family that has OTUs across different kingdoms. Based on the annotation of the sequence features he/she would like to remove all the partial genes and extract only CDS features of all the complete (not-partial) plant genes in Newick format (I have tried to make it sound more like a logical reasoning type of problem).

or

User has a NeXML file and wanted to convert the contents to CDAO individual and then reroot the tree based on selected node and export the modified tree in Newick format.

Motivation
  • To develop an object-oriented API for CDAO ontology in order to programmatically interact with the ontology model and to build applications that create, read, update, delete and extract data to the model. (CDAO ontology as such represents comparative data model in evolutionary analysis.)
  • To build the API in such a way that it hides the generic ontology methods and provides methods understandable by the users with phylogenetic knowledge.
  • To use Dynamic Object Modeling (Objects are created based on the CDAO ontology components) methodology so that any changes to the CDAO ontology or other related ontologies should cause minimal or no change in the code (Ontologies keep evolving).
  • To use reasoning services to extract specific data based on the restriction of logical relationship among the CDAO classes at runtime (e.g. all the nodes and their relationships can be directly extracted based on the componentOf or hasChild properties using reasoner instead of writing the logic in the code).
Key Challenges
  • The Semantic framework and reasoner currently available may be very slow.
  • CDAO ontology is not stable yet.
Preconditions for Use
  • User must have an NeXML file with a tree block containing a tree with atleast 30 to 50 OTUs.
  • An API to read NeXML data.
  • Availability of framework for Semantic Application Development
  • Availability of reasoners for inference of ontology.
  • Availability of API for persistence of data in the ontology (optional).
User Steps
  • User gives a NeXML file as input for the application.
  • User also provides name of that represent all the plant OTUs and the gene feature that should be extracted.
  • User gets a modified Newick tree as output.
Results
  • Evaluation of available API for OWL models (Jena, OWL-API, Pretege-OWL API)
  • Evaluation of available reasoners and its compatibility with the available OWL APIs.
  • A basic CDAO API will be developed with a) tree manipulation and export options and b) to convert the tree data in NeXML to CDAO ontology.
References

Reconciliation of Semantic Heterogeneity in PhyloWS Requests using Third Party Service

Description

Two added challenges for phyloinformatics is that communicating tree information is (1) sensitive to node label strings (typically species names) that are not always clear as to their meaning; and (2), unable to express more generic requests, yet these generic questions are often what interest phylogeneticists the most. I propose the extension of PhyloWS to allow the specification of a third party semantic reconciliation service as an intermediary between a user and a target data source. Hackathon activities would involve specifying the syntax for the extension to PhyloWS and building a prototype semantic reconciliation service as proof-of-concept.

Trs.jpg

Motivation

The semantics of node identifiers are critical to accurate communication, yet these are not always effective.

  • One kind of inadequacy is in heterogeneity of taxon labels: a node labeled “Homo sapiens” will not map to another node labeled “Homo sapiens Linnaeus, 1758” or one labeled “Homo sapiens Linnaeus” even though all three of these forms are considered accurate scientific notations.
  • Another kind of inadequacy is in the difficulty of trying to ask generic questions, such as: “find me phylogenies that support the monophyly of bats.” Trees relevant to this question are idiosyncratic as to which taxon labels they use, yet it’s impossible for a user to query all possible relevant taxon labels.

To solve this, I propose that PhyloWS be capable of identifying an intermediary reconciliation service that can enhance phylogenetic objects before passing them on to the target data service.

Key Challenges
  • How to handle instances where the meaning of a taxon label cannot be resolved unambiguously without iterative user interaction.
  • How to return the response to the user despite passing the request through a 3rd party service.
Preconditions for Use
  • A PhyloWS-compliant semantic reconciliation service must be available.
  • User needs to craft PhyloWS requests with added semantic reconciliation syntax and then parse returned NeXML.
User Steps
  • What steps would a user go through to achieve this task?
Results
  • NeXML passed through a semantic reconciliation service is marked up with added metadata for each included taxon label such that the meaning of the labels are more easily interpretable.
Interested Participants (please add your name if interested)

Connecting Tree Data Repositories with Visualization and Manipulation Tools

Description

Users wishing to employ the confederated phylogenetic data sets that will hopefully emerge with the help of NeXML, PhyloWS and other resources would benefit from a sophisticated user interface. Searching, browsing, visualizing and manipulating the trees would be facilitated by an application that uses web services to remotely query and access a centralized data resource. Further, collaborative research efforts would benefit from the ability not only to view the data but also to enter or modify tree data and save the results.

A proof on concept protoype was assembled using PhyloWidget integrated with a TreeBase-derived database via an AJAXy web application (using the YAHOO API framework for the JavaScript components). Follow this link for a demo video


wokflow for a high-level interface to a tree data repository. In addition to searching and viewing tree data, optional (shaded) data entry or manipulation would be supported
Motivation

Assembling the tree of life will involve activities such as meta-analyses and manual supertree assembly, annotation and editing of existing trees, etc., that will require information such as provenance, node annotations and other types of meta-data to maintain their integrity through various handling steps.

We propose to extend an existing prototype of a web-based application for viewing and editing TreeBase data with Phylowidget to a more generalized application that 1) provides a GUI for searching, viewing, entering, or manipulating tree data, 2) uses NeXML as the data/meta-data interchange format and 3) uses PhyloWS as a means to interact with tree databases.

Key Challenges
  • Ensuring stability and universality of user interface
  • Ensuring meta-data integrity
  • Having a working database with PhyloWS/NeXML connecitivity
Preconditions for Use
  • A web browser, Java, and network access
  • Optionally, tree data to enter. Unless a particular format is enforced, adapters would be needed to format data in NeXML.
User Steps
  1. Query database via a web-based application
  2. View data with Phylowidget
  3. Optionally, edit, prune and graft, annotation, export and import etc
  4. Optionally, save data back to the database
Results
  • The results would depend on the users needs. They could range from a casual browser whose end result is visualized trees through to the creation, entry and archiving of new trees in the database
References
  • The following is an example of TreeBase data integrated with PhyloWidget, with primitive meta-data tracking and tree-editing/saving capability. The website works for all java enabled browsers but Phylowidget works best in IE or Safari (sorry).

iPlant tree of life pre-project web application

demo video

Interested Participants (please add your name if interested)

NeXML to TaxonConcept

This is a tentative high level Use Case - may not be detailed enough for implementation but may serve for discussion

Motivation

The Biodiversity community is resplendent with classifications and checklists that are not necessarily phylogenetically based. Examples include sites like Catalogue of Life, NCBI Taxonomy and USDA Plants. These databases largely contain hierarchies of names and synonym rather than separating out the notion Taxon Concepts. It would be desirable to compare NeXML files containing species trees to these classifications so as to generate a report, preferably with a visual component. Using this approach the implications of phylogentic research can be layered onto existing systems that may be used for species occurrence and monitoring work.

Key Challenges
  • Getting existing classifications into OWL or common format.
  • NeXML into common format.
  • Finding suitable test data sets.
  • Size: Small phylogenetic studies against large classifications and vise versa.
Preconditions for Use
  • NeXML file containing a species trees.
  • Conventional classification in an accessible format (possibly NeXML or OWL).
User Steps
  • Load NeXML and classification into tool. Envisaged loading both into a single OWL ontology.
  • Explore results through interface.
  • Make assertions regarding relationships of terminals in NeXML and nodes in classification.
  • Repeated iterations of exploring and making assertions.
Results
  • Possible annotation of original NeXML file with links to classification.
  • Textual report...
References (optional)

Standard Data Model Representation for Taxonomic Information


Your Use Case Here

Please copy this template and fill in the parts.

Motivation
  • Explanation of why this use case is important.
Key Challenges
  • What are they key issues in implementing a solution to this use case?
Preconditions for Use
  • What must a user have to use the solution. Sequence (what kinds)?, Character-state data matrices?, ...
User Steps
  • What steps would a user go through to achieve this task?
Results
  • What is the final output / result of this use case?
References (optional)
  • Links to documents related to this use case.

[[Image:File:Example.jpg]]