Difference between revisions of "CarrotBase"

From Evolutionary Informatics Working Group
Jump to: navigation, search
m (List of data resources to target)
m (Reverted edits by Dpc13 (Talk); changed back to last version by Arlin.stoltzfus@nist.gov)
Line 12: Line 12:
For more information, and analysis of target potential, see [[Data Resources]].
For more information, and analysis of target potential, see [[Data Resources]].
* resources for systematics
** taxonomies such as NCBI; uBio; for more, see list from Rod Page paper
** [http://tolweb.org Tree of Life], note example implementation, e.g. [http://nexml.org/nexml/phylows/tolweb/16299 hominidae subtree (node 16299)]
* resources focused on sequence families
** [http://www.phylo.org/sub_sections/databases TreeBaseII] has the kind of granular schema that would be a good challenge to try to accommodate using cdao.
** [http://phylodata.seas.upenn.edu/cgi-bin/wiki/pmwiki.php pPOD] (not really a data resource, its a db tech project led by computer scientists)
** [http://www.treefam.org/ TreeFam]
** Arkin lab's [http://microbesonline.org MicrobesOnline] server has cool [http://www.microbesonline.org/cgi-bin/treeBrowse.cgi?locus=392933 tree-based view of sequence families]
** [http://www.ebi.ac.uk/goldman-srv/pandit/ Pandit]
** [http://pbil.univ-lyon1.fr/databases/hovergen.php Hovergen], [http://pbil.univ-lyon1.fr/databases/hogenom.php hogenom]
** organism-centered gene-family databases, e.g., for plasmodium
* Other
** [http://phylomedb.bioinfo.cipf.es/ PhylomeDB]
** [http://loco.biosci.arizona.edu/pb/ PhyLoTA]
** [http://phylogenomics.berkeley.edu/phylofacts/ PhyloFacts]
** [http://www.timetree.org TimeTree]
** [http://morphobank.geongrid.org/ MorphoBank]
** [http://www.morphbank.net/ MorphBank]
== Notes on how to approach this ==
== Notes on how to approach this ==

Revision as of 23:25, 2 March 2009

News flash: the 4th meeting of the working group will be a Database_Interop_Hackathon that takes on some of the interoperability challenges described below.


The goal of improving interoperability is not served merely by developing artefacts such as CDAO and nexml. In order to improve interoperability, these artefacts must be used in the research community and, ideally, they must become popular. Thus, to achieve its goals, the working group wishes to promulgate its artefacts, and in particular, we wish to develop a strategy to reach a "critical mass" in which there are sufficiently many operations and resources that rely on an artefact to ensure that the self-interested researcher (or the lazy programmer) embraces the artefact simply because its useful and not just for some intangible long-term benefit of "interoperability".

Our main focus is on the "carrots" part of a "carrots and sticks" strategy, but first we should explain the "sticks" part. Journals that serve the evolution research community have indicated a desire and a willingness to impose data archiving requirements on authors as soon as the technology becomes feasible (Todd Vision, pers. comm.; see Dryad project); also, the research community has called for a minimal reporting standard, MIAPA. These two aspects of interoperability go together: the repository goal favors a minimal reporting standard such as MIAPA to ensure that archived data are re-usable. Furthermore, these two aspects of interoperability go together with a third aspect, which is a file format that is MIAPA-compliant and that is accepted by archives. The "stick" part of our nexml-CDAO strategy is that we intend to support MIAPA, which means that some users may resort to using nexml-CDAO if it is the most convenient way to achieve MIAPA-compliance or to archive data.

The tentatively named CarrotBase project is intended to provide a more positive incentive, i.e., to represent the "carrots" part of the strategy. When we constituted the working group in 2006, we clearly were targeting developers of phylogenetics applications software. This might have been the best way to start developing the technology for interoperability-- artefacts such as nexml and CDAO-- but now we are faced with a different problem of promulgating these technologies. Targeting applications does not seem like the best way to generate carrots for users. Targetting data resources might be more effective.

List of data resources to target

For more information, and analysis of target potential, see Data Resources.

Notes on how to approach this

Semantic transformation, ontologies, and formats

As Gopal said at one of our earlier meetings "There is a theory of interoperability, and it is semantic transformation". Interoperability is about expressing the same meaning in different ways, or about maintaining the meaning of information as it flows through different formats or dialects or contexts. This information may have a representation in the form of a string of characters, or a diagram, or some other kind of tokens. The meaning of tokens is formalized through ontologies and other standards.

If we want data resources, services, or software applications to interoperate in their treatment of a "coding sequence", we need to know what a "coding sequence" means for each one. If the gene_masher database assumes that "coding sequence" includes a stop codon, but the gene_whacker program assumes it doesn't, then we have a potential interoperability problem. Gene_whacker might crash if given a gene_masher coding sequence, or if gene_whacker is more carefully written software, it might refuse to validate a coding sequence from gene_masher because its length does not match the expected length.

Foundations that we need prior to the hackathon

In order to work effectively with data providers at the hackathon, we need to develop, in advance, foundational technology, in order:

  1. advanced data representations such as nexml and CDAO
  2. APIs for these and any other relevant data representations
  3. using the APIs, translation tools for the relevant data representations
  4. translation and validation web services

From there, before getting to the detailed phyloWS stuff, it would seem important to develop an API for basic top-level services. The top-level services would be:

  1. provide meta-data on the resource (point of contact, publication, organizations, support)
  2. provide a list of data and metadata types
  3. respond to a query for an instance of a given data type (do you have a tree with Xenopus in it?)

Some of the work at the hackathon would be focused on working with data providers to implement support for these services.

Demonstration projects

Integrated Sequence Family Resource

As a start, rather than considering all kinds of character data, lets imagine integrating all of the sequence family resources, e.g., the ones that have trees and sequence alignments (Pandit, TreeFAM, etc).

One way to do this would be to fully instantiate all the data:

  1. download content from TreeFam, Pandit, HOVERGEN, etc in its native format
  2. translate content from native format to a common data model (or ontology)
  3. map the data model (or ontology) to a relational schema
  4. populate the schema with all of the translated data
  5. provide a web interface with search capabilities
  6. enhance the web interface with an alignment-viewer or tree-viewer plugin

Another way to do this would be through web services. In this case, the interface would look the same as above, but everything would be done on the fly. To do this, of course, one has to work with the data provider to provide the content of the data resource via a web service.

Ultimately (when its ready) the nexplorer-like API described below could provide the interface.

nexplorer-like API for CDAO

Nexplorer is a mouse-driven graphical web application for visualizing and manipulating character-data-and-trees. It displays data sets as a tree with its tips attached to the rows of a data matrix. The tree can be rearranged or pruned, and the view will change accordingly. Nexplorer uses Bio::NEXUS to read NEXUS files, but a CDAO visualization tool could use the Protege API for OWL described in the programming guide.

An increasingly demanding list of capabilities would be:

  1. visualize character data and trees
  2. manipulate views of the data (re-root, flip branches, scroll alignment)
  3. annotate data interactively
  4. edit data interactively (prune tree, slice alignment, etc)
  5. generate output in standard formats
  6. invoke external services or analysis operations on data
  7. more complex forms of data input and integration
    • interactive user input (e.g., user-generate tree as in MacClade)
    • uploads of data and matching to current data set
    • integration from other sources accessed via web services
    • error correction


Our notes from the working group meeting

Data resource interop project

  • rationale
    • need to develop critical mass for nexml-cdao (critical mass of community involvement, use-cases, eyes on code)
    • hard to get critical mass by targeting app developers
    • hard to show value by targeting app developers
    • strategy of targeting data resources puts data in hands of users
    • however there also are carrots for db maintainers:
      • interchange with other sources
      • common treatment of keys such as taxonomy
  • data resources to target (see Data Resources page)
  • implementation strategy
    1. recruit participants
      • possible leadership from Hilmar, other suggestions (Rod Page, Bill Piel, Encyc of Life, TOL)
      • other participants drawn from projects above
    2. develop project plan
      • consider implementing a coordinated db with families from many sources (similar philosophy to InterPro for protein family alignment dbs?)
      • CDAO to providing constraining vocabulary for nexml schema
        • disagreement over whether this is a good idea
        • to succeed, we must provide users a way to expand nexml, thus to expand cdao
        • mechanism is the feature request, e.g., I want to represent Bremer values
          1. user contacts nexml.org about how to represent new type of data
          2. if a formal mechanism is desired but does not exist
          3. user is referred to cdao feature request (must provide searchable interface and front page link to feature request form)
          4. cdao developers are obliged to respond
    3. get support
      • NSF grant
      • NESCent-sponsored hackathon

More notes, from telecon with Hilmar and Arlin (5/29/08)

Outline of the basic plan

  1. expand CDAO to support more metadata
  2. map CDAO to a relational schema (CDAO already is mapped to nexml and NEXUS)
  3. develop a database interface that has some nice features
    • better query interface than TreeBase or Pandit because it has more access to semantics
    • output in rich, structured format (nexml)
    • some nice integrated tools such as ATV and jalview, iTOL
  4. pick two or three data resource managers to come work with us to develop a mapping, so that the content of their resources can be uploaded automatically into the database application via nexml-CDAO
  5. finally, hold a hackathon where we bring in other data resource managers to create their own mappings
  • the goal is to develop an integration platform
    • integrate taxonomy, phylogeny, gene family databases
  • data resource manager can take existing content, load it into db package, and automatically get the cool features
    • taxonomic links
    • standard services API
    • allows data submission, returns data in nexml or equiv
    • accepts metadata and transmits it
    • reasons over metadata, e.g., validate MIAPA compliance
  • NSF grant
    • Advances in Biological Informatics (due August 12)
    • program has explicit shift in focus away from database creation . . .
    • . . . but that's ok because we won't focus on One Database That Rules Them All, but on the infrastructure that supports them all
  • questions
    • isn't this proposing the same thing as pPOD?
    • apparent overlap with pPOD is strong; need buy-in from them
    • do we need a pilot project to get preliminary results?
    • do we need to hire interface programmers to do this? A: can do it by contract, at least initially