Related Artefacts

From Evolutionary Informatics Working Group
Jump to: navigation, search

Rationale

We don't want to reinvent the wheel. We *do* want to draw on any existing artefacts. For instance, we should not redefine nucleotides and the relationships between them. We should not redefine what is a sequence. Instead we should incorporate existing ontologies that define these concepts.

Identification of related artefacts

We identified related artefacts in several ways. In some cases we had personal knowledge of artefacts that were existing or in development, e.g., NeXML. We also surveyed the OBO repository, carried out web searches with key words, and scanned related journal articles. As a result of this we identified 16 related artefacts. Many of these turned out to be hypothetical or in early stages of development (see below).

Initial survey and results

During the second working group meeting, three of us (Julie Thompson, Enrico Pontelli, and Arlin Stoltzfus) carried out a superficial analysis of the related artefacts. In each case we asked the follow questions:

  1. What is the specification language?
  2. What is the scope of the artefact in terms of domains, major concepts, and cases?
  3. Within its scope, does the artefact have some case-dependence or frame-dependence (e.g., SO is a genetic mapping language, doesn't makes sense as a biochemistry language)
  4. Is the artefact extensible? Is its development open?
  5. What is the controlling authority, if any?
  6. What is the update and versioning policy?
  7. What is the conformance policy, if any?
  8. What is the status of the artefact?
  9. How widely used (extended, modified) is the artefact?

The answers to these questions are in the following table and in the "info" pages linked to the last column.

who artefact scope status our info
EP CHADO sequence, genotypes, phenotypes, phylogenies used in GMOD Info
AS EBO Evolution in general hypothetical? Info
JT Genome comparison and rearrangements ontology genomes, genome comparisons, their evolution and biological function hypothetical? info
EP GraphML markup language for graphs (general, no bio scope) implemented/stable Info
JT MAO multiple alignments of DNA, RNA and protein sequences and structures implementation info
AS NCBI data model sequences, alignments, annotations in use at NCBI info
AS NCBI Taxonomy Organismal classification, genetic code used widely info
EP NeXML character data, trees, models, meta-info in development info
EP NEXUS format (Maddison, et al., 1997) character data, trees, assumptions, sets, notes legacy, in use, extended info
EP OBO REL General; set of allowed relations for OBO in use; evolving info
AS pPOD processing of data related to phylogenetic analysis workflows in development info
JT PATO phenotypic and trait ontology implementation info
JT Protein Ontology protein structure, function implementation info
JT PRO/PROEVO ontology protein evolutionary families, multiple endproducts of genes in development info
JT RnaO RNA sequence, structure, motifs, alignments in development info
EP TreeBASE character data, trees, meta-info on analyses in use, v. II in progress info

Further analysis

Ignoring the ones that are not developed: ProEvo, EBO, RNAO,

overlaps in scope

  • phylogenies
    • NEXUS, TreeBASE I have Newick strings
    • GraphML describes graphs in general, thus allows for tree topologies
    • NeXML uses nodes and edges as in GraphML.
    • CHADO has phylonode, phylotree
    • GCRO (if we could get it) apparently has phylogenies
  • sequence alignments
    • NEXUS, TreeBASE have blocks of character data that can hold residues, though the concept that the residues are connected together in a "sequence", or that they have a linear coordinate system, isn't really part of this
    • NeXML can represent sequence alignments
    • MAO focuses on sequence alignments composed of sub-alignments
    • CHADO may include sequence alignments
    • NCBI data model includes alignments as a special kind of annotation that maps to locations on multiple sequences
  • other character data
  • bibliographic references
    • PubMed
    • TreeBase
  • other information

features that we want

  • CHADO's ability to link arbitrary data to leaves of a tree
  • NCBI's taxonomy hierarchy, with its treatment of synonyms


features that we don't want

  • GCRO allows comparisons to be described in terms of transformations from a "source" to a "target" sequence. In reality, most actual pairwise comparisons do not involve a source and a target, but two targets that descend from a (hypothetical) common source. This is a key to the evolutionary approach. Of course we want to be able to describe transformations, but we want them to correspond to evolutionary changes, not just ways to convert one set of symbols into another.

potential conflicts

Conclusions relevant to CDAO

  1. what features should be emulated?
  2. what features should be avoided?
  3. can we follow all of the OBO relations ontology rules?

We decided we were ready to carry out a preliminary design and implementation.

first draft