We don't want to reinvent the wheel. We *do* want to draw on any existing artefacts. For instance, we should not redefine nucleotides and the relationships between them. We should not redefine what is a sequence. Instead we should incorporate existing ontologies that define these concepts.
We identified related artefacts in several ways. In some cases we had personal knowledge of artefacts that were existing or in development, e.g., NeXML. We also surveyed the OBO repository, carried out web searches with key words, and scanned related journal articles. As a result of this we identified 16 related artefacts. Many of these turned out to be hypothetical or in early stages of development (see below).
Initial survey and results
During the second working group meeting, three of us (Julie Thompson, Enrico Pontelli, and Arlin Stoltzfus) carried out a superficial analysis of the related artefacts. In each case we asked the follow questions:
- What is the specification language?
- What is the scope of the artefact in terms of domains, major concepts, and cases?
- Within its scope, does the artefact have some case-dependence or frame-dependence (e.g., SO is a genetic mapping language, doesn't makes sense as a biochemistry language)
- Is the artefact extensible? Is its development open?
- What is the controlling authority, if any?
- What is the update and versioning policy?
- What is the conformance policy, if any?
- What is the status of the artefact?
- How widely used (extended, modified) is the artefact?
The answers to these questions are in the following table and in the "info" pages linked to the last column.
|EP||CHADO||sequence, genotypes, phenotypes, phylogenies||used in GMOD||Info|
|AS||EBO||Evolution in general||hypothetical?||Info|
|JT||Genome comparison and rearrangements ontology||genomes, genome comparisons, their evolution and biological function||hypothetical?||info|
|EP||GraphML||markup language for graphs (general, no bio scope)||implemented/stable||Info|
|JT||MAO||multiple alignments of DNA, RNA and protein sequences and structures||implementation||info|
|AS||NCBI data model||sequences, alignments, annotations||in use at NCBI||info|
|AS||NCBI Taxonomy||Organismal classification, genetic code||used widely||info|
|EP||NeXML||character data, trees, models, meta-info||in development||info|
|EP||NEXUS format (Maddison, et al., 1997)||character data, trees, assumptions, sets, notes||legacy, in use, extended||info|
|EP||OBO REL||General; set of allowed relations for OBO||in use; evolving||info|
|AS||pPOD||processing of data related to phylogenetic analysis workflows||in development||info|
|JT||PATO||phenotypic and trait ontology||implementation||info|
|JT||Protein Ontology||protein structure, function||implementation||info|
|JT||PRO/PROEVO ontology||protein evolutionary families, multiple endproducts of genes||in development||info|
|JT||RnaO||RNA sequence, structure, motifs, alignments||in development||info|
|EP||TreeBASE||character data, trees, meta-info on analyses||in use, v. II in progress||info|
Ignoring the ones that are not developed: ProEvo, EBO, RNAO,
overlaps in scope
- sequence alignments
- NEXUS, TreeBASE have blocks of character data that can hold residues, though the concept that the residues are connected together in a "sequence", or that they have a linear coordinate system, isn't really part of this
- NeXML can represent sequence alignments
- MAO focuses on sequence alignments composed of sub-alignments
- CHADO may include sequence alignments
- NCBI data model includes alignments as a special kind of annotation that maps to locations on multiple sequences
- bibliographic references
- other information
features that we want
- CHADO's ability to link arbitrary data to leaves of a tree
- NCBI's taxonomy hierarchy, with its treatment of synonyms
features that we don't want
- GCRO allows comparisons to be described in terms of transformations from a "source" to a "target" sequence. In reality, most actual pairwise comparisons do not involve a source and a target, but two targets that descend from a (hypothetical) common source. This is a key to the evolutionary approach. Of course we want to be able to describe transformations, but we want them to correspond to evolutionary changes, not just ways to convert one set of symbols into another.
Conclusions relevant to CDAO
- what features should be emulated?
- what features should be avoided?
- can we follow all of the OBO relations ontology rules?
We decided we were ready to carry out a preliminary design and implementation.