From Evolutionary Informatics Working Group
Revision as of 18:00, 26 November 2007 by (talk) (Abstract)
Jump to: navigation, search

Framework for a Comparative Data Analysis Ontology

  • Enrico Pontelli, Julie Thompson, Arlin Stoltzfus
  • other authors as needed, included in alphabetical or other order


Formal ontologies have the potential to improve the practice of biological data analysis by facilitating: semantic transformation and other forms of automated reasoning; storage, transfer, re-use and integration of data; and validation and other forms of quality assurance. Here we describe the initial implementation of an ontology for Evolutionary Comparative Analysis, an analysis framework in which similarities and differences of OTUs ("Operational Taxonomic Units") such as genes or proteins (or genomes, species, etc.) are understood to have emerged from common ancestor entities by a branching process of transitions in the states of characters. In consultation with a group of domain scientists specializing in phylogenetic analysis software, we documented use cases, developed a concept glossary, and studied related artefacts in order to identify core concepts and relations. The related artefacts included ontologies, file formats, and database schemata. While the NEXUS file format and the TreeBase II database schema capture many of the core concepts and relations of comparative data analysis, no formal ontology does so. Therefore, we implemented a Comparative Data Analysis Ontology (CDAO), using the OWL-DL representation language. Here we describe the core concepts and relations of CDAO, and an initial evaluation of its implementation.


Ontologies and Interoperability

Comparative Data Analysis

can use material from wiki for this

NESCent EvoInfo Working Group

Development Strategy

The strategy devised to develop an ontology involves five distinct operations:

  1. define domain by means of use cases
  2. develop a concept glossary
  3. study available related ontologies (and other artefacts)
  4. implement the core concepts and relations of the domain
  5. evaluate the effectiveness of the ontology

The overall strategy is not merely a simple linear sequence of these operations, since feedback and iteration are required. Nevertheless, the ordering above is not completely arbitrary, since an initial iteration of some steps must be taken before a subsequent step is attempted.

Use Cases

to do:

  • add some introductory comments
  • insert re-worded list here from grant proposal

Concept Glossary

A concept glossary was developed with participation of members of the NESCent evoinfo working group. The current version of the glossary, available at, contains 63 defined terms and 65 undefined terms. The definitions include some information on subclass and synonymy relationships.

to do:

  • add specific examples of how the glossary clarifies core concepts and relations

Analysis of Related Artefacts

to do:

  • add list of related artefacts from wiki
  • draw conclusions
    • what overlaps exist
    • what design principles should be re-used
    • what design principles should be avoided
    • what artefacts should be incorporated directly

Design Principles

to do:

  • add list of design principles

Initial Implementation

Core concepts and relations

some core concepts:

  • character-state data matrix
  • phylogeny (history, tree, reconstruction)
  • transition model
  • analysis
  • publication

to do:

  • explain core concepts
  • illustrate with parts of ontology


to do:

  • give an example with instance data

Connections with other ontologies

  • explain planned connections with other artefacts
    • cited references (pubmed biblio item?)
    • NCBI taxonomy


Evaluation Strategy

  • semantic transformation projects
  • other

Lessons Learned


  • NESCent evoinfo and phyloinformatics participants
  • NESCent informatics leadership
  • funding sources

Literature Cited