- 1 Overview
- 2 Background
- 3 Goals
- 4 Strategy for developing and evaluating an ontology
- 5 Results
- 6 Other resources
This page contains general information on the project to develop CDAO (Comparative Data Analysis Ontology), which is intended to provide a common ontological framework for evolutionary analysis regardless of the type of data involved. Ultimately the ontology will import concepts from domain-specific ontologies so as to integrate characters of a particular type (e.g., import sequence ontology to address sequence characters; import GO concepts to address protein function characters).
At our first meeting, we decided that an important goal was to develop what we called a "central unifying artefact". Any representation in the domain of interest (evolutionary analysis), e.g., a PHYLIP file, a NEXUS file, PAML options, can be translated into the CUA, which means that any representation can be translated into any other representation.We called the CUA the "central unifying artefact", instead of calling it an ontology, because different group members thought of the CUA in different ways: as a file format, an ontology, or a database schema. At right is the database schema (UML diagram) Hilmar developed from the discussion at the first meeting.
Which approach is best? We are not going to argue this point, other than to say that the consensus of computer scientists is that defining a formal ontology is a sound approach to the interoperability problem that takes advantage of powerful technologies now available. For instance, given current technology such as RDF-OWL, implementing a formal ontology automatically solves the problem of having a file format definition.
Therefore, a substantial portion of the group's effort will be devoted to the goal of developing a formal ontology.
First period At the first meeting (May 20-23, 2007), we discussed the importance of an artefact such as an ontology that would play a central role in interoperability. We decided to start by applying the 80:20 rule, which in this case means finding the simplest scheme to support uncomplicated analyses of molecular sequence data. The follow-up for the meeting included two tasks:
- generate and add UML diagram and ERD for character state matrix (incl. multiple sequence alignment) and tree from Tuesday session (Hilmar)
- develop a strategy for developing and evaluating an ontology (Gopal, Enrico, Arlin)
Before the second meeting, we not only had devised a strategy, we had in hand a rough version of the first stage (use case list) and had begun the second stage (concept glossary).
Second period At the second meeting (12 to 14 November, 2007), Enrico, Julie and Arlin began the third stage of analyzing related artefacts. On this basis, we identified only a small number of artefacts with important overlaps, and decided that the goal for the second period would be to sketch the initial version of the ontology, implement it, and write a manuscript describing the results, possibly using support from NESCent.
Strategy for developing and evaluating an ontology
Developing an ontology is a challenging task that is not easily resolved into a simple linear procedure. There are conceptually distinct activities (background research, design, implementation, evaluation), but the entire process involves iteration and feedback. For instance, we might redefine some terms in our concept glossary after examining other artefacts.
Nevertheless, the ordering below is not completely arbitrary. We should *start* with use cases, even if we come back to them later on; we can't start with evaluation, but we should end with it.
- define domain by means of use cases
- develop a concept glossary
- study available related ontologies (and other artefacts)
- implement the core concepts and relations of the domain
- evaluate the effectiveness of the ontology
Over the summer, group members began to put the above strategy into action.
Define domain by means of use cases
See Use Cases developed by phyloinformatics hackathon planners (need ref to hackathon paper).
- add references to research literature
- clarify language (see glossary, next section)
- link to actual instances with input and output data
The ontology is to provide coverage for the domain of evolutionary comparative analysis. Most inferences in genomics, for instance, come from comparative analysis, e.g., inferring the locations of genes and regulatory sites in the human genome by comparing it with the chimp and mouse genomes.
Evolutionary analysis (or "evolutionary comparative analysis") is comparative analysis undertaken within the framework provided by evolutionary theory. Unlike heuristic machine-learning approaches borrowed from computer science, evolutionary analysis treats the items to be compared as homologs that have evolved along paths of common descent (a tree) according to dynamics that reflect an evolutionary process of change. Ideally this framework converts questions about interpreting similarities and differences into theoretically well posed questions about evolutionary transitions in the states of characters along the branches of a tree.
Develop glossary of concepts
Once the domain of interest is defined, we can look at domain-relevant books, research articles, and software interfaces to identify the domain-specific vocabulary.
Then we define the terms informally, in a ConceptGlossary. Studying the concept glossary will reveal, not just isolated definitions, but relations between concepts that eventually will become part of the ontology.
- greatly expand the number of terms
- get systematic review of content from domain experts
- start discussing relations among terms
Formalize the ontology by encoding its key concepts and relations
- use a formal language such as OWL
- use description logics (OWL-DL) to facilitate early evaluation (next section)
The current description of the ontology is on the CDAO wiki topic. Some initial results and demonstration projects are available from Brandon Chisholm's web site: http://www.cs.nmsu.edu/~bchisham/ontology/
We want to get together in early 2008 to put together the first draft of the ontology, and to develop a presentation for publication and for meetings. We have decided to ask for NESCent to support this via its short-term visitors program: here is our CDAOShortTermVisitorProposal.
Evaluate and refine the ontology in relation to its domain-specific uses
There are two kinds of evaluation. One of them is evaluation of the logical implications of the ontology using reasoning tools. The ontology that we build will imply certain relationships that are not given explicitly. We can infer these relationships and ask if they lead to logical contradictions, or to inaccuracies in a domain-specific context. Using description logics allows even more powerful evaluation of this type. We can populate an ontology with specific facts, and then ask a domain expert to interrogate the ontology about the implications of these facts, to see if the implications are accurate.
The second mode of evaluation is in a larger system of analysis. That is, the ontology can be integrated into an actual research project or into a workflow system (for example). Use of the system may reveal weaknesses in the ontology.
Eventually we will need to come up with specific challenges that can be used to evaluate an ontology. Some properties to look for in a challenge:
- it is concrete or can be made concrete with specific examples
- it is circumscribed (limited, doesn't stretch out to include everything)
- it represents an actual challenge faced by researchers
- it is an ontology-related challenge and not some other kind of challenge
- we can do it (possibly with the help of collaborators)
- generalization of functional inference problem for proteins (integrate GO)
- do worm development characters exhibit evolutionary biases (Worm EvoDevo Case)
We want to remain aware of related projects, take advantage of opportunities for partnering, and avoid conflicts. This includes the active projects in the Related Artefacts page as well as:
General strategy for transforming one notation to another can be found in the following two papers:
G. Gupta, H-F. Guo, A. Karshmer, E. Pontelli, et al. Semantic-Based Filtering: Logic Programming's Killer App? 4th International Symposium on Practical Aspects of Declarative Languages, LNCS 2257, Springer Verlag, pp. 82-100, Jan. 2002. http://www.utdallas.edu/~gupta/killerapp.pdf
J. R. Iglesias, E. Pontelli, G. Gupta, D. Ranjan, B. Milligan. Interoperability between Bioinformatics Tools: A Logic Programming Approach. In 3rd Symposium on Practical Aspects of Declarative Languages, 2001. Springer Verlag LNCS 1990. pp. 153-168. 2001. http://www.utdallas.edu/~gupta/interop.pdf