Difference between revisions of "CDAO"

From Evolutionary Informatics Working Group
Jump to: navigation, search
(find a suitable software development environment)
(Comparative Data Analysis Ontology)
Line 1: Line 1:
 
=Comparative Data Analysis Ontology=
 
=Comparative Data Analysis Ontology=
  
This is a project page for ongoing work and contains links to supporting docs, past work, and sub-topics.  
+
''Note: this is mostly a project page for team members, focused on current issues and ongoing work.''
 +
 
 +
The Comparative Data Analysis Ontology (CDAO) is intended to provide a framework for understanding data in the context of evolutionary-comparative analysis.  This comparative approach is used commonly in bioinformatics and other areas of biology to draw inferences from a comparison of differently evolved versions of something, such as differently evolved versions of a protein.  In this kind of analysis, the things-to-be-compared typically are classes called 'OTUs' (Operational Taxonomic Units).  The OTUs can represent biological species, but also may be drawn from higher or lower in a biological hierarchy-- anywhere from molecules to communities.  The features to be compared among OTUs are rendered in an entity-attribute-value model sometimes referred to as the 'character-state data model'.  For a given character, such as 'beak length', each OTU has a state, such as 'short' or 'long'.  The differences between states are understood to emerge by a historical process of evolutionary transitions in state, represented by a model (or rules) of transitions along with a phylogenetic tree.  CDAO provides the framework for representing OTUs, trees, transformations, and characters.  The representation of characters and transformations may depend on imported ontologies for a specific type of character.
  
 
== Quick links ==
 
== Quick links ==
  
 
* Source file [http://cdao.cvs.sourceforge.net/*checkout*/cdao/cdao/OWL/cdao.owl cdao.owl] on SourceForge
 
* Source file [http://cdao.cvs.sourceforge.net/*checkout*/cdao/cdao/OWL/cdao.owl cdao.owl] on SourceForge
* The [[CDAOManuscript manuscript]] describing the initial version of CDAO.  
+
* The [[CDAOManuscript manuscript]] describing the initial version of CDAO.
  
 
== Overview of Goals and Project Trajectory ==
 
== Overview of Goals and Project Trajectory ==
Line 14: Line 16:
 
# represent in OWL (done April 2008)
 
# represent in OWL (done April 2008)
 
# revise (in progress)
 
# revise (in progress)
# create sourceforge project (done April 8)  
+
# create sourceforge project (done April 8)
 
# integrate a domain ontology such as amino acids (to do: Enrico)
 
# integrate a domain ontology such as amino acids (to do: Enrico)
 
# clean up (to do: Enrico by April 17)
 
# clean up (to do: Enrico by April 17)
 
## reorganize properties
 
## reorganize properties
## assign disjoint classes  
+
## assign disjoint classes
## close the classes  
+
## close the classes
 
# annotate (to do: Arlin major concepts by April 25)
 
# annotate (to do: Arlin major concepts by April 25)
 
## There is a nice summary of the meaning of the various Dublin Core Concepts at: http://www.sics.se/~preben/DC/DC_guide.html
 
## There is a nice summary of the meaning of the various Dublin Core Concepts at: http://www.sics.se/~preben/DC/DC_guide.html
Line 25: Line 27:
 
=== find a suitable software development environment ===
 
=== find a suitable software development environment ===
  
We have been using Protege, which was easy to use initially but has numerous flaws:  
+
We have been using Protege, which was easy to use initially but has numerous flaws:
* built-in reasoners don't always work (esp. FACT++) and sometimes cause crashes  
+
* built-in reasoners don't always work (esp. FACT++) and sometimes cause crashes
 
** Fact++ seems stable under Windows
 
** Fact++ seems stable under Windows
 
** On the other hand there seem to be no way to have access to the full querying capabilities (e.g., of Pellet); older versions of Protege allowed this, Protege 4 does not.
 
** On the other hand there seem to be no way to have access to the full querying capabilities (e.g., of Pellet); older versions of Protege allowed this, Protege 4 does not.
* allows invalid edits (can edit and save a file that cannot be parsed again)  
+
* allows invalid edits (can edit and save a file that cannot be parsed again)
 
** The format creates occasional confusions; for example, OWL functional syntax occasionally confuses classes and individuals
 
** The format creates occasional confusions; for example, OWL functional syntax occasionally confuses classes and individuals
 
* OWL functional syntax output format apparently results in loss of data
 
* OWL functional syntax output format apparently results in loss of data
* OWL functional syntax output not intelligible to racer  
+
* OWL functional syntax output not intelligible to racer
 
** [Enrico] actually I was unable to load '''any''' of the formats produced by Protege 4 into Racer; some produce a parser error (e.g., functional syntax), others simply do not produce any data in Racer (e.g., OWL/XML)
 
** [Enrico] actually I was unable to load '''any''' of the formats produced by Protege 4 into Racer; some produce a parser error (e.g., functional syntax), others simply do not produce any data in Racer (e.g., OWL/XML)
* output formats not fully interchangeable with other systems  
+
* output formats not fully interchangeable with other systems
 
** Swoop recognizes OWL/XML but not the other formats
 
** Swoop recognizes OWL/XML but not the other formats
 
* apparent inability to create separate files for instance data
 
* apparent inability to create separate files for instance data
 
** It is probably possible to edit a separate ontology that contains only instances; this did not work in our case though; I have tried to create an ontology in Protege which imports CDAO (and which was meant to contain our instances), but Protege failed to import CDAO.
 
** It is probably possible to edit a separate ontology that contains only instances; this did not work in our case though; I have tried to create an ontology in Protege which imports CDAO (and which was meant to contain our instances), but Protege failed to import CDAO.
 
* very limited graphics (can't see how to show object properties or change how classes are labeled)
 
* very limited graphics (can't see how to show object properties or change how classes are labeled)
* apparently allows OWL-full properties to be added to OWL-DL ontologies  
+
* apparently allows OWL-full properties to be added to OWL-DL ontologies
  
We want to find a new software environment with the following features:  
+
We want to find a new software environment with the following features:
 
* stable implementation of OWL-DL
 
* stable implementation of OWL-DL
* validating editor  
+
* validating editor
* ability to create separate instance files  
+
* ability to create separate instance files
 
* interoperable formats (especially, interoperable with racer)
 
* interoperable formats (especially, interoperable with racer)
* ability to visualize more than just class names and is_a relationships.  
+
* ability to visualize more than just class names and is_a relationships.
  
The candidates:  
+
The candidates:
 
* [http://code.google.com/p/swoop/ swoop]
 
* [http://code.google.com/p/swoop/ swoop]
 
* [http://doddle-owl.sourceforge.net doddle]
 
* [http://doddle-owl.sourceforge.net doddle]
Line 54: Line 56:
 
* [http://jena.sourceforge.net/index.html jena] is an API, not a stand-alone editor
 
* [http://jena.sourceforge.net/index.html jena] is an API, not a stand-alone editor
 
* [http://www.alphaworks.ibm.com/tech/semanticstk alphaworks]
 
* [http://www.alphaworks.ibm.com/tech/semanticstk alphaworks]
* the following project appear to be inactive or inadequate in scope  
+
* the following project appear to be inactive or inadequate in scope
 
** oil-ed
 
** oil-ed
 
** [http://sofa.projects.semwebcentral.org/ sofa], latest updates are from 2005
 
** [http://sofa.projects.semwebcentral.org/ sofa], latest updates are from 2005
Line 62: Line 64:
 
=== evaluate the draft ontology ===
 
=== evaluate the draft ontology ===
 
evaluation plans by performance criterion
 
evaluation plans by performance criterion
# ontology can represent character data instances  
+
# ontology can represent character data instances
 
## Brandon thinks he can automate NEXUS-to-CDAO import of data
 
## Brandon thinks he can automate NEXUS-to-CDAO import of data
## Arlin can supply token examples  
+
## Arlin can supply token examples
 
# ontology provides computability of (some) useful queries
 
# ontology provides computability of (some) useful queries
 
## list of phylo-related queries in [http://www.cs.rice.edu/~nakhleh/Papers/bibe03_final.pdf Nakleh, et al]
 
## list of phylo-related queries in [http://www.cs.rice.edu/~nakhleh/Papers/bibe03_final.pdf Nakleh, et al]
Line 71: Line 73:
 
### return nodes in lineage from root to X
 
### return nodes in lineage from root to X
 
### find distance (X, Y)
 
### find distance (X, Y)
### what is character type of state datum X  
+
### what is character type of state datum X
 
### what is coordinate of state datum X
 
### what is coordinate of state datum X
 
### what is homolog of state datum X in otu Y
 
### what is homolog of state datum X in otu Y
 
### return slice of data matrix defined by subtree( otuA, otuB)
 
### return slice of data matrix defined by subtree( otuA, otuB)
###  
+
###
 
# ontology does not duplicate existing ontologies - see table in paper
 
# ontology does not duplicate existing ontologies - see table in paper
 
# ontology integrates related ontologies for character domains, SUCH AS (at least one of):
 
# ontology integrates related ontologies for character domains, SUCH AS (at least one of):
Line 83: Line 85:
 
## morphology, e.g., C. elegans (worm) [http://purl.org/obo/owl/WBbt gross anatomy ontology]
 
## morphology, e.g., C. elegans (worm) [http://purl.org/obo/owl/WBbt gross anatomy ontology]
 
# ontology is normalized or modular according to [http://www.cs.man.ac.uk/%7Erector/papers/rector-modularisation-kcap-2003-distrib.pdf Rector]
 
# ontology is normalized or modular according to [http://www.cs.man.ac.uk/%7Erector/papers/rector-modularisation-kcap-2003-distrib.pdf Rector]
## should trees be separate?
+
## should trees be separate?
 
### consider SUMO (a [http://rocling.iis.sinica.edu.tw/kifb/en/ SUMO browser])
 
### consider SUMO (a [http://rocling.iis.sinica.edu.tw/kifb/en/ SUMO browser])
 
### [Enrico] Don't see the relevance of SUMO (generic, domain independent ontology).
 
### [Enrico] Don't see the relevance of SUMO (generic, domain independent ontology).
  
 
=== release the initial ontology version (May 20) ===
 
=== release the initial ontology version (May 20) ===
# submit to OBO (to do: Enrico by May 9)  
+
# submit to OBO (to do: Enrico by May 9)
# announce the ontology with an initial publication  
+
# announce the ontology with an initial publication
 
## write manuscript (to do: Julie by May 11)
 
## write manuscript (to do: Julie by May 11)
 
## venue choice: bmc evolutionary biology  (done, April 8)
 
## venue choice: bmc evolutionary biology  (done, April 8)
Line 97: Line 99:
  
 
=== further develop an ontology to a useful version for public release ===
 
=== further develop an ontology to a useful version for public release ===
# subject to tests and challenges  
+
# subject to tests and challenges
 
## representation (can the ontology represent complex and diverse data sets?)
 
## representation (can the ontology represent complex and diverse data sets?)
 
### neXML (Brandon at next evoinfo meeting?)
 
### neXML (Brandon at next evoinfo meeting?)
 
### NEXUS
 
### NEXUS
### Pandit and other formats?  
+
### Pandit and other formats?
 
## round-trip test (formal version of representation challenge)
 
## round-trip test (formal version of representation challenge)
## validation  
+
## validation
 
### try to insert wrong value
 
### try to insert wrong value
 
## simple reasoning in Protege
 
## simple reasoning in Protege
Line 110: Line 112:
 
## more complex reasoning with RACE or other external reasoner
 
## more complex reasoning with RACE or other external reasoner
 
### not sure how this would work
 
### not sure how this would work
### ideally we could test for correct computation of MRCA, lineage, and so on  
+
### ideally we could test for correct computation of MRCA, lineage, and so on
# develop tools  
+
# develop tools
 
# carry out demonstration projects (ideal properties for a demo: significant; extends what is possible; relies critically on CDAO)
 
# carry out demonstration projects (ideal properties for a demo: significant; extends what is possible; relies critically on CDAO)
## functional inference generalization  
+
## functional inference generalization
 
## natural language processing (via CDAO) to create literature resource (Enrico, idea for possible ASU collaborators)
 
## natural language processing (via CDAO) to create literature resource (Enrico, idea for possible ASU collaborators)
 
## TreeBase input validator
 
## TreeBase input validator
## translation tools  
+
## translation tools
 
## translate high-value content (Pandit, KOGs, etc)
 
## translate high-value content (Pandit, KOGs, etc)
 
## other
 
## other
Line 124: Line 126:
 
# need to be aggressive - get attention, hold workshop with interested people
 
# need to be aggressive - get attention, hold workshop with interested people
 
=== completed ontology ===
 
=== completed ontology ===
# projects  
+
# projects
 
## FIGENIX human proteome history project (Julie, Francisco)
 
## FIGENIX human proteome history project (Julie, Francisco)
 
## phylogenetic profiles (Julie)
 
## phylogenetic profiles (Julie)
Line 132: Line 134:
 
=== ontology development framework ===
 
=== ontology development framework ===
  
OWL-DL using Protege:  
+
OWL-DL using Protege:
 
* Some slides illustrating a brief introduction to the use of Protege [[http://www.cs.nmsu.edu/~bchisham/ontology/protege-presentation/html/protege-presentation.html]HTML]  [[http://www.cs.nmsu.edu/~bchisham/ontology/protege-presentation/pdf/protege-presentation.pdf]PDF]  [[http://www.cs.nmsu.edu/~bchisham/ontology/protege-presentation/swf/protege-presentation.swf]Flash]
 
* Some slides illustrating a brief introduction to the use of Protege [[http://www.cs.nmsu.edu/~bchisham/ontology/protege-presentation/html/protege-presentation.html]HTML]  [[http://www.cs.nmsu.edu/~bchisham/ontology/protege-presentation/pdf/protege-presentation.pdf]PDF]  [[http://www.cs.nmsu.edu/~bchisham/ontology/protege-presentation/swf/protege-presentation.swf]Flash]
  
Line 141: Line 143:
 
====Initial test-driven development strategy ====
 
====Initial test-driven development strategy ====
  
To get started, we propose to use a test-driven strategy based on explicit tests of the basic concepts from the [[ConceptGlossary]].  Attached is the [[media:prioritized_concept_list.txt]] (1 is highest priority, 3 is lowest).  Here is how it works.  Imagine we have a *high-level test language* and this is the code for testing the ontology on its implementation of the "ancestor" concept:  
+
To get started, we propose to use a test-driven strategy based on explicit tests of the basic concepts from the [[ConceptGlossary]].  Attached is the [[media:prioritized_concept_list.txt]] (1 is highest priority, 3 is lowest).  Here is how it works.  Imagine we have a *high-level test language* and this is the code for testing the ontology on its implementation of the "ancestor" concept:
  
 
  load_ontology("CDAO");
 
  load_ontology("CDAO");
  load_data("ancestor_test.nex");  
+
  load_data("ancestor_test.nex");
  statements = { "otuA is_a ancestor_of otuB", "htuAB is_a ancestor_of otuB" };  
+
  statements = { "otuA is_a ancestor_of otuB", "htuAB is_a ancestor_of otuB" };
  truth_value = { "false", "true" };  
+
  truth_value = { "false", "true" };
  evaluate( statements, answers );  
+
  evaluate( statements, answers );
  
Here is the "ancestor_test.nex" file:  
+
Here is the "ancestor_test.nex" file:
  
 
  #NEXUS
 
  #NEXUS
 
  BEGIN TAXA;
 
  BEGIN TAXA;
 
       dimensions ntax=4;
 
       dimensions ntax=4;
       taxlabels A B C D;
+
       taxlabels A B C D;
 
  END;
 
  END;
 
  BEGIN TREES;
 
  BEGIN TREES;
Line 161: Line 163:
  
 
I'm hoping to [[http://www.cs.nmsu.edu/~epontell/tests.tar]attach a tar file] with tests for concepts, but the wiki does not like tar files.  I can send it via email.  The files come in pairs,
 
I'm hoping to [[http://www.cs.nmsu.edu/~epontell/tests.tar]attach a tar file] with tests for concepts, but the wiki does not like tar files.  I can send it via email.  The files come in pairs,
+
 
 
  <concept><test_number>.nex
 
  <concept><test_number>.nex
 
  <concept><test_number>.tab
 
  <concept><test_number>.tab
Line 169: Line 171:
 
====More elaborate test data sets====
 
====More elaborate test data sets====
  
Each data set comes with a tree and a character matrix in NEXUS format.  To explore these data sets you may wish to:  
+
Each data set comes with a tree and a character matrix in NEXUS format.  To explore these data sets you may wish to:
 
* view the NEXUS files in a text editor
 
* view the NEXUS files in a text editor
 
* view the data in a phylogenetic context with [http://www.molevol.org/nexplorer Nexplorer]
 
* view the data in a phylogenetic context with [http://www.molevol.org/nexplorer Nexplorer]
* translate the files to XML with Rutger Vos's [http://www.nexml.org/nexml/phylows NEXUS-to-neXML conversion server].  
+
* translate the files to XML with Rutger Vos's [http://www.nexml.org/nexml/phylows NEXUS-to-neXML conversion server].
  
 
There are four different categories of character sets:
 
There are four different categories of character sets:
Line 180: Line 182:
 
* morphology: discrete morphological characters with ad hoc numeric encoding (e.g., 0 = absent, 1 = present)
 
* morphology: discrete morphological characters with ad hoc numeric encoding (e.g., 0 = absent, 1 = present)
  
The DNA data are "CDS" or "coding sequence" data, meaning the sequence of nucleotide triplets in the protein-coding part of a gene.  
+
The DNA data are "CDS" or "coding sequence" data, meaning the sequence of nucleotide triplets in the protein-coding part of a gene.
  
There are three grades of difficulty:  
+
There are three grades of difficulty:
 
* Simplified: small number of OTUs and characters; unambiguous states; single bifurcating tree
 
* Simplified: small number of OTUs and characters; unambiguous states; single bifurcating tree
 
* Typical: may contain many OTUs, multiple trees, polytomies, other stuff
 
* Typical: may contain many OTUs, multiple trees, polytomies, other stuff
Line 366: Line 368:
  
 
remains to be done
 
remains to be done
* category 2 items from yesterday  
+
* category 2 items from yesterday
  
Incorporated (provisionally) in Version 12:  
+
Incorporated (provisionally) in Version 12:
 
* imported equivalence class of amino acids SpecificAminoAcid from OWL amino-acid (www.co-ode.org/ontologies/amino-acid/2005/10/11/amino-acid.owl)
 
* imported equivalence class of amino acids SpecificAminoAcid from OWL amino-acid (www.co-ode.org/ontologies/amino-acid/2005/10/11/amino-acid.owl)
 
* found classes for nucleotides in www.bioontology.org/files/4531/basic-vertebrate-gross-anatomy.owl, actually these are imported from http://www.co-ode.org/ontologies/basic-bio/top-bio.owl
 
* found classes for nucleotides in www.bioontology.org/files/4531/basic-vertebrate-gross-anatomy.owl, actually these are imported from http://www.co-ode.org/ontologies/basic-bio/top-bio.owl
Line 374: Line 376:
  
 
Incorporated in Version 11:
 
Incorporated in Version 11:
* gap state or missing data  
+
* gap state or missing data
 
* homologous to
 
* homologous to
 
* taxonomic link as TU annotation
 
* taxonomic link as TU annotation
  
 
Incorporated in Version 10:
 
Incorporated in Version 10:
* lineage  
+
* lineage
 
* transformation types
 
* transformation types
  
Line 392: Line 394:
 
The next sub-section of the ontology to be refined was the part representing the character state data matrix. Although this initially seemed to be a relatively simple structure, a number of complications were encountered because of the various data types we wanted to include (nucleotide, amino acid, continuous data and other discrete data, such as GO terms, EC numbers, anatomy, etc.) and the large number of properties attached to each class.
 
The next sub-section of the ontology to be refined was the part representing the character state data matrix. Although this initially seemed to be a relatively simple structure, a number of complications were encountered because of the various data types we wanted to include (nucleotide, amino acid, continuous data and other discrete data, such as GO terms, EC numbers, anatomy, etc.) and the large number of properties attached to each class.
  
Two alternative representations were considered to take into account the different data types. The first tried to minimise the number of classes specific to each data type, however this turned out to be too difficult to represent with the OWL language. The second option defined a number of type-specific classes and although this is not an ideal structure with a number of duplications, the  ontology structure was simplified. The second option defines a set of restrictions that will allow us to check data for consistency with the ontology reasoner.  
+
Two alternative representations were considered to take into account the different data types. The first tried to minimise the number of classes specific to each data type, however this turned out to be too difficult to represent with the OWL language. The second option defined a number of type-specific classes and although this is not an ideal structure with a number of duplications, the  ontology structure was simplified. The second option defines a set of restrictions that will allow us to check data for consistency with the ontology reasoner.
  
 
At this stage the validity of the draft ontology was tested by adding instance data into Protege and a reasoner to check for inconsistencies.
 
At this stage the validity of the draft ontology was tested by adding instance data into Protege and a reasoner to check for inconsistencies.
  
In addition, we  
+
In addition, we
 
* generalized edge annotation concept
 
* generalized edge annotation concept
 
* allowed for residues to have coordinates in a sequence
 
* allowed for residues to have coordinates in a sequence
 
* learned more about the bugs-features of reasoners in Protege 4
 
* learned more about the bugs-features of reasoners in Protege 4
** FaCT++ does not seem to work in the Mac version  
+
** FaCT++ does not seem to work in the Mac version
 
** errors in instance data trigger a Java fault (not reasoner error) for both FaCT++ and Pellet
 
** errors in instance data trigger a Java fault (not reasoner error) for both FaCT++ and Pellet
 
** DL Queries are limited to classes and simple conjunctions with properties
 
** DL Queries are limited to classes and simple conjunctions with properties
Line 481: Line 483:
  
 
1. Revisions
 
1. Revisions
  We added synonyms to the ontology, in the needed places.  
+
  We added synonyms to the ontology, in the needed places.
 
We also separated characters and their related classes and properties
 
We also separated characters and their related classes and properties
 
into a separate ontology in order to better encapsulate these elements
 
into a separate ontology in order to better encapsulate these elements
Line 489: Line 491:
  
 
2. Examples
 
2. Examples
  We started work encoding the examples provided on the Wiki page.  
+
  We started work encoding the examples provided on the Wiki page.
 
This encoding is not yet complete, but we are making progress, and have
 
This encoding is not yet complete, but we are making progress, and have
 
identified and made a few necessary changes to fix earlier errors such as
 
identified and made a few necessary changes to fix earlier errors such as
Line 505: Line 507:
 
4. Updates available
 
4. Updates available
  
We have uploaded the current versions of our work to  
+
We have uploaded the current versions of our work to
 
[[http://www.cs.nmsu.edu/~bchisham/ontology/ ]]
 
[[http://www.cs.nmsu.edu/~bchisham/ontology/ ]]
It's now available as both OWL and Protege Project files.  
+
It's now available as both OWL and Protege Project files.
  
 
====Day 1, Monday====
 
====Day 1, Monday====
Line 625: Line 627:
 
===telecon, 14 March, 2:00 UTC===
 
===telecon, 14 March, 2:00 UTC===
  
skipped this  
+
skipped this
  
 
===Telecon, 7 March, 2007===
 
===Telecon, 7 March, 2007===
Line 631: Line 633:
 
present: Francisco Prodoscimi, Julie Thompson, Enrico Pontelli, Arlin Stoltzfus
 
present: Francisco Prodoscimi, Julie Thompson, Enrico Pontelli, Arlin Stoltzfus
  
What activities to do before the meeting?  Plan for development?  
+
What activities to do before the meeting?  Plan for development?
# represent 4 simple test cases  
+
# represent 4 simple test cases
 
## nt alignment plus tree
 
## nt alignment plus tree
 
## prot alignment plus tree
 
## prot alignment plus tree
Line 640: Line 642:
 
## set and logic operations on characters and OTUs
 
## set and logic operations on characters and OTUs
 
## tree operations (clade selection, prune)
 
## tree operations (clade selection, prune)
## other?  
+
## other?
 
# map ontology to other representations
 
# map ontology to other representations
 
## NEXUS
 
## NEXUS
 
## neXML
 
## neXML
# start compiling list of concepts that are missing  
+
# start compiling list of concepts that are missing
 
## review Enrico's proposal
 
## review Enrico's proposal
 
# look ahead to future challenges
 
# look ahead to future challenges
Line 651: Line 653:
  
 
Other issues for meeting and for paper
 
Other issues for meeting and for paper
* what is the scope?  
+
* what is the scope?
* How to integrate with other ontologies?  
+
* How to integrate with other ontologies?
** table from 'related artefacts' exercise  
+
** table from 'related artefacts' exercise
** genetic code as a test case for integration  
+
** genetic code as a test case for integration
 
*** requires nt aa mapping to specify code
 
*** requires nt aa mapping to specify code
 
*** requires species taxonomy to assign code to species
 
*** requires species taxonomy to assign code to species

Revision as of 13:53, 6 May 2008

Comparative Data Analysis Ontology

Note: this is mostly a project page for team members, focused on current issues and ongoing work.

The Comparative Data Analysis Ontology (CDAO) is intended to provide a framework for understanding data in the context of evolutionary-comparative analysis. This comparative approach is used commonly in bioinformatics and other areas of biology to draw inferences from a comparison of differently evolved versions of something, such as differently evolved versions of a protein. In this kind of analysis, the things-to-be-compared typically are classes called 'OTUs' (Operational Taxonomic Units). The OTUs can represent biological species, but also may be drawn from higher or lower in a biological hierarchy-- anywhere from molecules to communities. The features to be compared among OTUs are rendered in an entity-attribute-value model sometimes referred to as the 'character-state data model'. For a given character, such as 'beak length', each OTU has a state, such as 'short' or 'long'. The differences between states are understood to emerge by a historical process of evolutionary transitions in state, represented by a model (or rules) of transitions along with a phylogenetic tree. CDAO provides the framework for representing OTUs, trees, transformations, and characters. The representation of characters and transformations may depend on imported ontologies for a specific type of character.

Quick links

Overview of Goals and Project Trajectory

develop a draft ontology

  1. initial study (done Nov 2007)
  2. represent in OWL (done April 2008)
  3. revise (in progress)
  4. create sourceforge project (done April 8)
  5. integrate a domain ontology such as amino acids (to do: Enrico)
  6. clean up (to do: Enrico by April 17)
    1. reorganize properties
    2. assign disjoint classes
    3. close the classes
  7. annotate (to do: Arlin major concepts by April 25)
    1. There is a nice summary of the meaning of the various Dublin Core Concepts at: http://www.sics.se/~preben/DC/DC_guide.html

find a suitable software development environment

We have been using Protege, which was easy to use initially but has numerous flaws:

  • built-in reasoners don't always work (esp. FACT++) and sometimes cause crashes
    • Fact++ seems stable under Windows
    • On the other hand there seem to be no way to have access to the full querying capabilities (e.g., of Pellet); older versions of Protege allowed this, Protege 4 does not.
  • allows invalid edits (can edit and save a file that cannot be parsed again)
    • The format creates occasional confusions; for example, OWL functional syntax occasionally confuses classes and individuals
  • OWL functional syntax output format apparently results in loss of data
  • OWL functional syntax output not intelligible to racer
    • [Enrico] actually I was unable to load any of the formats produced by Protege 4 into Racer; some produce a parser error (e.g., functional syntax), others simply do not produce any data in Racer (e.g., OWL/XML)
  • output formats not fully interchangeable with other systems
    • Swoop recognizes OWL/XML but not the other formats
  • apparent inability to create separate files for instance data
    • It is probably possible to edit a separate ontology that contains only instances; this did not work in our case though; I have tried to create an ontology in Protege which imports CDAO (and which was meant to contain our instances), but Protege failed to import CDAO.
  • very limited graphics (can't see how to show object properties or change how classes are labeled)
  • apparently allows OWL-full properties to be added to OWL-DL ontologies

We want to find a new software environment with the following features:

  • stable implementation of OWL-DL
  • validating editor
  • ability to create separate instance files
  • interoperable formats (especially, interoperable with racer)
  • ability to visualize more than just class names and is_a relationships.

The candidates:

  • swoop
  • doddle
  • Altova SemanticWorks [1] It's a commercial product. It has a very nice visualization of ontologies (including properties). I am trying to understand whether it can handle OWL 1.1. It SEEMS to be able to read the RDF/XML version from Protege!
  • jena is an API, not a stand-alone editor
  • alphaworks
  • the following project appear to be inactive or inadequate in scope
    • oil-ed
    • sofa, latest updates are from 2005
    • SWeDe [2] extension of Eclipse, not supported since 2005
    • CmapTools [3] Very nice graphical interface (you build the ontology by actually drawing classes and links), but very limited expressive power (e.g., no way to describe features of a property, e.g., transitivity or functional behavior)

evaluate the draft ontology

evaluation plans by performance criterion

  1. ontology can represent character data instances
    1. Brandon thinks he can automate NEXUS-to-CDAO import of data
    2. Arlin can supply token examples
  2. ontology provides computability of (some) useful queries
    1. list of phylo-related queries in Nakleh, et al
    2. other queries that we devise
      1. is node X ancestor_of node Y
      2. return nodes in lineage from root to X
      3. find distance (X, Y)
      4. what is character type of state datum X
      5. what is coordinate of state datum X
      6. what is homolog of state datum X in otu Y
      7. return slice of data matrix defined by subtree( otuA, otuB)
  3. ontology does not duplicate existing ontologies - see table in paper
  4. ontology integrates related ontologies for character domains, SUCH AS (at least one of):
    1. amino acids from amino acid ontology of Drummond, et al
    2. nucleotides (is_a Nucleic_acid_base_residue in top-bio ontology of OBO
    3. GO terms
    4. morphology, e.g., C. elegans (worm) gross anatomy ontology
  5. ontology is normalized or modular according to Rector
    1. should trees be separate?
      1. consider SUMO (a SUMO browser)
      2. [Enrico] Don't see the relevance of SUMO (generic, domain independent ontology).

release the initial ontology version (May 20)

  1. submit to OBO (to do: Enrico by May 9)
  2. announce the ontology with an initial publication
    1. write manuscript (to do: Julie by May 11)
    2. venue choice: bmc evolutionary biology (done, April 8)
    3. submit manuscript (May 20)
  3. set up public web site domain: evolutionaryontology.org/cdao (in progress: Arlin)
  4. prepare presentation for scientific meetings in Summer 2008 (done: Marseille (Francisco); to do: SSE)

further develop an ontology to a useful version for public release

  1. subject to tests and challenges
    1. representation (can the ontology represent complex and diverse data sets?)
      1. neXML (Brandon at next evoinfo meeting?)
      2. NEXUS
      3. Pandit and other formats?
    2. round-trip test (formal version of representation challenge)
    3. validation
      1. try to insert wrong value
    4. simple reasoning in Protege
      1. e.g., has_Amino_Acid_State value "A"
      2. could use Protege API to write simple test application to run such tests
    5. more complex reasoning with RACE or other external reasoner
      1. not sure how this would work
      2. ideally we could test for correct computation of MRCA, lineage, and so on
  2. develop tools
  3. carry out demonstration projects (ideal properties for a demo: significant; extends what is possible; relies critically on CDAO)
    1. functional inference generalization
    2. natural language processing (via CDAO) to create literature resource (Enrico, idea for possible ASU collaborators)
    3. TreeBase input validator
    4. translation tools
    5. translate high-value content (Pandit, KOGs, etc)
    6. other
  4. further develop web resource with documentation

feedback, community involvement

  1. need to be aggressive - get attention, hold workshop with interested people

completed ontology

  1. projects
    1. FIGENIX human proteome history project (Julie, Francisco)
    2. phylogenetic profiles (Julie)

Approach

ontology development framework

OWL-DL using Protege:

  • Some slides illustrating a brief introduction to the use of Protege [[4]HTML] [[5]PDF] [[6]Flash]

Initial Implementation (NESCent meetings)

see notes below

Tests and Challenges

Initial test-driven development strategy

To get started, we propose to use a test-driven strategy based on explicit tests of the basic concepts from the ConceptGlossary. Attached is the media:prioritized_concept_list.txt (1 is highest priority, 3 is lowest). Here is how it works. Imagine we have a *high-level test language* and this is the code for testing the ontology on its implementation of the "ancestor" concept:

load_ontology("CDAO");
load_data("ancestor_test.nex");
statements = { "otuA is_a ancestor_of otuB", "htuAB is_a ancestor_of otuB" };
truth_value = { "false", "true" };
evaluate( statements, answers );

Here is the "ancestor_test.nex" file:

#NEXUS
BEGIN TAXA;
      dimensions ntax=4;
      taxlabels A B C D;
END;
BEGIN TREES;
      tree bush = [&R] ((otuA,otuB)htuAB,(otuC,otuD)htuCD)htuABCD;
END;

I'm hoping to [[7]attach a tar file] with tests for concepts, but the wiki does not like tar files. I can send it via email. The files come in pairs,

<concept><test_number>.nex
<concept><test_number>.tab

The first file is a NEXUS file with the data. The second file is a table of statements for evaluation, with fields statement_number, truth_value, statement. Right now I am using a three-valued logic (true false and unknown or indeterminate), e.g., if the tree is not rooted, then whether an internal node is the ancestor of some other node is indeterminate.

More elaborate test data sets

Each data set comes with a tree and a character matrix in NEXUS format. To explore these data sets you may wish to:

There are four different categories of character sets:

  • DNA: aligned nucleotides coded via IUPAC standard (T, C, G, A, and so on)
  • protein: aligned amino acids coded via IUPAC standard (A, C, D, E, F, G, H, I and so on)
  • continuous: numeric values of continuous characters (e.g., 0.001, 0.230)
  • morphology: discrete morphological characters with ad hoc numeric encoding (e.g., 0 = absent, 1 = present)

The DNA data are "CDS" or "coding sequence" data, meaning the sequence of nucleotide triplets in the protein-coding part of a gene.

There are three grades of difficulty:

  • Simplified: small number of OTUs and characters; unambiguous states; single bifurcating tree
  • Typical: may contain many OTUs, multiple trees, polytomies, other stuff
  • Demanding: may contain ambiguous characters, mixed data types, notes, assumptions, etc.


type difficulty description comments NEXUS CDAO
CDS (DNA) Simplified Subset of 10 ATPase CDSs comments PF00137_10_cds.nex PF00137_10_cds.owl[[8]]
CDS (DNA) Typical Eukaryotic cytochrome C CDSs comments PF00034_39_cds.nex PF00034_39_cds.owl[[9]]
CDS (DNA) Typical Eukaryotic ATPase CDSs comments PF00137_47_cds.nex PF00137_47_cds.owl[[10]]
CDS (DNA) Demanding NA comments [[Media:|NA]] [[Media:|NA]]
Protein (AA) Simplified Subset of 10 ATPases comments PF00137_10_protein.nex PF00137_10_protein.owl[[11]]
Protein (AA) Typical Eukaryotic cytochrome Cs comments PF00034_39_protein.nex PF00034_39_protein.owl[[12]]
Protein (AA) Typical Eukaryotic ATPases comments PF00137_47_protein.nex PF00137_47_protein.owl[[13]]
Protein (AA) Demanding NA comments [[Media:|NA]] [[Media:|NA]]
Continuous Simplified NA comments [[Media:|NA]] [[Media:|NA]]
Continuous Typical Inhibitor sensitivity data for human kinases -log(IC50) scaled kinase_rescaled3_sets.nex kinase_rescaled3_sets.owl[[14]]
Continuous Demanding NA comments [[Media:|NA]] [[Media:|NA]]
Morphological Simplified NA comments [[Media:|NA]] [[Media:|NA]]
Morphological Typical Nematode vulval morphology and development Kiontke, et al., 2007 Kiontke_CB_fixed.nex [[Media:|NA]]
Morphological Demanding NA comments [[Media:|NA]] [[Media:|NA]]

Initial Implementation

  • The preliminary draft of the CDAO work done at NMSU is available here [15]. This is a current view of the content of the ontology [16]. In particular
    • MAO-Prime: [[17]Web page] this is a Protege implementation of the MAO along with the inclusion of some description of individual nucleotides, amino-acids, and meta symbols such as gap.
    • CDAO: [[18]Web page] this is a fairly direct implementation of the draft ontology developed during the Fall meeting of the EvoInfo group at NESCent
    • Transformations: [[19]Web Page] During the Fall meeting we discussed the need of including in the ontology a description of possible transformations; this is an attempt of this.
    • Tree: [[20]Web Page] this is a draft ontology for the description of trees, mostly drawn from Nexus and from Chado.


Evaluation

  • Some preliminary considerations:
    • Comparison of NeXML elements with ontology concepts (Updated Feb. 18, 2008) [21]
    • Comparison of Nexus elements with ontology concepts (Updated Mar. 1, 2008) [22]
    • Comparison of CHADO (Phylogeny Module) elements with ontology concepts (Added Feb. 25, 2008) [23]

Meeting Notes

Working meeting march 24-april 4

Day 11, Thursday Apr 3, 2008

remains to be done

  • category 2 items from yesterday

Incorporated (provisionally) in Version 12:

  • imported equivalence class of amino acids SpecificAminoAcid from OWL amino-acid (www.co-ode.org/ontologies/amino-acid/2005/10/11/amino-acid.owl)
  • found classes for nucleotides in www.bioontology.org/files/4531/basic-vertebrate-gross-anatomy.owl, actually these are imported from http://www.co-ode.org/ontologies/basic-bio/top-bio.owl


Incorporated in Version 11:

  • gap state or missing data
  • homologous to
  • taxonomic link as TU annotation

Incorporated in Version 10:

  • lineage
  • transformation types

Incorporated in Version 9:

  • made Anotayshun CDAO_Annotation
  • added Ancestral_Node, Common_Ancestral_Node and MRCA_Node with restriction has_Descendants > 1
  • linked TU to Node with object property represented_by_Node (inverse represents_TU)
  • compound character class

Day 10, Wednesday Apr 2, 2008

The next sub-section of the ontology to be refined was the part representing the character state data matrix. Although this initially seemed to be a relatively simple structure, a number of complications were encountered because of the various data types we wanted to include (nucleotide, amino acid, continuous data and other discrete data, such as GO terms, EC numbers, anatomy, etc.) and the large number of properties attached to each class.

Two alternative representations were considered to take into account the different data types. The first tried to minimise the number of classes specific to each data type, however this turned out to be too difficult to represent with the OWL language. The second option defined a number of type-specific classes and although this is not an ideal structure with a number of duplications, the ontology structure was simplified. The second option defines a set of restrictions that will allow us to check data for consistency with the ontology reasoner.

At this stage the validity of the draft ontology was tested by adding instance data into Protege and a reasoner to check for inconsistencies.

In addition, we

  • generalized edge annotation concept
  • allowed for residues to have coordinates in a sequence
  • learned more about the bugs-features of reasoners in Protege 4
    • FaCT++ does not seem to work in the Mac version
    • errors in instance data trigger a Java fault (not reasoner error) for both FaCT++ and Pellet
    • DL Queries are limited to classes and simple conjunctions with properties
Remaining issues:

1. ontology classes and properties

  • In the toplogy sub-section: notions of lineage, MRCA, subtree.
  • Links between tree and TU
  • In the character state data matrix sub-section: sequential character types and positional information.
  • Transformation concepts that will be included as edge_annotation in the tree topology

2. general issues

  • Linking CDAO to other ontologies: amino acid, GO, SO, NCBI taxonomy
  • Mapping to MAO concepts
  • Text annotations of classes and properties i.e. human-readable definitions of all CDAO concepts
  • Submission to OBO

Day 9, Tuesday Mar. 30, 2008

We spent the first part day becoming familiar with the latest version of Protege 4 and other tools such as the ontology DL and the OWLviz visualization. Updating from version 3 of Protege caused a number of compatalibity problems, but the extra features, especially the visualization tool were considered important.

We then decided to concentrate on a number of sub-sections of the ontology, starting with the tree topology. The issued raised on day 8 were all addressed during this session. The most important decisions were the kinds of topologies to include in the ontology (rooted/unrooted trees, more general networks or graphs, ...) and the representation of direction in rooted trees. The concept of parent/child or ancestor/descendant nodes connected by an edge proved to be non-trivial to describe in the OWL language with the limitation of properties linking only two classes. This was overcome with the facility of 'chaining properties'.

Day 8, Monday Mar. 31, 2008

Julie and Francisco reviewed the complete draft ontology built during the first week and compiled a list of questions to address with the full team during the next week:

  1. Do we want to differentiate between traits and characters?
  2. How de we represent the tree topology and what do we need to differentiate between rooted and unrooted trees?
  3. What properties de we need for edges and nodes? how do we define directed edges for rooted trees?
  4. Do we want to define a minimum ontology with only basic concepts, or do we want to include other concepts that could be derived from the basic ones?

Day 6 to 7, Sat to Sun, Mar. 29 to 30, 2008

The work done by Brandon and Francisco was handed over to the team working in the second week (Julie, Enrico, Arlin).

Day 5, Friday Mar. 28, 2008

1. Adding new concepts to the ontology

Today Francisco reviewed Arlin's entire list of concepts and added to the ontology all that he could clearly understand and describe. To do for Monday: (1) take a look into this list together and (2) try to verify if the concepts are clearly and well represented or whether different representations can be suggested.

Some concepts at the bottom of the list (many of the priority 3 ones), although very relevant in evolutionary biology, seem very difficult to include in the current ontology -- pecifically the ones related to population genetics. Maybe we would need another "sister ontology" to describe these, although it may be possible to link them to the concepts already present in CDAO.


2. Gathering new information to write a final version of paper

In addition, Francisco has been looking into the CDAO manuscript and gathering information from the web to try to fill it with more information. He has made a list of references on biomedical ontologies retrieved from PubMed. These have been downloaded and saved in a MS-word .doc format for use with endnote.


3. Writing the algorithm

Brandon has another great day of coding. He has finished the overall design of the system, but we have not yet had a chance to test it. He plans to work on it more this weekend after getting back to Las Cruces, and will send an updated version by Monday morning.

Day 4, Thursday Mar. 27, 2008

1. We have now produced a more consistent version of the ontology presenting (almost) all priority 1 concepts and also some priority 2 ones -- we've missed just a couple of priority 1 concepts that we didn't understand very well and that we'll be able to add into the ontology next week, after discussion with the other members of the group.

2. We believe this version of the ontology is much clearer and the relationship between classes are better described. Some concepts considered as classes before are now represented as object_properties or datatype_properties (and vice-versa). We have also restricted some of the datatype_properties to a set of limited values, avoiding misrepresentation of data. We think we've found a good representation of some difficult inter-related concepts such as the relationship between transformation, branch modification, character, OTU and character-state modification. I hope that we can re-refine this representation during the next week.

3. Brandon has finished a preliminary version of his algorithm that reads and interprets the NEXUS files and tomorrow morning he'll be adding to it the module that reads our new ontology -- he has an idea of some modules to use and he thinks there will be no problem with this. We hope to finish the day tomorrow with the complete and validated-by-hand representation of at least 2 nexus files in an ontology XML format.

Day 3, Wednesday Mar. 26, 2008

1. Optimizing the ontology

Today, we began discussing two versions of a simplified ontology we made yesterday (each of us made our own simplified version). Finally, we realized that the original ontology made on Monday containing all the concepts was not very well encapsulated and we prefer to begin another one. We have checked the best descriptions made by each of us and produced a cleaner and more consistent ontology using the best way to describe concepts we've made. We've discussed differences in data representation and have come to an agreement on the best way to represent different kinds of data.

Although the new ontology is cleaner, more understandable and the concepts are inter-related in a better way, it still lacks some synonyms and some important concepts. We'll add them by importing from the original complete ontology in a step-by-step manner, testing each concept and their relationships before adding the next one.


2. Automating the representation of test sets

Since we spent lots of time yesterday afternoon and this morning trying to represent 3-OTUs with 5-characters in the Protégé ontology manually, in the afternoon we decided that we would need at least a very preliminary algorithm to read the input test files made by Arlin and translate them into a file to be read and checked inside Protégé.

Brandon has spent the afternoon producing this algorithm (although he hasn't finished it yet, he has advanced well). In the meantime, Francisco continued to look into the simplified ontologies we have made and to add new concepts into them. Although they still lack many of Arlin's concepts with priority 1, we think that these new ontologies we have made beginning from zero are more internally-consistent and they will allow better representation of the data than the original one produced on Monday.

Day 2, Tuesday Mar. 25, 2008

1. Revisions

We added synonyms to the ontology, in the needed places.

We also separated characters and their related classes and properties into a separate ontology in order to better encapsulate these elements so that they could be refined in isolation without disturbing the other parts. This additionally helps to reduce confusion between properties while working.

2. Examples

We started work encoding the examples provided on the Wiki page.

This encoding is not yet complete, but we are making progress, and have identified and made a few necessary changes to fix earlier errors such as relating traits/characters to edges rather than OLT's

3. Testing and Protege Training

As part of the testing process we each made simplified versions of the ontology

and worked on encoding the examples, so that we could identify the critical components, transfer knowledge about protege, and also work out problem spots in a simple environment where they would be, most likely, easier to fix. Additionally the import system has proven to be somewhat brittle so while the encapsulation is desireable, until each of the sub-parts is stable, it iseasier to work with them as a single ontology file.

4. Updates available

We have uploaded the current versions of our work to [[24]] It's now available as both OWL and Protege Project files.

Day 1, Monday

We began by checking the concepts in the prioritized_concept_list, trying to make them available in the current version of the ontology. Most of the concepts were added in the tree subsection, although a number of them were shown to be redundant or better represented as other terms and relationships. We have also converted some terms that were classes to properties and other from properties to classes -- in the context of an OWL representation.

First we grouped the terms in related groups:

  • Group 1 - TU related
    • Descendant
    • Ancestor
    • HTU
    • Hypothetical Taxonomic Unit
    • Most Recent Common Ancestor
    • MRCA
    • Operational Taxonomic Unit
    • OTU
    • Outgroup
    • Leaf node
    • Terminal node
    • Root
    • Basal
  • Group 2 - Tree related
    • Branch support
    • Tree
    • Unresolved
    • Cladrogram
    • Dichotomy
    • Edge
    • Fully resolved
    • Monophyly
    • Network
    • Bifurcation
    • Phylogenetic Tree
    • Phylogentic Tree Topology
    • Bipartition
    • Bootstrap support
    • Branch
    • Subtree
    • Lineage
    • Topology
    • Polytomy
    • Unrooted
  • Group 3 - Character related
    • Trait
    • Character
    • Character-state
    • Character-State Data Matrix
    • Derived
    • Apomorphy
    • Primitive
    • State
    • Missing data
  • Group 4 - Others
    • Gap
    • Indel
    • Homology
    • Polymorphism
    • Taxon
    • Taxonomic Rank

Then, we defined the synonymous usage of terms. When the terms are synonymous concepts or representation, we chose just one of them to present.

  • Group 1
    • HTU = Hypothetical Taxonomic Unit = Ancestor
    • Leaf node = OTU = Operational Taxonomic Unit = Terminal node
    • Descendant = Child
    • Root
    • Outgroup
    • Most Recent Common Ancestor = MRCA
    • Basal

(These two concepts may be derived from an algorithm reading the ontology-annotated file, but they are not explicitly defined in the ontology itself. The information is there, but no specific concept is provided. If we choose to represent all the MRCA of all OTU/HTU and which TUs are more or less basal than other ones, we think the representation file would be very big.)

  • Group 2
    • Tree = Cladogram = Network = Phylogenetic Tree
    • Dichotomy = Fully resolved = Bifurcation = Monophyly = Bipartition
    • Edge = Branch
    • Polytomy = Unresolved
    • Unrooted
    • Subtree = Lineage
    • Branch confidence level = Branch support = Bootstrap support

(here we used confidence level as it can support any confidence analysis, even if bootstrap is the most used)

    • Topology = Phylogenetic Tree Topology

(the topology is something we need to have to build ontology-based representations, it is imported from NEXUS file and it can be retrieve by the ontology file through child-parent relationships)

  • Group 3 - Character related
    • Trait: Defined as any characteristic of the TU that the annotator would like to describe
    • Character: Defined as the characteristics used for evolutionary classification
    • State = Character-state
    • Derived = Apomorphy
    • Primitive
    • Missing data
    • Character-State Data Matrix

(This would be in the input file of the ontology and could also be retrieved from the ontology-annotated file by algorithms)


  • Group 4 - Others
    • Gap = defined in the transformation
    • Indel = defined in the transformation
    • Homology
    • Polymorphism = we didn't understand what it means
    • Taxon = defined as a property of an OTU
    • Taxonomic Rank


Once all these concepts were defined and added to the ontology, we began to make a simple representation of a simple hypothetical dataset. During this preliminary representation we have found some errors, and modified some concepts from properties to classes (such like the branch one, etc). Moreover, we had some difficulties to work with Protégé since it seems to be in a very beta release and each time we found something that would be represented better in the ontology by changing slightly the concepts, we need to rebuild and re-enter manually all the concepts in our test set.

  • Question : can synonyms be represented in Protégé? I think it would be useful for scientists to be able to choose the term they want to use.

telecon, 14 March, 2:00 UTC

skipped this

Telecon, 7 March, 2007

present: Francisco Prodoscimi, Julie Thompson, Enrico Pontelli, Arlin Stoltzfus

What activities to do before the meeting? Plan for development?

  1. represent 4 simple test cases
    1. nt alignment plus tree
    2. prot alignment plus tree
    3. kinases with inhibitor sensitivity
    4. worm morphologies
  2. carry out operations with reasoning
    1. set and logic operations on characters and OTUs
    2. tree operations (clade selection, prune)
    3. other?
  3. map ontology to other representations
    1. NEXUS
    2. neXML
  4. start compiling list of concepts that are missing
    1. review Enrico's proposal
  5. look ahead to future challenges
    1. genetic encoding of characters
    2. ambiguous, multi-dimensional, or otherwise complex characters

Other issues for meeting and for paper

  • what is the scope?
  • How to integrate with other ontologies?
    • table from 'related artefacts' exercise
    • genetic code as a test case for integration
      • requires nt aa mapping to specify code
      • requires species taxonomy to assign code to species
      • requires cell ontology to assign code to compartmental genome (nuc, mito, cp)

Next meeting

  • telecon, 14 March, 2:00 pm UTC
  • agenda
    • nt and prot test data sets (arlin)
    • protege demo (brandon)

Related Work

  • we are working on a direct generation of an ontology from the Concept Glossary. We are documenting the progress at this page [25]. Note that the page is not up-to-date at this moment (hopefully it will be by the end of the day or tomorrow [3/18/2008]). The goal is to eventually show that CDAO can map over all these concepts.