CDAO

From Evolutionary Informatics Working Group
Jump to: navigation, search

Contents

Comparative Data Analysis Ontology - Project Page

Note: this is for historical purposes only: go to www.evolutionaryontology.org/cdao for current project news


The Comparative Data Analysis Ontology (CDAO) is intended to provide a framework for understanding data in the context of evolutionary-comparative analysis. This comparative approach is used commonly in bioinformatics and other areas of biology to draw inferences from a comparison of differently evolved versions of something, such as differently evolved versions of a protein. The entities to be compared, typically called 'OTUs' (Operational Taxonomic Units), may represent biological species, or entities drawn from higher or lower in a biological hierarchy-- anywhere from molecules to communities. The features to be compared among OTUs are rendered in an entity-attribute-value model sometimes referred to as the 'character-state data model'. For a given character, such as 'beak length' (or 'position 20 of alpha hemoglobin'), each OTU has a state, such as 'short' (or 'Alanine'). The differences between states are understood to emerge by a historical process of evolutionary transitions in state, represented by a model (or rules) of transitions along with a phylogenetic tree. CDAO provides the framework for representing OTUs, trees, transformations, and characters. The representation of characters may depend on imported ontologies,. e.g., character-states for amino acid characters are based on an imported ontology of amino acids.

Quick links

Current priorities

Manuscript revisions

  1. What are the key points from the reviewer critique?
  2. What should we do in response to the key points?
  3. Who is going to take responsibility for each response?

Francisco and Julie went through each point raised by the two reviewers and have started answering the criticisms.

Reviewer 1, point 1

REVIEWERS' COMMENTS:
REVIEWER  1 EVALUATION>
1. Several of the ontology classifications are tree-like (e.g. PANTHER), while others
are cyclic (e.g. GO). Can the authors  comment about their preference for one type over
the other, and justify?
  • (Francisco and Julie) The type of ontology (tree-like or cyclic) depends on which concepts on the real-world the ontology developers want to represent. Any ontology should try to represent the concepts and relations in a natural way, such as the humans use their concepts in an everyday basis to describe the knowledge to be formalized on the ontology. The choice of the ontology type will depend on the structure of knowledge in the domain they want to represent. The two ontologies described by the referee (PANTHER and GO) represent the same type of knowledge, i. e., molecular function and biological process. In this case, the particular application will probably determine the more suitable representation of the data and both types of ontologies have advantages and disadvantages.
  • (Francisco and Julie) I think we need to be careful – from what I’ve seen, the OWL format is considered to be a tree-like structure and doesn’t allow cyclic relations. We need to justify the choice of OWL (tree-like ?) for CDAO carefully. This seems to be one of the arguments in the debate between OBO and OWL. What aspects of CDAO are particularly suitable to OWL, that cannot be modelled in OBO format?
  • (Francisco and Julie) Regarding CDAO, it is impossible to classify it as cyclic or tree-like. CDAO is formed based on three main parts (or sub-ontologies): the character-state data matrix, the topology and the transformation. Although the parts interact with each other via some concepts and relations, CDAO is not a content descriptive ontology such like GO or PANTHER or any other anatomic ontology and so it does not fit well these classifications.
  • (Arlin) we chose OWL on Enrico's advice, because of OWL-DL, the description-logics dialect that allows us to specify properties of relations such as transitivity. this is explained in the manuscript, so i think we just need to refer the reviewer to that.
  • we should make clear that CDAO is primarily an ontology of information artefacts, not biological things or processes. Evolution is a process but we don't refer to it directly. I was confused about this point when we were writing the manuscript. I still find the subject confusing but my thoughts are more organized on it.
  • (Julie) I think we still need to mention why we don't need cyclic relations (as this seems to be an important point for the reviewer).
  • (Arlin) I think the reviewer is confused. OBO allows for relations that are cyclic. This is from an introduction to OBO (http://oboedit.org/docs/html/An_Introduction_to_OBO_Ontologies.htm):

CyclicityIf a relation is cyclic, it is legal to create a cycle of links of that relationship type. Note that a cycle of a given relation P may contain other relationship types than P; the cycle may include is_a links or sub-relations of P.
"develops_from" is an everyday example of a relation that may be cyclical. An instance of A may develop from an instance of B, and later an instance of B may develop from the instance of A. Cyclic relationships often are ones that involve some sense of change over time.

A cyclic relation is treated as an annotation when an OBO file is translated to OWL. That is all that I was able to find out. I think there is something strange about this. The instances in the "develops from" example above are not cyclic, they are a linear array along the dimension of time. I'm not really sure what is the issue here.

  • (Enrico): this is a rather badly-phrased comment from the reviewer. I believe what he/she means with the distinction is really about a taxonomic vs. non-taxonomic organization of the ontology. It does not really anything to do with whether OWL/OBO/... is used. OWL does not have any "tree-like" structure, we can introduce properties that link concept recursively, e.g.,

<xml>

   <owl:Class rdf:about="#Class1">
       <rdfs:subClassOf rdf:resource="&owl;Thing"/>
   </owl:Class>
   <owl:ObjectProperty rdf:about="#cycrel">
       <rdfs:range rdf:resource="#Class1"/>
       <rdfs:domain rdf:resource="#Class1"/>
   </owl:ObjectProperty>
   <Class1 rdf:about="#elem1">
       <cycrel rdf:resource="#elem2"/>
   </Class1>
   <Class1 rdf:about="#elem2">
       <cycrel rdf:resource="#elem1"/>
   </Class1>

</rdf:RDF> </xml> Given this, we also need to clarify even further what "cyclic" means. In CDAO I don't believe we have any individual property that creates cyclic dependencies; but we do have chains of properties that creates cycle. This is obvious in any situation where we have properties that have an inverse: if x is related to y by a property, the y will be related to x via the inverse properties, if we chain the two together we get that x is related to itself via the connection of property+inverse property. I think the reviewer is really trying to aim at the issue of whether this ontology is an ontology that generates a taxonomic classification of concepts (i.e., essentially using only is-a relations) or whether we are capturing a more general (non-taxonomic) classification of concepts with more complex properties. We fit into the second one, not the first one (though CDAO 2 does contain more taxonomic components, e.g., classifications of various classes of trees).

  • (Julie) I think Enrico has done a great job of answering this point in the article and in the letter to the reviewers!

Reviewer 1, point 2

2. Probably it will be very useful to do a schematic comparison between sequence-level
and abstract-level evolutionary analysis.  The authors can do that by improvising the
Figure 2.
  • (Francisco and Julie) Figure 2 has been modified to represent better the relation between sequence-level and abstract level evolutionary analysis.
  • (Francisco and Julie) This was a point raised by Arlin’s internal reviewer I think? Can Arlin comment on this?
  • (Arlin) I don't follow the distinction that the reviewer is making between "sequence-level" and "abstract-level". If the reviewer thinks morphological characters are "abstract" whereas sequence residue characters are not, the reviewer is confused, and the revisions to Fig. 2 would address this confusion.
  • (Arlin) the NIST reviewer was Fabian Neuhaus, who has been closely associated with OBO and is truly an expert. We were lucky to get his advice. One thing he pointed out was that CDAO thing are mainly in the realm of information artefacts, separate from chemicals, flesh-blood-things, and biological processes. This puts a restriction on possible relations. A thing in the flesh-and-blood world cannot have a "part" in the information-artefact realm, and vice versa. Fabian was particularly concerned with the way that we said that OTUs are "biological entities" but that they "have" rows in a matrix, given that a data matrix is clearly an information artefact. I'm still trying to sort this all out in my head. I think we do not need to make changes to CDAO yet, but we need to change some things in the manuscript to avoid this kind of error.
  • (Enrico) I really like the new picture. Would it be worth to put in even little snaps of RDF to show where the different parts would go in a CDAO file?
  • (Julie) I think the RDF included in the text is enough to show how cdao concepts are represented, but I might be wrong?
  • (Julie) I misread this comment the first time. I thought the reviewer was talking about real-world versus abstract concepts. I think we can add a comment similar to Enrico's reply to the cyclic relations - that we are not sure about the distinction between sequence-level and abstract-level evolutionary analysis.
  • (Arlin) Julie, I think you had it right the first time. The reviewer probably thinks that non-sequence characters are "abstract", and in that case, your revisions to Fig. 2 address the comment.

Reviewer 1, point 3

3. The state transition matrix (e.g. PAM for amino acids) is likely to be different
for different type of data. Will an user be  able to use a generalized precomputed
matrix or should he/she generate one of his/her specialized own on the spot?  Can the
authors comment about it.
  • (Francisco and Julie) (Here one can realize that the referee missed completely the point on CDAO, he thinks it is some sort of algorithm, I guess... not a representation schema... that's why I say we should make overall modifications in the manuscript. Anyway, we have produced the beginning of an answer below.)
  • (Francisco and Julie) CDAO is a model to describe evolutionary analysis data, and we have tried to make this independent of the methodology used to analysis or produce the data. However, it is clear that the methodology used is very important and we have included Annotation concepts to document these aspects. In the future, we hope that CDAO will be able to benefit from the MIAPA project (Minimum Information About a Phylogenetic Analysis; Leebens-Mack J et al, 2006), allowing a more precise annotation of the methodologies used.
  • (Francisco and Julie) I think we need to describe the annotation concepts more clearly in the Implementation section.
  • (Arlin) I agree that the reviewer is confused and I would say to ignore this one except that the reviewer specifically asked us to comment on it. In a parsimony analysis, the user can assume that all transitions have the same cost, and this scheme is generalizable to any number of characters, each with any number of states, regardless of whether the characters are molecular or morphological. In a probabilistic analysis of a model, the user can make the comparable assumption that all transitions happen with the same rate. If the user wants a more sophisticated model-- either an empirical model such as PAM, or a mechanistic model--, then its up to the analysis programmer to supply that. Its just not an issue for CDAO. The thing that CDAO could do, in principle, to help with this is to supply more of a framework for model descriptions. The EvoInfo group at NESCent has struggled on a general language for describing models and this effort gets bogged down in complexity very quickly.
  • (Enrico) yes, the reviewer at this point has completely gone off on a tangent. We could simply emphasize that the ontology is aimed at providing a description of existing data and artifacts, in a platform and application independent fashion, and emphasizing the concepts and properties of concepts. Thus, the CDAO itself does not generate any data or matrices, it is used to describe existing data and matrices, possibly originating from diverse and heterogeneous sources of information.

Reviewer 2, summary

REVIEWER  2 EVALUATION
The authors present an initial implementation of an ontology for evolutionary biology
(CDAO), which is aimed at facilitating the application of phylogenetic and comparative
methodology for software developers and biomedical researchers. To achieve this they
adopted a common workflow for ontology development involving a specification phase,
subsequent conceptualization, implementation and evaluation.
Based on an initial discussion among the Evolutionary Informatics working group and
inspection of phylogenetic software packages and standards, a list of glossary terms was
defined. Using this glossary the authors have derived relationships and concepts which
subsequently were/will be transformed into an OWL 1.1 compliant ontology.
It was not obvious to me whether all concepts from the glossary are already implemented
in the ontology. This didn't seem to be the case when looking at it using Protégé (35
terms).
  • (Francisco and Julie) The referee is right, we have first compiled the most important and relevant concepts to describe the most basic evolutionary analysis possible. The current version of CDAO allows a clear representation of the character-state data matrix, phylogenetic tree topology and character modifications. The users are able to use our ontology terms to describe a complex pattern of characters, including sequence-based, behavioral-based and anatomical-based ones. Special attention was paid to the overall structure of the ontology, so that the other concepts described in the glossary can be easily included in CDAO without disturbing the basic structure. These concepts will be added in future versions of CDAO and one of the main objectives of the article was to invite other researchers in the field to participate in the ontology design.
  • (Francisco and Julie) The current status of CDAO has been made clearer in the manuscript (I don’t know where exactly, but maybe we should include a paragraph “current status”, number of concepts etc. I think Brandon made a table for Francisco with this info?).
  • (Enrico) actually Brandon and myself did go through the glossary and we tried to pull in as many terms as we could (and that's what led to CDAO 2). Could we somehow mention that? All the terms that we could figure out being related to the description of trees are present in CDAO 2.

I also agree that a table could be nice.

Finally this ontology was evaluated by transformation of multiple alignments and
phylogenetic trees in NEXUS format into  CDAO-RDF. These XML files were used for tests
using reasoners.
First of all, I fully support and am enthusiastic about the idea of an evolutionary
ontology! I entirely agree that developers as well as users can greatly benefit from a
well annotated ontology for phylogenetic/comparative analysis. "Gold standard" ontologies
like e.g. GO or SO nicely demonstrate how ontologies can inform (computational) biology.
It would be a tremendous step forward if the data obtained from phylogenetic analysis
could be analyzed using a common vocabulary that allowed a higher level of abstraction
comparable to how we think and discuss about evolutionary problems.
  • (arlin) its important to respond to this by pointing out that this is an "evolutionary ontology" in the sense that it falls into the domain of evolutionary biology, but it is not an ontology of evolution qua process, instead it is an ontology of comparative evolutionary analysis. As such it focuses on methods and info artefacts rather than on population genetics and ecology and geologic eras and so on. I would like to work in that stuff somehow, e.g., to have a way to reference evolutionary processes and geologic time periods within CDAO, but I'm not sure what it is right now.
  • (Francisco and Julie) We thank the referee for his positive comments and we hope that phylogenetic analysis software developers and users will agree with him.

Reviewer 2, point 1

However, in my opinion, the present form of the manuscript presented by Prosdocimi and
Chisham et al, is unfortunately not yet recommendable for publication.
General points of criticism:
1.) Readability:
Large parts of the manuscript are hard to read for a non-ontology developer (probably
true for most readers). Specialist terminology (e.g. artefact, inverse properties,
transitive properties, property chaining) should be explained.
  • (Francisco and Julie) We have explained these terms in more detail the first time they are used in the manuscript. (We can try to find a non-ontology developer (a biologist!) to read through the manuscript and point out specialist terminology - Francisco, Julie).
  • (Enrico) I think we can make another pass and further improve. I am not sure about the format of papers accepted by the journal, but if possible we could collect the more technical terms and provide a little glossary as an appendix to the paper.
I think the manuscript could greatly benefit from concrete (in-text) examples to
explain the concepts (especially the "Implementation" sections). I would like to see a
consecutive example which connects the different parts of the results, provides a red
thread and demonstrates possible use cases and the power of using an ontology instead of
the existing "archetypes" e.g. XML formats. It should be possible to demonstrate one
(simplified) use case, e.g. the NEXUS example, throughout the manuscript. This could be
more illustrative if in addition to the plain text NEXUS, the alignment and the tree were
to be shown.
  • (Francisco and Julie) I guess this is quite easy for the example we used in the nexus example? I can probably do this, if noone else wants to? (Julie)
  • (Enrico) that would be great. Thanks to the hackaton, we can now produce CDAO instances from Nexml as well.
The lack of definition and introduction of concepts is not only true for the ontology
concepts used in the manuscript, but even more so for the phylo/sequence data concepts
and tools. There are several passages where one could guess from the citation of a tool
(e.g. CHADO) what the authors probably meant but if you do not know the tool or resource
then it is likely you will miss the concept the authors intend to bring across. Please
avoid vague descriptions and be more specific!
Furthermore, citations are missing in many cases (tools/resources/models/concepts).
Minor points hindering readability are typos/sloppy typesetting ("ie" --> "i.e.", "eg"
--> "e.g.", URLs are not formatted consistently,  ) and colloquial phrases (e.g. "just
about anything").
  • (Francisco) Missing citations have been added and complete revision of the manuscript has been performed, avoiding vague descriptions.
  • (Enrico) I will also make a pass and see if there is anything else to be fixed*
  • (Julie) Thanks to Enrico for going through and correcting these points. The URL's included as footnotes (Mr Bayes, PAUP etc) should be included in the reference list though, not as footnotes. For clustalw, the correct URL is: http://www.clustal.org/

Reviewer 2, point 2

2.) Comparison/interaction with existing "artefacts":
The results section "Relevant artefacts" implies that: "It is noteworthy that none of
the artefacts, covering an acceptable range of evolutionary analysis concepts, provides
the same level of formalism as an ontology, thus preventing their effective use for
knowledge-based tasks."
In principle I agree that an extensive and deep ontology could allow more effective
knowledge-based discovery. But the current status of the CDAO does not provide more
abstraction than e.g. phyloXML or nexml. One could even go so far as to say that it
does not provide an advance or improvement, especially as no downloadable source code
is provided to transform existing formats into CDAO and for reasoning purposes. What
questions can we ask now using CDAO that we could not ask with the existing tools?
A more detailed comparison to the existing formats (especially nexml & phyloXML),
discussion of shared and distinct features and probable ways of interactions or integration
would make a stronger point why developers and users should use and contribute to CDAO.
  • (Francisco and Julie) We agree that the conceptual coverage in this initial version of CDAO is similar to NeXML and phyloXML. However, the use of ontologies and reasoners have much wider potential, illustrated by the success of GO, SO etc., as the referee pointed out. These structured data representations allow the definition of complex relationships between the different data entities that cannot be encoded in a file format, such as NEXUS.
  • (Francisco and Julie) In the future, new concepts and relations will be developed in collaboration with the community and will provide a general conceptual framework for the comparative analysis of many different types of data, that will not be limited to a specific methodology. Moreover, specific reasoners will be soon developed to interpret CDAO instances, that will allow richer analyses...
  • (ENRICO) and in addition CDAO will gain more interactions with other existing ontologies.
  • (Francisco and Julie) Question for Enrico/Brandon: is the source code for the nexml to cdao conversion available?
    • (ENRICO) Yes! I have asked Brandon to ensure that the code is placed in the CDAO web site with a minimal documentation
  • (Arlin) the reviewer is asking for the moon and stars! Developing an ontology is a long-term enterprise, not expected to pay off heavily in the first few months or even years. When the pay-off comes, what will it look like? I think the reviewer has a wrong impression of how the ontology will pay off.
    • First, it does not redefine what is possible. What can anyone do with GO that Evgeni Selkov could not have done 20 years earlier? Nothing. In general, there are no questions that better computing technology can answer that human experts can't answer. The only thing that computers can do is to make it faster and more accurate, and more automatic, so that problem-solving by non-experts becomes possible.
    • Second, Nexml and phyloXML provide syntax, not semantics. They give the appearance of having semantics because we (human experts) recognize the tags and we (human experts) fill in the blanks. But a computer can't do this. A computer does not know from a nexml file that an edge is part of a tree, all it knows is that there are some "tree" elements with "edge" elements inside them. Its that simple.
    • In the medium term (next few years if the project continues to be successful), we expect CDAO to be used in the following ways:
      1. to attach semantics to data in foreign formats. Annotators and curators of data resources have a need to express, in clear terms, the semantics of data. This is one of the use cases in the DB interop hackathon at NESCENT. (I need to provide an example here).
      2. to clarify the semantics of other artefacts. Nexml provides some mapping to CDAO in its schema using the SAWSDL standard. Thus, the nexml schema relates "Edge" elements to the "Edge" concept in CDAO, making the semantics accessible.
      3. to reason over data. "reasoning" sounds very complicated, but automated reasoning does not need to be very sophisticated in order to be very useful. It can be used for simple validation and to ask questions about what goes with what. Simple reasoning will tell us that if A, B, and C are objects, and A is part_of B, and B is part_of C, then A is a part_of C.
  • (Enrico) I think Arlin made the point. We can of course now claim the existence of tools (we do have a Nexus translator, now we have also the Nexml one, we have a term submission system). But what really bothers me is the continuous lack of understanding of the principle difference between a data format and a semantics specification. One cannot do reasoning on syntax (i.e., on Nexml) and the only questions that one can ask are simply to extract pieces of syntax according to a syntactic pattern. If we attach semantics to data from heterogenous sources, we can unify data without translation.

Reviewer 2, point 3

3.) Community:
An important prerequisite for the success of an ontology is active community
involvement. In the current version of the manuscript and the web resource it is
difficult to see how the community could contribute to the ontology. Gene and Sequence
Ontology e.g. provide mechanisms to suggest and discuss terms, which are indispensable
in ensuring an actively evolving ontology which is shaped by the community of
scientists actually using it! Scientists and programmers will not use an ontology they
cannot contribute to. As stated before, actual software or source code to help at least
developers to get started using and implementing software using CDAO would be crucial.
As a side point, it is not entirely clear, how and if the Evolutionary Informatics
working group or any other non-author participate(d) in the other stages of ontology
development (why isn't there a CDAO consortium ?).
  • (Julie and Francisco) We agree that an active community is crucial and this is one of the reasons for the manuscript. This version of CDAO is intended to be a prototype that will promote and facilitate community-wide discussions.
  • (Arlin) Again, the reviewer is asking for results that take years to develop. In fact, we have a draft version of a term request server here: http://chishamconsulting.org/suggest-term/index.php . But frankly, we don't expect people to come beating down our doors to join our project when the short-term pay-offs are low. Most people will take the attitude of this reviewer that "I don't see any short-term payoffs, therefore its not useful, therefore I'm not interested". Apparently we aren't even important enough to get attention from the OBO people to look at our ontology to review it for submission. However, we are very tuned in to community needs via our connections to NESCent, and we expect that the project infrastructure will expand in the future. (this needs some more examples, citing things like phenoscape and the MIAPA project)
  • (Enrico) well, this is a chicken-and-egg problem - we would like the journal to publish the paper EXACTLY TO ACCOMPLISH THIS! Several ontology communities have been created starting from a working paper that acted as a call for attention. Also, the seed of the ontology is the glossary, and the glossary was the product of a community of users. I believe we should make this point in the paper, by stating as one of the contributions of the paper the ability to develop a community around CDAO to further its development and adoption.

Reviewer 2, point 4

4.) Missing or fragmentary coverage of evolutionary concepts :
Of course, this is an intial version of CDAO and I did not expect it to be final. But it should be clear from the text what is already covered and what is not!  E.g.:
Although there is a term "Networks" and  reticulation and hybridization events are introduced as key concepts, neither the text nor the actual ontology provide a way to model these concepts. For now, binary trees, character matrices and sequence alignments can be described - is this correct?
  • (Julie and Francisco) This could be answered in the “current status” section….
How could concerted evolution be modelled using CDAO?
What will not be covered by CDAO? What additional ontologies will be needed?
As correctly stated in the outlook, a necessary extension will be the modelling of
trees as processes. Why do we need both representations in CDAO? To come back to point
2.), why do we need an application ontology, when we already have well-formed XML
formats?
  • (Enrico) The only thing that a well-defined XML format brings over a traditional data format (e.g., NEXUS) is easier parsing. Period. All the issues of extensibility and interoperability are going to be present no matter how much XML is employed. People will add extensions to XML, they will start using elements defined in different namespaces, they will start violating the "suggested uses" of elements to fit in a data format components that had not been foreseen at the time of the design. Ontologies are used to address this problem; they enable to create a structure for the knowledge in the domain (and an XML format is just a possible concrete representation of such knowledge). It makes extensions possible without running the risk of losing interoperability, as extensions will be explained by a semantics (they will not just be a bunch of new elements that suddenly show up in an XML file -- they will be instead tied to concepts in an ontology that have explicit relations to other entities).

More important, ontologies allow one to cross the boundaries between implicit and explicit representation of knowledge. A data format is simply an explicit representation of everything that I know. An ontology allows me to describe the implicit relations between the components of knowledge -- and those relations will enable the generation of new knowledge (without having to having to explicitly writing it down!).

<xml> <phyloxml xsi:schemaLocation="http://www.phyloxml.org http://www.phyloxml.org/1.00/phyloxml.xsd">

 <phylogeny rooted="true">
   <name>Bcl-2</name>

<clade> <events>

           <duplications>1</duplications>
         </events>

<clade>

           <branch_length>0.21409500000000004</branch_length>
           <confidence type="bootstrap">33.0</confidence>

<events>

             <duplications>1</duplications>
           </events>

<clade>

             <branch_length>0.22234</branch_length>
             <confidence type="bootstrap">58.0</confidence>

<events>

               <duplications>1</duplications>
             </events>

<clade>

               . . .
                 <clade>
                   <name>51_CHICK</name>
                   <branch_length>0.28045</branch_length>

<taxonomy>

                     CHICK
                   </taxonomy>
                 </clade>

</xml> Many people find it natural to have nested sets of "clade" elements like this. What does the nesting mean? Does nesting mean that one clade has the is_a subclass relation to its containing clade? Or is it a part_of relation? If nesting means "is_a", then why are there "event" and "duplications" elements inside of a "clade", implying that these are "is_a" subclasses of "clade"? Clearly, nesting does not really tell us much. The syntax is well formed, but human readers are supply all of the meaning hidden in the tags. Now lets ask a few questions about "duplications" and "speciations"? These elements are assigned literal values of 1 or 0. Does this mean true or false? Are these *counts* of duplications or speciations? Are they probabilities? Even if we limit the schema so that these elements have values of 0 or 1, this does not resolve the question of what the values *mean*. An ontology could specify that, and it could tell us that "speciations" and "duplications" are disjoint with respect to an "event" instance, i.e., an event instance cannot be both a speciation and a duplication. By looking at the XML file in this way, we begin to see that the semantics are left entirely to the user to figure out. Expert humans read the hidden messages in the tags and they fill in this knowledge. Computers can't do that. This is why computers need ontologies. A computer cannot even distinguish "part_of" from "is_a" relations without being told specifically what type of relation applies. To a computer, an "edge" element is not "part_of" a "tree" element in an XML file unless we have an ontology to tell us that.

Reviewer 2, minor points

Additional specific points:
� The methods section describes tests which were performed and provides an URL,
but in the results a discussion of these tests is missing.
� In paragraph "(iii) Transitions, rules and models", for instance it is not clear
which concepts will be important and need to be modeled in the ontology. A description
of the biological mechanisms responsible for these transitions would help further.
� Explain TU earlier in the manuscript  - what is the difference to OTU?
� The generic class "CDAOAnnotation" is a rather technical definition, which as the
ontology grows will span an even more diverse set of concepts. Is this intended?
  • (Enrico) yes, that is exactly the purpose of it. An ontology will never be closed, so there are concepts that are meant to be refined by the individual users; we should perhaps indicate some of the sub-concepts that are already present in there.
� Are the amino acid classes imported correctly? Since I cannot see relationships to e.g.
AminoAcidResidue.
  • (Enrico) I am not sure I understand this point
� What is the message of these sentences:
o "Though powerful and generalizable, evolutionary analysis is difficult to apply."
o "Since molecular biology and bioinformatics are, in many aspects, of enormous
complexity, it is important to well understand the intended use of a new ontology "
o "The task-oriented nature of comparative analysis is apparent in the way that some
concepts are represented, eg, what makes an entity an OTU is that it plays a particular
role in the analysis."
� "CDAO is designed to facilitate data interoperability" --> An example or an outlook on
what this could look like would be very important!

Hackathon preparation

extending CDAO

see CDAO term request

test architecture

To apply the tests below, we need to have an xml parser with extensions to handle nexml logic. Then we need to be able to construct RDF triples from this, right? Then we need to be able to insert the triples in a reasoner and reason from them over the namespaces that have been identified.

test plan

We need to be able to test nexml files for syntax and also to apply semantic tests, based on reasoning over nexml annotations that refer to external namespaces. Here are ideas for tests, expressed in an informal language

  1. Taxonomy
    • OTU has_a cdao:annotation:external_reference (does this OTU have an external reference)
    • OTU is_a <external_taxonomy>:teleost (is this OTU a fish?)
    • OTU is_a <external_taxonomy>:primate (is this OTU a primate)
  2. Specimen provenance
  3. Tree type
    • tree is_a cladogram
    • tree is_a cdao:fully_resolved and cdao:species_tree

Accomplishments

  • initial study (done Nov 2007)
  • developed initial draft of ontology in OWL (done April 2008)
  • created sourceforge project (done April 8)
  • cleaned up (property hierarchy; close classes; disjoint axioms) (done in April)
  • annotated (done May 2, Arlin)
  • integrate a domain ontology such as amino acids (done May 5, Enrico)
  • evaluated strengths and weaknesses of Protege (see Ontology Development Software; decided to stick with Protege)
  • evaluated draft ontology
    1. ontology can represent character data instances
    2. ontology provides computability of (some) useful queries
    3. ontology does not duplicate existing ontologies - see table in paper
    4. ontology integrates related ontologies for character domains, SUCH AS (at least one of):
    5. ontology is normalized or modular according to Rector
  • released the initial ontology version (May 20, 2008)
  • set up public web site: evolutionaryontology.org/cdao (Francisco)
  • prepare presentation for scientific meetings in Summer 2008 (done: Marseille (Francisco); to do: SSE)
  • wrote manuscript; submitted; rejected; re-submitted; accepted (February, 2009)



Further development and evaluation (plans and ideas)

More challenging inference

not sure how this would work; some ideas:

  • technology for implementing tests
    • could use Jena API to write simple test application to run such tests
    • more complex reasoning with RACE or other external reasoner
  • test challenges
    • correct computation of MRCA, lineage, and so on

representation challenges

Can the ontology represent complex and diverse data sets?)

  1. NeXML (Brandon at next evoinfo meeting?)
  2. NEXUS
  3. PANDIT and other formats?
  4. round-trip test (formal version of representation challenge)

feedback, community involvement

  • further develop web resource with documentation
  • need to be aggressive - get attention, hold workshop with interested people

Demonstration project ideas

Carry out demonstration projects to demonstrate utility of CDAO. The ideal properties for a demo:

  • significant biological problem
  • technical innovation (extends what is possible)
  • relies uniquely (or at least, relies critically) on CDAO

Possible projects

  1. FIGENIX human proteome history project (Julie, Francisco)
  2. phylogenetic profiles (Julie)
  3. functional inference generalization
  4. natural language processing (via CDAO) to create literature resource (Enrico, idea for possible ASU collaborators)
  5. TreeBASE input validator
  6. translation tools
  7. translate high-value content (PANDIT, KOGs, etc)
  8. other

Acute testing

To get started, we propose to use a test-driven strategy based on explicit tests of the basic concepts from the ConceptGlossary. Attached is the media:prioritized_concept_list.txt (1 is highest priority, 3 is lowest). Here is how it works. Imagine we have a *high-level test language* and this is the code for testing the ontology on its implementation of the "ancestor" concept:

load_ontology("CDAO");
load_data("ancestor_test.nex");
statements = { "otuA is_a ancestor_of otuB", "htuAB is_a ancestor_of otuB" };
truth_value = { "false", "true" };
evaluate( statements, answers );

Here is the "ancestor_test.nex" file:

#NEXUS
BEGIN TAXA;
      dimensions ntax=4;
      taxlabels A B C D;
END;
BEGIN TREES;
      tree bush = [&R] ((otuA,otuB)htuAB,(otuC,otuD)htuCD)htuABCD;
END;

I'm hoping to [[1]attach a tar file] with tests for concepts, but the wiki does not like tar files. I can send it via email. The files come in pairs,

<concept><test_number>.nex
<concept><test_number>.tab

The first file is a NEXUS file with the data. The second file is a table of statements for evaluation, with fields statement_number, truth_value, statement. Right now I am using a three-valued logic (true false and unknown or indeterminate), e.g., if the tree is not rooted, then whether an internal node is the ancestor of some other node is indeterminate.

Resources and demonstrations for development

More elaborate test data sets

Each data set comes with a tree and a character matrix in NEXUS format. To explore these data sets you may wish to:

There are four different categories of character sets:

  • DNA: aligned nucleotides coded via IUPAC standard (T, C, G, A, and so on)
  • protein: aligned amino acids coded via IUPAC standard (A, C, D, E, F, G, H, I and so on)
  • continuous: numeric values of continuous characters (e.g., 0.001, 0.230)
  • morphology: discrete morphological characters with ad hoc numeric encoding (e.g., 0 = absent, 1 = present)

The DNA data are "CDS" or "coding sequence" data, meaning the sequence of nucleotide triplets in the protein-coding part of a gene.

There are three grades of difficulty:

  • Simplified: small number of OTUs and characters; unambiguous states; single bifurcating tree
  • Typical: may contain many OTUs, multiple trees, polytomies, other stuff
  • Demanding: may contain ambiguous characters, mixed data types, notes, assumptions, etc.


type difficulty description comments NEXUS CDAO
CDS (DNA) Simplified Subset of 10 ATPase CDSs comments PF00137_10_cds.nex PF00137_10_cds.owl[[2]]
CDS (DNA) Typical Eukaryotic cytochrome C CDSs comments PF00034_39_cds.nex PF00034_39_cds.owl[[3]]
CDS (DNA) Typical Eukaryotic ATPase CDSs comments PF00137_47_cds.nex PF00137_47_cds.owl[[4]]
CDS (DNA) Demanding NA comments [[Media:|NA]] [[Media:|NA]]
Protein (AA) Simplified Subset of 10 ATPases comments PF00137_10_protein.nex PF00137_10_protein.owl[[5]]
Protein (AA) Typical Eukaryotic cytochrome Cs comments PF00034_39_protein.nex PF00034_39_protein.owl[[6]]
Protein (AA) Typical Eukaryotic ATPases comments PF00137_47_protein.nex PF00137_47_protein.owl[[7]]
Protein (AA) Demanding NA comments [[Media:|NA]] [[Media:|NA]]
Continuous Simplified NA comments [[Media:|NA]] [[Media:|NA]]
Continuous Typical Inhibitor sensitivity data for human kinases -log(IC50) scaled kinase_rescaled3_sets.nex kinase_rescaled3_sets.owl[[8]]
Continuous Demanding NA comments [[Media:|NA]] [[Media:|NA]]
Morphological Simplified NA comments [[Media:|NA]] [[Media:|NA]]
Morphological Typical Nematode vulval morphology and development Kiontke, et al., 2007 Kiontke_CB_fixed.nex [[Media:|NA]]
Morphological Demanding NA comments [[Media:|NA]] [[Media:|NA]]

Meeting Notes

electronic conversation on manuscript revisions, March, 2009

this section has been moved to #Manuscript_revisions

Telecon, 15 September, 2008

participating: Julie, Arlin, Enrico, Francisco, Brandon

  1. Paper submission
    • format: Julie has done everything but EndNote references (arlin needs to re-format these)
    • venue: bmc evol biology
    • referees: people we haven't published with.
      • ad hoc list: Fabricio Santosm, Jim Leebens-Mack, David Maddison, Wayne Maddison, Jim Balhoff, Miguel Andrade, Manolo Gouy
      • Enrico (*) to send suggestions (name, affiliation, contact info)
      • Francisco (*) to send suggestions
  2. Ontology submission to OBO
    • Enrico will do this
  3. Next evoinfo meeting
    • Rutger is planning this (Arlin too busy)
    • focus on nexml, MIAPA, CarrotBase
  4. Further development of CDAO
    • possible NESCent-sponsored group supplement meeting
      • Arlin will ask if supplement funds are still available
      • Enrico will do application for meeting funds from NESCent
    • Objectives for further development
      • enrich classes (Enrico's additions)
      • core relations and concepts
      • evaluation (see Supporting_MIAPA and CarrotBase projects on evoinfo wiki)


Telecon, 6 June, 2008

participating: Enrico, Julie, Arlin, Francisco

  1. website
    • francisco says yes
    • nescent says yes
  2. evaluation
    • worm file
  3. representation (implementation)
    • tree (arlin)
    • matrix (enrico)
    • homologizing concept, coordinate system
  4. MIAPA
    • phenote
    • web form: top-level, launch, phenote, verity, term-request

Telecon, 29 May, 2008

participating: Enrico, Julie, Arlin

  1. website
    • Arlin to ask francisco to take lead on this
    • work with NESCent a la nexml.org
    • transfer content
    • develop system to manage content
  2. nexml-CDAO
    • not much progress lately
    • Brandon working on C-based
    • Enrico working on Prolog-based
  3. sequence characters
    • practical issues: how to merge with upper and lower
    • chebi amino acid residues
    • bio-top
  4. manuscript
    • waiting for evaluation
  5. what do we need to wrap up this stage
    • groups of chars
    • evaluation

Telecon, 16 May, 2008

Web site (Francisco)

  • design criteria
    • stable home, public face for project
    • provides accessible description of CDAO
    • portal for those seeking information in this area
    • host for workflows projects
    • may host other services e.g., discussion board
  • Arlin to find out about content management
  • Home for CDAO URI?
    • not NESCent
    • could be our web site (preferred option)
    • or OBO (biggest player)
  • deadline ? not urgent

Submission to OBO

  • renewed our commitment to submit to OBO
  • Arlin will ask Julie is she could pursue this further

Evaluation (Enrico)

  • difficulties
  • many changes to ontology
  • nearest-common-ancestor
  • ran RDQL queries through Pellet external reasoner
  • racer does not like our ontology
    • doesn't like property chaining
    • purports to support OWL 1.1 but does not
  • compute cost of path across tree, can't do that directly with reasoner

Manuscript (Arlin)

  • will finish more revisions today
  • still not done
  • suggest pushing submission date back to June

Telecon, 9 May, 2008

Ongoing issues from last week:

  1. software choices:
    • arlin recommends to stick with Protege 4
      • latest build fixes several major bugs we have encountered
      • Protege has a large user base and an active developer community-- bugs being fixed
      • lots of discussion on mailing lists p4-feedback (protege 4 list) and protege-owl
    • but enrico notes problem with insufficient dl query interface
      • Brandon found free SPARQL engine called twinkle
      • Arlin notes that SPARQL (accessed via the Java API) is one of the topics discussed on protege-owl
    • our conclusion is that, at least for now, we should stick with Protege 4
  2. instance data
    • Enrico and Brandon are almost ready with their translators
      • Brandon has written a C++ program based on NCL
      • Enrico has written a Perl program based on Bio::NEXUS and prolog
        • the program reads in NEXUS and dumps out facts in CDAO RDF terms
  3. query-based evaluation (test ability to reason from representation of instance data)
    • this also is nearly ready to begin
    • depends on instance data translators described above
    • also depends on full DL queries as with SPARQL
  4. home page (we need a home page with a public orientation)
    • need to hear back from Francisco
    • as a start, Arlin added an intro paragraph to evoinfo/CDAO wiki
  5. annotation: done (Arlin)

Other issues

  1. Representation of nearest common ancestor is not good.
    • best we can do is represent-- we cannot make MRCA computable
    • current attempt to represent MRCA will not work
    • we need to have a set concept to use a binary relation
      1. OTU_set has_mrca Node
      2. Node is_mrca_of OTU_set
  2. Upper ontologies? Should we address this in manuscript?
    • Enrico says no need to worry about this now
  3. is this an application ontology or reference ontology?
    • also seems like an unimportant distinction

Telecon, 5 May, 2008

To do:

  1. find a more stable and interoperable replacement for Protege (Enrico, Arlin):
  2. develop instance data to illustrate encoding, for use in manuscript figures and supplementary data (very simple cases, e.g., based on Fig. 4)
  3. use simple queries on instance data to evaluate the ontology (Brandon, Enrico)
  4. develop a project home page that is written for other scientists and potential users, not just for project participants (Francisco)
  5. finish up ontology annotation (Arlin)

Telecon, 25 April, 2008

Telecon, 18 April, 2008

Working meeting march 24-april 4

Day 11, Thursday Apr 3, 2008

remains to be done

  • category 2 items from yesterday

Incorporated (provisionally) in Version 12:

  • imported equivalence class of amino acids SpecificAminoAcid from OWL amino-acid (www.co-ode.org/ontologies/amino-acid/2005/10/11/amino-acid.owl)
  • found classes for nucleotides in www.bioontology.org/files/4531/basic-vertebrate-gross-anatomy.owl, actually these are imported from http://www.co-ode.org/ontologies/basic-bio/top-bio.owl


Incorporated in Version 11:

  • gap state or missing data
  • homologous to
  • taxonomic link as TU annotation

Incorporated in Version 10:

  • lineage
  • transformation types

Incorporated in Version 9:

  • made Anotayshun CDAO_Annotation
  • added Ancestral_Node, Common_Ancestral_Node and MRCA_Node with restriction has_Descendants > 1
  • linked TU to Node with object property represented_by_Node (inverse represents_TU)
  • compound character class

Day 10, Wednesday Apr 2, 2008

The next sub-section of the ontology to be refined was the part representing the character state data matrix. Although this initially seemed to be a relatively simple structure, a number of complications were encountered because of the various data types we wanted to include (nucleotide, amino acid, continuous data and other discrete data, such as GO terms, EC numbers, anatomy, etc.) and the large number of properties attached to each class.

Two alternative representations were considered to take into account the different data types. The first tried to minimise the number of classes specific to each data type, however this turned out to be too difficult to represent with the OWL language. The second option defined a number of type-specific classes and although this is not an ideal structure with a number of duplications, the ontology structure was simplified. The second option defines a set of restrictions that will allow us to check data for consistency with the ontology reasoner.

At this stage the validity of the draft ontology was tested by adding instance data into Protege and a reasoner to check for inconsistencies.

In addition, we

  • generalized edge annotation concept
  • allowed for residues to have coordinates in a sequence
  • learned more about the bugs-features of reasoners in Protege 4
    • FaCT++ does not seem to work in the Mac version
    • errors in instance data trigger a Java fault (not reasoner error) for both FaCT++ and Pellet
    • DL Queries are limited to classes and simple conjunctions with properties
Remaining issues:

1. ontology classes and properties

  • In the toplogy sub-section: notions of lineage, MRCA, subtree.
  • Links between tree and TU
  • In the character state data matrix sub-section: sequential character types and positional information.
  • Transformation concepts that will be included as edge_annotation in the tree topology

2. general issues

  • Linking CDAO to other ontologies: amino acid, GO, SO, NCBI taxonomy
  • Mapping to MAO concepts
  • Text annotations of classes and properties i.e. human-readable definitions of all CDAO concepts
  • Submission to OBO

Day 9, Tuesday Mar. 30, 2008

We spent the first part day becoming familiar with the latest version of Protege 4 and other tools such as the ontology DL and the OWLviz visualization. Updating from version 3 of Protege caused a number of compatalibity problems, but the extra features, especially the visualization tool were considered important.

We then decided to concentrate on a number of sub-sections of the ontology, starting with the tree topology. The issued raised on day 8 were all addressed during this session. The most important decisions were the kinds of topologies to include in the ontology (rooted/unrooted trees, more general networks or graphs, ...) and the representation of direction in rooted trees. The concept of parent/child or ancestor/descendant nodes connected by an edge proved to be non-trivial to describe in the OWL language with the limitation of properties linking only two classes. This was overcome with the facility of 'chaining properties'.

Day 8, Monday Mar. 31, 2008

Julie and Francisco reviewed the complete draft ontology built during the first week and compiled a list of questions to address with the full team during the next week:

  1. Do we want to differentiate between traits and characters?
  2. How de we represent the tree topology and what do we need to differentiate between rooted and unrooted trees?
  3. What properties de we need for edges and nodes? how do we define directed edges for rooted trees?
  4. Do we want to define a minimum ontology with only basic concepts, or do we want to include other concepts that could be derived from the basic ones?

Day 6 to 7, Sat to Sun, Mar. 29 to 30, 2008

The work done by Brandon and Francisco was handed over to the team working in the second week (Julie, Enrico, Arlin).

Day 5, Friday Mar. 28, 2008

1. Adding new concepts to the ontology

Today Francisco reviewed Arlin's entire list of concepts and added to the ontology all that he could clearly understand and describe. To do for Monday: (1) take a look into this list together and (2) try to verify if the concepts are clearly and well represented or whether different representations can be suggested.

Some concepts at the bottom of the list (many of the priority 3 ones), although very relevant in evolutionary biology, seem very difficult to include in the current ontology -- pecifically the ones related to population genetics. Maybe we would need another "sister ontology" to describe these, although it may be possible to link them to the concepts already present in CDAO.


2. Gathering new information to write a final version of paper

In addition, Francisco has been looking into the CDAO manuscript and gathering information from the web to try to fill it with more information. He has made a list of references on biomedical ontologies retrieved from PubMed. These have been downloaded and saved in a MS-word .doc format for use with endnote.


3. Writing the algorithm

Brandon has another great day of coding. He has finished the overall design of the system, but we have not yet had a chance to test it. He plans to work on it more this weekend after getting back to Las Cruces, and will send an updated version by Monday morning.

Day 4, Thursday Mar. 27, 2008

1. We have now produced a more consistent version of the ontology presenting (almost) all priority 1 concepts and also some priority 2 ones -- we've missed just a couple of priority 1 concepts that we didn't understand very well and that we'll be able to add into the ontology next week, after discussion with the other members of the group.

2. We believe this version of the ontology is much clearer and the relationship between classes are better described. Some concepts considered as classes before are now represented as object_properties or datatype_properties (and vice-versa). We have also restricted some of the datatype_properties to a set of limited values, avoiding misrepresentation of data. We think we've found a good representation of some difficult inter-related concepts such as the relationship between transformation, branch modification, character, OTU and character-state modification. I hope that we can re-refine this representation during the next week.

3. Brandon has finished a preliminary version of his algorithm that reads and interprets the NEXUS files and tomorrow morning he'll be adding to it the module that reads our new ontology -- he has an idea of some modules to use and he thinks there will be no problem with this. We hope to finish the day tomorrow with the complete and validated-by-hand representation of at least 2 nexus files in an ontology XML format.

Day 3, Wednesday Mar. 26, 2008

1. Optimizing the ontology

Today, we began discussing two versions of a simplified ontology we made yesterday (each of us made our own simplified version). Finally, we realized that the original ontology made on Monday containing all the concepts was not very well encapsulated and we prefer to begin another one. We have checked the best descriptions made by each of us and produced a cleaner and more consistent ontology using the best way to describe concepts we've made. We've discussed differences in data representation and have come to an agreement on the best way to represent different kinds of data.

Although the new ontology is cleaner, more understandable and the concepts are inter-related in a better way, it still lacks some synonyms and some important concepts. We'll add them by importing from the original complete ontology in a step-by-step manner, testing each concept and their relationships before adding the next one.


2. Automating the representation of test sets

Since we spent lots of time yesterday afternoon and this morning trying to represent 3-OTUs with 5-characters in the Protégé ontology manually, in the afternoon we decided that we would need at least a very preliminary algorithm to read the input test files made by Arlin and translate them into a file to be read and checked inside Protégé.

Brandon has spent the afternoon producing this algorithm (although he hasn't finished it yet, he has advanced well). In the meantime, Francisco continued to look into the simplified ontologies we have made and to add new concepts into them. Although they still lack many of Arlin's concepts with priority 1, we think that these new ontologies we have made beginning from zero are more internally-consistent and they will allow better representation of the data than the original one produced on Monday.

Day 2, Tuesday Mar. 25, 2008

1. Revisions

We added synonyms to the ontology, in the needed places.

We also separated characters and their related classes and properties into a separate ontology in order to better encapsulate these elements so that they could be refined in isolation without disturbing the other parts. This additionally helps to reduce confusion between properties while working.

2. Examples

We started work encoding the examples provided on the Wiki page.

This encoding is not yet complete, but we are making progress, and have identified and made a few necessary changes to fix earlier errors such as relating traits/characters to edges rather than OLT's

3. Testing and Protege Training

As part of the testing process we each made simplified versions of the ontology

and worked on encoding the examples, so that we could identify the critical components, transfer knowledge about protege, and also work out problem spots in a simple environment where they would be, most likely, easier to fix. Additionally the import system has proven to be somewhat brittle so while the encapsulation is desireable, until each of the sub-parts is stable, it iseasier to work with them as a single ontology file.

4. Updates available

We have uploaded the current versions of our work to [[9]] It's now available as both OWL and Protege Project files.

Day 1, Monday

We began by checking the concepts in the prioritized_concept_list, trying to make them available in the current version of the ontology. Most of the concepts were added in the tree subsection, although a number of them were shown to be redundant or better represented as other terms and relationships. We have also converted some terms that were classes to properties and other from properties to classes -- in the context of an OWL representation.

First we grouped the terms in related groups:

  • Group 1 - TU related
    • Descendant
    • Ancestor
    • HTU
    • Hypothetical Taxonomic Unit
    • Most Recent Common Ancestor
    • MRCA
    • Operational Taxonomic Unit
    • OTU
    • Outgroup
    • Leaf node
    • Terminal node
    • Root
    • Basal
  • Group 2 - Tree related
    • Branch support
    • Tree
    • Unresolved
    • Cladrogram
    • Dichotomy
    • Edge
    • Fully resolved
    • Monophyly
    • Network
    • Bifurcation
    • Phylogenetic Tree
    • Phylogentic Tree Topology
    • Bipartition
    • Bootstrap support
    • Branch
    • Subtree
    • Lineage
    • Topology
    • Polytomy
    • Unrooted
  • Group 3 - Character related
    • Trait
    • Character
    • Character-state
    • Character-State Data Matrix
    • Derived
    • Apomorphy
    • Primitive
    • State
    • Missing data
  • Group 4 - Others
    • Gap
    • Indel
    • Homology
    • Polymorphism
    • Taxon
    • Taxonomic Rank

Then, we defined the synonymous usage of terms. When the terms are synonymous concepts or representation, we chose just one of them to present.

  • Group 1
    • HTU = Hypothetical Taxonomic Unit = Ancestor
    • Leaf node = OTU = Operational Taxonomic Unit = Terminal node
    • Descendant = Child
    • Root
    • Outgroup
    • Most Recent Common Ancestor = MRCA
    • Basal

(These two concepts may be derived from an algorithm reading the ontology-annotated file, but they are not explicitly defined in the ontology itself. The information is there, but no specific concept is provided. If we choose to represent all the MRCA of all OTU/HTU and which TUs are more or less basal than other ones, we think the representation file would be very big.)

  • Group 2
    • Tree = Cladogram = Network = Phylogenetic Tree
    • Dichotomy = Fully resolved = Bifurcation = Monophyly = Bipartition
    • Edge = Branch
    • Polytomy = Unresolved
    • Unrooted
    • Subtree = Lineage
    • Branch confidence level = Branch support = Bootstrap support

(here we used confidence level as it can support any confidence analysis, even if bootstrap is the most used)

    • Topology = Phylogenetic Tree Topology

(the topology is something we need to have to build ontology-based representations, it is imported from NEXUS file and it can be retrieve by the ontology file through child-parent relationships)

  • Group 3 - Character related
    • Trait: Defined as any characteristic of the TU that the annotator would like to describe
    • Character: Defined as the characteristics used for evolutionary classification
    • State = Character-state
    • Derived = Apomorphy
    • Primitive
    • Missing data
    • Character-State Data Matrix

(This would be in the input file of the ontology and could also be retrieved from the ontology-annotated file by algorithms)


  • Group 4 - Others
    • Gap = defined in the transformation
    • Indel = defined in the transformation
    • Homology
    • Polymorphism = we didn't understand what it means
    • Taxon = defined as a property of an OTU
    • Taxonomic Rank


Once all these concepts were defined and added to the ontology, we began to make a simple representation of a simple hypothetical dataset. During this preliminary representation we have found some errors, and modified some concepts from properties to classes (such like the branch one, etc). Moreover, we had some difficulties to work with Protégé since it seems to be in a very beta release and each time we found something that would be represented better in the ontology by changing slightly the concepts, we need to rebuild and re-enter manually all the concepts in our test set.

  • Question : can synonyms be represented in Protégé? I think it would be useful for scientists to be able to choose the term they want to use.

telecon, 14 March, 2:00 UTC

skipped this

Telecon, 7 March, 2007

present: Francisco Prodoscimi, Julie Thompson, Enrico Pontelli, Arlin Stoltzfus

What activities to do before the meeting? Plan for development?

  1. represent 4 simple test cases
    1. nt alignment plus tree
    2. prot alignment plus tree
    3. kinases with inhibitor sensitivity
    4. worm morphologies
  2. carry out operations with reasoning
    1. set and logic operations on characters and OTUs
    2. tree operations (clade selection, prune)
    3. other?
  3. map ontology to other representations
    1. NEXUS
    2. neXML
  4. start compiling list of concepts that are missing
    1. review Enrico's proposal
  5. look ahead to future challenges
    1. genetic encoding of characters
    2. ambiguous, multi-dimensional, or otherwise complex characters

Other issues for meeting and for paper

  • what is the scope?
  • How to integrate with other ontologies?
    • table from 'related artefacts' exercise
    • genetic code as a test case for integration
      • requires nt aa mapping to specify code
      • requires species taxonomy to assign code to species
      • requires cell ontology to assign code to compartmental genome (nuc, mito, cp)

Next meeting

  • telecon, 14 March, 2:00 pm UTC
  • agenda
    • nt and prot test data sets (arlin)
    • protege demo (brandon)

Related Work

  • we are working on a direct generation of an ontology from the Concept Glossary. We are documenting the progress at this page. Note that the page is not up-to-date at this moment (hopefully it will be by the end of the day or tomorrow [3/18/2008]). The goal is to eventually show that CDAO can map over all these concepts.