CDAOShortTermVisitorProposal

From Evolutionary Informatics Working Group
Jump to: navigation, search

Here are the application contents from the instructions issued by NESCent:

  1.  Title (80 characters max)
  2. Name and contact information
  3. Project Summary (200 words max) - appropriate for public distribution on the NESCent web site
  4. Introduction and Goals - What important evolutionary problem is being addressed?
  5. Proposed Activities and Rationale for NESCent Support - Why can this activity be most effectively conducted at NESCent?
  6. Proposed Collaborators (if any) and Timetable (including start date)
  7. Anticipated Results
  8. Short CV of the Applicant (2 pages)

We can pull some of this stuff from our grant proposal and from the General Ontology and CDAO pages.


Title

Initial Draft of a Comparative Data Analysis Ontology (CDAO)

Name and Contact Information

Julie Thompson, Institut de Génétique et deBiologie Moléculaire et Cellulaire, 1 rue Laurent Fries, B.P. 10142, 67404 Illkirch Cedex, France. Tel: +33 388653497 E-mail: julie.thompson@igbmc.u-strasbg.fr

Francisco Prosdocimi, Institut de Génétique et deBiologie Moléculaire et Cellulaire, 1 rue Laurent Fries, B.P. 10142, 67404 Illkirch Cedex, France. Tel: +33 388653200 E-mail: francisco.prosdocimi@igbmc.u-strasbg.fr

Project Summary

200 wds max

A primary goal of bioinformatics is to make inferences or gather clues, often based on a comparative approach in which patterns of similarities and differences reveal clues to function. Evolutionary theory provides a powerful framework for comparative biology, by converting similarities and differences into events reflecting causal processes. Although powerful tools exist for some applications of evolutionary analysis, they remain under-utilized because of the lack of an appropriate informatics infrastructure. In this context, one of the agreed-upon goals of the NESCent evoinfo working group was to develop a formal comparative data analysis ontology (CDAO) aimed at making it easier for biomedical researchers to apply evolutionary methods of inference to diverse types of data in the fields of genomics, proteomics, systems biology,...

The short term visitor project proposed here will permit a close interaction between domain experts (evolutionary biologists) and computer scientists, in order to specify the most pertinent formalization, representation and evaluation strategies and to implement an initial draft version of the CDAO. Subsequently, a publication will be co-authored by the project collaborators, that will serve as a ‘call for participation’ to other researchers in the field who might be interested in contributing to the future development of the ontology.

Introduction and Goals

  • Introduction

The recent availability of the complete genome sequences of a large number of model organisms, together with the immense amount of data being produced by a number of new high-throughput technologies (transcriptomics, proteomics, interactomics,…), means that we can now begin comparative analyses to understand the mechanisms involved in the evolution of the genome and their consequences in contemporary biological systems. Phylogenetic approaches provide a unique conceptual framework for performing comparative analyses of all this data, for propagating information between different systems and for predicting or inferring new knowledge. Consequently, phylogenetic inference systems are now playing an increasingly important role in most areas of high throughput genomics, including studies of promoters (phylogenetic footprinting), interactomes (based on the presence and degree of conservation of interacting proteins), and in comparisons of trancriptomes or proteomes.

While powerful tools exist for some applications of evolutionary analysis, they remain under-utilized because of the lack of an appropriate informatics infrastructure that makes evolutionary approaches relatively inaccessible and difficult to use. The traditional approach to software for evolutionary studies relies on independent task-specific services and applications, using different input and output formats, often idiosyncratic, and frequently not designed to inter-operate. To have interoperability, it must be possible to store, transfer, retrieve, and re-use data.

Recently, the utility of ontologies has been clearly demonstrated in several biological domains for the organisation and management of biological knowledge. Ontologies provide an ideal means of representing the fundamental concepts in a domain and the relationships that exist between them. They are used for automatic annotation of data, for the sharing of information from different resources and for the presentation of domain knowledge to researchers, in particular to non-experts in the specific field. A number of biomedical ontologies have been developed, such as the Gene Ontology or the Sequence Ontology, that provide a formal representation of the data for automatic, high-throughput data parsing by computers. These ontologies are now being exploited in the new information management systems that are being developed to allow large scale data mining, pattern discovery and knowledge inference.

In this context, one of the agreed-upon goals of the NESCent evoinfo working group was to develop a formal ontology for the domain of comparative data analyses, CDAO (Comparative Data Analysis Ontology). The project aims to make it easier for biomedical researchers— and for those who develop the software they use— to apply evolutionary methods of inference to diverse types of data, so as to integrate this more powerful framework for reasoning into their research on genomics, proteomics, cancer, systems biology and so on.

  • Goals

The general goal is to develop a formally specified ontology that provides extensive conceptual coverage, guaranteeing the ability to express concepts encountered in evolutionary analysis use cases and appearing in common data formats and application interfaces. The ontology will be formalized using a standard ontology definition and exchange language (OWL). Developing such a domain ontology will require a close interaction between domain experts (evolutionary biologists) and computer scientists. The specific aims of this short term visitor proposal are to: • Delineate the domain by examining use cases in evolutionary comparative analysis • Identify and analyze related artifacts • Identify core concepts and relations of the domain • Formally encode core concepts and relations via an ontology language • Define the evaluation and refinement strategy, using description logics and practical research projects

Proposed Activities and Rationale for NESCent Support

  • Proposed Activities

a. Study use cases that delineate the domain of interest The development of robust and general methods requires consideration of a broad set of use cases. During the meeting, we plan to collect a token set of example data to cover basic uses, including one molecular example with protein-coding gene family data and one systematics example with species having discrete and continuous characters.

b. Identify related artefacts and ontologies The identification of related ontologies is important for our goals of avoiding duplication and allowing connections with other ontologies. Some existing formalisms were considered during the last working group meeting in November 2007. During the proposed meeting, these will be further analyzed to identify key concepts and relations that should be imported to CDAO.

c. Identify core concepts and relations The next step will be to identify core concepts, relations and operations and to represent them in a glossary of terms. An initial glossary developed by Stoltzfus for this project is available at the NESCent wiki for the evolutionary informatics working group, and will be refined and extended at the proposed meeting.

d. Formalize the ontology and implement its representation The ontology will then be formalized in terms that are unambiguous and that can be formally investigated. Research in the field of ontology design has identified Description Logics, and their instances in OWL (OWL-DL and OWL-Lite) as suitable logical frameworks to formalize and investigate ontologies. The encoding in a formal language will allow us to apply standard tools to prove correctness and consistency properties of the ontology. CDAO will be subsequently translated into the OBO format, for compatibility with the most widely used biological ontologies, available from the OBO website (http://obofoundry.org/).

e. Define the evaluation and refinement strategy The use of formal systems (OWL) for the encoding of our ontology enables the use of axioms and inference rules from the logical system to draw conclusions concerning the relations between concepts— relations that are not directly encoded in the ontology’s representation. Ontological reasoning is also essential when dealing with alignment of different ontologies. By using actual data from use-case examples to create instances in the ontology, it is possible using a reasoning engine to check whether expected relations are determined by the ontology. In addition, we intend to subject the ontology to evaluation in the context of practical research projects. Ideally the ontology will be sufficient to represent all use-cases of evolutionary analysis, and thus requires evaluation for all cases. For the present project, however, we propose to perform a narrower evaluation of the ontology based on projects that will be identified during the meeting.

  • Rationale for NESCent Support
    • invoke videoconferencing capacity?
    • are we going to use any NESCent staff?

Proposed Collaborators and Timetable

  • how many weeks, with which participants?
    • Julie: I would suggest 10 days to 2 weeks. Good dates for us are 6-15 march, or from 24 march and anytime in april. I assume we would ask for funding for one participant from France, so Francisco would come to NESCent, and I would teleconference as required.
  • Pontelli, Stoltzfus, Thompson - at least part of the time
  • Julie's grad student (Francisco Prosdocimi)
  • Vien Tran or Brandon

Anticipated Results

At the outcome of the meeting, we should have defined the implementation and evaluation strategies for the CDAO ontology. A draft ontology will be available, with key concepts defined and the relationships between them. The draft ontology will then be registered with the OBO collection of ontologies, which provides a means for distribution. CDAO will also be described in a publication, co-authored by the meeting attendants, that will serve as a ‘call for participation’ to other researchers in the field who might be interested in contributing to the future development of the ontology. We will also assemble a presentation such as a poster or slideshow to be shown at meetings and for NESCent to use.

The development of this ontology should facilitate the efficient exploitation of evolutionary information in functional genomics and systems biology projects. Currently, the use of evolutionary concepts is underexploited and in the future CDAO (and automatic reasoning systems based on CDAO) should facilitate automatic, large-scale phylogenetic reconstruction and functional inference strategies. Such strategies will be fundamental for the development of the new fields of systems biology and synthetic biology, with important consequences in applied fields such as biotechnology, agronomy, medicine and pharmacology.

Short CV of the Applicant