Difference between revisions of "CDAO"
(→Test Data Sets)
(→Test Data Sets)
|Line 66:||Line 66:|
Revision as of 16:25, 11 March 2008
Comparative Data Analysis Ontology
The material previously on this page has been moved to CDAOManuscript.
This page is for ongoing work and contains links to supporting docs, past work, and sub-topics.
Test Data Sets
Each data set comes with a tree and a character matrix. The data are given in two formats where possible:
There are four different categories of character sets:
- cds: aligned nucleotides coded via IUPAC standard (T, C, G, A, and so on)
- protein: aligned amino acids coded via IUPAC standard (A, C, D, E, F, G, H, I and so on)
- continuous: numeric values of continuous characters (e.g., 0.001, 0.230)
- morphology: discrete morphological characters with ad hoc numeric encoding (e.g., 0 = absent, 1 = present)
There are three grades of difficulty:
- Simplified: small number of OTUs and characters; unambiguous states; single bifurcating tree
- Typical: may contain many OTUs, multiple trees, polytomies, other stuff
- Demanding: may contain ambiguous characters, mixed data types, notes, assumptions, etc.
|CDS (DNA)||Simplified||Subset of 10 ATPase CDSs||comments||PF00137_10_cds.nex||PF00137_10_cds.nexml|
|CDS (DNA)||Typical||Cytochrome C CDS sequences||comments||PF00034_39_cds.nex||PF00034_39_cds.nexml|
|CDS (DNA)||Typical||Full set eukaryotic ATPase CDSs||comments||PF00137_47_cds.nex||PF00137_47_cds.nexml|
|Protein (AA)||Simplified||Subset of 10 ATPase sequences||comments||PF00137_10_protein.nex||PF00137_10_protein.nexml|
|Protein (AA)||Typical||Full set eukaryotic ATPases||comments||PF00137_47_protein.nex||PF00137_47_protein.nexml|
Telecon, 7 March, 2007
present: Francisco Prodoscimi, Julie Thompson, Enrico Pontelli, Arlin Stoltzfus
What activities to do before the meeting? Plan for development?
- represent 4 simple test cases
- nt alignment plus tree
- prot alignment plus tree
- kinases with inhibitor sensitivity
- worm morphologies
- carry out operations with reasoning
- set and logic operations on characters and OTUs
- tree operations (clade selection, prune)
- map ontology to other representations
- start compiling list of concepts that are missing
- review Enrico's proposal
- look ahead to future challenges
- genetic encoding of characters
- ambiguous, multi-dimensional, or otherwise complex characters
Other issues for meeting and for paper
- what is the scope?
- How to integrate with other ontologies?
- table from 'related artefacts' exercise
- genetic code as a test case for integration
- requires nt aa mapping to specify code
- requires species taxonomy to assign code to species
- requires cell ontology to assign code to compartmental genome (nuc, mito, cp)
- telecon, 14 March, 2:00 pm UTC
- nt and prot test data sets (arlin)
- protege demo (brandon)