Difference between revisions of "Current Standards"

From Evolutionary Informatics Working Group
Jump to: navigation, search
(Task 4: Formalization of common formats)
m (Relationship to other goals of the working group)
 
(20 intermediate revisions by 4 users not shown)
Line 2: Line 2:
 
For computer programs to exchange data transparently requires standardized serialization schemes.  Traditionally this means having a standard file format for data exchange.  Supporting such standards may include such things as clarifying or extending an existing standard, providing users and developers with software tools to use the standard, providing conversion between formats, and ensuring an upward conversion path to the next standard.
 
For computer programs to exchange data transparently requires standardized serialization schemes.  Traditionally this means having a standard file format for data exchange.  Supporting such standards may include such things as clarifying or extending an existing standard, providing users and developers with software tools to use the standard, providing conversion between formats, and ensuring an upward conversion path to the next standard.
  
How can we increase interoperability by supporting current standards?
+
How can we increase interoperability by supporting current standards?
  
The place to begin is by identifying current standards and assessing available support.  Current standards include file formats like NEXUS, MEGA, and PHYLIP described on the [http://www.molecularevolution.org/resources/fileformats/ Molecular Evolution Workshop] page at Woods Hole.  To what extent can these formats be validated?  To what extent is the choice of format an irrelevant, transparent issue for the user who wishes to view, query, store, edit or analyze data?  
+
The place to begin is by identifying current standards and assessing available support.  Current standards include file formats like [[NEXUS]], MEGA, and PHYLIP described on the [http://www.molecularevolution.org/resources/fileformats/ Molecular Evolution Workshop] page at Woods Hole.  To what extent can these formats be validated?  To what extent is the choice of format an irrelevant, transparent issue for the user who wishes to view, query, store, edit or analyze data?
  
Consider the simplest format, FASTA.  We can view the data in a FASTA file using a standard text browser or word processor.  Yet, how many computer programs that input FASTA files place an arbitrary limit on the length of a definition line?  What about support for search, query, and edit  operations?  Even for a simple FASTA file, the possibility of arbitrary line breaks means that when we search for the sequence GAATTC using a standard text editor, we may fail to find it because the pattern is hard-wrapped from one line to the next.  Without this kind of query capability, automatic editing is not possible.  
+
Consider the simplest format, FASTA.  We can view the data in a FASTA file using a standard text browser or word processor.  Yet, how many computer programs that input FASTA files place an arbitrary limit on the length of a definition line?  What about support for search, query, and edit  operations?  Even for a simple FASTA file, the possibility of arbitrary line breaks means that when we search for the sequence GAATTC using a standard text editor, we may fail to find it because the pattern is hard-wrapped from one line to the next.  Without this kind of query capability, automatic editing is not possible.
  
The problem becomes much worse when we consider other file formats.  For instance, PHYLIP files limit OTU names to 10 characters.  In a NEXUS file, an OTU can be referred to by an identifier string called a "taxon label", or by a whole number that corresponds to the order in which an OTU was declared.  This means that, if we wish to rename OTUs in a NEXUS file in order to achieve compatibility with PHYLIP limits, we cannot simply do a search-and-replace using the taxon label, because an OTU might be referenced by a number instead of a label.  Clearly, then, in order to support these standards and hide the details of file formats from the user, we must have things like wrappers for PHYLIP programs that preserve long names, and editors for NEXUS that are aware of NEXUS semantics.  
+
The problem becomes much worse when we consider other file formats.  For instance, PHYLIP files limit OTU names to 10 characters.  In a [[NEXUS]] file, an OTU can be referred to by an identifier string called a "taxon label", or by a whole number that corresponds to the order in which an OTU was declared.  This means that, if we wish to rename OTUs in a NEXUS file in order to achieve compatibility with PHYLIP limits, we cannot simply do a search-and-replace using the taxon label, because an OTU might be referenced by a number instead of a label.  Clearly, then, in order to support these standards and hide the details of file formats from the user, we must have things like wrappers for PHYLIP programs that preserve long names, and editors for NEXUS that are aware of NEXUS semantics.
  
Also, it may be clear from these examples that we can neither assess nor enhance support for standards without understanding the goals and practices of the user community.
+
Also, it may be clear from these examples that we can neither assess nor enhance support for standards without understanding the goals and practices of the user community.
  
 
== Goals for the working group ==
 
== Goals for the working group ==
  
At the first meeting (20-23 May, 2007), the working group decided on several deliverables relating to support for current standards.  The most extensive efforts will be done by Enrico, Gopal and Arlin as part of a larger project for which outside support will be sought.  
+
At the first meeting (20-23 May, 2007), the working group decided on several deliverables relating to support for current standards.  The most extensive efforts will be done by Enrico, Gopal and Arlin as part of a larger project for which outside support will be sought.
  
 
=== Task 1: Assess current user practices ===
 
=== Task 1: Assess current user practices ===
Line 20: Line 20:
 
'''Task leaders''': Sudhir, Arlin
 
'''Task leaders''': Sudhir, Arlin
  
'''Statement of problem''': To develop support efficiently (applying the 80:20 rule), we need to know what most users are doing, and how they are doing it.
+
'''Statement of problem''': To develop support efficiently (applying the 80:20 rule), we need to know what most users are doing, and how they are doing it.
  
'''Approach''': Use analyses of literature citations and inspection of software packates to determine the most common input and output formats and the most common operations.  
+
'''Approach''': Use analyses of literature citations and inspection of software packates to determine the most common input and output formats and the most common operations.
  
'''To do''':  
+
'''To do''':
 
# include numerical analysis of citations to different packages
 
# include numerical analysis of citations to different packages
# make a table of input and output formats used by common packages  
+
# make a table of input and output formats used by common packages
 
# optionally, consider some additional questions
 
# optionally, consider some additional questions
## how do users create data files?  
+
## how do users create data files?
## how do users edit data files?  what are the most common edits (e.g., delete OTU)?  
+
## how do users edit data files?  what are the most common edits (e.g., delete OTU)?
## do users archive data in a retrievable manner and, if so, how?  
+
## do users archive data in a retrievable manner and, if so, how?
## what "dead-end" files are created?
+
## what "dead-end" files are created?
## how do users visualize data in files?  
+
## how do users visualize data in files?
## how do users query data in files?  
+
## how do users query data in files?
  
 
Note that some parts of this will cross-fertilize with the goal of assembling a library of cases.  While pursuing this goal, we can identify use cases and users who would be willing to contribute files.
 
Note that some parts of this will cross-fertilize with the goal of assembling a library of cases.  While pursuing this goal, we can identify use cases and users who would be willing to contribute files.
Line 40: Line 40:
 
  [[Image:Kumar-Figure 1.jpg]]
 
  [[Image:Kumar-Figure 1.jpg]]
  
=== Task 2: collect examples of NEXUS flavors and pathologies ===  
+
=== Task 2: collect examples of NEXUS flavors and pathologies ===
  
 
'''Task leader''': Mark
 
'''Task leader''': Mark
  
'''Statement of problem''': To develop support for NEXUS we need examples of different versions as well as pathologies.   
+
'''Statement of problem''': To develop support for [[NEXUS]] we need examples of different versions as well as pathologies.
 +
 
 +
'''Approach''': Collect examples and link them to this page.  Below is a template (change if desired) for a table.  In deciding how to address the differences, we could take an ad hoc approach or we could build on the formal approach begun by Iglesias et al in the attached paper (''[[:Image:Iglesias_etal_NEXUS.pdf|Interoperability between bioinformatics tools: a logic programming approach, by Iglesias et al.]]'').
  
'''Approach''':  Collect examples and link them to this page. Below is a template (change if desired) for a table.
+
Mark has written a small python script to aid determining when there are differences in the interpretation of a NEXUS file by different programs.
 +
The code for that resides on the CIPRES svn repository, but the code is not dependent on the rest of CIPRES.
 +
The url is http://nladr-cvs.sdsc.edu/svn/CIPRES/cipresdev/trunk/python-example/nexus-tool-testing and "guest" as the username and password will work (for read only access, if you need write access contact Mark).
 +
The script produces an xml doc that summarizes the behavior of the parsers (stdin, stdout, and whether there was an error return code).
  
<table border=1>
+
These files are the first set I have documented.  They demonstrate the weak character typing issues mentioned on the [[NEXUS#NEXUS Problems|NEXUS]] page.
<tr><th>File Link</th><th>Application Source</th><th>User Source</th><th>Description</th></tr>
+
Different programs disagree about character encoding for some of these (similar) files -- even in the parsimony context.
<tr><td>myfile.nexus</td><td>MacBush</td><td>John Doe</td><td>trees have branch lengths in the wrong place</td></tr>
+
{| class="wikitable"
</table>
+
! File Link
 +
! Application Source
 +
! User Source
 +
! Description
 +
|-
 +
| [[:Image:No-equates.nex|No-equates.nex]]
 +
| PAUP/[[Mesquite]]
 +
| Mark Holder
 +
|
 +
|-
 +
| [[:Image:Dna-ext.nex|Dna-ext.nex]]
 +
| PAUP/[[Mesquite]]
 +
| Mark Holder
 +
|
 +
|-
 +
| [[:Image:Dna-prot.nex|Dna-prot.nex]]
 +
| PAUP/[[Mesquite]]
 +
| Mark Holder
 +
|
 +
|-
 +
| [[:Image:Prot-dna.nex|Prot-dna.nex]]
 +
| PAUP/[[Mesquite]]
 +
| Mark Holder
 +
|
 +
|-
 +
|}
  
=== Task 3: collect other file format examples ===  
+
=== Task 3: collect other file format examples ===
  
 
Task leaders: Enrico, Sudhir
 
Task leaders: Enrico, Sudhir
  
'''Statement of problem''': To do practical work in this area (generating validation and transformation servers), we need to have sample files.
+
'''Statement of problem''': To do practical work in this area (generating validation and transformation servers), we need to have sample files.
  
'''Approach''':  Collect examples and link them to this page.  Ideally these will be actual user files, but they also could be hypothetical tests files that represent specific challenges.  Eventually create a more formal library of cases.
+
'''Approach''':  Collect examples and link them to this page.  Ideally these will be actual user files, but they also could be hypothetical tests files that represent specific challenges.  Eventually create a more formal library of cases.
  
'''To do''':  
+
'''To do''':
 
# Create a list of formats
 
# Create a list of formats
# Generate a table for each type of format. See the [https://www.nescent.org/wg_phyloinformatics/Supporting_NEXUS_Documentation hack-a-thon "Supporting NEXUS" document] for an example of how to create a wiki table for this.
+
# Generate a table for each type of format. See the [https://www.nescent.org/wg_phyloinformatics/Supporting_NEXUS_Documentation hack-a-thon "Supporting NEXUS" document] for an example of how to create a wiki table for this.
# Populate the tables with examples of each type of format  
+
# Populate the tables with examples of each type of format
 
# Examples of formating errors (format does not fit nominal standard) and parsing errors errors (parser chokes on valid files) [[A list of existing formatting/parsing errors]]
 
# Examples of formating errors (format does not fit nominal standard) and parsing errors errors (parser chokes on valid files) [[A list of existing formatting/parsing errors]]
  
Line 90: Line 120:
 
<td>http://www.phyloxml.org</td>
 
<td>http://www.phyloxml.org</td>
 
<td></td>
 
<td></td>
 +
</tr>
 +
<tr>
 +
<td>Phylip</td>
 +
<td></td>
 +
<td>Tree specs from Phylip documentations</td>
 +
<td></td>
 +
<td>http://www.cs.nmsu.edu/~epontell/info2.pdf</td>
 
</tr>
 
</tr>
 
</table>
 
</table>
Line 97: Line 134:
 
<tr><th>Format Name</th><th>Synonyms</th><th>Most common use</th><th>Reference</th><th>Formal definition</th></tr>
 
<tr><th>Format Name</th><th>Synonyms</th><th>Most common use</th><th>Reference</th><th>Formal definition</th></tr>
 
<tr>
 
<tr>
<td>PHYLIP</td><td></td><td>input for PHYLIP package</td><td>NA</td><td>NA</td>
+
<td>PHYLIP</td><td></td><td>input for PHYLIP package</td><td>NA</td><td>http://www.cs.nmsu.edu/~epontell/info2.pdf</td>
 
</tr>
 
</tr>
 
<tr>
 
<tr>
<td>MEGA</td><td></td><td>input for MEGA package</td><td>NA</td><td>NA</td>
+
<td>MEGA</td><td></td><td>input for MEGA package</td><td>NA</td><td>http://www.cs.nmsu.edu/~epontell/mega.pdf</td>
 
</tr>
 
</tr>
 
<tr>
 
<tr>
<td>NEXUS</td><td></td><td>input for PAUP*</td><td>Maddison, et al., 1997</td><td>http://www.cs.nmsu.edu/~epontell/nexus/nexus_grammar</td>
+
<td>[[NEXUS]]</td><td></td><td>input for PAUP*</td><td>Maddison, et al., 1997</td><td>http://www.cs.nmsu.edu/~epontell/nexus/nexus_grammar</td>
 
</tr>
 
</tr>
  
Line 109: Line 146:
  
  
Table of examples of different file formats
+
Table of examples of different file formats
 
<table border=1>
 
<table border=1>
 
<tr><th>File Link</th><th>File Type</th><th>Source</th><th>Comment</th></tr>
 
<tr><th>File Link</th><th>File Type</th><th>Source</th><th>Comment</th></tr>
Line 116: Line 153:
 
</table>
 
</table>
  
=== Task 4: Formalization of common formats ===
+
=== Task 4: Common formats: formalization to support validation and translation services ===
  
 
'''Task leaders''': Gopal, Enrico
 
'''Task leaders''': Gopal, Enrico
  
'''Statement of problem''': To support current standards by providing validation, interconversion, and storage, we need to define these standards precisely.   
+
'''Statement of problem''': To support current standards by providing validation, interconversion, and storage, we need to define these standards precisely.
 +
 
 +
'''Approach''': Develop formal grammars, evaluate them using DCG processing on test files.  Consider best practices to deal with information loss.  See the attached paper ''[[:Image:Gupta etal semantic trans.pdf|Semantics-based Filtering: Logic Programming's Killer App?, Gupta et al.]]'' on semantic transformation using logic  programming.
  
'''Approach''':  Develop formal grammars, evaluate them using DCG processing on test files. Consider best practices to deal with information loss.  See the attached paper [[Image:Gupta_etal_semantic_trans]] on semantic transformation using logic  programming.
+
'''Further design principles''':  Weigang developed the following description of what is needed:
 +
* Develop and maintain a web service for file format validation
 +
** flag format errors
 +
** flag deprecated syntax
 +
** suggest changes
 +
** provide warnings on potential incompetibility
 +
** make translation tables to preserve long OTU names.
 +
* At a higher level service, a web-based tool could
 +
** make codon alignment
 +
** calculate sequence divergence
 +
** filter taxa and character sets.
 +
* This format validation would be based on the emerging phylo-data and phylo-model ontologies, rather than on a format mandated by a particular application
 +
* Such a service will depend on the initial support of application developers, but the eventual adoption will be beneficial for the whole phyloinfo community
 +
* Providing file exchange service would be an added attraction of such service.
 +
* The main advantage is guaranteed acceptance by all major applications (with no loss of information) when a user's data set passes the validation.
  
  
'''To do''':  
+
'''To do''':
# table with grammars for NEXUS, MEGA, PHYLIP, DAMBE, ...
+
# table with links to grammars for [[NEXUS]], MEGA, PHYLIP, DAMBE, ...
 
# table with test results for each grammar
 
# table with test results for each grammar
 
# recommended practices for dealing with information loss
 
# recommended practices for dealing with information loss
  
=== Task 5: Validation server ===
+
'''DONE''':
 
+
# Formal specification of PHYLIP input format (easily convertible to DCG): [http://www.cs.nmsu.edu/~epontell/info2.pdf PHYLIP input format]
'''Task leaders''' Gopal, Enrico
+
# Formal specification of MEGA input (sequence data): http://www.cs.nmsu.edu/~epontell/mega.pdf
 
 
'''Statement of problem''':  
 
 
 
'''Approach''':
 
  
'''To do''':
+
See the PHYLIP-NEXUS [http://www.ivory-tower-theorist.com/fconv/ interconversion demo] developed by Brian De Vries (from Gopal's group)
  
 
== Relationship to other goals of the working group ==
 
== Relationship to other goals of the working group ==
  
 
Purposefully, we are approaching the [[Current Standards]] goal in a way that will lead on to our other goals.  Task 1 of '''Assessing user practices''' will broaden and clarify our awareness of what are the most needed capabilities for a [[Future Data Exchange Standard]] and a [[Database Archive]]; by clarifying the inputs and outputs used in typical analyses, it provides a foundation for [[Analysis Templates]]; it also may reveal [[Future Challenges]].  Tasks 2 and 3 address our [[Library of Cases]] goal, and provide material for the future goals of [[Education and Outreach]], [[Support for Evaluation]], and [[Analysis Templates]].  Task 4 ('''Formalization''') will identify minimal elements for both a [[Database Archive]] and a [[Future Data Exchange Standard]], and provides an upconversion path to the [[Future Data Exchange Standard]].
 
Purposefully, we are approaching the [[Current Standards]] goal in a way that will lead on to our other goals.  Task 1 of '''Assessing user practices''' will broaden and clarify our awareness of what are the most needed capabilities for a [[Future Data Exchange Standard]] and a [[Database Archive]]; by clarifying the inputs and outputs used in typical analyses, it provides a foundation for [[Analysis Templates]]; it also may reveal [[Future Challenges]].  Tasks 2 and 3 address our [[Library of Cases]] goal, and provide material for the future goals of [[Education and Outreach]], [[Support for Evaluation]], and [[Analysis Templates]].  Task 4 ('''Formalization''') will identify minimal elements for both a [[Database Archive]] and a [[Future Data Exchange Standard]], and provides an upconversion path to the [[Future Data Exchange Standard]].
 +
 +
[[Category:NEXUS]]
 +
[[Category:Standards]]
 +
[[Category:Working Group]]

Latest revision as of 15:37, 12 March 2009

Overview

For computer programs to exchange data transparently requires standardized serialization schemes. Traditionally this means having a standard file format for data exchange. Supporting such standards may include such things as clarifying or extending an existing standard, providing users and developers with software tools to use the standard, providing conversion between formats, and ensuring an upward conversion path to the next standard.

How can we increase interoperability by supporting current standards?

The place to begin is by identifying current standards and assessing available support. Current standards include file formats like NEXUS, MEGA, and PHYLIP described on the Molecular Evolution Workshop page at Woods Hole. To what extent can these formats be validated? To what extent is the choice of format an irrelevant, transparent issue for the user who wishes to view, query, store, edit or analyze data?

Consider the simplest format, FASTA. We can view the data in a FASTA file using a standard text browser or word processor. Yet, how many computer programs that input FASTA files place an arbitrary limit on the length of a definition line? What about support for search, query, and edit operations? Even for a simple FASTA file, the possibility of arbitrary line breaks means that when we search for the sequence GAATTC using a standard text editor, we may fail to find it because the pattern is hard-wrapped from one line to the next. Without this kind of query capability, automatic editing is not possible.

The problem becomes much worse when we consider other file formats. For instance, PHYLIP files limit OTU names to 10 characters. In a NEXUS file, an OTU can be referred to by an identifier string called a "taxon label", or by a whole number that corresponds to the order in which an OTU was declared. This means that, if we wish to rename OTUs in a NEXUS file in order to achieve compatibility with PHYLIP limits, we cannot simply do a search-and-replace using the taxon label, because an OTU might be referenced by a number instead of a label. Clearly, then, in order to support these standards and hide the details of file formats from the user, we must have things like wrappers for PHYLIP programs that preserve long names, and editors for NEXUS that are aware of NEXUS semantics.

Also, it may be clear from these examples that we can neither assess nor enhance support for standards without understanding the goals and practices of the user community.

Goals for the working group

At the first meeting (20-23 May, 2007), the working group decided on several deliverables relating to support for current standards. The most extensive efforts will be done by Enrico, Gopal and Arlin as part of a larger project for which outside support will be sought.

Task 1: Assess current user practices

Task leaders: Sudhir, Arlin

Statement of problem: To develop support efficiently (applying the 80:20 rule), we need to know what most users are doing, and how they are doing it.

Approach: Use analyses of literature citations and inspection of software packates to determine the most common input and output formats and the most common operations.

To do:

  1. include numerical analysis of citations to different packages
  2. make a table of input and output formats used by common packages
  3. optionally, consider some additional questions
    1. how do users create data files?
    2. how do users edit data files? what are the most common edits (e.g., delete OTU)?
    3. do users archive data in a retrievable manner and, if so, how?
    4. what "dead-end" files are created?
    5. how do users visualize data in files?
    6. how do users query data in files?

Note that some parts of this will cross-fertilize with the goal of assembling a library of cases. While pursuing this goal, we can identify use cases and users who would be willing to contribute files.

results

Kumar-Figure 1.jpg

Task 2: collect examples of NEXUS flavors and pathologies

Task leader: Mark

Statement of problem: To develop support for NEXUS we need examples of different versions as well as pathologies.

Approach: Collect examples and link them to this page. Below is a template (change if desired) for a table. In deciding how to address the differences, we could take an ad hoc approach or we could build on the formal approach begun by Iglesias et al in the attached paper (Interoperability between bioinformatics tools: a logic programming approach, by Iglesias et al.).

Mark has written a small python script to aid determining when there are differences in the interpretation of a NEXUS file by different programs. The code for that resides on the CIPRES svn repository, but the code is not dependent on the rest of CIPRES. The url is http://nladr-cvs.sdsc.edu/svn/CIPRES/cipresdev/trunk/python-example/nexus-tool-testing and "guest" as the username and password will work (for read only access, if you need write access contact Mark). The script produces an xml doc that summarizes the behavior of the parsers (stdin, stdout, and whether there was an error return code).

These files are the first set I have documented. They demonstrate the weak character typing issues mentioned on the NEXUS page. Different programs disagree about character encoding for some of these (similar) files -- even in the parsimony context.

File Link Application Source User Source Description
No-equates.nex PAUP/Mesquite Mark Holder
Dna-ext.nex PAUP/Mesquite Mark Holder
Dna-prot.nex PAUP/Mesquite Mark Holder
Prot-dna.nex PAUP/Mesquite Mark Holder

Task 3: collect other file format examples

Task leaders: Enrico, Sudhir

Statement of problem: To do practical work in this area (generating validation and transformation servers), we need to have sample files.

Approach: Collect examples and link them to this page. Ideally these will be actual user files, but they also could be hypothetical tests files that represent specific challenges. Eventually create a more formal library of cases.

To do:

  1. Create a list of formats
  2. Generate a table for each type of format. See the hack-a-thon "Supporting NEXUS" document for an example of how to create a wiki table for this.
  3. Populate the tables with examples of each type of format
  4. Examples of formating errors (format does not fit nominal standard) and parsing errors errors (parser chokes on valid files) A list of existing formatting/parsing errors

Results

Tree Formats

Format NameSynonymsMost common useReferenceFormal definition
NewickNew Hampshireall (?) plain text formats Newick grammar by Gary Olsen
NHX New Hampshire Extended input to ATV Zmasek and Eddy NA
PhyloXML http://www.phyloxml.org
Phylip Tree specs from Phylip documentations http://www.cs.nmsu.edu/~epontell/info2.pdf

Data matrix formats

Format NameSynonymsMost common useReferenceFormal definition
PHYLIPinput for PHYLIP packageNAhttp://www.cs.nmsu.edu/~epontell/info2.pdf
MEGAinput for MEGA packageNAhttp://www.cs.nmsu.edu/~epontell/mega.pdf
NEXUSinput for PAUP*Maddison, et al., 1997http://www.cs.nmsu.edu/~epontell/nexus/nexus_grammar


Table of examples of different file formats

File LinkFile TypeSourceComment
xxx.alnclustalw (aln)ClustalWND
xxx.phyPHYLIPClustalWND

Task 4: Common formats: formalization to support validation and translation services

Task leaders: Gopal, Enrico

Statement of problem: To support current standards by providing validation, interconversion, and storage, we need to define these standards precisely.

Approach: Develop formal grammars, evaluate them using DCG processing on test files. Consider best practices to deal with information loss. See the attached paper Semantics-based Filtering: Logic Programming's Killer App?, Gupta et al. on semantic transformation using logic programming.

Further design principles: Weigang developed the following description of what is needed:

  • Develop and maintain a web service for file format validation
    • flag format errors
    • flag deprecated syntax
    • suggest changes
    • provide warnings on potential incompetibility
    • make translation tables to preserve long OTU names.
  • At a higher level service, a web-based tool could
    • make codon alignment
    • calculate sequence divergence
    • filter taxa and character sets.
  • This format validation would be based on the emerging phylo-data and phylo-model ontologies, rather than on a format mandated by a particular application
  • Such a service will depend on the initial support of application developers, but the eventual adoption will be beneficial for the whole phyloinfo community
  • Providing file exchange service would be an added attraction of such service.
  • The main advantage is guaranteed acceptance by all major applications (with no loss of information) when a user's data set passes the validation.


To do:

  1. table with links to grammars for NEXUS, MEGA, PHYLIP, DAMBE, ...
  2. table with test results for each grammar
  3. recommended practices for dealing with information loss

DONE:

  1. Formal specification of PHYLIP input format (easily convertible to DCG): PHYLIP input format
  2. Formal specification of MEGA input (sequence data): http://www.cs.nmsu.edu/~epontell/mega.pdf

See the PHYLIP-NEXUS interconversion demo developed by Brian De Vries (from Gopal's group)

Relationship to other goals of the working group

Purposefully, we are approaching the Current Standards goal in a way that will lead on to our other goals. Task 1 of Assessing user practices will broaden and clarify our awareness of what are the most needed capabilities for a Future Data Exchange Standard and a Database Archive; by clarifying the inputs and outputs used in typical analyses, it provides a foundation for Analysis Templates; it also may reveal Future Challenges. Tasks 2 and 3 address our Library of Cases goal, and provide material for the future goals of Education and Outreach, Support for Evaluation, and Analysis Templates. Task 4 (Formalization) will identify minimal elements for both a Database Archive and a Future Data Exchange Standard, and provides an upconversion path to the Future Data Exchange Standard.