Difference between revisions of "Current Standards"

From Evolutionary Informatics Working Group
Jump to: navigation, search
(Overview)
(Analysis of current needs)
Line 2: Line 2:
 
For computer programs to exchange data transparently requires standardized serialization schemes.  Traditionally this means having a standard file format for data exchange.  Current standards include NEXUS, MEGA, and PHYLIP.  Supporting such standards may include such things as clarifying or extending an existing standard, providing users and developers with software tools to use the standard, providing conversion between formats, and ensuring an upward conversion path to the next standard.
 
For computer programs to exchange data transparently requires standardized serialization schemes.  Traditionally this means having a standard file format for data exchange.  Current standards include NEXUS, MEGA, and PHYLIP.  Supporting such standards may include such things as clarifying or extending an existing standard, providing users and developers with software tools to use the standard, providing conversion between formats, and ensuring an upward conversion path to the next standard.
  
== Analysis of current needs ==
+
== Defining the problem ==
  
(what do we have now?  what is missing? what challenges are coming up?) Address
+
A variety of file formats are in use, including NEXUS, MEGA, and PHYLIP.  These and several others are described on the [[http://www.molecularevolution.org/resources/fileformats/][Molecular Evolution Workshop]] page at Woods Hole. 
* NEXUS
 
* MEGA
 
* PHYLIP
 
* others
 
* interconversion
 
* upward conversion
 
  
In spite of being a [[de facto]] standard, NEXUS  (Maddison, et al., 1997) is poorly supported by software, with many inconsistent or incomplete implementationsTo support NEXUS means providing users and developers the tools to read, write, and validate files, and to search, edit, store and visualize the contents of a fileProper support would allow for lateral transfer of data with other file formats, and an upward conversion path to the next standard. For more information, see the [https://www.nescent.org/wg/phyloinformatics/index.php?title=Supporting_NEXUS_Documentation Supporting NEXUS] documentation from NESCent's recent phyloinformatics hack-a-thon.  
+
One way of thinking about how well these standards are supported by software is the extent to which the format can be validated, and the extent to which the choice of format is an irrelevant, transparent issue for the user who wishes to view, search, query, store, edit or analyze data.   
 +
 
 +
Consider the simplest format, FASTA.  Obviously we can view the data stored in a FASTA file using a text browser, and this only becomes problematic for extremely large sets of data.  But what about anything else?  How many FASTA file readers place an arbitrary limit on the definition lines?  What about support for search, query, and edit operations?  Even for a simple FASTA file, the possibility of arbitrary line breaks means that when we search for the sequence GAATTC using a text editor, we may fail to find it because the pattern is wrapped from one line to the nextIf we cannot search like this, we cannot do automatic editing.
 +
 
 +
The problem is much worse for the other file formats, which are more complex.  Some of the problems surrounding NEXUS (Maddison, et al., 1997) are described in the [https://www.nescent.org/wg/phyloinformatics/index.php?title=Supporting_NEXUS_Documentation Supporting NEXUS] report from NESCent's recent phyloinformatics hack-a-thon.
  
 
== Goals for the working group ==
 
== Goals for the working group ==

Revision as of 12:51, 4 April 2007

Overview

For computer programs to exchange data transparently requires standardized serialization schemes. Traditionally this means having a standard file format for data exchange. Current standards include NEXUS, MEGA, and PHYLIP. Supporting such standards may include such things as clarifying or extending an existing standard, providing users and developers with software tools to use the standard, providing conversion between formats, and ensuring an upward conversion path to the next standard.

Defining the problem

A variety of file formats are in use, including NEXUS, MEGA, and PHYLIP. These and several others are described on the [[1][Molecular Evolution Workshop]] page at Woods Hole.

One way of thinking about how well these standards are supported by software is the extent to which the format can be validated, and the extent to which the choice of format is an irrelevant, transparent issue for the user who wishes to view, search, query, store, edit or analyze data.

Consider the simplest format, FASTA. Obviously we can view the data stored in a FASTA file using a text browser, and this only becomes problematic for extremely large sets of data. But what about anything else? How many FASTA file readers place an arbitrary limit on the definition lines? What about support for search, query, and edit operations? Even for a simple FASTA file, the possibility of arbitrary line breaks means that when we search for the sequence GAATTC using a text editor, we may fail to find it because the pattern is wrapped from one line to the next. If we cannot search like this, we cannot do automatic editing.

The problem is much worse for the other file formats, which are more complex. Some of the problems surrounding NEXUS (Maddison, et al., 1997) are described in the Supporting NEXUS report from NESCent's recent phyloinformatics hack-a-thon.

Goals for the working group

(specific goals for this topic)

Strategy for achieving goal

(be sure to include specific deliverables or milestones)