Difference between revisions of "Current Standards"

From Evolutionary Informatics Working Group
Jump to: navigation, search
(Analysis of current needs)
Line 1: Line 1:
 
== Overview ==
 
== Overview ==
For computer programs to exchange data transparently requires standardized serialization schemes.  Traditionally this means having a standard file format for data exchange.  Current standards include NEXUS, MEGA, and PHYLIP.  Supporting such standards may include such things as clarifying or extending an existing standard, providing users and developers with software tools to use the standard, providing conversion between formats, and ensuring an upward conversion path to the next standard.
+
For computer programs to exchange data transparently requires standardized serialization schemes.  Traditionally this means having a standard file format for data exchange.  Supporting such standards may include such things as clarifying or extending an existing standard, providing users and developers with software tools to use the standard, providing conversion between formats, and ensuring an upward conversion path to the next standard.
  
== Defining the problem ==
+
How can we increase interoperability by supporting current standards? 
  
A variety of file formats are in use, including NEXUS, MEGA, and PHYLIP.  These and several others are described on the [[http://www.molecularevolution.org/resources/fileformats/][Molecular Evolution Workshop]] page at Woods Hole.   
+
The place to begin is by identifying current standards and assessing available support.  Current standards include file formats like NEXUS, MEGA, and PHYLIP described on the [http://www.molecularevolution.org/resources/fileformats/ Molecular Evolution Workshop] page at Woods Hole.  To what extent can these formats be validated?  To what extent is the choice of format an irrelevant, transparent issue for the user who wishes to view, query, store, edit or analyze data?
  
One way of thinking about how well these standards are supported by software is the extent to which the format can be validated, and the extent to which the choice of format is an irrelevant, transparent issue for the user who wishes to view, search, query, store, edit or analyze data.
+
Consider the simplest format, FASTA.  We can view the data in a FASTA file using a standard text browser or word processor.  Yet, how many computer programs that input FASTA files place an arbitrary limit on the length of a definition line?  What about support for search, query, and edit  operations?  Even for a simple FASTA file, the possibility of arbitrary line breaks means that when we search for the sequence GAATTC using a standard text editor, we may fail to find it because the pattern is hard-wrapped from one line to the next.  Without this kind of query capability, automatic editing is not possible.  
  
Consider the simplest format, FASTAObviously we can view the data stored in a FASTA file using a text browser, and this only becomes problematic for extremely large sets of dataBut what about anything else?  How many FASTA file readers place an arbitrary limit on the definition lines?  What about support for search, query, and edit  operations?  Even for a simple FASTA file, the possibility of arbitrary line breaks means that when we search for the sequence GAATTC using a text editor, we may fail to find it because the pattern is wrapped from one line to the next.  If we cannot search like this, we cannot do automatic editing.  
+
The problem becomes much worse when we consider other file formats.  For instance, PHYLIP files limit OTU names to 10 charactersIn a NEXUS file, an OTU can be referred to by an identifier string called a "taxon label", or by a whole number that corresponds to the order in which an OTU was declaredThis means that, if we wish to rename OTUs in a NEXUS file in order to achieve compatibility with PHYLIP limits, we cannot simply do a search-and-replace using the taxon label, because an OTU might be referenced by a number instead of a label.  Clearly, then, in order to support these standards and hide the details of file formats from the user, we must have things like wrappers for PHYLIP programs that preserve long names, and editors for NEXUS that are aware of NEXUS semantics.  
  
The problem is much worse for the other file formats, which are more complex.  Some of the problems surrounding NEXUS (Maddison, et al., 1997) are described in the [https://www.nescent.org/wg/phyloinformatics/index.php?title=Supporting_NEXUS_Documentation Supporting NEXUS] report from NESCent's recent phyloinformatics hack-a-thon.
+
Also, it may be clear from these examples that we can neither assess nor enhance support for standards without understanding the goals and practices of the user community.
  
 
== Goals for the working group ==
 
== Goals for the working group ==
  
(specific goals for this topic)  
+
=== Assessing current user practices ===
 +
Some issues that we need to address in order to assess what are the community needs for support:
 +
 
 +
  * list of current widely used file formats for character data
 +
  * list of current most common input (i.e., what software is used to generate the files)
 +
  * list of current most common uses for the files
 +
      * how are they processed or analyzed?
 +
      * what other file formats are generated in the process?
 +
      * how are they edited? (my guess: the most common edits are manual renaming and alignment fixes)
 +
      * how are they visualized?
 +
      * how are they queried?
 +
      * how are they archived or stored? 
 +
 
 +
Note that some parts of this will cross-fertilize with the goal of assembling a library of cases.  While pursuing this goal, we can identify use cases and users who would be willing to contribute files.
 +
 
 +
=== Formalization ===
 +
In order to develop a strategy to support current standards, we need to define these standards precisely and pursue strategies for validation, interconversion, and storage:
 +
  * formal definitions of file formats
 +
  * format validation tools
 +
  * other standards to incorporate, e.g., IUPAC amino acid and DNA codes
 +
  * format conversion tools
 +
 
 +
Note that we can convert the simpler format (PHYLIP, MEGA) into XML and this allows a way to view the data and also to edit it safely (using a validating XML editor).  Also we can use this as an opportunity to test and extend the BioPerl formats. 
 +
 
 +
=== Case library for evolutionary analysis ===
 +
 
 +
To do practical work in this area, we need to have sample files or test files.  Ideally these will be actual user files that represents various stages of analysis and various degrees of complexity.    For instance, if we wish to test format interconversion tools, we must have test sets that consist of *the same data in different formats*, which means that some experts are going to have to verify that the data are indeed the same.
 +
 
 +
  * various examples of each type of format
 +
  * sets with the same data in different formats (for testing one-way format conversion)
 +
  * pairs of input and output files from some procedural step
 +
  * pipelines with more than one step
 +
 
 +
Note that this will be an extraordinarily valuable resourse for both teaching and software development. 
 +
 
 +
=== Software testing and gap analysis ===
 +
 
 +
We do not have time to do extensive software development.  However, we can assess the state of currently available software, test its performance, develop recommendations for how users can make use of available software, and assign priorities to future development goals (indeed, NESCent may be able to sponsor a hack-a-thon to address some of these goals).
 +
 
 +
  * identify reference implementation for each type of file format
 +
  * test the reference implementation
 +
  * assess conformance  of current software (e.g,. name length limits)
 +
 
  
 
== Strategy for achieving goal ==
 
== Strategy for achieving goal ==
  
 
(be sure to include specific deliverables or milestones)
 
(be sure to include specific deliverables or milestones)

Revision as of 13:35, 4 April 2007

Overview

For computer programs to exchange data transparently requires standardized serialization schemes. Traditionally this means having a standard file format for data exchange. Supporting such standards may include such things as clarifying or extending an existing standard, providing users and developers with software tools to use the standard, providing conversion between formats, and ensuring an upward conversion path to the next standard.

How can we increase interoperability by supporting current standards?

The place to begin is by identifying current standards and assessing available support. Current standards include file formats like NEXUS, MEGA, and PHYLIP described on the Molecular Evolution Workshop page at Woods Hole. To what extent can these formats be validated? To what extent is the choice of format an irrelevant, transparent issue for the user who wishes to view, query, store, edit or analyze data?

Consider the simplest format, FASTA. We can view the data in a FASTA file using a standard text browser or word processor. Yet, how many computer programs that input FASTA files place an arbitrary limit on the length of a definition line? What about support for search, query, and edit operations? Even for a simple FASTA file, the possibility of arbitrary line breaks means that when we search for the sequence GAATTC using a standard text editor, we may fail to find it because the pattern is hard-wrapped from one line to the next. Without this kind of query capability, automatic editing is not possible.

The problem becomes much worse when we consider other file formats. For instance, PHYLIP files limit OTU names to 10 characters. In a NEXUS file, an OTU can be referred to by an identifier string called a "taxon label", or by a whole number that corresponds to the order in which an OTU was declared. This means that, if we wish to rename OTUs in a NEXUS file in order to achieve compatibility with PHYLIP limits, we cannot simply do a search-and-replace using the taxon label, because an OTU might be referenced by a number instead of a label. Clearly, then, in order to support these standards and hide the details of file formats from the user, we must have things like wrappers for PHYLIP programs that preserve long names, and editors for NEXUS that are aware of NEXUS semantics.

Also, it may be clear from these examples that we can neither assess nor enhance support for standards without understanding the goals and practices of the user community.

Goals for the working group

Assessing current user practices

Some issues that we need to address in order to assess what are the community needs for support:

  * list of current widely used file formats for character data 
  * list of current most common input (i.e., what software is used to generate the files)
  * list of current most common uses for the files 
     * how are they processed or analyzed? 
     * what other file formats are generated in the process? 
     * how are they edited? (my guess: the most common edits are manual renaming and alignment fixes)
     * how are they visualized? 
     * how are they queried? 
     * how are they archived or stored?  

Note that some parts of this will cross-fertilize with the goal of assembling a library of cases. While pursuing this goal, we can identify use cases and users who would be willing to contribute files.

Formalization

In order to develop a strategy to support current standards, we need to define these standards precisely and pursue strategies for validation, interconversion, and storage:

  * formal definitions of file formats 
  * format validation tools 
  * other standards to incorporate, e.g., IUPAC amino acid and DNA codes
  * format conversion tools 

Note that we can convert the simpler format (PHYLIP, MEGA) into XML and this allows a way to view the data and also to edit it safely (using a validating XML editor). Also we can use this as an opportunity to test and extend the BioPerl formats.

Case library for evolutionary analysis

To do practical work in this area, we need to have sample files or test files. Ideally these will be actual user files that represents various stages of analysis and various degrees of complexity. For instance, if we wish to test format interconversion tools, we must have test sets that consist of *the same data in different formats*, which means that some experts are going to have to verify that the data are indeed the same.

  * various examples of each type of format 
  * sets with the same data in different formats (for testing one-way format conversion) 
  * pairs of input and output files from some procedural step
  * pipelines with more than one step 

Note that this will be an extraordinarily valuable resourse for both teaching and software development.

Software testing and gap analysis

We do not have time to do extensive software development. However, we can assess the state of currently available software, test its performance, develop recommendations for how users can make use of available software, and assign priorities to future development goals (indeed, NESCent may be able to sponsor a hack-a-thon to address some of these goals).

  * identify reference implementation for each type of file format
  * test the reference implementation 
  * assess conformance  of current software (e.g,. name length limits)


Strategy for achieving goal

(be sure to include specific deliverables or milestones)