NEXUS

From Evolutionary Informatics Working Group
Jump to: navigation, search

In spite of being a de facto standard, NEXUS (Maddison, et al., 1997) is poorly supported by software, with many inconsistent or incomplete implementations. To support NEXUS means providing users and developers the tools to read, write, and validate files, and to search, edit, store and visualize the contents of a file. Proper support would allow for lateral transfer of data with other file formats, and an upward conversion path to the next standard. For more information, see the Supporting NEXUS documentation from NESCent's recent phyloinformatics hackathon.

NEXUS Problems

Supporting NEXUS in software for evolutionary analyses reveals problems that fall into 3 basic categories:

  1. Deficiencies in the standard itself
  2. Variable level of support for standard feature in current software
  3. The extremely high incidence of illegal files (a byproduct of the fact that most parsers are "forgiving" of errors).

Problems with the NEXUS Standard

NEXUS has many appealling properties, such as ease of extensibility, human-readability, and richness (see [1] for a discussion). Despite these advantages there are some deficiencies that users should be aware of.

  1. weak typing of the characters matrix
  2. no (standardized) support for file inclusion
  3. no (standardized) support multiple instances of the same type of block.
  4. numbers as labels issue

Weak typing in the characters matrix

For a NEXUS character matrix to be valid the symbols found in it must be declared. There is a concept of a character type (DNA, protein, "standard"), but the symbols list of the predefined types can be augmented with implementation defined behavior. There are mechanisms for labeling character states and assigning user-types to characters, etc. but the are entirely optional.

Consider trying to calculate the likelihood of a morphological character using something like the Mkv model (Paul Lewis' term for the CF or JC-like model when applied to variable characters). One thing you need to know is the number states, but you can't get this from a NEXUS file. You can see that 3 symbols are used in a column and maybe 4 states are labeled using a charstatelabels command, but the NEXUS standard explicitly says that it is acceptable to assign label to a subset of the states. So you can really only assign a lower bound on the number of states for any character in a matrix.

To make matters worse, some programs do not accept multiple characters blocks, and extending the symbols list for built in types can result in clashes between symbols and equate keys. See File:Dna-prot.nex for an example of this.

No standardized method for dealing with blocks of the same type

This is most obvious in the case of DATA, TAXA, and CHARACTERS block. The standard does not say how multiple CHARACTERS blocks or TAXA blocks should be handled. PAUP and MrBayes do not accept multiple CHARACTERS block (in the case of PAUP, the user can cancel the reading in of a second block or replace the first block in memory). Mesquite encourages multiple blocks.

To support multiple blocks Mesquite invented a TITLE command (to assign blocks names), and LINK command so that a CHARACTERS (or TREES) block can be linked to a specific TAXA block, for example. These links are not supported by other software. Mesquite also extended the CHARSET (and TAXSET) commands so that the name of the the name of the source of the labels can be identified. For example in an ASSUMPTIONS block:

CHARSET * untitled  (CHARACTERS = sequences)  =  unord:  1 -  100;

Means that this CHARSET uses the first 100 characters of the CHARACTERS block with TITLE=sequences.

No other program (that I know of) understands this syntax - and because it is a change to a public command, it will cause PAUP to reject the file.