Difference between revisions of "Future Data Exchange Standard"
m (→Meeting notes)
|Line 428:||Line 428:|
[[Category:DB Interop Hackathon]]
[[Category:DB Interop Hackathon]]
Revision as of 17:30, 2 March 2009
This page covers NeXML.
- 1 Introduction
- 2 Goals for the working group
- 3 RFC: nexml
- 3.1 Preamble
- 3.2 Design
- 3.2.1 General design
- 3.2.2 Element description
- 4 Resources
- 5 Outstanding issues
- 6 Meeting notes
In the first evoinfo meeting the working group acknowledged the desirability of a formal ontology as the basis for data interoperability, and has chosen the development of a General Ontology as one of its goals. Such a formal ontology, for example in RDF, can be translated directly into a serialization protocol and into a file standard definition such as an XML Schema. However, the requirements of this ontology were not clearly established, and a serialization protocol for it would not automatically result in a convenient next generation file standard. Hence, some attendants argued that a more "bottom up" approach that starts with the design of a new standard will help uncover the hidden requirements of the ontology and generate the serialization protocol in the process. This page describes recent discussion and ongoing work on the design of this new standard.
Goals for the working group
At the first meeting (20-23 May, 2007), the working group decided that work on the future data exchange standard would primarily proceed under the auspices of Rutger Vos who has reported on its progress at the Fall '07 meeting.
In internetworking and computer network engineering, Request for Comments (RFC) documents are a series of memoranda encompassing new research, innovations, and methodologies applicable to Internet technologies. RFCs are meant for peer review and to transmit new concepts, and are published through the Internet Society. This document is therefore in strict terms not a real RFC, but it serves the same purpose: to provoke comment and discussion, and to disseminate information about the design process of a new data exchange standard for phylogenetics, which in honor of the commonly used nexus flat file format is called nexml. The status of this page is neither that of a formal specification or recommendation, nor formally endorsed by evoinfo in its current stage.
The following subsections introduce the context within which nexml development takes place, describing the interoperability problems faced by phylogeneticists, the requirements nexml needs to meet to address these problems, and the development approach that is being taken to meet these requirements.
Current practices and ongoing developments in phylogenetics add pressure to the need for a new data standard. The working group is compiling a list of these issues, some of which include:
- Many flat file formats and dialects: in recent years, phylogeneticists have adopted a number of different file formats to store data such as trees and sequence alignments. Unfortunately, most flat file formats used for this purpose generally lack an exhaustive, formal specification - and even if they do, they have been extended in various ways. Examples of this abounded during the meeting; for instance:
- Data hard or impossible to validate automatically: for many file formats, no validation procedure exists - and given the proliferation of mutually exclusive dialects it's difficult to see how such a procedure could deduce validity unambiguously. What might be valid "nexus" to one program, is rejected by another program. This makes many types of compatibility issues hard to solve without extensive knowledge of file conventions "in the wild".
- Bigger data sets: meanwhile, phylogenetic analyses are continuously growing in size and complexity. Solving compatibility issues by hand is becoming too tedious.
- Increased automation: the growth in size and complexity of analyses has promoted development of automated systems for data analysis (pipelines/workflows) and data storage (databases). The purpose of a workflow is obviously defeated if it requires manual editing of data files between steps in the analysis. As well, the reproducibility of results obtained from complex workflows is in doubt if no formal facility exists for logging the steps undertaken during an analysis.
- More data types: since the introduction of the nexus file format, new types of analyses (e.g. Bayesian, ML) have gained ground, which have introduced new types of information (substitution models, priors and posteriors on branches) for which no standardized format exists.
- Nexus problems: as the nexus file format is the most commonly used format in phylogenetics, the working group is compiling a separate list of nexus problems.
To address the issues described in the preceding section, a new standard should meet at least the following requirements:
- Formalization: the next standard implements a formal ontology and is defined by a schema (as in phyloxml), such that instance documents can be validated with great granularity, including for type safety for character data (per character), referential integrity between entities, validity of tree descriptions;
- Extension: the next standard extends current standards to include model descriptions, reticulate trees, arbitrary attachments on nodes, trees, tree sets, sites, sequences, alignments, etc.;
- Data integration: the next standard allows for the integration of data from disparate sources, e.g. from multiple files, from databases, from sets of alignments and trees, from existing ontologies;
- Legacy compatibility: the next standard is organized conceptually, and can be serialized or otherwise represented in ways that are understandable for legacy software, such as being able to express the standard nexus subset;
- Analysis context description: the next standard can represent meta-information such as analysis procedures, instance document changes, schema changes;
- Parseability: the next standard is easily parsed without the need for custom tokenization, both in compact and verbose representations;
- Development of the data exchange standard is now ongoing with input from various contributors, including, in alphabetical order, Jason Caravas, Mark Holder, David Maddison, Wayne Maddison, Peter Midford, Andrew Rambaut, Jeet Sukumaran and Rutger Vos as well as other members of the evoinfo group and attendants of the pPOD kickoff meeting.
- The svn repository has migrated to sourceforge. It contains a modularized XML schema (discussed on powerpoint slides) and various instance documents demonstrating representations of common phylogenetic data.
- The domain name http://www.nexml.org has been registered, which serves to uniquely identify the nexml namespace and for which a web presence which aggregates various information streams (mailing list, wiki changes, svn commits, bugs) has been established.
- NESCent has provided a staging server from which this web presence is launched on the production server.
- To facilitate development of the ontology, concepts introduced by the design of the schema and discussed on this wiki page are crossreferenced with the glossary.
- An online validation service is under development. You can reuse it by placing this in an html document:
<form action="http://www.nexml.org/nexml/validator" enctype="multipart/form-data" method="post"> <input type="file" name="file"/> <input type="submit"/> </form>
</xml> Or use it as a simple REST service: HTTP response code 201 means it's valid, 400 means it isn't. The validator goes through a two-step process:
- grammar-based validation: this is the part where we check an input file for sensible placement of elements, attributes and text nodes. We do this by validating against the xml schema using a Xerces-J validator writting by Terri Schwartz of the CIPRES-SDSC team. Unfortunately, some constraints we like to place on files cannot be expressed in schema language. For this we have to resort to the second step.
- rule-based validation: here we test whether chained remote references (rules such as "matrix row X is part of characters block Y, which is linked to OTUs block Z. Therefore, X can only refer to an OTU that is part of Z") make sense. We do this by parsing the file using Bio::Phylo's nexml parser.
The following subsections describe the design of the developing nexml standard. These descriptions are likely to both be in flux, and lagging behind the formal (one, true) XSD description in the svn repository. The reader is therefore advised to follow the links in the subsections to the XSD fragments and XML instance document examples that define the elements.
XML in general is usually far more verbose than flat file formats that convey the same information. Nexml is designed to allow both for verbose and compact representation of data. The rationale is that some use cases (e.g. submission of morphological observations to a database) require extensive annotation, whereas others (e.g. submission to a processing engine as part of a workflow) don't require a lot of metadata, only the bare bones needed to complete the analysis step.
Elements other than xml structural placeholders (e.g. the
matrix element that lumps matrix rows together) all have the following properties:
- a required
idattribute. This attribute is not a standard XML id (which are file scoped), but of type
xs:NCName, which have the same string format restrictions but are free to have their scope defined by the schema. In the nexml schema, all such ids are scoped within their enclosing elements. This means that nodes must have an id that is unique within the tree that contains them, matrix rows must have ids unique within the matrix, etc. The rationale is that this minimizes the risk of clashes when combining data from multiple sources, and makes it easier to solve clashes should they occur (by re-assigning the id of the enclosing element).
- some elements must refer to other elements: e.g. a matrix
rowelement must refer to a
otuelement. By convention, the attribute used to define the id of another element is named after that referenced element. For example,
<row otu="t1" id="r1"/>means that the row element identified by id
r1links to a otu element whose id is
t1. Referenced elements in nexml always precede their references, which is why
otuselements come before
characterselements, for example.
To allow for the preservation of arbitrary labels and annotations, nexml elements have:
- an optional
labelattribute, which is used to specify a human-readable label for the element. The only restriction on this attribute's value is that it is a valid xml string, which means that reserved xml tokens (such as the greater-than symbol >) must be escaped using xml entities.
- zero or more
<dict/>elements (dictionary attachments) can be enclosed within them. These attachments, if present, precede any other enclosed elements (the rationale is similar to that for having referenced elements precede their references: it facilitates stream based parsing strategies).
Other common attributes
In addition to the shared attributes and structures described above, nexml elements also optionally have:
classattribute. This attribute has an id reference to a class defined earlier as its value. Its function is analogous to that of sets (e.g. OTU sets, character sets) in nexus, but instead of the set declaring what it contains here the nexml elements declare what set(s) they belong to. This is the idiomatic approach for XML, and is compatible with CSS styling.
xml:baseattribute, from the core XML namespace. This attribute has an absolute URL as value, which is the base address upon which relative links defined later in a document are built. There is a similar facility in HTML (defined in the head of a document), which in practice is used to shorten URLs in pages by defining their shared path fragments in one location. The rationale for inclusion in nexml is similar, with the additional consideration that some XInclude processors place the attribute by default in the post-include infoset.
xlink:hrefattribute, from the XLink namespace. This attribute has an absolute or relative URL as its value, which is used to create a unidirectional link from the element that has the attribute to another location.
xml:idattribute, from the core XML namespace. This attribute has a string of the same format as the nexml id as its value, except this id is file-scope unique. The attribute has no defined function within nexml, but some XInclude processors might include it.
xml:langattribute, from the core XML namespace. This attribute has an RFC3066 compliant language code (e.g.
en-US) as its value, which is used to indicate the natural language in which text of the element or its children are written.
- In addition, attributes from any other namespace are allowed (i.e.
foo:bar="baz"), provided this namespace (say,
http://foo.org) is declared previously and bound to a namespace prefix (
(Like all standalone xml files, nexml files start with the
<?xml version="1.0" encoding="some character encoding scheme"?> processing instruction. The encoding scheme is used to specify what character set - e.g. some unicode flavor - is used. Hence, nexml files can include OTU descriptions in simplified Chinese, for example, without confusing parsers. Naturally, i18n in terms of character sets is best used in combination with
xml:lang attributes where appropriate.)
The root element of the schema is called
nexml. This root element has the following attributes:
- a required
versionattribute whose value is a decimal number indicating the nexml schema version. Until a revision occurs after the first release of nexml the value must be 1.0.
- an optional
generatorattribute, which is used to identify the program that generated the file. The attribute's value is a free form string.
In addition, it will commonly define a number of xml namespace prefixes. (Where it says "by convention" in the list below, the convention applies to the three-letter prefixes which are free to vary in most cases, not the namespaces themselves):
- the xml namespace prefix that identifies xml schema semantics that might be inlined in the file. By convention this is of the format
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"so that parts of schema language used inside nexml (e.g. where a concrete subclass must be specified) are identified by the
- the nexml namespace prefix, by convention of the format
xmlns:nex="http://www.nexml.org/1.0", so that locations where nexml specific types are referenced (e.g. data type subclasses) these are identified by their
- the xml namespace prefix, required to be of the format
xmlns:xml="http://www.w3.org/XML/1998/namespace". This may be used, for example, to specify the base address of imported nexml snippets (using the
xml:base="http://example.com/base.xml"attribute) and to specify the language in which certain element contents are written (using, say,
Lastly, to associate the instance document with the nexml schema, it requires an attribute to specify the nexml schema location, and the namespace it applies to. This is of the format
xsi:schemaLocation="http://www.nexml.org/1.0 nexml.xsd" (in a stable release the location of the schema would not be a local path - such as nexml.xsd - but a url). Notice that this attribute is a schema language snippet (identified by the
xsi: prefix) that identifies a namespace (
http://www.nexml.org/1.0) and associates it with a physical schema location (
Together, this makes the root element look something like the following: <xml>
<?xml version="1.0" encoding="ISO-8859-1"?> <nex:nexml version="1.0" generator="eclipse" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:nex="http://www.nexml.org/1.0" xsi:schemaLocation="http://www.nexml.org/1.0 http://www.nexml.org/1.0/nexml.xsd"> </nex:nexml>
</xml> The root element can hold:
- zero or more attachment dictionaries,
- one or more OTUs elements,
- zero or more characters elements,
- zero or more trees elements (in mixed order with characters elements),
In some phylogenetic analyses (such as Bayesian analyses that yield a set of trees) many nodes [term] - in different trees [term] - refer to the same sequence. On the other hand, in some other analyses (such as those that involve simulation of a set of sequence alignments) many sequences - in different alignments - refer to the same node in the generating tree. This relationship can be normalized by creating a third entity from which one-to-many links point both to nodes and sequences. In nexus files, these entities are defined in the taxa block using a set of labels that sequences and nodes later on in the file must refer to. In nexml, this same functionality is provided by
otus elements. The name change is a result of the progressing integration between phylogenetics and taxonomy [term], which now causes concept confusion because of the overloading of the term "taxa" when it was introduced in the nexus standard.
In its simplest form, a
otus element looks something like this:
<otus id="tax1"> <otu id="t1"/> <otu id="t2"/> </otus>
That is, the
otus element and its contained
otu elements require a (file level unique) id attribute. In addition, these elements can have an optional
label attribute that defines a human readable name for the element, and can contain dictionary attachments.
characters element is somewhat analogous to the nexus characters blocks: it stores data such as molecular sequences, categorical data or continuous data. The element is different from the nexus characters block in that it allows for more detailed specification of the allowed states per character, strict validation of the observed states, annotation of characters (columns), states, rows and individual observations using dictionaries. In addition, the
characters element is designed to allow for representation of non-homologized data: the element is more accurately described as a bucket of observations and the allowed parameter space for those observations. Only if the boolean attribute
aligned of the
matrix element is set to "1" (true), can subsequent observations be assumed to be homologous across
The schema specifies the
characters element to be of an abstract type, so that instance documents need to specify the concrete subclass (i.e. datatype) using the
xsi:type attribute. At present, six data types are supported: DNA, protein, restriction sites, standard categorical, continuous and RNA. For each of these data types there are two subclasses, whose names are constructed as "data type"Seqs (e.g. DnaSeqs) and "data type"Cells (e.g. DnaCells), the former being a compact representation with states listed as tokens in a sequence, the latter a verbose representation with states marked up granularly.
In its most compact form, a DNA sequence alignment as expressed in a
characters element would look something like this:
<characters otus="tax1" id="m1" xsi:type="nex:DnaSeqs"> <matrix aligned="1"> <row id="r1" otu="t1"><seq>AACATATCTC</seq></row> <row id="r2" otu="t2"><seq>ATACCAGCAT</seq></row> <row id="r3" otu="t3"><seq>GAGGGTATGG</seq></row> <row id="r4" otu="t4"><seq>GGTCTTAGAG</seq></row> <row id="r5" otu="t5"><seq>CGTCACAGTG</seq></row> </matrix> </characters>
characters elements are polymorphic in two ways:
xsi:typeattribute defines the data type, and thus the concrete character matrix subclass. In the example, the DNA subclass restricts the allowed symbols in the matrix to the IUPAC single character nucleotide symbols, allows omission of per-character state definitions, and allows concatenation of symbols into a string inside the
- The second polymorphism is in the granularity of observation mark up. The example above shows a compact representation where characters are concatenated as a string.
The RNA subclasses are virtually identical to the equivalent DNA subclasses (only T vs U). The protein and restriction data types are similar in that they also don't require character definitions and can also be expressed as concatenated strings. This is because they both use single character symbols, whose allowed symbols are well-defined, namely as the IUPAC single character amino acid symbols and as 0 or 1, respectively. In all compact representations, gaps [term] and missing data are encoded as separate alignments, so that multiple alignments can be associated with the same sequence.
For other data types, the behavior of the
characters element is different. For categorical data (
row elements inside the
matrix element must be preceded by a
format element that specifies the allowed states [term] per character [term]. An example is shown below:
<format> <states id="states1"> <state id="s1" symbol="1"/> <state id="s2" symbol="2"/> <state id="s3" symbol="3"/> <state id="s4" symbol="4"/> <polymorphic_state_set id="s5" symbol="5"> <mapping state="s1"/> <mapping state="s2"/> </polymorphic_state_set> <uncertain_state_set id="s6" symbol="6"> <mapping state="s3"/> <mapping state="s4"/> </uncertain_state_set> </states> <char states="states1" id="c1"/> <char states="states1" id="c2"/> </format>
In this case, then, the matrix holds two four-state characters. State
s4 functions as an ambiguity code, a state that actually means a set of two other states (
s2). Because the
mstaxa attribute is set to
polymorphism, the mapping to the two other states indicates true polymorphism (i.e. both states are observed in a population). The other value for the
mstaxa attribute is
uncertain. In practice, "polymorphism" can be read as "AND", and "uncertain" as "OR".
char elements can take dictionary attachments, which are loosely analogous to - but more powerful than - the nexus tokens CHARSTATELABELS and CHARLABELS, respectively.
row elements contain the mappings between the defined states and the actual observations. For example:
<row id="r1" otu="t1"> <cell char="c1" state="s2"/> <cell char="c2" state="s2"/> </row>
This structure means that the entity defined by OTU "t2" has state "s2" (symbol 2) for both characters. Therefore,
cell elements within different
row elements are homologous [term] if they have the same value for the
DNA, RNA, protein and restriction data can be described in a similar, verbose way, with the "state" attribute's value being an IUPAC nucleotide symbol, an IUPAC amino acid symbol or a boolean (0/1), respectively, and the "char" attribute's value being the (zero-based) column number.
In a compact representation, the same STANDARD information is marked up like this: <xml>
<row id="r2" otu="t2"><seq>2 2</seq></row>
</xml> Notice how this is similar to the compact DNA representation, except in the allowed symbols that are used (integers instead of IUPAC symbols) and that the symbols are space-separated (this is because STANDARD states aren't necessarily single-character symbols: integers greater than 9 are allowed also).
For continuous data, the
format element defines the characters (i.e.
char elements) but not their states, and observations values (i.e. either the
state attribute in verbose notation, or space-separated symbols in compact notation) are arbitrary precision floating point numbers.
In some analyses, data of different types is analyzed jointly (e.g. mrbayes does this). Nexml does not currently facilitate this. It is likely that this will be implemented either using matrix sets or (less likely) using a MIXED concrete subclass.
Due to their nesting, tree descriptions as nested elements (as suggested in Joe Felsenstein's book Inferring Phylogenies) can pose special problems for xml parsers: a parser can only hand off an element once all its children have been processed and stored in memory. Large trees described using nested elements can therefore develop huge memory requirements. Hence, nexml describes trees as node and edge [term] tables instead, following the semantics for GraphML (which is also discussed separately in the context of the study of related artefacts).
The concrete subclasses
FloatTree describe a tree shape following GraphML semantics. The classes differ in that the optional
length attribute is either an integer or a IEEE 754-1985 compliant floating point number. Below is an example:
<tree id="tree1" xsi:type="nex:FloatTree" label="tree1"> <node id="n1" label="n1" root="true"/> <node id="n2" label="n2" otu="t1"/> <node id="n3" label="n3"/> <node id="n4" label="n4"/> <node id="n5" label="n5" otu="t3"/> <node id="n6" label="n6" otu="t2"/> <node id="n7" label="n7"/> <node id="n8" label="n8" otu="t5"/> <node id="n9" label="n9" otu="t4"/> <rootedge target="n1" id="re1" length="0.34765" /> <edge source="n1" target="n3" id="e1" length="0.34534"/> <edge source="n1" target="n2" id="e2" length="0.4353"/> <edge source="n3" target="n4" id="e3" length="0.324"/> <edge source="n3" target="n7" id="e4" length="0.3247"/> <edge source="n4" target="n5" id="e5" length="0.234"/> <edge source="n4" target="n6" id="e6" length="0.3243"/> <edge source="n7" target="n8" id="e7" length="0.32443"/> <edge source="n7" target="n9" id="e8" length="0.2342"/> </tree>
This is an XML representation of the newick string
(((t4,t5)n7,(t2,t3)n4)n3,t1)n1;. In this representation, the root [term] is principally identified by having in-degree of zero or one, i.e. no
edge element exists with a
target attribute that references that node, but a
rootedge element may exist to indicate a time span leading up to the root (principally for coalescent trees). An additional
root attribute is used to indicate that this tree is in fact considered truly rooted. This attribute may be used on multiple nodes, to indicate multiple rootings. Tips [term] are identified by there being no
edge elements with
source attributes that reference them.
To add additional objects to nodes or edges, such as bootstrap [term] values, a dictionary attachment is used.
FloatNetwork subclasses only differ from the tree subclasses in that the key constraints on the in-degree of nodes is lessened, so that a node can have multiple parents. In the example below, node
n6 has an additional parent node
n7, creating a reticulation:
<tree id="tree3" xsi:type="nex:IntNetwork" label="tree2"> <node id="n1" label="n1"/> <node id="n2" label="n2" otu="t1"/> <node id="n3" label="n3"/> <node id="n4" label="n4"/> <node id="n5" label="n5" otu="t3"/> <node id="n6" label="n6" otu="t2"/> <node id="n7" label="n7"/> <node id="n8" label="n8" otu="t5"/> <node id="n9" label="n9" otu="t4"/> <edge source="n1" target="n3" id="e1" length="1"/> <edge source="n1" target="n2" id="e2" length="2"/> <edge source="n3" target="n4" id="e3" length="3"/> <edge source="n3" target="n7" id="e4" length="1"/> <edge source="n4" target="n5" id="e5" length="2"/> <edge source="n4" target="n6" id="e6" length="1"/> <edge source="n7" target="n6" id="e7" length="1"/> <edge source="n7" target="n8" id="e7" length="1"/> <edge source="n7" target="n9" id="e8" length="1"/> </tree>
No formal design for a "set" of entities (a node set, an OTU set, etc.) exists yet. The intention is that nexml will provide a general facility to identifiable group elements, so that annotations and substitution models can be attached to these groups or to indicate that sets (e.g. matrices with different data types) are to be analyzed jointly. A first step has been the addition of the optional
class attribute on elements. This attribute has a vector of identifiers as its value, such that the element can specify to which classes (i.e. sets) it belongs. This is different from nexus, where sets specify which elements they contain, but the approach suggested here is more idiomatic for xml, and is compatible with styling (e.g. using CSS).
Future developments in phylogenetics are impossible to predict - new support values for nodes on trees may emerge, new types of annotations for DNA sequences, etc. In addition, the amount and types of metadata that can be attached to phylogenetic data are unlimited: some researchers may want to attach taxonomy database ids to otu elements, or genbank accession numbers to DNA and so on. A data exchange standard that attempts to limit the universe of "stuff" to attach to "things" is likely to be headed for immediate obsolescence. The nexml standard therefore allows for attachment of arbitrary key/value pairs to:
- the root element,
- otus and otu elements,
- the characters element and its character definitions, state definitions, matrix rows, sequences and individual observations,
- the trees, individual tree elements and nodes.
Or, in general, any element that inherits from the Annotated abstract type (i.e. everything other than placeholder elements).
The semantics of these key/value pairs follow the conventions used in Apple's Mac OS X property list format, a simple example being: <xml>
<dict> <key>description</key> <string>This is a string based description of an element</string> </dict>
dict element contains a sequence of key/value pairs, where the
key element contains a string (scoped to be unique within the enclosing dictionary), and the value can take on a number of different types:
datevector, i.e. a restricted date/time format
base64vector, i.e. for base64 encoded binary data such as images;
dictvector, i.e. another dictionary, yielding a recursive data structure;
any, i.e. XML from another namespace such as XHTML (for marked up text), SVG (for graphs), ant build scripts, etc.
The implication is that dictionaries map onto data structures much like hashtables, python dictionaries, or perl hashes but with type-safe values. In principle, this structure thus allows all kinds of things to be attached to elements, the only downside being that different researchers might use different substructures or types. For example: <xml>
</xml> and: <xml>
</xml> would not be recognized to mean the same thing without human intervention. The solution is to create restricted subclasses such as an "accession" class: <xml>
which would define a limited set of keys (e.g. genbank, dbj, ebi) and what type of value (e.g.
id) is to follow. The same approach can obviously be used to refer to other databases such as those for taxonomies (e.g. ITIS) or morphology databases (e.g. Morphbank). In addition,
xsi:type subclasses can be created to restrict other common types of attachments such a branch lengths or various node support values.
Extension of the nexml standard by this kind of subclassing does not require changes in the core schema (which only references the superclass) so should not confuse (well-designed) nexml parsers.
Core restricted dictionaries
<dict xsi:type="nex:RCSDate"> <key>date</key> <string>$Date: 2007-07-30 18:36:51 +0200 (Mon, 30 Jul 2007) $</string> </dict> <dict xsi:type="nex:RCSRev"> <key>rev</key> <string>$Rev: 74 $</string> </dict> <dict xsi:type="nex:RCSAuthor"> <key>author</key> <string>$Author: rutgeraldo $</string> </dict> <dict xsi:type="nex:RCSURL"> <key>url</key> <string>$URL: https://nexml07gsoc.googlecode.com/svn/trunk/examples/verbose.xml $</string> </dict> <dict xsi:type="nex:RCSId"> <key>id</key> <string>$Id: verbose.xml 74 2007-07-30 16:36:51Z rutgeraldo $</string> </dict> <dict xsi:type="nex:RCSHeader"> <key>header</key> <string>$Header: $</string> </dict>
3rd party restricted dictionaries
The repository shows an example of how software authors can introduce their own restricted dictionaries in a separate namespace. Here's an example file, and here's an example schema that illustrates this.
Below are the most commonly used open source XML parsers. Which should I use? Apache and MIT licenses are both compatible with closed source projects, i.e. no copyleft requirement. SAX is the stream-based api (more memory-efficient, but need to keep track of context yourself), DOM is the tree-based api (easier to traverse, harder on memory). Expat is also used under the hood by perl xml parser XML::Twig. It's (supposed to be) the fastest xml parser. Gnome is sometimes hard to build on Windows, so I'd use expat for SAX or xerces for DOM. For java, the nexml class libraries use whatever lives under org.xml.sax.*
- http://xerces.apache.org is Apache's XML parser, implements the SAX (recommended), SAX2 and DOM apis in C++, Java and Perl. Released under Apache license.
- http://xmlsoft.org is Gnome's XML parser, implements SAX/SAX2/DOM, written in C. Released under MIT License.
- http://expat.sourceforge.net implements the SAX api in C. Released under Apache license.
Other XML projects
- http://hobit.gsf.de/wiki/display/wiki/XML+Schemas maintains a wiki of other bioinformatics xml projects.
These are not yet in the RFC.
The schema discussed here is to be an implementation of the ontology under development by the working group. Once formalization of the ontology begins (e.g. in OWL), the nexml project and the OWL development will keep a close eye on each other. In addition, the phylogenetic terms used in this page are to be compatible with those in the glossary.
General design issues
To meet the requirements for analysis context description, the nexml standard needs to accommodate more metadata about the contents of an entire nexml file:
- Contents processing: conceivably, a nexml file shouldn't just log how it has changed, but also how it should be processed - i.e. it might specify the steps involved in a workflow, settings used, etc. If a workflow solely consists of command line programs, it's possible these can be chained together using ant, for which the commands could be embedded as xml in a nexml file.
To meet the requirements for data integration, a higher-level "project" structure that can include multiple nexml files may be necessary. A facility called XInclude exists that allows for merging of xml files into a common Infoset. Unfortunately, this may create id clashes between included files. Some form of processing before inclusion may be necessary, e.g. as described here.
Some data type subclasses need a way to indicate reading frames.
In addition, the characters element needs to be able to provide alternative alignments, possibly using a restricted dictionary that defines, per matrix row, the mapping between location in the unaligned sequence and homologized position, probably using an
integervector value type. An attractive suggestion (from MTH) is to attach a vector of integers where the value of each integer is an index in the unaligned sequence, and the number of occurrences of that value is the number of gaps to insert. For example,
0 0 0 means that three gap symbols are prepended to the sequence.
Lastly, characters from different data types might be analyzed jointly. This might be implemented using the sets element.
Nexml needs some notion of OTU bipartitions (Concept Glossary#Bipartition) / splits.
Compact representations of trees (e.g. compressing of sets of trees by reusing subtrees, representing trees as ancestor functions in integer vectors) may be implemented. There may be a need, under some use cases, for a branch leading up to the root, so this needs to be implemented somehow.
Some way of grouping elements of the same kind needs to be implemented, e.g. node sets, tree sets, and so on.
The nexml standard needs to be able to express substitution models, and which characters/trees they apply to. Given that the work to date on this project has been implemented using IDL, there's some chance that this might be transformed directly into XML schema, then added to nexml. During a work meeting in Lawrence, KS, some contributors sketched out an example of what a model description in XML might look like.