NCBI Data Model Information

From Evolutionary Informatics Working Group
Revision as of 08:52, 14 November 2007 by (talk) (Representation, status, etc)
Jump to: navigation, search



Bibliographic information and biological information.

The biological sequences and associated information include

  • basic sequence stuff
  • sequence features
    • splicing
    • coding regions
    • item
  • item

Representation, status, etc

The data model is represented in ASN.1.

The implementation is supported by a programming library, the NCBI toolkit, that is used internally at NCBI. This includes a serialization library that handles mappings between ASN.1 and XML.

External users have CVS access to the library. The documentation can be browsed.

The data model is very stable. An enormous computing infrastructure at NCBI is built on top of this model. The NCBI toolkit is freely available to outside developers. However, it is not used outside of NCBI to any great extent.

Key features of interest

The basic biosequence concept is as follows (in ASN.1):

Bioseq ::= SEQUENCE {
   id SET OF Seq-id OPTIONAL,
   descr Seq-descr ,
   inst Seq-inst ,
   annot SET OF Seq-annot OPTIONAL }

Here SEQUENCE is an ASN.1 term indicating that the listed items occur in the order given, i.e., one or more identifiers, a description, a sequence instance, and one or more annotations of the sequence.

Sequences can be of various types, not just different chemical types, but also constructed sequences, partial sequences, and so on.

Annotations are attached to sequences by a location called a Seq-loc in the data model. Locations on sequences can be specified in a variety of ways (points, ranges, choices, ambiguities, and sets of these, etc).

In the NCBI data model, alignments are described as a sequence of blocks of sequences. This is superficially similar to the concept of an alignment composed of sub-alignments in MAO, but actually the breaking up into blocks has more to do with a representation implementation than an expression of biology. To understand, here is an extended description from the NCBI data model documentation.

"A sequence alignment is essentially a correlation between Seq-locs, often associated with some score. An alignment is most commonly between two sequences, but it may be among many at once. In an alignment between two raw Bioseqs, a certain amount of optimization can be done in the data structure based on the knowledge that there is a one to one mapping between the residues of the sequences. So instead of recording the start and stop in Bioseq A and the start and stop in Bioseq B, it is enough to record the start in A and the start in B and the length of the aligned region. However if one is aligning a genetic map Bioseq with a physical map Bioseq, then one will wish to allow the aligned regions to distort relative one another to account for the differences from the different mapping techniques. To accommodate this most general case, there is a Seq-align type which is purely correlations between Seq-locs of any type, with no constraint that they cover exactly the same number of residues.

A Seq-align is considered to be a SEQUENCE OF segments. Each segment is an unbroken interval on a defined Bioseq, or a gap in that Bioseq. For example, let us look at the following three dimensional alignment with 6 segments:

| 1 | 2 | 3 | 4| 5 | 6 | Segments

The example above is a global alignment that is each segment sequentially maps a region of each Bioseq to a region of the others. An alignment can also be of type "diags", which is just a collection of segments with no implication about the logic of joining one segment to the next. This is equivalent to the diagonal lines that are shown on a dot-matrix plot. The example above illustrates the most general form of a Seq-align, Std-seg, where each segment is purely a correlated set of Seq-loc. Two other forms of Seq-align allow denser packing of data for when only raw Bioseqs are aligned. These are Dense-seg, for global alignments, and Dense-diag for "diag" collections. The basic underlying model for these denser types is very similar to that shown above, but the data structure itself is somewhat different."