NCBI Data Model Information

From Evolutionary Informatics Working Group
Revision as of 13:25, 13 November 2007 by (talk) (Representation, status, etc)
Jump to: navigation, search



Bibliographic information and biological information.

The biological sequences and associated information include

  • basic sequence stuff
  • sequence features
    • splicing
    • coding regions
    • item
  • item

Representation, status, etc

The data model is represented in ASN.1.

The implementation is supported by a programming library that is used internally at NCBI.

External users have CVS access to the library. The documentation can be browsed.

This includes a serialization library that handles mappings between ASN.1 and XML.

There is an enormous computing infrastructure at NCBI that is built on top of this model.

Key features of interest

The basic biosequence concept is as follows (in ASN.1):

Bioseq ::= SEQUENCE {
   id SET OF Seq-id OPTIONAL,
   descr Seq-descr ,
   inst Seq-inst ,
   annot SET OF Seq-annot OPTIONAL }

Here SEQUENCE is an ASN.1 term indicating that the listed items occur in the order given, i.e., one or more identifiers, a description, a sequence instance, and one or more annotations of the sequence.

Sequences can be of various types, not just different chemical types, but also constructed sequences, partial sequences, and so on.

Annotations are attached to sequences by location. Locations on sequences can be specified in a variety of ways (points, ranges, choices, ambiguities, and sets of these, etc).