User talk:Mjensen

From Evolutionary Informatics Working Group
Revision as of 23:30, 20 April 2009 by (talk) (00:36, 31 March 2009 (EDT) : HIVDBSchema, NeXML, BioPerl: updated Bio::Phylo repos link)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Hello, hacker world. --Mjensen 12:18, 17 February 2009 (EST)

Thanks, NESCent and hackers!

The hackathon was a major mind expansion and inspiration. Thanks to all, esp. Roger, Rutger, & Enrico, for their patient answers to my questions. Hope to see you soon in the virtual world. --Mjensen 22:56, 13 March 2009 (EDT)

MAJ's Web Service blog

  • It might be helpful, or at least amusing, for others to watch someone in open combat with these concepts. As I attempt to wrap my head around them, I will attempt to regularly post my insights. I promise they will be extremely naive, but bold. --MAJ 23:13, 27 February 2009 (EST)
  • Comments (including but not limited to: helpful, unhelpful, humorous, and derisive) are most welcome. I would only ask that you leave the text as is, and add your comments in brackets [like this, say] or as an indented bulleted point
  • like this
which can be done by typing :* text. Cheers.

23:13, 27 February 2009 (EST): The story so far:

The typical web page is one designed to accept input from humans and deliver output readable by humans. One way to think about a web service is as a [set of] web page[s] designed to accept inputs from programs and deliver output readable by programs.

Designing a dynamic web page for humans involves "binding" controls like buttons, pulldown menus, and text fields to functions in the page's scripts, in order to provide the content the human desires. These controls have more or less standard meanings (e.g., the 'Submit' button) that have evolved into common sense by consistent usage.

Designing a web service involves binding particular request types in a given protocol to functions that provide content, meaningful to humans at some point, but upon delivery principally meaningful to the program that requested it. The protocol does not evolve common sense by the spread of memes; its consistency is completely intentional and built-in. The consistency of an existing protocol's methods is precisely why web communication between web agents and web services takes place via an established protocol.

The REST communication architecture enjoins the HTTP protocol to establish communication between agents and services. Here, however, there are only four "buttons" that can be "pushed": GET, POST, PUT, and DELETE (besides the content to be referred to, e.g., text entered in a text field, on which to act). In practice, however, we can count on HTTP servers to understand only GET and POST. Since two buttons don't provide much scope for choosing a variety of actions, in the REST model the context in which a GET or POST request is made plays a major role.

You can imagine that if you had only two buttons (Submit and Reset for example [which might behave semantically like GET and DELETE]), you would have to float a large number of web pages and associated scripts to finally infer what content the user desires, and then provide it. The WSDL instance can be thought of (I'm thinking of it this way at this moment, anyway) as the description of these many web pages. In the WSDL, the possible (allowable) contexts for GET/PUT/POST/DELETE requests are described, and the actions that these requests should initiate in these contexts are bound to the request type.

So how does the agent "browse these pages"? By first requesting the WSDL instance. Since it is "computable" (i.e., has a defined syntax and semantics), the agent will know how to read it, and will thus inform itself of the valid kinds of requests that can be made of your service. It can then proceed to make GET/PUT/POST/DELETE requests in the correct contexts that will elicit the content the agent desires.

23:44, 27 February 2009 (EST) : Deconstructing Rutger's tolweb.cgi

see the code here

Elegant and not too hard. (Well...)


Background: There are two important environment variables that are set when a cgi script is invoked by the server: PATH_INFO and QUERY_STRING. If the script was invoked like so:


PATH_INFO == /this/is/the/path
QUERY_STRING == for=bar&baz=goob


There is a functional stem URL set up as a constant:

.../app?[set of param=values]&node_id=

This is a real URL that does the work.


tolweb.cgi parses the environment variable PATH_INFO

  • the script detects "Tree/TolWeb:[id]", slurps the [id], and tacks it onto the functional url, producing the real url get request.
  • the function parse(-format=>'tolweb',-url=>$url), imported from Bio::Phylo::IO, does the actual web work


The calling agent can gets the web service description by simply accessing


which is handled in the following snippet:

<perl> if ( $ENV{'QUERY_STRING'} =~ /wsdl/ ) { my $file = $0; $file =~ s/\.cgi$/.wsdl/; open my $fh, '<', $file or die $!; my $wsdl = do { local $/; <$fh>}; print "Content-type: text/xml\n\n" . $wsdl; exit 0; } </perl>

So here it is assumed that tolweb.cgi is the script providing the service, and tolweb.wsdl is the WSDL description of the bindings (the "web pages") of the service.

21:54, 1 March 2009 (EST) : Hacking the WSDL file

So, we need to create the WSDL file describing the service. Note that we "need" to do this, only because we want the site/service to be able to describe itself to the agents that hit it. I look through Rutger's tolweb.cgi, and now I the developer know how to make a node_id query of the script. If I am now a (compliant) agent, and want to determine how to perform valid actions, I need a WSDL description to look at.

W3Schools has a quick and fairly clear tutorial on the WSDL. The WSDL instance is an XML file whose root is <definitions>, the principal components of which define:

  • types - basic data types
  • messages - meaning the messages between agent and service, e.g., requests and responses
  • portTypes - which aggregate
  • operations - which aggregate
  • messages marked as input and output
  • bindings - which map protocol actions to service operations. This is where WSDL gets down to business. For example, HTTP GET at specified URL is mapped to a service operation MyServiceQuery defined in the WSDL instance. In the REST context, this is where the pages are listed for which HTTP GET pushes the "Submit" button.
  • the service itself - which aggregates the defined bindings and associates them with a particular URL.

22:51, 1 March 2009 (EST) : A deconstruction of Rutger's tolweb.wsdl

First, call in a bunch of namespaces. The W3Schools XML tutorial sez that the namespace URI is meant only to provide a unique identifier for the namespace prefix, and does not necessarily refer to a "real" page. To quote: "Note: The namespace URI is not used by the parser to look up information." I wonder whether this is really the de facto state of things.

<xml> <?xml version="1.0" encoding="UTF-8"?> <wsdl:definitions


xmlns:mime="" xmlns:tns="" xmlns:wsdl="" xmlns:xsd="" </xml>

name and targetNamespace attributes



targetNamespace="" </xml>

Here, the nex prefix is assigned to a namespace, but it is then truly loaded (by "The Parser") as XML Schema from an external site, specified by the attribute schemaLocation, via the import element. (True?) So this statement provides the NeXML type required in the response specification below.

<xml> xmlns:nex=""> <wsdl:types> <xsd:schema xmlns:xsd=""> <xsd:import namespace="" schemaLocation=""> </xsd:import> </xsd:schema> </wsdl:types> </xml>

Now the messages corresponding to request and response are specified. The TreeRequest message has one part, the ToL node id, as an integer. The TreeResponse message has one part, a NeXML document that will be returned on sucessful lookup.
(What about unsuccessful lookups? Currently, it appears that the agent will just get the html barf provided by tolweb.cgi on error. But within the wsdl:operation specs, a fault element can be defined, separate from the input and output elements, and assigned to a particular handler.)

<xml> <wsdl:message name="TreeRequest"> <wsdl:part name="ToLWebTreeID" type="xsd:integer"></wsdl:part> </wsdl:message> <wsdl:message name="TreeResponse"> <wsdl:part name="ToLWebTreeResponse" element="nex:nexml"/> </wsdl:message> </xml>

The request and response messages are put together into an operation called Tree in the portType element:

<xml> <wsdl:portType name="tolweb"> <wsdl:operation name="Tree"> <wsdl:input message="tns:TreeRequest" /> <wsdl:output message="tns:TreeResponse" /> </wsdl:operation> </wsdl:portType> </xml>

Finally, HTTP and service can be glued together. The value tns:tolweb of the type attribute refers to the portType defined immediately above. The binding name refers to HTTP to differentiate it from different bindings (to a SOAP interface, e.g.) that might come later. (Just mindreading here.) Recall that the prefix http is referring to the namespace In this namespace, the binding element takes an attribute verb, for which you can specify GET/POST(/PUT/DELETE).

<xml> <wsdl:binding name="tolwebHTTP" type="tns:tolweb"> <http:binding verb="GET" /> </xml>

Then the wsdl:operation is mapped onto http:operations. When we make a TreeRequest with ToLWebTreeID specified as an integer, what do we do? We place the integer where (ToLWebTreeID) is in the attribute, and tack this on to the URL sent along with the HTTP GET request. (True?) What do we do when we get the HTTP response? We assign it to the wsdl:output part called ToLWebTreeResponse, giving it an XML MIME type.

<xml> <wsdl:operation name="Tree"> <http:operation location="Tree/ToLWeb:(ToLWebTreeID)" /> <wsdl:input> <http:urlReplacement/> </wsdl:input> <wsdl:output> <mime:content type="text/xml" part="ToLWebTreeResponse" /> </wsdl:output> </wsdl:operation> </wsdl:binding> </xml>

Now, package it all up as a service named tolweb, associated with the URL

<xml> <wsdl:service name="tolweb"> <wsdl:port binding="tns:tolwebHTTP" name="tolwebHTTP"> <http:address location="" /> </wsdl:port> </wsdl:service> </wsdl:definitions> </xml>

00:05, 3 March 2009 (EST) : We give it a shot

A skeletal (but I think complete) WSDL spec for an HIVQuery service, liberally ripped from tolweb.wsdl:

<xml> <?xml version="1.0" encoding="UTF-8"?> <wsdl:definitions xmlns:wsdl=""


xmlns:mime="" xmlns:xsd="" xmlns:tns="" name="hivqs" targetNamespace="" xmlns:nex="">

<wsdl:types> <xsd:schema xmlns:xsd=""> <xsd:import namespace="" schemaLocation=""> </xsd:import> </xsd:schema> </wsdl:types>

<wsdl:message name="HIVQRequest"> <wsdl:part name='HIVQstring' type="xsd:string"></wsdl:part> </wsdl:message>

<wsdl:message name="HIVQResponse"> <wsdl:part> name=HIVQresult element="nex:nexml" </wsdl:part> </wsdl:message>

<wsdl:portType name="hivqs"> <wsdl:operation name="HIVQuery"> <wsdl:input message="tns:HIVQRequest" /> <wsdl:output message="tns:HIVQResponse" /> </wsdl:operation> </wsdl:portType>

<wsdl:binding name="hivqsHTTP" type="tns:hivqs"> <http:binding verb="GET" /> <wsdl:operation name="HIVQuery"> <http:operation location="Query/String:(HIVQstring)" /> <wsdl:input> <http:urlReplacement/> </wsdl:input> <wsdl:output> <mime:content type="text/xml" part="HIVQResponse" /> </wsdl:output> </wsdl:operation> </wsdl:binding>

<wsdl:service name="hivqs"> <wsdl:port binding="tns:hivqsHTTP" name="hivqsHTTP"> <http:address location="" /> </wsdl:port> </wsdl:service>

</wsdl:definitions> </xml>

There is a stubby CGI script there at the location above. We can use a WSDL calling tool to check out whether the WSDL gets read correctly and is providing the expected operations, at least nominally.

There are probably other cool tools, but I found the community edition of Liquid XML Studio to work. Among many other things, it has a "Browse Web Service" tool. I browse to, which delivers the WSDL using rvos's code snippet noted in a previous entry.

LiquidXML Screenshot

(Great! Now, if it only did something...)

00:08, 14 March 2009 (EDT) : The saga continues...

While I'm still quite a ways from my dream to associate an ontology with LANL annotations, a roadmap has become much clearer in my mind thanks to the Hackathon. Not to mention, a whole load of tools was built this week that I can rip without the slightest qualm. This necessitates learning Java, something I ought to have done ages ago, but for my strict policy of never learning a new language until absolutely desperate.

But before that, I can still move towards implementing a computable web service for HIVQ. I decided that if sequences are to be returned in compact NeXML, the annotations should at least be returned in XML conforming to an XML Schema. Of course, I have to build the XSD myself, but the enterprising spirit pervading the hackathon, not to mention the euphoria that accompanies near-complete exhaustion, gave me courage.

Fortunately, having already hacked the database little-s schema for Bio::DB::HIV and represented it in custom XML with a custom interface (see Bio::DB::HIV::HIVQueryHelper), there was light at the end of the tunnel. I turned to the real XSD of the Immune Epitope Database for example XSD, and the XML::Writer module in CPAN for automation. I've put up the Perl LANL XSD generator script in the dbhack1 trunk under trunk/guerilla/hivqs. To actually run it requires BioPerl 1.6. The .xsd it generates (as of!) are also checked in.

21:18, 15 March 2009 (EDT) : I've got a schema. Now what?

Now that I have a schema, what do I want to do with it? Well...
  • Return annotation data in well-formed XML that is valid with respect to the schema, so that annotation is organized in a predictable, machine-readable way;
  • Associate ontological terms with the schema (or schema instance) elements, so that they have persistent, defined, machine-readable meanings, and so that reasoning queries can be executed over the data; and
  • Promulgate the schema, so that new annotations entered into the database are consistent and predictable...

The first bullet is the next step. The second bullet will be doable with the translation mechanism developed by our subgroup. The third bullet is the sales job, where I gingerly approach the data provider to start a conversation.

So, how do you write XML instances with respect to a given schema (using Perl)?

Seems like this is all about bindings, or explicit associations between the data provider format/organization (LANL's SQL db schema as table.column.celldata, e.g.) and the XSD elements.

Given a Bio::DB::Query::HIVQuery object, from which the annotations can be accessed (but which does not contain the sequences--that requires a get_seq_by_query() on a Bio::DB::HIV object), the identifiers of the annotations must be mapped to the corresponding elements of the XSD; then the appropriate XML (based on the element defintions) must be written out (or "unparsed").

One way would be:

  • have the XSD available as a parsed XML::DOM object
  • read annotation names from the HIVQuery object
  • iterating over the HIVQuery annotations, do
  • find the XSD element corresponding to the annotation name by searching the DOM using XPath or some other method
  • the identified XSD element (if there is one) will be the correct container for the .celldata provided by HIVQuery, so
  • 'write' an instance of this element, using the element format specified by the XSD element, with content as given by the HIVQuery
  • insert the instance element into an XML instance DOM (in the right containers and the right order as specified by the XSD)
  • unparse the whole DOM when finished

Another approach:

  • rewrite HIVQuery so that annotations are slurped directly into XSD-valid XML DOM objects, and manipulate annotations (including serializing them as XML instance documents) from there.

Presumably the unparsed XML instance should then be validated against the XSD, or (more likely) validated automatically as it's written by the helping XML module.

Perl machinery for all this?

Almost completely documentation-free; but looks very powerful. Sort of makes me feel as though I'm approaching a crash-landed flying saucer.

Also very powerful, much more documentation, schema support readers/writers. Requires the C libraries of libxml. XML::Compile speaks XML::LibXML objects natively.

From XML::Compile docs:

Many (professional) applications process XML messages based on a

formal specification, expressed in XML Schemas. XML::Compile translates between XML and Perl with the help of such schemas. Your Perl program only handles a tree of nested HASHes and ARRAYs, and does not need to understand namespaces and other general XML and schema


Excellent! One caveat: XML::Compile::Schema doesn't validate the .xsd file itself-- however, XML::LibXML::Schema will do this (by dying on the schema object creation, so need to trap constructor errors with eval or some such and read $@).

23:09, 17 March 2009 (EDT) : Preparing to pound the Golden Spike

Well, it appears that XML::Compile::Schema will do everything I need and more, very elegantly and (I hope) simply. With it in my back pocket, I decided to lay track from the other coast, and work on getting Bio::DB::Query::HIVQuery ready to speak XML. (How many metaphors can you use in a sentence, boys and girls?)

Recall I have my own homebrew XML version of the LANL SQL schema, called lanl-schema.xml. With the accompanying homebrew Perl interface (the HIVSchema package in Bio::DB::HIV::HIVQueryHelper, I was able pretty efficiently to browse the schema and make the necessary kludges to connect the XSD to the downstream SQL, without having to resort to the funky CGI interface. Just another plus of computable schemas, even rough-and-ready ones.

XML::Compile::Schema turns the problem of writing valid XML into one of constructing the proper hash-of-hashes-...-of-hashes data structure, which gives to a Perl lover something like the feeling of doing sudoku. However, while it feels nice, I found myself asking: if I'm going to build the entire data structure from scratch, why bother with XML::Compile? Many reasons, I decided--the main one being the staving off of carpal tunnel syndrome. Beyond this, there are many groovy features, including not having to remember the order of elements in XML sequences, the existence of various filtering hooks that can be employed post-Perl processing and pre-XML writing, and the built-in validation services.

The hash builder routines are now contained in the HIVXMLSchemaHelper module. I hope to fold this properly into the BioPerl HIVQuery eventually, probably first on the highly-touted but currently-empty bioperl-dev branch.

23:08, 23 March 2009 (EDT) : Meeting XML::Compile in a dark alley

Hello again, hacker world. My personal hackathon continues apace.

XML::Compile is a truly amazing piece of code. Owing to its "function-orientation" and liberal use of "inside-out" objects, it is also nearly completely non-debuggable. But when has that ever stopped a hacker with borderline OCD who wants to just get the damn thing done?

I found that the schemata were compiling properly in Liquid XML (though after forcing myself finally to understand namespaces, they need updating from their current naive form to a more advanced naive form...more on that later). However, a test program designed to see if I could write a nice hash to valid HIVQ XML was throwing impenetrable errors (after I had fixed the penetrable ones). In the process of feeling my way through this, I wound up learning more about namespaces, about nillable elements, and more than anyone would EVER want to know about installing CPAN modules into ActiveState via Cygwin without the aid of a net or PPM (lawd 'a' mercy).

Some background: I made most of the sub-elements of the schema optional (i.e., minOccurs = "0"), because of the way HIVQuery allows one to query for particular annotations and leave out others. However, not every annotation field is populated for every LANL HIV DB record, so I wanted some indication to the user that the annotation field she requested was empty, rather than simply leaving out the field altogether. For this I turned to the nillable attribute.

XSD elements and attributes can be allowed to take a true NULL value, by setting their nillable attribute. NULL is represented officially in the XML instance document by a specially defined attribute in the Schema-instance ( namespace, xsi:nil. To wit, if in the schema (.xsd) document, we have: <xml> <xs:element

xmlns:xs="" />

</xml> then, in the instance, we can represent a NULL with: <xml> <myNillableElt xsi:nil="true" xmlns:xsi="" /> </xml>

To make a long story short, it appears that XML::Compile::Schema gets confused if an element is both optional and nillable. Alternatively, XML::Compile::Schema could be perfectly healthy, and I am confused. Read on for the excruciating details...

Here are two toy schemas and a script. Commentary follows.

try.xsd: <xml> <?xml version="1.0" encoding="utf-8" ?> <xs:schema

  xmlns:try="" >
 <xs:import schemaLocation="try-sub.xsd" 
 <xs:complexType name="myEltType">
     <xs:element minOccurs="0" name="myOptional"  type="try:myOptEltType"/>
     <xs:element name="myRequired" type="xs:string"/>
 <xs:element name="myElt" type="tns:myEltType" />

</xs:schema> </xml>

try-sub.xsd <xml> <?xml version="1.0" encoding="utf-8" ?> <xs:schema

  xmlns:xs="" >
 <xs:complexType name="myOptEltType">
     <xs:element minOccurs="1" name="a" type="xs:string" />
     <xs:element minOccurs="0" name="b" type="xs:string" />

</xs:schema> </xml>

The tester: <perl> $|=1; use strict; use XML::LibXML; use XML::Compile; use XML::Compile::Schema; use XML::Compile::Util qw( SCHEMA2001 SCHEMA2001i pack_type); use Log::Report;

use constant XSDDIR => '~/fortinbras/hackathon/code/dbhack-live/guerilla/hivqs/test'; use constant HIVNS => ''; use constant HIVTEST => 0;

  1. avoid the debugger's stack dump...
  2. close the default PERL dispatcher ( avoid die() on "fatal" xml-compile errors)

dispatcher close => 'default';

  1. open a FILE dispatcher, but to stdout

dispatcher 'FILE' => 'default', to => \*STDOUT, mode => 'NORMAL';

  1. now, the xml-compile calls apparently need to occur in eval{} blocks to
  2. avoid an 'undefined' error at Log::Report l.97 (apparently, $opts->{errno}
  3. doesn't get a default value (at least in this kludgy setup I've got) )
  1. instantiate an XML document with LibXML:

my $doc = XML::LibXML::Document->new('1.0','UTF-8');

  1. here we tell XML::Compile where to find the schemata


my ($xsds, $sch, $wri, $data);

  1. SCHEMA2001, SCHEMA2001i are constants containing the declaration URLs
  2. (defined in XML::Compile::Util)

$xsds = [SCHEMA2001, SCHEMA2001i, qw( try.xsd try-sub.xsd )];

  1. read in the schemata

$sch = XML::Compile::Schema->new($xsds);

  1. compile the XML writer; write starting at the root element

$wri = $sch->compile('WRITER', pack_type(HIVNS,'myElt'))

  1. here's some data: the required element is defined, the optional
  2. element is not...

$data = {'myRequired' => 'narb' } ;

eval {

    # at the command-line, just execute the line below
    # what, oh what, will happen??
   $wri->($doc, $data);

}; </perl>

What do the schemata say? They set up a structure for which the following instances are valid:

<xml> <myElt>

<myRequired> Hey, I'm required! </myRequired>
  <a> I need to be here </a>
   but I don't 

</myElt> </xml>

<xml> <myElt>

<myRequired> Hey, I'm still required! </myRequired>
 <a> I go with mama everywhere </a>

</myElt> </xml>


<xml> <myElt>

<myRequired> Hey, where'd everybody go? </myRequired>

</myElt> </xml>

With the schemata as defined above, XML::Compile has no problems with these. In particular,

<perl> $data = { myRequired => " Hey! Where'd everybody go? " }; print $wri->($doc, $data)->toString; </perl>


<xml> <x0:myElt xmlns:try=""

 <x0:myRequired>Hey! Where\'d everybody go?</x0:myRequired>

</x0:myElt> </xml>

(See, you even get an escaped apostrophe.)

However, when myOptional is made nillable (but still optional) in try.xsd, like so:

<xml> ...

 <xs:complexType name="myEltType">
     <xs:element minOccurs="0" name="myOptional" nillable="true"
     <xs:element name="myRequired" type="xs:string"/>

... </xml>

then the above snippet causes XML::Compile::Schema to barf:

error: complex `x0:myOptional' requires data at {}myElt/myOptional

(courtesy of Log::Report, another cool module by Mark Overbeek). You can specify

<perl> $data = { myRequired => "Hey! ...", myOptional => 'NIL' }; </perl>

(where the string 'NIL' gets magically converted into the right attribute, as described above), and no more barf:

<xml> <x0:myElt xmlns:try=""

 <x0:myOptional xsi:nil="true"/>

</x0:myElt> </xml>

as advertised. But what about the optionality? So with XML::Compile as-is, it seems impossible to overcome the semipredicate problem in the way I had hoped.

Bummer, dude.

01:09, 26 March 2009 (EDT) : BioPerl is beautiful

BioPerl, how do I love thee? Let me

$count = scalar @ways;

We have achieved schema-valid XML in Bio::DB::HIV. Lots of i-dotting and t-crossing to do, but the basic foundation seems to be in place. The following code is now all you need to poke LANL and get XML to squirt out:

<perl> use strict; use Bio::DB::Query::HIVQuery; use HIVXmlSchema; # to be rechristened Bio::DB::HIV::HIVXmlSchema

my $q;

  1. do an hivquery

$q = Bio::DB::Query::HIVQuery->new( -query => "(F)[subtype] (Env)[gene] (BR ZA)[country] { pat_id risk_factor project }" );

  1. print the schema-valid XML

print $q->make_XML_with_ids( $q->ids );

  1. The end.


Here's the XML output.

Isn't that elegant? It seems like it, but of course you can't see the guts. Bioperl is like people; disorder and chaos is frequently stashed away someplace relatively safe and inaccessible. However, I will open the can of worms next time, with a brief code walkthrough. Even the guts are relatively elegant-looking, thanks to the chaos encapsulated in XML::Compile::Schema and XML::LibXML. I'll go over the details of how I used XML::Compile at that time.

If you're curious about the query statement, please have a look at the HIVQuery tutorial at Fortinbras.

(BTW, regarding the nillable-optional problem, I opted for optional rather than nillable, to make life easiest with XML::Compile::Schema. There are several kludgy options that could work around it.)

The next objective will be incorporating NeXML into the schema, to contain the sequence data. After that, we connect the hose to the web service.

00:36, 31 March 2009 (EDT) : HIVDBSchema, NeXML, BioPerl

Greetings, fellow pilgrims.

Now we're getting somewhere. Tonite, the executive summary:

  • We have a means for translating nested hash structures in Perl to HIVDBSchema-valid XML in the module XML::Compile::Schema;
  • We have HIVquery object methods (defined in module Bio::DB::HIV::HIVXmlSchema) that use the above methods to squirt out a valid XML Schema instance document.

But wait; there's more. Thanks to Bio::Phylo with help from XML::LibXML, we also can also build a valid NeXML instance document that contains the sequence data returned on a query via Bio::DB::HIV

How do we do it? Volume. (Seems like there's a helluva lot of code.) But all the user needs to do is:

<perl> use Bio::DB::HIV; use Bio::DB::Query::HIVQuery; use Bio::DB::HIV::HIVXmlSchema;

my $db = Bio::DB::HIV->new(); my $q = Bio::DB::Query::HIVQuery->new(

   -query => "(F)[subtype] (Env)[gene] (BR ZA)[country]"
            ."{ pat_id risk_factor project }"

my $xml_annot_doc = $q->make_xml_from_query(); my $xml_seq_doc = $db->make_nexml_from_query( $q ); </perl>

The NeXML output is valid. Try it!.

Next, a brief code-walkthrough. Having found there was a large WTF? zone between the cool idea and the actual code, I think others may find a map through this zone helpful. But, do help yourself to the links above, if you're curious.

11:05, 8 April 2009 (EDT) : Code walkthroughs

We're back.

He're I'm going to quickly walk through the HIVXmlSchema methods that provide the schema-valid annotations (make_xml_from_query() and make_nexml_from_query).

Note that the sequences do not have to be retrieved. If a query is performed as:

<perl> my $q = Bio::DB::Query::HIVQuery(-query=>'C[subtype] BR[country] SI[phenotype]'); </perl>

then the annotations are retrieved, but the sequences are obtained through a Bio::DB::HIV object later (in a nice BioPerly way):

<perl> my $db = Bio::DB::HIV->new();

  1. get a sequence stream based on the query:

my $seqio = $db->get_Stream_by_query( $q ); </perl>

Therefore, we create the NeXML (which holds the sequence data as a compact character matrix) from the $db object, using $db->make_nexml_from_query.

Method make_xml_from_query()

make_xml_from_query() is an instance method of class Bio::DB::Query::HIVQuery. It generates a HIVDBSchema/1.0-valid XML document that contains the LANL HIV DB annotations requested in the query.

This method wraps another called make_xml_with_ids.

<perl> sub make_xml_from_query {

   my $self = shift;
   return $self->make_xml_with_ids( $self->ids );


sub make_xml_with_ids {

   my $self = shift;
   my @ids = @_;
   my @hashes;
   unless ($self->_run_option == 2) {

$self->warn("Method requires that query be run at level 2"); return undef;



The following loop creates an array of hashrefs that encode the annotations as nested hashes that can be provided to the XML::Compile::Schema::Writer, which renders the XML:


   foreach (@ids) {

my $h = $self->_xml_hashref_from_id($_); next unless $h; # skip on dne push @hashes, $h;



Here we create a new XML DOM with XML::LibXML::Document, and instantiate the writer. Producing the XML output is then as easy as inserting a node into the document using the DOM. Note that we use Mark Overbeek's Log::Report try functionality to capture exceptions. It's quite cool, though pain to install (for me anyway) due to all the dependencies.


   if (@hashes) {

my $sch = Bio::DB::HIV::HIVXmlSchema->new(); my $doc = XML::LibXML::Document->new(); my ($wri, $guts);

# use the Log::Report try block around $wri->() and check # $@; throw BP error if set. </perl>

Recall that make_writer is a HIVQuery wrapper around XML::Compile::Schema. The writer produced already contains the HIVDBSchema/1.0 root element. Note especially that the writer returns an XML::LibXML::Node object, that can be manipulated as such.

<perl> try { $wri = $sch->make_writer; $guts = $wri->($doc, { 'annotHivqSeq' => [@hashes] }) }; if ($@) { $@->reportAll; exit(0); # handle XML::Compile::Schema error } else { $doc->addChild($guts); </perl>

To produce the XML string, we use a XML::LibXML method:

<perl> return $doc->toString(1); }

   else {

# dude, no data! $self->warn("No XML was generated for this query");


} </perl>

Method make_nexml_from_query()

make_nexml_from_query() is an instance method of class Bio::DB::HIV, and takes a Bio::DB::Query::HIVQuery object as an argument. It generates a NeXML-valid XML document that contains the LANL HIV DB sequences plus id information as requested in the query instance.

This method is more involved, but the basic idea is pretty simple: create an XML document (as a DOM) and build it up node by node. We create the new NeXML-compliant nodes by sending the query's sequence data through Rutger's Bio::Phylo package, relying heavily on its built-in Bioperl interface.

Some fiddling with NeXML dict elements is required in order to store the LANL id numbers and GenBank accessions. The NeXML @id attributes are really meant to identify the XML elements of the document internally, so these are unsafe to use as persistent places for user data (see the extensive discussion on this and other related issues in Database Interop Hackathon/Metadata Support).

<perl> sub make_nexml_from_query {

  my ($self,@args) = @_;
  my ($q)  = @args;
  my $bpf       = Bio::Phylo::Factory->new;
  my $seqio     = $self->get_Stream_by_query( $q );
  my $dat_obj   = $bpf->create_datum();
  my $taxon_obj = $bpf->create_taxon();
  my $taxa      = $bpf->create_taxa();
  my ($xrdr, $dom);
  my $doc = XML::LibXML::Document->new();
  my ($mx, $alphabet);
  while ( my $seq = $seqio->next_seq ) {
      # check first seq and make matrix (note this won't 
      # work if we have mixed data)
      if ($mx) {

$self->throw( "Mixed data NeXML not currently implemented" ) if $seq->alphabet ne $alphabet;

      else {

$alphabet = $seq->alphabet;



Here is the Bio::Phylo-y business. We used the Factory above to avoid having to climb back through Rutger's code to get the dependencies. (Which is why he provided the Factory -- that's no disrespect!) The only thing that is a little tricky here is that the create_... methods aren't used as constructors directly. We want to access the Bioperl functionality, so we call new_from_bioperl off the objects returned by the create_... methods...


      $mx ||= $bpf->create_matrix( -type=>$seq->alphabet ); 
      my ($taxon, $datum);
      #create elements...
      $taxon = $taxon_obj->new( -name => $seq->id, 

-desc => $seq->annotation->get_value('Special','accession'));

      $datum = $dat_obj->new_from_bioperl($seq);
      #organize into containers...
      $taxa->insert( $taxon );
      $mx->insert( $datum);
  # so if @dna != 0 and @aa != 0, we require a mixed matrix.
  # link the matrix to the taxa 'block'
  $mx->set_taxa( $taxa);


Here we use the XML::LibXML::Reader to convert the NeXML text back into a XML::LibXML::Node so we can manipulate it within the DOM. (Recall that XML::Compile::Schema works natively with DOM nodes. Maybe this would be a convenient functionality to build into Bio::Phylo!) (Did I just volunteer? Dang!)


  $xrdr = XML::LibXML::Reader->new( string => join(, '<fake>',

$taxa->to_xml, $mx->to_xml(-compact => 1), '</fake>'));



This line captures the main NeXML elements <otus> and <characters> for loading up with data nodes:


  my ( $otus_node, $characters_node ) = $xrdr->copyCurrentNode(1)->childNodes;
  # build the DOM
  foreach my $otu_node ( $otus_node->childNodes ) {
      next unless $otu_node->nodeName eq 'otu';
      my ($lanlid, $tmp, $lanlid_node, $gbaccn_node);


In this code and similar snippets below, we make dict elements to contain special user data (the various sequence ids). A unique internal @id for the NeXML element is required by the standard. We create this by noting that XML::LibXML objects are created "inside-out". The inside-out object is a scalar reference, so dereferencing to the scalar gives us a guaranteed document-unique id.


      $lanlid_node = XML::LibXML::Element->new('dict');
      $lanlid_node->setAttribute('id', "dict$$lanlid_node"); # uniquify
      $lanlid = $otu_node->getAttribute('label');
      $tmp = XML::LibXML::Element->new('string');
      $tmp->setAttribute('id', "LANLSeqId_$$lanlid_node");
      $tmp->addChild( XML::LibXML::Text->new($lanlid ) );
      $gbaccn_node =  XML::LibXML::Element->new('dict');
      $gbaccn_node->setAttribute('id', "dict$$gbaccn_node");
      $tmp = XML::LibXML::Element->new('string');
      $tmp->setAttribute('id', "GenBankAccn_$$gbaccn_node");
      $tmp->addChild( XML::LibXML::Text->new($q->get_accessions_by_id($lanlid)));


Now we set up the NeXML header and root element...


  # create the NexML doc
  $dom = XML::LibXML::Element->new('nexml');
  $dom->setNamespace(NEXML, 'nex');
  $dom->setAttribute('version', '0.8');
  $dom->setAttribute('generator', 'Bio::DB::HIVXmlSchema');
  $dom->setAttribute('xmlns', NEXML);


and set our NeXML-compliant document free, as a plain string.


  return $doc->toString(1);

} </perl>

16:20, 12 April 2009 (EDT) : Going live with XML output

After getting XML::LibXML, XML::Compile, XML::Compile::Schema and their dependencies installed on Fortinbras (not easy when you have the el cheapo account; the standard Perl installation on Network Solutions is pretty bare-bones), I was able to get the new Bio::DB::HIV modules to work. Options for retrieving schema-valid XML epidemiological sequence annotations and NeXML-formatted sequence data are now available at HIVQuery for free. And this is not just free as in free beer, since the new modules are available by Subversion checkout at, along with the XML Schema files.

There are some memory issues in the NeXML option on Fortinbras (but not on my test machine at home). If the number of sequences are "too many", the dreaded Perl Out of memory! error appears. It's not clear where the problem resides, though the kick-out occurs during the Bio::Phylo-y code, and not during the DOM-building code. The total size of the Bio::Phylo store after processing the "large" sequence download is only about 360Kb, which is suggesting that my Network Solutions account is fairly ungenerous. The custom annotation schema routines don't seem to have any problem.

Still staggering toward the elusive web service, but the real machinery is now in place.

15:53, 13 April 2009 (EDT) : Re-skinning the cat

Since the out-of-memory NeXML problem appeared to happen in the Bio::Phylo part of the code, I decided to try changing the construction of the NeXML DOM object. Instead of building the entire Bio::Phylo structure first, and adding the otus and characters elements to the XML::LibXML DOM in two lines (much preferred, from a esthetic standpoint), I thought I'd try to add each B:P taxon and datum object to the DOM immediately after each of those were created; i.e., on the fly, as we loop through the sequences. Doing this required going a bit deeper into Bio::Phylo, not without profit. This M.O. also suggests a way that NeXML could be "streamed" while preserving its validity. More on that later.

This change in construction order evades the memory issue, because rather than storing all the data, with ancillary object information, as a Bio::Phylo structure before squirting it into the DOM, we instead create the objects temporarily and store them in the DOM just-in-time. On each loop, the $taxon and $datum B:P objects are created locally (with my), and are destroyed by B:P for free. B:P is therefore storing only a handful of objects at a time, rather than 1000s at once.

Here's an overview of the strategy:

  • Add a $taxon to the $taxa object (B:P)
  • Add a $datum to the matrix object (B:P)
  • Create the taxon:datum linkage (B:P)
  • Convert the $taxon object to NeXML (B:P)
  • Read the NeXML as a LibXML node (X:L) (now an <otu> XML element)
  • this is really a kludge conversion, as I noted in passing above; Bio::Phylo could probably provide X:L nodes in an Unparser
  • Convert the $datum object to NeXML (B:P)
  • Read the NeXML as a LibXML node (X:L) (now a <row> XML element)
  • Add the <otu> element to the <otus> element in the DOM (X:L)
  • Add the <row> element to the matrix element (inside the <characters> element) in the DOM (X:L)
  • Destroy the $taxon and $datum objects (B:P; happens automatically by leaving the scope)

In this strategy, we won't then use the B:P native insert() facility, but add the new taxon and datum nodes (converted) to the DOM directly. Is this a problem? Not based on a glance at the insert() method, which basically just pushes the new entity into the container. We will lose the B:P logging of this event, however.

Preserving uniqueness of ids

A issue we must solve, however, is making sure the @id attributes that form the links between the otu and row elements are unique, even though we are adding each pair of elements "by hand". We make the link by invoking a statement like <perl> $taxon->set_data($datum); </perl> which is fine, but if $taxon and $datum are subsequently destroyed, the id numbers assigned to the next pair of object when the loop repeats may get assigned these old id numbers, since B:P reclaims the ids it has already used.

The first thought is to defeat the id reclamation, and make sure that id numbers returned have never been used before. But, have to make sure this doesn't break anything. There is a discussion of object reclamation deep in B:P code (in, that says:


  1. ...To avoid memory leaks (and subtle bugs, should a new object by
  2. the same id appear (though that shouldn't happen)), the hash slots
  3. occupied by $obj->get_id need to be reclaimed in the destructor. This
  4. is done by recursively calling the $obj->_cleanup methods in all of $obj's
  5. superclasses. To make that method easier to write, we create an array
  6. with the local inside-out hashes here, so that we can just iterate over
  7. them anonymously during destruction cleanup. Other classes do something
  8. like this as well.


When we have a look at the destructor ( Bio::Phylo::DESTROY() ), it appears however that all the space is recovered, before the id number is given back to the pool of ids. The last thing DESTROY does is call Bio::Phylo::Util::IDPool::_reclaim(), which is all of <perl>

   sub _reclaim {
       my ( $class, $obj ) = @_;
       push @reclaim, $obj->get_id;

</perl> So (as far as I can see), I can throw my wrench in here. I kludge the following: <perl>

   sub _reclaim {
       return; # kludge!
       my ( $class, $obj ) = @_;
       push @reclaim, $obj->get_id;

</perl> and so prevent id numbers from being reclaimed. Now whenever a new B:P object is created, its id has never been seen before, whether or not previous objects have been destroyed.

The Code

With this jig in place, we can implement our strategy. Compare with the first try above.

<perl> sub make_nexml_from_query_s {

 # defeated the id reclaimer in Bio::Phylo...
  my ($self,@args) = @_;
  my ($q)  = @args;
  my $bpf       = Bio::Phylo::Factory->new;
  my $seqio     = $self->get_Stream_by_query( $q );
  my $dat_obj   = $bpf->create_datum();
  my $taxon_obj = $bpf->create_taxon();
  my ($mx, $taxa, $alphabet);
  my ($xrdr, $dom, $otus_elt, $characters_elt, $matrix_elt);
  my $doc = XML::LibXML::Document->new();


  • build the NeXML header starting out


  # create the DOM, with NexML doc header
  $dom = XML::LibXML::Element->new('nexml');
  $dom->setNamespace(NEXML, 'nex');
  $dom->setAttribute('version', '0.8');
  $dom->setAttribute('generator', 'Bio::DB::HIVXmlSchema');
  $dom->setAttribute('xmlns', NEXML);
  my $seq1 = $seqio->next_seq; #peek
  $alphabet = $seq1->alphabet;


  • Here we create the $matrix and $taxa objects (and the linkage between them) before anything else...they are the bags we'll scoop the downloaded data into piece by piece--


  $mx = $bpf->create_matrix( -type=>$seq1->alphabet ); 
  $taxa = $bpf->create_taxa();       
  # create the linkage


  • We do the conversion of these B:P objects to DOM elements, and get handles to the internal elements that will actually hold the data--


  # create 'otus' and 'characters' elements in the DOM, and get
  # refs to the meaty bits
  $xrdr = XML::LibXML::Reader->new( string => join(, '<fake>',

$taxa->to_xml, $mx->to_xml(-compact => 1), '</fake>'));

  # the (1) argument gets a deep copy of the node...
  ( $otus_elt, $characters_elt ) = $xrdr->copyCurrentNode(1)->childNodes;
  ($matrix_elt) = $characters_elt->getChildrenByTagName('matrix');


  • Put these in the DOM--


  # add to the DOM


  • It turns out we need to create the <char> elements required by NeXML by hand.


  # need to kludge in the <char> elts...
  my $maxlen=0;


  • The main sequence-grabbing loop...


  while ( my $seq = $seq1 || $seqio->next_seq ) {
      undef $seq1;
      $maxlen = ($maxlen > $seq->length ? $maxlen : $seq->length);
      $self->throw( "Mixed data NeXML not currently implemented" ) if

$seq->alphabet ne $alphabet; </perl>

  • Local B:P objects that will get thrown away..


      my ($taxon, $datum);
      #create elements...
      $taxon = $taxon_obj->new( -name => $seq->id, 

-desc => $seq->annotation->get_value('Special','accession'));

      $datum = $dat_obj->new_from_bioperl($seq);
      #create the link
      # write the new elements into the DOM...

      $xrdr = XML::LibXML::Reader->new( string => 

join(, '<fake>', $taxon->to_xml, $datum->to_xml(-compact => 1), '</fake>'));

      my ($otu_elt, $row_elt) = $xrdr->copyCurrentNode(1)->childNodes;


  • Futz with the namespaces


      # remove ns redecls
      # put the new row in the matrix


  • Details on this construction are in the routine described previously.


      # create the otu element, adding the LANL id and GenBank accn.
      # as 'dict' elements...
      my ($lanlid, $tmp, $lanlid_elt, $gbaccn_elt);
      $lanlid_elt = XML::LibXML::Element->new('dict');
      $lanlid_elt->setAttribute('id', "dict$$lanlid_elt"); # uniquify
      $lanlid = $otu_elt->getAttribute('label');
      $tmp = XML::LibXML::Element->new('string');
      $tmp->setAttribute('id', "LANLSeqId_$$lanlid_elt");
      $tmp->addChild( XML::LibXML::Text->new($lanlid ) );
      $gbaccn_elt =  XML::LibXML::Element->new('dict');
      $gbaccn_elt->setAttribute('id', "dict$$gbaccn_elt");
      $tmp = XML::LibXML::Element->new('string');
      $tmp->setAttribute('id', "GenBankAccn_$$gbaccn_elt");
      $tmp->addChild( XML::LibXML::Text->new($q->get_accessions_by_id($lanlid)));
      # put the otu elt in the otus elt


  • Produce the char elements using XML::LibXML facilities and add them to the DOM--


  # char elements
  my ($format_elt) = $dom->getElementsByTagName('format');
  my $states_id = ($dom->getElementsByTagName('states'))[0]->getAttribute('id');
  foreach (1..$maxlen) {
      my $char_elt = XML::LibXML::Element->new('char');
      $char_elt->setAttribute('id' => "c$_");
      $char_elt->setAttribute('states' => $states_id);


  • We're done.


  return $doc->toString(1);

} </perl>

Definitely more complex and uglier. But, it does the job, producing valid NeXML and not gagging at Fortinbras.

"Streaming" NeXML?

Stepping back, it looks to me like we've written a method that provides an (unbuffered) stream from one internal memory representation to another, for a rather specific use case. The reason for the conversion was to be able to take advantage of the nice internal bookkeeping that Bio::Phylo takes care of for you, but then to slough off the extra B:P stuff after doing that, storing the growing data representation in the slimmer DOM of libxml.

A more general version of this process could be useful. A Google Summer of Code project I proposed (here it is) suggests integrating Bio::Phylo with BioPerl in way that allows BioPerl users to access B:P functionality in a BioPerl-y way. Streams are an important feature of BioPerl, and the ability to do something like this <perl> $fasta_io = Bio::SeqIO( -file=>"myseqs.fas", -format=>"fasta" ); $nexml_io = Bio::SeqIO( -file=>">myseqs.xml", -format=>"nexml" ); foreach ( $fasta_io->next_seq ) {

 if ($seq->desc =~ /Homo/) {

} </perl> might really improve the uptake of NeXML as a data exchange format. In this example, we pull out the Homo sequences, and at the end of the loop, myseqs.xml is a complete and valid NeXML file. Streaming then is necessary here, as is a way to add individual data elements to a growing XML representation while maintaining its validity. The ideas in the routine detailed in this post could go some way to making that a possibility.