Clinical annotation of pathogen sequences and phylogenies

From Evolutionary Informatics Working Group
Jump to: navigation, search

Background

Linking the processes of pathogen sequence evolution to pathogenesis in humans and animals has become one of the most important endeavors in modern molecular epidemiology (broadly defined), and has been a part of HIV research for many years. Identifying valid and reproducible associations between sequence changes and disease processes absolutely depends upon standardized sequence and "phylodynamic" annotations that describe clinical states within the patient at the time of sequence sampling. However, an intimate look at (for example) the Los Alamos HIV Sequence database reveals that field names and data descriptors are frequently slurped in from primary sources, misspellings, duplications and all. This greatly reduces the value of the resource, to the point where colleagues of mine are hesitant to touch such data with a ten-foot pole.

The development of a unified, standardized protocol for the clinical annotation of pathogen sequences would lead not only to improved database interoperability, but also increased legitimacy and confidence in the data by the end users, improving database usage and return on investment.

Possible discussion topics

I envision discussions and brainstorming sessions over these issues that would bring participants up to date on the approaches that have been taken for existing implementations, hopefully culminating in a prototype for a standard ontology that would take into account the special requirements of sequence analysis in the context of disease research, the standard representation of clinical data associated with sequences.

Patient de-identification

Any discussion of clinical annotation must also include aspects of electronic privacy and security, when dealing with patient-identifiable data. The matter of identifiability and de-identification is made more complex when, as in the case of HIV but also other rapidly evolving pathogens like Helicobacter pylori and hepatitis C virus, the pathogen sequence itself contains enough distinguishing information to identify the patient. I would hope for discussions about protocols that ensure patient privacy in the context of open data access.

Deliverables

Because it is an important data problem within my own experience, I would propose consideration of HIV and the LANL HIV sequence database as prototypical example of the issues raised above. I would like to begin working out an OWL ontology containing terms and relationships specific to HIV disease, based on or taking inspiration from the more general infectious disease OBO. This ontology would interact with another HIV-specific sequence/phylogenetic ontology derived from CDAO. The phylogenetic ontology for HIV would, for example, be able to represent dated tips, virus phenotype, and drug resistance mutations.

With an HIV-specific computable ontology, I would direct my hacking efforts at producing a proof-of-principle ontological filter or adaptor that would lie atop the BioPerl modules Bio::DB::HIV and Bio::DB::Query::HIVQuery, that would convert user query fields based on the ontology to the actual table and fieldnames that exist in the LANL database. Responses would be delivered through the same filter, annotated according to ontologically valid terms. I suggest these modules, not just because I know them intimately, but because they provide a direct, computable and unattended way to query the LANL database, and so behave as a model of a relational sequence+clinical database, with the advantage of having real data behind it.

Since the LANL database can provide alignments and phylogenies, I would like to see HIVQuery modified to return these objects, in a standard computable form such as NexML and annotated via the HIV-specific ontologies. On the data input side, I would like to pursue an application that allows automated or semi-automated, ontologically valid clinical other annotation of HIV/SIV sequences via scans of sequence-associated GenBank records. A prototype may follow naturally from the above data accession filter, since from examination of the LANL database, it appears that GenBank sequence feature names and entries are sometimes, perhaps often, used verbatim in data records.