Difference between revisions of "TreeBASE Data"

From Evolutionary Informatics Working Group
Jump to: navigation, search
(Dump Contents)
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
==TreeBASE Dump==
 
==TreeBASE Dump==
  
A Postgres dump for TreeBASE can be obtained [[http://www.treebase.org/~piel/tbtrees.zip here]].
+
A Postgres dump for [[TreeBASE]] can be obtained [[http://www.treebase.org/~piel/tbtrees.zip here]].
  
 
==Dump Contents==
 
==Dump Contents==
Line 8: Line 8:
 
                 List of relations
 
                 List of relations
 
  Schema |        Name        |  Type  | Owner
 
  Schema |        Name        |  Type  | Owner
  --------+----------------------+----------+-------
+
  -------+----------------------+----------+-------
 
  public | edges                | table    | piel
 
  public | edges                | table    | piel
 
  public | ncbi_names          | table    | piel
 
  public | ncbi_names          | table    | piel
Line 29: Line 29:
  
 
For each "study" record, there are many "trees" records. Each "trees" record has many "nodes" records, which are wired to each other via the "edges" table and a transitive closure index is in the "node_path" table. Each "tb_labels" record can point to many "nodes" records -- "tb_labels" is a table of unique taxon labels that appear in all trees. Each taxon_variant record maps to zero or more tb_labels; each "taxa" record maps to one or more taxon_variant records.  Each "taxa" record represents a single, normalized taxon, usually a species, but could be a subspecies or a higher taxon. Wherever possible, each "taxa" record has an ncbi_taxid -- that is, the IDs used by ncbi in their Genbank distribution. Consequently, these taxids connect the "ncbi_names" table, which in turn uses the "ncbi_nodes" table as a hierarchical classification. This classification has been pre-indexed with left and right IDs so that hierarchical searching is possible.
 
For each "study" record, there are many "trees" records. Each "trees" record has many "nodes" records, which are wired to each other via the "edges" table and a transitive closure index is in the "node_path" table. Each "tb_labels" record can point to many "nodes" records -- "tb_labels" is a table of unique taxon labels that appear in all trees. Each taxon_variant record maps to zero or more tb_labels; each "taxa" record maps to one or more taxon_variant records.  Each "taxa" record represents a single, normalized taxon, usually a species, but could be a subspecies or a higher taxon. Wherever possible, each "taxa" record has an ncbi_taxid -- that is, the IDs used by ncbi in their Genbank distribution. Consequently, these taxids connect the "ncbi_names" table, which in turn uses the "ncbi_nodes" table as a hierarchical classification. This classification has been pre-indexed with left and right IDs so that hierarchical searching is possible.
 +
 +
==Example Queries==
 +
 +
Queries for a list of trees that contain any kind of bird:
 +
 +
<sql>
 +
SELECT DISTINCT ON (tbt.tree_id) tbt.tree_label, tbt.tree_title
 +
FROM trees tbt JOIN nodes tbn ON (tbt.tree_id = tbn.tree_id)
 +
JOIN tb_labels tbl ON (tbn.taxon_id = tbl.tb_labels_id)
 +
JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
 +
JOIN taxa txa ON (txv.taxon_id = txa.taxon_id)
 +
JOIN ncbi_names nna ON (txa.taxid = nna.tax_id)
 +
JOIN ncbi_nodes nno ON (nna.tax_id = nno.tax_id),
 +
ncbi_names hna NATURAL JOIN ncbi_nodes hno
 +
WHERE nno.left_id >= hno.left_id
 +
AND nna.name_class = 'scientific name'
 +
AND nno.left_id < hno.right_id
 +
AND hna.name_txt = 'Aves';
 +
</sql>
 +
 +
Get top ten list of trees with birds in them each listed by how many birds in each tree:
 +
 +
<sql>
 +
SELECT tbt.tree_id AS "TreeBASE Tree ID", MAX(tbt.tree_label) AS "Tree Label", MAX(tbt.tree_title) AS "Tree Title", COUNT(tbn.tree_id) AS "Number of Birds In Tree"
 +
FROM trees tbt JOIN nodes tbn ON (tbt.tree_id = tbn.tree_id)
 +
JOIN tb_labels tbl ON (tbn.taxon_id = tbl.tb_labels_id)
 +
JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
 +
JOIN taxa txa ON (txv.taxon_id = txa.taxon_id)
 +
JOIN ncbi_names nna ON (txa.taxid = nna.tax_id)
 +
JOIN ncbi_nodes nno ON (nna.tax_id = nno.tax_id),
 +
ncbi_names hna NATURAL JOIN ncbi_nodes hno
 +
WHERE nno.left_id >= hno.left_id
 +
AND nna.name_class = 'scientific name'
 +
AND nno.left_id < hno.right_id
 +
AND hna.name_txt = 'Aves'
 +
GROUP BY tbt.tree_id
 +
ORDER BY COUNT(tbn.tree_id) DESC
 +
LIMIT 10;
 +
</sql>
 +
 +
List the trees that contain any kind of bird, any kind of crocodyle, but not any kind of mammal:
 +
 +
<sql>
 +
SELECT t.tree_id, t.tree_label, t.tree_title
 +
FROM trees t
 +
WHERE EXISTS (
 +
    SELECT 1
 +
    FROM nodes tbn JOIN tb_labels tbl ON (tbn.taxon_id = tbl.tb_labels_id)
 +
    JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
 +
    JOIN taxa txa ON (txv.taxon_id = txa.taxon_id)
 +
    JOIN ncbi_names nna ON (txa.taxid = nna.tax_id)
 +
    JOIN ncbi_nodes nno ON (nna.tax_id = nno.tax_id)
 +
    JOIN ncbi_nodes inco ON (nno.left_id BETWEEN inco.left_id AND inco.right_id),
 +
    ncbi_names inca
 +
    WHERE inco.tax_id = inca.tax_id
 +
    AND inca.name_class = 'scientific name'
 +
    AND inca.name_txt = 'Aves'
 +
    AND tbn.tree_id = t.tree_id
 +
    )
 +
AND EXISTS (
 +
    SELECT 1
 +
    FROM nodes tbn JOIN tb_labels tbl ON (tbn.taxon_id = tbl.tb_labels_id)
 +
    JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
 +
    JOIN taxa txa ON (txv.taxon_id = txa.taxon_id)
 +
    JOIN ncbi_names nna ON (txa.taxid = nna.tax_id)
 +
    JOIN ncbi_nodes nno ON (nna.tax_id = nno.tax_id)
 +
    JOIN ncbi_nodes inco ON (nno.left_id BETWEEN inco.left_id AND inco.right_id),
 +
    ncbi_names inca
 +
    WHERE inco.tax_id = inca.tax_id
 +
    AND inca.name_class = 'scientific name'
 +
    AND inca.name_txt = 'Crocodylidae'
 +
    AND tbn.tree_id = t.tree_id
 +
    )
 +
AND NOT EXISTS (
 +
    SELECT 1
 +
    FROM nodes tbn JOIN tb_labels tbl ON (tbn.taxon_id = tbl.tb_labels_id)
 +
    JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
 +
    JOIN taxa txa ON (txv.taxon_id = txa.taxon_id)
 +
    JOIN ncbi_names nna ON (txa.taxid = nna.tax_id)
 +
    JOIN ncbi_nodes nno ON (nna.tax_id = nno.tax_id)
 +
    JOIN ncbi_nodes inco ON (nno.left_id BETWEEN inco.left_id AND inco.right_id),
 +
    ncbi_names inca
 +
    WHERE inco.tax_id = inca.tax_id
 +
    AND inca.name_class = 'scientific name'
 +
    AND inca.name_txt = 'Mammalia'
 +
    AND tbn.tree_id = t.tree_id
 +
    )
 +
;
 +
</sql>
 +
 +
Here is an example of a topological query that takes advantage of the transitive closure (in the node_path table) for all the trees in the database. In this case we are asking for all trees that match the pattern ((Hippopotamus_amphibius, Bos_taurus), Sus_scrofa). Meaning that the hippo and the cow are more closely related than either is to the pig.
 +
 +
<sql>
 +
SELECT t.tree_id, t.tree_label, t.tree_title
 +
    FROM trees t
 +
        INNER JOIN
 +
        (SELECT DISTINCT ON (inN.tree_id) inP.parent_node_id, inN.tree_id
 +
        FROM nodes inN JOIN tb_labels tbl ON (inN.taxon_id = tbl.tb_labels_id)
 +
        JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
 +
        JOIN taxa txa ON (txv.taxon_id = txa.taxon_id), node_path inP
 +
        WHERE txa.namestring IN ('Hippopotamus amphibius','Bos taurus')
 +
        AND inP.child_node_id = inN.node_id
 +
        GROUP BY inN.tree_id, inP.parent_node_id
 +
        HAVING COUNT(inP.child_node_id) = 2
 +
        ORDER BY inN.tree_id, inP.parent_node_id DESC) AS lca
 +
        USING (tree_id)
 +
    WHERE NOT EXISTS (
 +
      -- make sure that inP.parent_node_id is not a parent to Sus scrofa
 +
        SELECT 1
 +
        FROM nodes outN JOIN tb_labels tbl ON (outN.taxon_id = tbl.tb_labels_id)
 +
        JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
 +
        JOIN taxa txa ON (txv.taxon_id = txa.taxon_id), node_path outP
 +
        WHERE txa.namestring IN ('Sus scrofa')
 +
        AND outP.child_node_id = outN.node_id
 +
        AND outP.parent_node_id = lca.parent_node_id)
 +
    AND EXISTS (
 +
      -- but still make sure that the tree contains Sus scrofa
 +
        SELECT c.tree_id
 +
        FROM trees c, nodes q JOIN tb_labels tbl ON (q.taxon_id = tbl.tb_labels_id)
 +
        JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
 +
        JOIN taxa txa ON (txv.taxon_id = txa.taxon_id)
 +
        WHERE txa.namestring IN ('Sus scrofa')
 +
        AND q.tree_id = c.tree_id
 +
        AND c.tree_id = t.tree_id
 +
        GROUP BY c.tree_id
 +
        HAVING COUNT(c.tree_id) = 1);
 +
</sql>
 +
 +
[[Category:TreeBASE]]
 +
[[Category:Data Resources]]

Latest revision as of 15:58, 12 March 2009

TreeBASE Dump

A Postgres dump for TreeBASE can be obtained [here].

Dump Contents

               List of relations
Schema |         Name         |   Type   | Owner
-------+----------------------+----------+-------
public | edges                | table    | piel
public | ncbi_names           | table    | piel
public | ncbi_nodes           | table    | piel
public | node_path            | table    | piel
public | nodes                | table    | piel
public | nodes_node_id        | sequence | piel
public | study                | table    | piel
public | study_id_seq         | sequence | piel
public | taxa                 | table    | piel
public | taxon_id_seq         | sequence | piel
public | taxon_variant_id_seq | sequence | piel
public | taxon_variants       | table    | piel
public | tb_labels            | table    | piel
public | tb_labels_id         | sequence | piel
public | trees                | table    | piel
public | trees_tree_id        | sequence | piel
(16 rows)

For each "study" record, there are many "trees" records. Each "trees" record has many "nodes" records, which are wired to each other via the "edges" table and a transitive closure index is in the "node_path" table. Each "tb_labels" record can point to many "nodes" records -- "tb_labels" is a table of unique taxon labels that appear in all trees. Each taxon_variant record maps to zero or more tb_labels; each "taxa" record maps to one or more taxon_variant records. Each "taxa" record represents a single, normalized taxon, usually a species, but could be a subspecies or a higher taxon. Wherever possible, each "taxa" record has an ncbi_taxid -- that is, the IDs used by ncbi in their Genbank distribution. Consequently, these taxids connect the "ncbi_names" table, which in turn uses the "ncbi_nodes" table as a hierarchical classification. This classification has been pre-indexed with left and right IDs so that hierarchical searching is possible.

Example Queries

Queries for a list of trees that contain any kind of bird:

<sql> SELECT DISTINCT ON (tbt.tree_id) tbt.tree_label, tbt.tree_title FROM trees tbt JOIN nodes tbn ON (tbt.tree_id = tbn.tree_id) JOIN tb_labels tbl ON (tbn.taxon_id = tbl.tb_labels_id) JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id) JOIN taxa txa ON (txv.taxon_id = txa.taxon_id) JOIN ncbi_names nna ON (txa.taxid = nna.tax_id) JOIN ncbi_nodes nno ON (nna.tax_id = nno.tax_id), ncbi_names hna NATURAL JOIN ncbi_nodes hno WHERE nno.left_id >= hno.left_id AND nna.name_class = 'scientific name' AND nno.left_id < hno.right_id AND hna.name_txt = 'Aves'; </sql>

Get top ten list of trees with birds in them each listed by how many birds in each tree:

<sql> SELECT tbt.tree_id AS "TreeBASE Tree ID", MAX(tbt.tree_label) AS "Tree Label", MAX(tbt.tree_title) AS "Tree Title", COUNT(tbn.tree_id) AS "Number of Birds In Tree" FROM trees tbt JOIN nodes tbn ON (tbt.tree_id = tbn.tree_id) JOIN tb_labels tbl ON (tbn.taxon_id = tbl.tb_labels_id) JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id) JOIN taxa txa ON (txv.taxon_id = txa.taxon_id) JOIN ncbi_names nna ON (txa.taxid = nna.tax_id) JOIN ncbi_nodes nno ON (nna.tax_id = nno.tax_id), ncbi_names hna NATURAL JOIN ncbi_nodes hno WHERE nno.left_id >= hno.left_id AND nna.name_class = 'scientific name' AND nno.left_id < hno.right_id AND hna.name_txt = 'Aves' GROUP BY tbt.tree_id ORDER BY COUNT(tbn.tree_id) DESC LIMIT 10; </sql>

List the trees that contain any kind of bird, any kind of crocodyle, but not any kind of mammal:

<sql> SELECT t.tree_id, t.tree_label, t.tree_title FROM trees t WHERE EXISTS (

   SELECT 1
   FROM nodes tbn JOIN tb_labels tbl ON (tbn.taxon_id = tbl.tb_labels_id)
   JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
   JOIN taxa txa ON (txv.taxon_id = txa.taxon_id)
   JOIN ncbi_names nna ON (txa.taxid = nna.tax_id)
   JOIN ncbi_nodes nno ON (nna.tax_id = nno.tax_id)
   JOIN ncbi_nodes inco ON (nno.left_id BETWEEN inco.left_id AND inco.right_id),
   ncbi_names inca
   WHERE inco.tax_id = inca.tax_id
   AND inca.name_class = 'scientific name'
   AND inca.name_txt = 'Aves'
   AND tbn.tree_id = t.tree_id
   )

AND EXISTS (

   SELECT 1
   FROM nodes tbn JOIN tb_labels tbl ON (tbn.taxon_id = tbl.tb_labels_id)
   JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
   JOIN taxa txa ON (txv.taxon_id = txa.taxon_id)
   JOIN ncbi_names nna ON (txa.taxid = nna.tax_id)
   JOIN ncbi_nodes nno ON (nna.tax_id = nno.tax_id)
   JOIN ncbi_nodes inco ON (nno.left_id BETWEEN inco.left_id AND inco.right_id),
   ncbi_names inca
   WHERE inco.tax_id = inca.tax_id
   AND inca.name_class = 'scientific name'
   AND inca.name_txt = 'Crocodylidae'
   AND tbn.tree_id = t.tree_id
   )

AND NOT EXISTS (

   SELECT 1
   FROM nodes tbn JOIN tb_labels tbl ON (tbn.taxon_id = tbl.tb_labels_id)
   JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
   JOIN taxa txa ON (txv.taxon_id = txa.taxon_id)
   JOIN ncbi_names nna ON (txa.taxid = nna.tax_id)
   JOIN ncbi_nodes nno ON (nna.tax_id = nno.tax_id)
   JOIN ncbi_nodes inco ON (nno.left_id BETWEEN inco.left_id AND inco.right_id),
   ncbi_names inca
   WHERE inco.tax_id = inca.tax_id
   AND inca.name_class = 'scientific name'
   AND inca.name_txt = 'Mammalia'
   AND tbn.tree_id = t.tree_id
   )

</sql>

Here is an example of a topological query that takes advantage of the transitive closure (in the node_path table) for all the trees in the database. In this case we are asking for all trees that match the pattern ((Hippopotamus_amphibius, Bos_taurus), Sus_scrofa). Meaning that the hippo and the cow are more closely related than either is to the pig.

<sql> SELECT t.tree_id, t.tree_label, t.tree_title

   FROM trees t
       INNER JOIN
       (SELECT DISTINCT ON (inN.tree_id) inP.parent_node_id, inN.tree_id
        FROM nodes inN JOIN tb_labels tbl ON (inN.taxon_id = tbl.tb_labels_id)
        JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
        JOIN taxa txa ON (txv.taxon_id = txa.taxon_id), node_path inP
        WHERE txa.namestring IN ('Hippopotamus amphibius','Bos taurus')
        AND inP.child_node_id = inN.node_id
        GROUP BY inN.tree_id, inP.parent_node_id
        HAVING COUNT(inP.child_node_id) = 2
        ORDER BY inN.tree_id, inP.parent_node_id DESC) AS lca
       USING (tree_id)
   WHERE NOT EXISTS (
     -- make sure that inP.parent_node_id is not a parent to Sus scrofa
       SELECT 1
       FROM nodes outN JOIN tb_labels tbl ON (outN.taxon_id = tbl.tb_labels_id)
       JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
       JOIN taxa txa ON (txv.taxon_id = txa.taxon_id), node_path outP
       WHERE txa.namestring IN ('Sus scrofa')
       AND outP.child_node_id = outN.node_id
       AND outP.parent_node_id = lca.parent_node_id)
   AND EXISTS (
     -- but still make sure that the tree contains Sus scrofa
       SELECT c.tree_id
       FROM trees c, nodes q JOIN tb_labels tbl ON (q.taxon_id = tbl.tb_labels_id)
        JOIN taxon_variants txv ON (tbl.taxon_variant_id = txv.taxon_variant_id)
        JOIN taxa txa ON (txv.taxon_id = txa.taxon_id)
       WHERE txa.namestring IN ('Sus scrofa')
       AND q.tree_id = c.tree_id
       AND c.tree_id = t.tree_id
       GROUP BY c.tree_id
       HAVING COUNT(c.tree_id) = 1);

</sql>