Virus data included in Hepadnavirus-GLUE


This page provides background information on the virus-associated data items included in the project - information about endogenous hepadnaviral elements (eHBVs) can be found here.

Please note: links to files on GitHub are mainly designed to indicate where these files are located within the repository. To investigate files (e.g. tree files) in the appropriate software context we recommend downloading the entire repository and browsing locally.

Those specifically interested in hepatitis B virus (HBV) may want to investigate HBV-GLUE, and NCBI-HBV-GLUE. This family of GLUE projects was developed specifically for HBV and incorporates a graphical user interface (GUI) (HBV-GLUE-WEB), that allows users to browse and interrogate the underlying GLUE database via 'point-and-click' methods.

The MRC-University of Glasgow Centre for Virus Research hosts an instance of the GUI version of HBV-GLUE.


Hepadnavirus genome features


Hepadnaviruses have enveloped, spherical virions and a small, circular DNA genome ~3 kilobases (Kb) in length. The genome is characterised by a highly streamlined organization incorporating extensive gene overlap - the open reading frame (ORF) encoding the viral polymerase (P) protein occupies most of the genome and typically overlaps at least one of the ORFs encoding the core (C), and surface (S) proteins.

We defined a standard set of genome features for hepadnaviruses and the locations of these genome features on master reference sequences (see here).


Hepadnavirus sequences and sequence-associated data


The sequence data in this project are organised into multiple distinct sources. Each source contains data in either GenBank XML or plain FASTA format. The type of data is indicated by the name of the source (all GenBank XML sources contain 'ncbi' in the name).

GenBank XML files are imported into this project directly from NCBI GenBank using an appropriately configured version of GLUE's GenBank importer module. The core Hepadnavirus-GLUE project contains a single NCBI-derived source - ncbi-refseqs - that contains 'master reference' genome sequences for each hepadnavirus species included in this project.


Hepadnavirus reference sequences


Hepadnavirus-GLUE contains reference sequences for all known hepadnavirus species.

Known hepadnavirus host species

Left to right: Recent research has identified divergent hepadnaviruses in: (i) icefish and (ii) spiny lizards. Viruses closely related to hepatitis B virus, which infects humans, have been identified in a wide range of mammals including (iii) woolly monkeys and (iv) duikers.


For each hepadnaviral genus, we have created a 'master' reference sequence, as follows:


Reference sequences are linked to auxiliary data in tabular format.


Parameter Type Definition
full_name VARCHAR Full name of the virus this sequence is derived from
name VARCHAR Abbreviated name of the virus this sequence is derived from
genus VARCHAR Taxonomy - virus genus
clade VARCHAR Taxonomy - virus clade
isolate_name VARCHAR Name of the virus isolate this sequence is derived from
host_sci_name VARCHAR Species (Latin binomial) virus was isolated from
host_name VARCHAR Species (common name) virus was isolated from
length INTEGER Length of the sequence
pubmed_id INTEGER PubMed ID of manuscript associated with sequence
gb_create_date GenBank GenBank creation date of the sequence
gb_update_date VARCHAR Date of most recent GenBank update
country VARCHAR Country where virus was isolated
collection_year INTEGER Year virus was isolated
collection_month VARCHAR Month virus was isolated
collection_month_day VARCHAR Day of month virus was isolated

Multiple sequence alignments


Multiple sequence alignment constructed in this study are linked together using GLUE's alignment tree data structure. Alignments in the project include:

  1. A ‘root’ alignment constructed to represent proposed homologies between representative members of major hepadnavirus lineages
  2. genus-level’ alignments constructed to represent proposed homologies between the genomes of representative members of specific hepadnavirus genera.


Phylogenetic trees


We used GLUE to implement an automated process for deriving midpoint rooted, annotated trees from the alignments included in our project.

Trees were constructed at distinct taxonomic levels:

  1. Recursively populated root phylogenies
  2. Genus-level phylogenies


Project-specific schema extensions


Hepadnavirus-GLUE extends GLUE's core schema through the incorporation of a number of additional fields in the sequence table, and a project-specific custom table: 'isolate'. These schema extensions are defined here. The isolate table is linked to the main 'sequence' table via the sequence ID field. It contains information pertaining to viral isolates, e.g. species sampled, date and location of sample.