Virus data included in Hepadnavirus-GLUE
This page provides background information on the virus-associated data items included in the project - information about endogenous hepadnaviral elements (eHBVs) can be found here.
Please note: links to files on GitHub are mainly designed to indicate where these files are located within the repository. To investigate files (e.g. tree files) in the appropriate software context we recommend downloading the entire repository and browsing locally.
Those specifically interested in hepatitis B virus (HBV) may want to investigate HBV-GLUE, and NCBI-HBV-GLUE. This family of GLUE projects was developed specifically for HBV and incorporates a graphical user interface (GUI) (HBV-GLUE-WEB), that allows users to browse and interrogate the underlying GLUE database via 'point-and-click' methods.
The MRC-University of Glasgow Centre for Virus Research hosts an instance of the GUI version of HBV-GLUE.
Hepadnavirus genome features
Hepadnaviruses have enveloped, spherical virions and a small, circular DNA genome ~3 kilobases (Kb) in length. The genome is characterised by a highly streamlined organization incorporating extensive gene overlap - the open reading frame (ORF) encoding the viral polymerase (P) protein occupies most of the genome and typically overlaps at least one of the ORFs encoding the core (C), and surface (S) proteins.
We defined a standard set of genome features for hepadnaviruses and the locations of these genome features on master reference sequences (see here).
Hepadnavirus sequences and sequence-associated data
The sequence data in this project are organised into multiple distinct sources. Each source contains data in either GenBank XML or plain FASTA format. The type of data is indicated by the name of the source (all GenBank XML sources contain 'ncbi' in the name).
GenBank XML files are imported into this project directly from NCBI GenBank using an appropriately configured version of GLUE's GenBank importer module. The core Hepadnavirus-GLUE project contains a single NCBI-derived source - ncbi-refseqs - that contains 'master reference' genome sequences for each hepadnavirus species included in this project.
Hepadnavirus reference sequences
Hepadnavirus-GLUE contains reference sequences for all known hepadnavirus species.
Left to right: Recent research has identified divergent hepadnaviruses in: (i) icefish and (ii) spiny lizards. Viruses closely related to hepatitis B virus, which infects humans, have been identified in a wide range of mammals including (iii) woolly monkeys and (iv) duikers.
For each hepadnaviral genus, we have created a 'master' reference sequence, as follows:
- Orthohepadnavirus: Hepatitis B virus, strain ayw (NC_003977)
- Avihepadnavirus: Duck hepatitis B virus, isolate DHBVQCA34 (NC_001344)
- Herpetohepadnavirus: Tibetan frog hepatitis B virus, isolate 243398 (NC_030446)
- Metahepadnavirus: Bluegill hepatitis B virus (NC_030445)
- Parahepadnavirus: White sucker hepadnavirus, isolate RR173 (NC_027922)
Reference sequences are linked to auxiliary data in tabular format.
Parameter | Type | Definition |
---|---|---|
full_name | VARCHAR | Full name of the virus this sequence is derived from |
name | VARCHAR | Abbreviated name of the virus this sequence is derived from |
genus | VARCHAR | Taxonomy - virus genus |
clade | VARCHAR | Taxonomy - virus clade |
isolate_name | VARCHAR | Name of the virus isolate this sequence is derived from |
host_sci_name | VARCHAR | Species (Latin binomial) virus was isolated from |
host_name | VARCHAR | Species (common name) virus was isolated from |
length | INTEGER | Length of the sequence |
pubmed_id | INTEGER | PubMed ID of manuscript associated with sequence |
gb_create_date | GenBank | GenBank creation date of the sequence |
gb_update_date | VARCHAR | Date of most recent GenBank update |
country | VARCHAR | Country where virus was isolated |
collection_year | INTEGER | Year virus was isolated |
collection_month | VARCHAR | Month virus was isolated |
collection_month_day | VARCHAR | Day of month virus was isolated |
Multiple sequence alignments
Multiple sequence alignment constructed in this study are linked together using GLUE's alignment tree data structure. Alignments in the project include:
- A ‘root’ alignment constructed to represent proposed homologies between representative members of major hepadnavirus lineages
- ‘genus-level’ alignments constructed to represent proposed homologies between the genomes of representative members of specific hepadnavirus genera.
Phylogenetic trees
We used GLUE to implement an automated process for deriving midpoint rooted, annotated trees from the alignments included in our project.
Trees were constructed at distinct taxonomic levels:
- Recursively populated root phylogenies
- Genus-level phylogenies
Project-specific schema extensions
Hepadnavirus-GLUE extends GLUE's core schema through the incorporation of a number of additional fields in the sequence table, and a project-specific custom table: 'isolate'. These schema extensions are defined here. The isolate table is linked to the main 'sequence' table via the sequence ID field. It contains information pertaining to viral isolates, e.g. species sampled, date and location of sample.