Endogenous hepadnavirus (eHBV) data

Whole genome sequencing has revealed the presence of DNA sequences derived from hepadnaviruses in vertebrate genomes. These ‘endogenous hepatitis B viruses’ (eHBVs) are thought to have originated via ‘germline incorporation’ events in which hepadnavirus DNA sequences were integrated into chromosomal DNA of germline cells and subsequently inherited as novel host alleles.

eHBV sequences are in some ways equivalent to hepadnavirus ‘fossils’ in that they provide a source of retrospective information about the distant ancestors of modern hepadnaviruses.

Timeline of hepadnavirus evolution

We have used GLUE to organise the 'genomic fossil record' of hepadnaviruses. This page provides a description of Hepadnavirus-GLUE's paleovirus component, and quick links to specific data items.

Please note: links to files on GitHub are mainly designed to indicate where these files are located within the repository. To investigate files (e.g. tree files) in the appropriate software context we recommend downloading the entire repository and browsing locally.

Timeline of hepadnavirus evolution. The inset panel shows the evolutionary relationships of hepadnavirus genera, the larger tree is a time-calibrated evolutionary tree of vertebrates. Geological eras are indicated by background shading. The scale bar shows time in millions of years before present. For details see Lytras et al (2020). Mya=Million years ago.

Reflecting this, analysis of eHBVs has proven immensely informative with respect to the long-term evolutionary history of hepadnaviruses. For example, in a recent study (Lytras et al, 2020), we uncovered evidence that the genus Avihepadnavirus comprises at least four distinct subgroups, each of which has circulated widely among avian species at some point during the Cenozoic Era. Each subgroup is characterised by a distinct type of Surface protein.

Relevance to molecular biological studies of hepadnaviruses


Endogenous viral sequences can inform our understanding of contemporary viruses in a wide variety of ways. Perhaps most importantly, EVEs allow calibration of the long-term evolutionary history of virus groups, which greatly influences how we understand their biology.

Importantly, once time calibrations have been established, a far richer range of comparative genomic studies can be performed. By examining variation in the light of a known evolutionary history, these studies can provide invaluable insights into the biological mechanisms through which viruses replicate and spread.

Hepadnavirus EVEs

Some of the species in which we have identified EVEs derived from hepadnaviruses (commonly referred to as endogenous hepatitis B (eHBV) elements)
From left to right: emperor penguin (Aptenodytes forsteri), barn owl (Tyto alba), cormorants (family Phalacrocoracidae), Anna's hummingbird (Calypte anna).


Relevance to viral metagenomics


The eHBV sequences in Hepadnavirus-GLUE can provide a useful resource for those interested in identifying and characterising hepadnaviruses in metagenomic datasets.

Firstly, the eHBV sequences collated here can be used to exclude any potential 'false positive' hits (i.e. sequences that seem to represent new hepadnaviruses but in fact derive from genomic DNA).

In addition, when new hepadnavirus species are identified, inclusion of EVEs in phylogenetic analyses can often provide useful information about their broader ecology and evolution, including (uniquely) their long-term evolution.


Relevance to genomics


eHBVs are not only useful genetic markers, several lines of evidence indicate they may have, or have had, functional roles as host alleles. The prevalence of multicopy eHBV lineages in some species suggests that germline incorporation of hepadnavirus sequences might have influenced the evolution of host genomes in important ways.

Consistent with the idea that germline incorporation of hepadnavirus sequences might in some cases be favoured by selection at the level of the host, we have identified several examples of loci containing multiple fixed eHBV elements (Lytras et al, 2020), each derived from a distinct germline colonisation event. It remains unclear whether this reflects natural selection due to a favourable influence or preferential integration of hepadnaviruses into particular loci (e.g. because they are accessible in embryonic cells).

Hepadnavirus EVEs

Additional species in which we have identified endogenous hepatitis B (eHBV) elements
From left to right: tuatara (Sphenodon punctatus), ducks (family Anatidae), snakes (suborder Serpentes), red-legged seriema (Cariama cristata).


The EVE component of Hepadnaviridae-GLUE


Currently, the distribution and diversity of hepadnavirus-related sequences in animal genomes remains incompletely characterized. Progress in characterising these elements has been hampered by the challenges encountered attempting to analyse their fragmentary and degenerated sequences. Hepadnavirus-GLUE aims to address these issues.

We have incorporated into this project a set of principles for organising the hepadnavirus 'fossil record', and a protocol through which it can be accessed and collaboratively developed.

This website provides background information about the EVE component of Hepadnavirus-GLUE, and direct links to specific data items.


Nomenclature for eHBVs


We have applied a systematic approach to naming eHBV, following a convention developed for endogenous retroviruses. Each element was assigned a unique identifier (ID) constructed from a defined set of components.

eHBV Nomenclature

The first component is the classifier ‘eHBV’ (endogenous hepatitis B virus/endogenous hepadnavirus).

The second component is a composite of two distinct subcomponents separated by a period: (i) the name of eHBV group; (ii) a numeric ID that uniquely identifies the insertion. The numeric ID is an integer that identifies a unique insertion locus that arose as a consequence of an initial germline infection. Thus, orthologous copies in different species are given the same number.

Where an EVE sequence is thought to have been duplicated within the germline following it's initial incorporation (e.g. via segmental duplication or transposition) we have appended an additional 'duplicate id' to the numeric ID, separated by a period. Please note that we have not yet resolved the orthologous relationships among sets of eHBV sequences belonging to multicopy eHBV lineages. We have therefore assigned unique duplicate IDs to each sequence within these lineages.

The third component of the ID defines the set of host species in which the ortholog occurs, or did occur prior to being deleted.


Raw eHBV sequences and data


These are the raw data generated by database-integrated genome screening (DIGS). The tabular file contains information about the genomic location of each EVE. EVEs were classified by comparison to a reference library of polypeptide sequences designed to represent the known diversity of hepadnaviruses - this includes extinct lineages represented only by endogenous viral elements (EVEs).

These data were obtained via DIGS performed in vertebrate genome assemblies downloaded from NCBI genomes (2020-07-15).

Raw data about the EVEs in tabular format can be found here.

Nucleotide level data in FASTA format (individual files) can be found here.


eHBV reference sequences and data


We constructed consensus sequences for paleoviruses by aligning eHBV sequences derived from the same initial germline colonisation event - i.e. orthologs in distinct species, and paralogs that have arisen via intragenomic duplication.

Reference sequence data in tabular format are here.


Nucleotide level data:

Taxonomic group Full-length eHBV Core codons Surface codons Pol codons
Avihepadnavirus FASTA MSA FASTA MSA FASTA MSA FASTA MSA
Herpetohepadnavirus FASTA MSA FASTA MSA FASTA MSA FASTA MSA
Metahepadnavirus FASTA MSA FASTA MSA FASTA MSA FASTA MSA


Protein level data:

Taxonomic group Core AA Surface AA Pol AA
Avihepadnavirus FASTA MSA FASTA MSA FASTA MSA
Herpetohepadnavirus FASTA MSA FASTA MSA FASTA MSA
Metahepadnavirus FASTA MSA FASTA MSA FASTA MSA


Multiple sequence alignments


Multiple sequence alignment constructed in this study are linked together using GLUE's alignment tree data structure. Alignments in the project include:

  1. A single ‘root’ alignment constructed to represent proposed homologies between representative members of major hepadnavirus lineages (including extinct lineages represented only by eHBVs).
  2. Genus-level’ alignments constructed to represent proposed homologies between the genomes of representative members of specific hepadnavirus genera and eHBV reference sequences.
  3. Tip’ alignments in which all taxa are derived from a single eHBV lineage.


Phylogenetic trees


We used GLUE to implement an automated process for deriving midpoint rooted, annotated trees from the alignments included in our project.

Trees were constructed at distinct taxonomic levels:

  1. Recursively populated root phylogeny (Rep)
  2. Genus-level phylogenies
  3. eHBV lineage-level phylogenies


Paleovirus-specific schema extensions in Hepadnaviridae-GLUE


The paleovirus component of Hepadnavirus-GLUE extends GLUE's core schema to allow the capture of EVE-specific data. These schema extensions are defined in this file and comprise two additional tables: 'locus_data' and 'refcon_data'. Both tables are linked to the main 'sequence' table via the 'sequenceID' field.

The 'locus_data' table contains information pertaining to individual EVE sequences: e.g. species in which they occur, genome assembly version, genomic location(i.e. scaffold, location coordinates, and orientation).

The 'refcon_data' table contains information pertaining to our eHBV reference sequences, which we have constructed in an effort to reconstruct, as closely as possible, the sequences of the progenitor viruses that gave rise to EVEs.


Related Publications


Lytras S, Arriagada G, and RJ Gifford (2020)
Ancient evolution of hepadnaviral paleoviruses and their impact on host genomes.
Virus Evolution [view]

Singer JB, Thomson EC, McLauchlan J, Hughes J, and RJ Gifford (2018)
GLUE: A flexible software system for virus sequence data.
BMC Bioinformatics [view]

Zhu H, Dennis T, Hughes J, and RJ Gifford (2018)
Database-integrated genome screening (DIGS): exploring genomes heuristically using sequence similarity search tools and a relational database. [preprint]

Gifford RJ, Blomberg B, Coffin JM, Fan H, Heidmann T, Mayer J, Stoye J, Tristem M, and WE Johnson (2018)
Nomenclature for endogenous retrovirus (ERV) loci.
Retrovirology [view]

Katzourakis A. and RJ. Gifford (2010)
Endogenous viral elements in animal genomes.
PLoS Genetics [view]