Endogenous parvovirus (EPV) data in Parvovirus-GLUE

Whole genome sequencing has revealed that DNA sequences derived from parvoviruses are present within vertebrate genomes. These ‘endogenous parvoviral elements’ (EPVs) are thought to have originated via ‘germline incorporation’ events in which parvovirus DNA sequences were integrated into chromosomal DNA of germline cells and subsequently inherited as novel host alleles.

Parvovirus EVEs

Parvovirus EVEs

Some of the species in which we identified novel parvoviruses and endogenous viral elements (EVEs) derived from parvoviruses. Top row, left to right: Masai giraffe (Giraffa camelopardalis tippelskirchii)), Tasmanian devil (Sarcophilus harrisii), elephants (Elephantidae), chinchilla (Chinchilla lanigera). Bottom row, left to right: Northern fur seals (Callorhinus ursinus), pit vipers (Crotalinae), Leadbetter's possum (Gymnobelideus leadbeateri) , Transcaucasian mole vole (Ellobius lutescens).

Analysis of EPVs has proven immensely informative with respect to the long-term evolutionary history of the Parvoviridae. EPV sequences are in some ways equivalent to parvovirus ‘fossils’ in that they provide a source of retrospective information about the distant ancestors of modern parvoviruses.

Currently, the distribution and diversity of parvovirus-related sequences in animal genomes remains incompletely characterized. Progress in characterising these elements has been hampered by the challenges encountered attempting to analyse their fragmentary and degenerated sequences. Parvovirus-GLUE aims to address these issues. We have incorporated into this project a set of principles for organising the parvovirus 'fossil record', and a protocol through which it can be accessed and collaboratively developed.

Please note: links to files on GitHub are mainly designed to indicate where these files are located within the repository. To investigate files (e.g. tree files) in the appropriate software context we recommend downloading the entire repository and browsing locally.

Where do the EPV data come from?

EVE sequences were recovered from whole genome sequence (WGS) assemblies via database-integrated genome screening (DIGS) using the DIGS tool.

All data pertaining to this screen are included in this repository.

Standardised nomenclature for EPVs

We have applied a systematic approach to naming EPV, following a convention developed for endogenous retroviruses (ERVs). Each element was assigned a unique identifier (ID) constructed from a defined set of components.

EPV Nomenclature

The first component is the classifier ‘EPV’ (endogenous parvovirus element).

The second component is a composite of two distinct subcomponents separated by a period: (i) the name of EPV group; (ii) a numeric ID that uniquely identifies the insertion. The numeric ID is an integer identifies a unique insertion locus that arose as a consequence of an initial germline infection. Thus, orthologous copies in different species are given the same number.

The third component of the ID defines the set of host species in which the ortholog occurs.

EPV reference sequences and data

We reconstructed reference sequences for EPVs using alignments of EPV sequences derived from the same initial germline colonisation event - i.e. orthologous elements in distinct species, and paralogous elements that have arisen via intragenomic duplication of EPV sequences.

Raw data in tabular format are can be found at the following links/directories:

  1. Amdoparvoviruses
  2. Erythyroparvoviruses
  3. Dependoparvoviruses
  4. Protoparvoviruses
  5. Ichthamaparvoviruses

Nucleotide level data in FASTA format (individual files), can be found at the following links/directories:

  1. Amdoparvoviruses
  2. Erythyroparvoviruses
  3. Dependoparvoviruses
  4. Protoparvoviruses
  5. Ichthamaparvoviruses

Multiple sequence alignments - maps of homology between EPVs and viruses

Multiple sequence alignment constructed in this study are linked together using GLUE's ‘alignment tree’ data structure. Alignments in the project include:

  1. A single ‘root’ alignment constructed to represent proposed homologies between representative members of major parvovirus lineages (including extinct lineages represented only by EPVs).
  2. Genus-level’ alignments constructed to represent proposed homologies between the genomes of representative members of specific parvovirus genera and EPV reference sequences.
  3. Tip’ alignments in which all taxa are derived from a single EPV lineage.

Phylogenetic trees - reconstructed evolutionary relationships

We used GLUE to implement an automated process for deriving midpoint rooted, annotated trees from the EPV-containing alignments included in our project, to reconstruct the evolutionary relationships between EPVs and related viruses.

Trees were constructed at distinct taxonomic levels:

  1. Recursively populated root phylogeny (Rep)
  2. Genus-level phylogenies
  3. EPV lineage-level phylogenies

Raw EPV sequences and data

These are the raw data generated by database-integrated genome screening (DIGS). The tabular files contain information about the genomic location of each EVE. EVEs were classified by comparison to a polypeptide sequence reference library designed to represent the known diversity of parvoviruses - this includes extinct lineages represented only by endogenous viral elements (EVEs).

These data were obtained via DIGS performed in vertebrate genome assemblies downloaded from NCBI genomes (2020-07-15).

Raw data in tabular format are can be found at the following links:

  1. Amdoparvoviruses
  2. Erythyroparvoviruses
  3. Dependoparvoviruses
  4. Protoparvoviruses
  5. Ichthamaparvoviruses

Nucleotide level data in FASTA format (individual files), can be found at the following links:

  1. Amdoparvoviruses
  2. Erythyroparvoviruses
  3. Dependoparvoviruses
  4. Protoparvoviruses
  5. Ichthamaparvoviruses

Paleovirus-specific schema extensions

The paleovirus component of Parvovirus-GLUE extends GLUE's core schema to allow the capture of EVE-specific data. These schema extensions are defined in this file and comprise two additional table: 'locus_data' and 'refcon_data'. Both tables are linked to the main 'sequence' table via the 'sequenceID' field.

The 'locus_data' table contains EVE locus information: e.g. species, assembly, scaffold, location coordinates.

The 'refcon_data' table contains summary information for individual EVE insertions. It refers to the reference sequences constructed to represent each insertion, which reflect our best efforts to reconstruct progenitor virus sequences as they might have looked when they initially integrated into the germline of ancestral species.

Related Publications

Hildebrandt E, Penzes J, Gifford RJ, Agbandje-Mckenna M, and R Kotin (2020)
Evolution of dependoparvoviruses across geological timescales – implications for design of AAV-based gene therapy vectors. Virus Evolution [view]

Pénzes JJ, de Souza WM, Agbandje-Mckenna M, and RJ Gifford (2019)
An ancient lineage of highly divergent parvoviruses infects both vertebrate and invertebrate hosts.
Viruses [view]

Callaway HM, Subramanian S, Urbina C, Barnard K, Dick R, Hafentein SL, Gifford RJ, and CR Parrish (2019)
Examination and reconstruction of three ancient endogenous parvovirus capsid proteins in rodent genomes.
Journal of Virology [view]

Kobayashi Y, Shimazu T, Murata K, Itou T, Suzuki Y. (2019)
An endogenous adeno-associated virus element in elephants.
Virus Res. Mar;262:10-14 [view]

Valencia-Herrera I, Cena-Ahumada E, Faunes F, Ibarra-Karmy R, Gifford RJ*, and G Arriagada* (2019) *co-corresponding authors
Molecular properties and evolutionary origins of a parvovirus-derived myosin fusion gene in guinea pigs.
Journal of Virology [view]

Pénzes JJ, Marsile-Medun S, Agbandje-McKenna M, and RJ Gifford (2018)
Endogenous amdoparvovirus-related elements reveal insights into the biology and evolution of vertebrate parvoviruses.
Virus Evolution [view]

Singer JB, Thomson EC, McLauchlan J, Hughes J, and RJ Gifford (2018)
GLUE: A flexible software system for virus sequence data.
BMC Bioinformatics [view]

Zhu H, Dennis T, Hughes J, and RJ Gifford (2018)
Database-integrated genome screening (DIGS): exploring genomes heuristically using sequence similarity search tools and a relational database. [preprint]

Gifford RJ, Blomberg B, Coffin JM, Fan H, Heidmann T, Mayer J, Stoye J, Tristem M, and WE Johnson (2018)
Nomenclature for endogenous retrovirus (ERV) loci.
Retrovirology [view]

Gloria Arriagada and RJ Gifford (2014)
Parvovirus-derived endogenous viral elements in two South American rodent genomes.
J. Virol. [view]

Katzourakis A. and RJ. Gifford (2010)
Endogenous viral elements in animal genomes.
PLoS Genetics [view]