Overview


This is CRESS-GLUE, an open access GLUE project designed to support comparative genomic and evolutionary analysis of circular Rep-encoding single-stranded DNA (CRESS DNA) viruses (phylum Cressdnaviricota).


CRESS DNA virus phylogeny


The figure above, taken from a report by Krupovic et al. (2020), shows an unrooted phylogenetic tree of the Cressdnaviricota, based on alignment of Rep proteins.


CRESS-GLUE has been designed to facilitate any form of comparative genomic analysis involving CRESS DNA viruses. It contains a richly annotated sequence dataset for these viruses, comprised of both viral sequences and endogenous viral elements (EVEs).

There are a wide variety of ways in which this resource can be used:

What is a GLUE project?


GLUE is an open, integrated software toolkit that provides functionality for storage and interpretation of sequence data.

GLUE supports the development of “projects” containing the data items required for comparative genomic analysis (e.g. sequences, multiple sequence alignments, genome feature annotations, and other sequence-associated data).

Projects are loaded into the GLUE "engine", creating a relational database that represents the semantic relationships between data items. This provides a robust foundation for the implementation of systematic comparative analyses and the development of sequence-based resources.

The core schema of this database can be extended to accommodate the idiosyncrasies of different projects, and GLUE provides a scripting layer (based on JavaScript) for developing custom analysis tools.

Hosting of GLUE projects in an online version control system (e.g. GitHub) provides a mechanism for their stable, collaborative development.

Some examples of 'sequence-based resources' built for viruses using GLUE include:


What does building the CRESS-GLUE project offer?


CRESS-GLUE offers a number of advantages for performing comparative sequence analysis of CRESS DNA viruses:

  1. Reproducibility. For many reasons, bioinformatics analyses are notoriously difficult to reproduce. The GLUE framework supports the implementation of fully reproducible comparative genomics through the introduction of data standards and the use of a relational database to capture the semantic links between data items.

  2. Reusable data objects and analysis logic. For many - if not most - comparative genomic analyses, data preparation is nine tenths of the battle. The GLUE framework has been designed to ensure that work spent preparing high-value data items such as multiple sequence alignments need only be performed once. Hosting of GLUE projects in an online version control system such as GitHub allows for collaborative management of important data items and community testing of hypotheses.

  3. Validation. Building GLUE projects entails mapping the semantic links between data items (e.g. sequences, tabular data, multiple sequence alignments). This process provides an opportunity for cross-validation, and thereby enforces a high level of data integrity.

  4. Standardisation of the genomic co-ordinate space. GLUE projects allow all sequences to utilise the coordinate space of a chosen reference sequence. Contingencies associated with insertions and deletions (indels) are handled in a systematic way.

  5. Predefined, fully annotated reference sequences: This project includes fully-annotated reference sequences for major lineages within the Hepadnaviridae family.

  6. Alignment trees: GLUE allows linking of alignments constructed at distinct taxonomic levels via an ""alignment tree" data structure. In the alignment tree, each alignment is constrained to a standard reference sequence, thus all multiple sequence alignments are linked to one another via a standardised coordinate system.


Building this GLUE project


On computers with the GLUE software framework installed, the CRESS-GLUE project can be instantiated by navigating to the project folder, initiating GLUE, and issuing the following command in the GLUE shell:

  Mode path: /
  GLUE> run file buildCoreProject.glue
This will build the CRESS-GLUE core project by executing the commands in this file.

The core project comprises a dataset designed to represent the phylum in a minimal way (i.e. by including only one or a handful of annotated reference sequences for each major lineage).

We have also created extension projects for certain families in the Cressdnaviricota (e.g. Circoviridae). These extensions extend the minimal set of reference sequences included in the core project to include representatives of taxonomic groups below family-level.

Once the core project has been built, the family-level extension project can be added by executing the commands in this file, as follows:

  Mode path: /
  GLUE> run file buildFamilyLevelProjects.glue

The family-level extension projects can be extended to incorporate EVE sequences derived from the corresponding virus family. To build these paleovirus extensions, execute the commands in this file, as follows:

  Mode path: /
  GLUE> run file buildFamilyLevelPaleoProjects.glue

For certain families in the Cressdnaviricota we have also created paleovirus-focused extension projects.

For example, the Circoviridae paleovirus extension incorporates a set of endogenous viral elements (EVEs) derived from ancient circoviruses. These sequences were recovered from the genomes of metazoan species. Building the paleovirus extension allows automated alignment and phylogeny reconstruction for individual ECV lineages in the project, based on the classifications in this file. Individual ECV sequences have been classified into sets considered likely to have arisen from the same germline colonisation event. Loci have been named using a systematic approach.


Contributors


Robert J. Gifford (robert.gifford@glasgow.ac.uk)

Tristan Dennis

Soledad Marsile-Medun


Related Publications


Krupovic M, Varsani A, Kazlauskas D, Breitbart M, Delwart E, Rosario K, Yutin N, Wolf YI, Harrach B, Zerbini FM, Dolja VV, Kuhn JH, and EVE Koonin. (2020)
Cressdnaviricota: a Virus Phylum Unifying Seven Families of Rep-Encoding Viruses with Single-Stranded, Circular DNA Genomes.
J. Virol.. Jun 1;94(12):e00582-20. doi: 10.1128/JVI.00582-20. [view]

Tisza MJ, Pastrana DV, Welch NL, Stewart B, Peretti A, Starrett GJ, Pang YS, Krishnamurthy SR, Pesavento PA, McDermott DH, Murphy PM, Whited JL, Miller B, Brenchley J, Rosshart SP, Rehermann B, Doorbar J, Ta'ala BA, Pletnikova O, Troncoso JC, Resnick SM, Bolduc B, Sullivan MB, Varsani A, Segall AM, Buck CB. (2020)
Discovery of several thousand highly diverse circular DNA viruses.
Elife. Feb 4;9:e51971. doi: 10.7554/eLife.51971. [view]

Singer JB, Thomson EC, McLauchlan J, Hughes J, and RJ Gifford (2018)
GLUE: A flexible software system for virus sequence data.
BMC Bioinformatics [view]

Singer JB, Gifford RJ, Cotten M, and D Robertson (2020)
CoV-GLUE: A Web Application for Tracking SARS-CoV-2 Genomic Variation.
Preprints [preprint]


License


This project is licensed under the GNU Affero General Public License v. 3.0.