This is CRESS-GLUE, an open access GLUE project designed to support comparative genomic and evolutionary analysis of circular Rep-encoding single-stranded DNA (CRESS DNA) viruses (phylum Cressdnaviricota).
The figure above, taken from a report by Krupovic et al. (2020), shows an unrooted phylogenetic tree of the Cressdnaviricota, based on alignment of Rep proteins.
CRESS-GLUE has been designed to facilitate any form of comparative genomic analysis involving CRESS DNA viruses. It contains a richly annotated sequence dataset for these viruses, comprised of both viral sequences and endogenous viral elements (EVEs).
There are a wide variety of ways in which this resource can be used:
- if you're interested in any particular virus in the Cressdnaviricota, you can use CRESS-GLUE as an efficient means of investigating this virus in depth, as demonstrated in this tutorial.
- for those interested in paleovirology and endogenous viral elements (EVEs), CRESS-GLUE provides a source of systematically organised information about EVEs derived from CRESS DNA viruses.
- if you're interested in exploring CRESS DNA virus diversity in metagenomic datasets, the data included in this project may assist you in your analysis, as explained here.
What is a GLUE project?
GLUE is an open, integrated software toolkit that provides functionality for storage and interpretation of sequence data.
GLUE supports the development of “projects” containing the data items required for comparative genomic analysis (e.g. sequences, multiple sequence alignments, genome feature annotations, and other sequence-associated data).
Projects are loaded into the GLUE "engine", creating a relational database that represents the semantic relationships between data items. This provides a robust foundation for the implementation of systematic comparative analyses and the development of sequence-based resources.
Hosting of GLUE projects in an online version control system (e.g. GitHub) provides a mechanism for their stable, collaborative development.
Some examples of 'sequence-based resources' built for viruses using GLUE include:
- COV-GLUE: A GLUE resource for tracking genetic variation in SARS-COV2. CoV-GLUE contains a database of amino acid replacements, insertions and deletions which have been observed in GISAID hCoV-19 sequences sampled from the pandemic
- RABV-GLUE: Tailored toward epidemiological tracking of rabies virus (RABV). Includes a database of RABV sequences and metadata from NCBI, updated daily and arranged into major and minor clades, and an analysis tool providing genotyping, analysis and visualisation of submitted FASTA sequences.
- HCV-GLUE: This GLUE resource aims to support analysis of drug resistance and vaccine escape in hepatitis C virus (HCV). A database of HCV sequences and metadata from NCBI, updated daily and arranged into clades (genotypes, subtypes). As well as pre-built multiple-sequence alignments of NCBI sequences, it includes an analysis tool providing genotyping, drug resistance analysis and visualisation of submitted FASTA sequences.
What does building the CRESS-GLUE project offer?
CRESS-GLUE offers a number of advantages for performing comparative sequence analysis of CRESS DNA viruses:
- Reproducibility. For many reasons, bioinformatics analyses are notoriously difficult to reproduce. The GLUE framework supports the implementation of fully reproducible comparative genomics through the introduction of data standards and the use of a relational database to capture the semantic links between data items.
- Reusable data objects and analysis logic. For many - if not most - comparative genomic analyses, data preparation is nine tenths of the battle. The GLUE framework has been designed to ensure that work spent preparing high-value data items such as multiple sequence alignments need only be performed once. Hosting of GLUE projects in an online version control system such as GitHub allows for collaborative management of important data items and community testing of hypotheses.
- Validation. Building GLUE projects entails mapping the semantic links between data items (e.g. sequences, tabular data, multiple sequence alignments). This process provides an opportunity for cross-validation, and thereby enforces a high level of data integrity.
- Standardisation of the genomic co-ordinate space. GLUE projects allow all sequences to utilise the coordinate space of a chosen reference sequence. Contingencies associated with insertions and deletions (indels) are handled in a systematic way.
- Predefined, fully annotated reference sequences: This project includes fully-annotated reference sequences for major lineages within the Hepadnaviridae family.
- Alignment trees: GLUE allows linking of alignments constructed at distinct taxonomic levels via an ""alignment tree" data structure. In the alignment tree, each alignment is constrained to a standard reference sequence, thus all multiple sequence alignments are linked to one another via a standardised coordinate system.
Building this GLUE project
On computers with the GLUE software framework installed, the CRESS-GLUE project can be instantiated by navigating to the project folder, initiating GLUE, and issuing the following command in the GLUE shell:
This will build the CRESS-GLUE core project by executing the commands in this file.
Mode path: / GLUE> run file buildCoreProject.glue
The core project comprises a dataset designed to represent the phylum in a minimal way (i.e. by including only one or a handful of annotated reference sequences for each major lineage).
We have also created extension projects for certain families in the Cressdnaviricota (e.g. Circoviridae). These extensions extend the minimal set of reference sequences included in the core project to include representatives of taxonomic groups below family-level.
Once the core project has been built, the family-level extension project can be added by executing the commands in this file, as follows:
Mode path: / GLUE> run file buildFamilyLevelProjects.glue
The family-level extension projects can be extended to incorporate EVE sequences derived from the corresponding virus family. To build these paleovirus extensions, execute the commands in this file, as follows:
Mode path: / GLUE> run file buildFamilyLevelPaleoProjects.glue
For certain families in the Cressdnaviricota we have also created paleovirus-focused extension projects.
For example, the Circoviridae paleovirus extension incorporates a set of endogenous viral elements (EVEs) derived from ancient circoviruses. These sequences were recovered from the genomes of metazoan species. Building the paleovirus extension allows automated alignment and phylogeny reconstruction for individual ECV lineages in the project, based on the classifications in this file. Individual ECV sequences have been classified into sets considered likely to have arisen from the same germline colonisation event. Loci have been named using a systematic approach.
Robert J. Gifford (email@example.com)
Krupovic M, Varsani A, Kazlauskas D, Breitbart M, Delwart E, Rosario K, Yutin N, Wolf YI, Harrach B, Zerbini FM, Dolja VV, Kuhn JH, and EVE Koonin. (2020)
Cressdnaviricota: a Virus Phylum Unifying Seven Families of Rep-Encoding Viruses with Single-Stranded, Circular DNA Genomes.
J. Virol.. Jun 1;94(12):e00582-20. doi: 10.1128/JVI.00582-20. [view]
Tisza MJ, Pastrana DV, Welch NL, Stewart B, Peretti A, Starrett GJ, Pang YS, Krishnamurthy SR, Pesavento PA, McDermott DH, Murphy PM, Whited JL, Miller B, Brenchley J, Rosshart SP, Rehermann B, Doorbar J, Ta'ala BA, Pletnikova O, Troncoso JC, Resnick SM, Bolduc B, Sullivan MB, Varsani A, Segall AM, Buck CB. (2020)
Discovery of several thousand highly diverse circular DNA viruses.
Elife. Feb 4;9:e51971. doi: 10.7554/eLife.51971. [view]
Singer JB, Thomson EC, McLauchlan J, Hughes J, and RJ Gifford (2018)
GLUE: A flexible software system for virus sequence data.
BMC Bioinformatics [view]
Singer JB, Gifford RJ, Cotten M, and D Robertson (2020)
CoV-GLUE: A Web Application for Tracking SARS-CoV-2 Genomic Variation.
This project is licensed under the GNU Affero General Public License v. 3.0.