GLUE

A flexible software system for virus genomics.

May 10th, 2018

Revolutionary advances in DNA sequencing technology have transformed the way that we research and monitor viruses.

This has been reflected in the development of a diverse range of virus genome data resources (VGDRs) - computational tools that facilitate the use of virus genome data in experimental research and public health. VGDRs are diverse - they may focus on any virus species or aspect of virus biology. Nevertheless, at their core, they typically operate on similar data types, using related computational processes. This means that in spite of the great diversity of VGDRs, there is room for the development of standardised approaches, and these can enable extensive re-use of code and vastly greater efficiency in VGDR development.

In my research group, we've been working for several years on ways to enable this. I'm delighted that we can now present GLUE a unified bioinformatics software environment for VGDR development.

Our manuscript describing GLUE is available here

A quick overview of GLUE

GLUE aims to provide a standardized, general framework for the development and maintenance of VGDRs. We have designed GLUE to provide sequence analysis functionality that would be useful across multiple, distinct virus projects, while at the same time offering sufficient flexibility and extensibility to accommodate the idiosyncrasies of different virus families and distinct sequence analysis contexts.

Why do viruses need their own system?

First off - there is in fact no aspect of GLUE that is specific to viruses - in principle, the GLUE framework can be used to develop genomic resources for any species or species group.

NextStrain NextStrain is a VGDR supporting digital pathogen surveillance - it provides interactive visualisations of virus outbreaks, driven by genomic data.

However, viruses need a system like GLUE more than other organisms due to; (i) their immense physical abundance and genetic diversity (greater than in all other organisms combined), and(ii) their extremely high mutation rates.

These attributes are among the main reasons why virus genome data are so informative. For example, it is the rapid rate of genome evolution that makes it possible to track virus spread in close-to-real time However, the immense diversity and rapid rates of evolutionary change found in viruses also make virus genome datasets challenging to deal with, particularly when creating multiple sequence alignments (MSA).

Central importance of MSA

Most genomic analyses of viruses require the creation of a multiple sequence alignment (MSA), which is essentially a hypotheses about the homologies between distinct virus genome sequences.

Tetris NextStrain is a VGDR supporting digital pathogen surveillance - it provides interactive visualisations of virus outbreaks, driven by genomic data.

The high level of genetic variation that occurs within viral genomes makes the creation of high quality MSA difficult and time-consuming. In particular, alignment of distantly related sequences often requires a degree of human oversight.

Because alignments are critical to virus sequence data resources, GLUE places these high cost and high value resources at the centre of its strategy for organising sequence data. A key aim of the GLUE core schema is to capture as much nucleotide homology as possible amongst the sequences of interest, and to integrate it into a single data structure.

While there are already many software tools for creating and analysing MSA, there are, as far as we are aware, no other tools besides GLUE that have been specifically designed to facilitate the efficient management of MSA.

Mechanisms for efficient development of VGDR

VGDR are highly-ordered collections of virus genome sequences and associated data, packaged together with algorithms for manipulating and interrogating these data. One of the main aims of GLUE is to provide a mechanism for preserving the value inherent in these digital biomedical assets, and enabling their re-use.

Many different kinds of VGDR can potentially be developed for an individual virus. While these may be very different in nature from one another, they are typically constructed using the same or overlapping sets of data items and algorithms. Working within the GLUE framework facilitates re-use of these fundamental components.

For example, a GLUE project can be constructed that provides only the essential components for an individual virus or virus group, and this project can then be used as a ‘seed’ or template for the creation of more elaborate or specialised VGDR focussing on the same virus(es). The project can then be hosted in a cloud-based public or private repositories (e.g. GitHub), allowing controlled collaborative development of these resources.

A related benefit of working within this framework is that it can facilitate the creation of de-centralised networks of co-operative VGDR development and enhancement. GLUE projects can be downloaded, utilised and extended locally - thus individuals or institutions can easily develop and maintain their own private versions of GLUE projects, while retaining the possibility for these projects, or components therein, to be efficiently integrated into public resources at a later date.

What have we used GLUE for so far?

HCV-GLUE

Sheep in tuscany NextStrain is a VGDR supporting digital pathogen surveillance - it provides interactive visualisations of virus outbreaks, driven by genomic data.

An unprecedented quantity of genomic data is now being generated as a by-product of efforts to treat Hepatitis C virus (HCV) infection. Using GLUE, we have developed a bioinformatics resource, called HCV-GLUE, to effectively curate and exploit these data. The resource can be used either via a web site or via an offline version aimed at more experienced bioinformaticians.

HCV-GLUE currently holds ~90,000 HCV published sequences, updates are collected on a regular basis, using an automated approach. It also incorporates tools for analysis of HCV genome data, including modules for genotyping HCV sequences and identification of sites that confer DAA resistance (resistance-associated substitutions, RAS).

DIGS for EVEs

DIGS for EVEs is an investigation into endogenous viral element (EVE) diversity in eukaryotic genomes. We are using GLUE to coordinate this investigation, and to generate VGDRs along the way. We have created individual projects for virus families or orders for which we have detected relatively high numbers of EVEs.

killifish We identified CVe in the mangrove killifish (*Kryptolebias marmoratus*) genome that encoded intact rep genes.

DIGS for EVEs is a work in progress. So far, we have mainly focussed on small DNA virus families. For example have created a GLUE project to facilitate studies of endogenous circoviral elements (CVe) - i.e. EVEs derived from circoviruses (family Circoviridae) - with information about infectious circoviruses. This project can serve in the first instance as a source of reference information for circovirus-related investigations, particularly those concerned with CVe. As discussed above, it may also be used as a template for the creation of other circovirus-related GLUE projects.

In current DIGS for EVEs work, we are developing GLUE projects for parvoviruses and retroviruses.

BTV-GLUE

Sheep in tuscany Aerial view of a flock of sheep in Italy.

Bluetongue virus (BTV) is a vector-borne virus within the Orbivirus genus of the Reoviridae family. It causes the bluetongue disease in various common domestic ruminant and wild animal species; an outbreak starting in 2006 threatened livestock industries in northern Europe.

In a collaboration between the MRC-University of Glasgow Centre for Virus Research and the The Pirbright Institute, we have used GLUE to collate several thousand BTV sequences from the literature and added complementary context data alongside each sequence. The resulting database is available online here

References

Singer JB, Thomson EC, McLauchlan J, Hughes J, and RJ Gifford (2018)
GLUE: A flexible software system for virus sequence data. preprint

Dennis TPW, de Souza WM, Marsile-Medun S, Singer JB, Wilson SJ, and RJ Gifford (2018)
Virus Research S0168-1702(17)30904-8
The evolution, distribution and diversity of endogenous circoviral elements in vertebrate genomes. view paper

Niebel M, Singer JB, Nickbakhsh S, Gifford RJ and EC Thomson (2017)
Lancet Gastroenterol Hepatol. (10):700-701.
Hepatitis C and the absence of genomic data in low-income countries: a barrier on the road to elimination? view