Comparative genomic analysis of flavivirids using GLUE


This is Flavivirid-GLUE, a GLUE project for the flavivirids (family Flaviviridae).

The Flaviviridae comprise enveloped, positive-strand RNA viruses, many of which pose serious risks to human health on a global scale. Arthropod-borne flaviviruses such as Zika virus (ZIKV), Dengue virus (DENV), and yellow fever virus (YFV) are the causative agents of large-scale outbreaks that result in millions of human infections every year, while the bloodborne hepatitis C virus (HCV) is a major cause of chronic liver disease.


Flaviviruses

Projected urbanisation in 2027 (from The Economist magazine). Urbanisation is often associated with the emergence and spread of mosquito-borne diseases by creating favourable conditions for the survival of mosquito vector species. Genome data can directly inform efforts to control diseases caused by mosquito-borne flaviviruses.


Since the emergence of the SARS-COV2 pandemic, many have become familiar with the use of virus genome data to track the spread and evolution of pathogenic viruses - e.g. via tools such as NextStrain. However, it is less widely appreciated that the same kinds of data sets and comparative genomic approaches can also be used to explore the structural and functional basis of virus adaptations.

The GLUE software framework provides an extensible platform for implementing computational genomic analysis of viruses in an efficient, standardised and reproducible way. GLUE projects can not only incorporate all of the data items typically used in comparative genomic analysis (e.g. sequences, alignments, genome feature annotations) but can also represent the complex semantic links between these data items via a relational database. This 'poises' sequences and associated data for application in computational analysis, minimising the requirement for labour-intensive pre-processing of datasets.

GLUE projects are equally suited for carrying out exploratory work (e.g. using virus genome data to investigate structural and functional properties of viruses) as they are for implementing operational procedures (e.g. producing standardised reports in a public or animal health setting).

Hosting of GLUE projects in an online version control system (e.g. GitHub) provides a mechanism for their stable, collaborative development, as shown below.

GitHub illustration


What is a GLUE project?


GLUE is an open, integrated software toolkit that provides functionality for storage and interpretation of sequence data. It supports the development of “projects” containing the data items required for comparative genomic analysis (e.g. sequences, multiple sequence alignments, genome feature annotations, and other sequence-associated data).


GLUE framework figure


Projects are loaded into the GLUE "engine", creating a relational database that represents the semantic relationships between data items. This provides a robust foundation for the implementation of systematic comparative analyses and the development of sequence-based resources. The database schema can be extended to accommodate the idiosyncrasies of different projects. GLUE provides a scripting layer (based on JavaScript) for developing custom analysis tools.


GLUE resources: server deployment illustration


Some examples of 'sequence-based resources' built for viruses using GLUE include:



What does building the Flavivirid-GLUE project offer?


Flavivirid-GLUE contains aligned, annotated reference genome sequences for all flavivirid species and endogenous viral elements (EVEs) derived from flavivirids. It offers a number of advantages for performing comparative sequence analysis of flavivirids:

  1. Reproducibility. For many reasons, bioinformatics analyses are notoriously difficult to reproduce. The GLUE framework supports the implementation of fully reproducible comparative genomics through the introduction of data standards and the use of a relational database to capture the semantic links between data items.

  2. Reusable data objects and analysis logic. For many - if not most - comparative genomic analyses, data preparation is nine tenths of the battle. The GLUE framework has been designed to ensure that work spent preparing high-value data items such as multiple sequence alignments need only be performed once. Hosting of GLUE projects in an online version control system such as GitHub allows for collaborative management of important data items and community testing of hypotheses.

  3. Validation. Building GLUE projects entails mapping the semantic links between data items (e.g. sequences, tabular data, multiple sequence alignments). This process provides an opportunity for cross-validation, and thereby enforces a high level of data integrity.

  4. Standardisation of the genomic co-ordinate space. GLUE projects allow all sequences to utilise the coordinate space of a chosen reference sequence. Contingencies associated with insertions and deletions (indels) are handled in a systematic way.

  5. Predefined, fully annotated reference sequences: This project includes fully-annotated reference sequences for major lineages within the Hepadnaviridae family.

  6. Alignment trees: GLUE allows linking of alignments constructed at distinct taxonomic levels via an ""alignment tree" data structure. In the alignment tree, each alignment is constrained to a standard reference sequence, thus all multiple sequence alignments are linked to one another via a standardised coordinate system.


GLUE project


On computers with GLUE installed, the Flavivirid-GLUE project can be instantiated by navigating to the project folder, initiating GLUE, and issuing the following command in the GLUE shell:

  Mode path: /
  GLUE> run file buildCompleteProject.glue


Contributors


Robert J. Gifford (robert.gifford@glasgow.ac.uk)

Rhys Parry

Connor Bamford

William Marciel de Souza


Related Publications


Bamford CGG, de Souza WM, Parry R and RJ Gifford (2021)
Comparative analysis of genome-encoded viral sequences reveals the evolutionary history of the Flaviviridae.
[preprint]

Singer JB, Thomson EC, McLauchlan J, Hughes J, and RJ Gifford (2018)
GLUE: A flexible software system for virus sequence data.
BMC Bioinformatics [view]

Zhu H, Dennis T, Hughes J, and RJ Gifford (2018)
Database-integrated genome screening (DIGS): exploring genomes heuristically using sequence similarity search tools and a relational database. [preprint]

License


This project is licensed under the GNU Affero General Public License v. 3.0.