Data Integration

Data integration in glycomics

This section presents the steps needed to integrate biological data within glycomics and with other "omics".

Glycan formats

Glycans are inherently more complex than nucleic acids and proteins, so defining a format to store the molecular information correctly is not a trivial problem. The complexity of the glycans resides in their branched structure and the collection of building blocks available. In contrast with proteins and nucleic acids which are made of respectively 4 and 20 building blocks, glycans can be built with many different monosaccharides. Additionally, information about monosaccharide anomericity, residues modification and substitution, glycosidic linkages and possible structure ambiguities must be taken into account. Without commenting on the different nomenclatures available to represent each monosaccharide, encoding a glycan structure into a file is required.

An O-Glycan (Glyconnect ID 2641 and GlyTouCan accession number G01614ZM) encoded with seven different glycan format. (A) LINUCS sequence format as used in (B) BCSDB sequence encoding. (C) CarbBank sequence format. (D) GlycoCT sequence format as used in GlycomeDB and UniCarbDB. (E) KCF format used in the KEGG database. (F) LinearCode® as used in the CFG database. (G) WURCS format used in GlyTouCan. Adapted from Toolboxes for a standardised and systematic study of glycans (Campbell et al. 2014).

The figure shows an O-glycan (Glyconnect ID 2641) encoded in several different formatsCampbell et al. 2014. Glycan encoding methods can be grouped into three sets according to the technique used to store the data. The first group reduces glycan tree-like structures into linear sequences using strict rules for sorting branches. This group contains Web3 Unique Representation of Carbohydrate Structures (WURCS) Matsubara et al., 2017, the Bacterial Carbohydrates Structure Database sequence format <notes|note=Toukach, 2011), LinearCode, LINUCS and CarbBank multilinear representation .
The second group, instead, uses connective tables. This is exemplified with the GlycoCT code (Herget et al., 2008) and KFC code . The third group relies on XML encoding and contains CabosML and GLYDE which are not shown.

The use of several formats has made data integration across different sources almost impossible. To tackle this problem, bioinformaticians have developed a collection of tools to parse and translate glycan structures across different encoding systems. Four main projects have been focus on data integration and tool development :

RINGS (Akune et al., 2010, Glycoscience_de (Lutteke et al., 2006), EuroCarbDB (von der Lieth et al., 2011) and GlycomeDB (Ranzinger et al., 2008) which now has been replaced by GlyTouCan (Tiemeyer et al., 2017). In the recent years, different databases like MatrixDB (Launay et al., 2015), Glyco3D (Perez et al., 2015, 2016),
UniLectin (Bonnardel et al., 2018) and the Glycomics@ExPASy initiative have introduced GlycoCT as a standard encoding format.

Glycan graphical representations

Graphic representations of the O-glycan. (A) Chemical representation. (B) Oxford cartoon representation. (C) SNFG representation. The legend shows the difference encoding of monosaccharides between Oxford and SNFG. Adapted from Toolboxes for a standardised and systematic study of glycans (Campbell et al. 2014).

Data formats presented in the previous paragraph are essential for storing information and exchanging data across software applications but are not suitable for humans. Consequently, glycobiologists have proposed different graphical representations where symbols or chemical structures replace monosaccharides. A collection of depictions of the O-glycan presented above is given. Each of these representations has some peculiarities. The chemical depiction is used by researchers that are interested in glycan synthesis or use NMR to elucidate glycan structures. The Oxford nomenclature (Harvey et al., 2011) defines linkages using angles and encodes monosaccharide anomericity with dashed or solid lines.

The nomenclature used by the first version of Essential of Glycobiology encodes monosaccharides using shapes and colour is the most user-friendly. With the publication of the third edition of Essential of glycobiology (Varki et al.,2017) have made an effort to define a standard nomenclature called the Symbol Nomenclature for Glycans (SNFG) which should replace previous graphical representations (Varki et al., 2015).

Glycan composition format

In many cases, experimental techniques can only distinguish glycan compositions, losing the information about the 2D structure. A glycan composition, indeed, reports the amounts of different monosaccharides present in the glycan without saying anything about their position in the space. An example is H2N1S2 which refers to a glycan with 2 Hex, 1 HexNAc, and 2 NeuAc. Sometimes, in addition to the composition, it is possible to extract information about the number of antennas, the presence of a bisected HexNAc and the amount of galactose. In this case, the researcher uses a nomenclature called Oxford which allows encoding of these additional data with the composition. This nomenclature should not be confused with the Oxford graphical representation for glycan structure described in the previous paragraph. In Oxford nomenclature, A2G2S2 refers to a glycan which has 2 antennas, 2 galactose and 2 NeuAc resulting in a composition of 5 Hex, 4 HexNAc, and 2 NeuAc. However, Oxford nomenclature works only with N-glycans and there are many dialects which are currently in use.


The community acceptance of unique identifiers for genes and proteins has facilitated data integration and data exchange boosting the research in genomics and proteomics. At the same time, some initiatives have taken place to fill this gap in glycomics. In 1989, the Complex Carbohydrates Structure Database, even known as CarbBank, (Doubet et al.,1989) was the first attempt to establish unique identifiers for glycan structures. However, in the late 1990s, CarbBank project ceased due to the end of funding. Consequently, different initiatives like CFG and KEGG have tried to continue CarbBank mission leading to a multiplicity of identifiers for each glycan structure.
From 2013, researchers in glycomics brought back the need for a glycan structure registry which would ease data sharing and increase data integration among different platforms. Following this pressing wish, in 2015, Aoki-Kinoshita et al., (2016) unveiled GlyTouCan, a central registry for glycan structures.
GlyTouCan allows users to deposit glycan structures in exchange for a unique identifier. Since different glycomics techniques cannot fully elucidate glycan structures, GlyTouCan accepts submissions of incomplete structures. Recently, the glycomics community has strongly endorsed the work of Aoki-Kinoshita to secure the stability of the registry which has become a crucial resource in the glycomics panorama. Every scientist in glycomics is encouraged to submit glycan structures to the registry and use the unique identifiers in reports and manuscripts (Tiemeyer et al., 2017).

Reporting Guidelines

The innate complexity of glycan structures, often, needs orthogonal experimental techniques to be elucidated. In this scenario, each experiment solves only a part of the puzzle, and only the integration of multiple data source leads to an accurate annotation. Due to this fact, experimental guidelines are necessary to publish datasets that can be easily interpreted, evaluated and reproduced by other glycoscientists. From 2011, experts in the field of glycobiology, glycoanalytics and glycoinformatics have been working together, under the patronage of the Beilstein Institute, to define the Minimum information Required for A glycomics Experiment (MIRAGE) (York et al., 2014). This initiative follows the more popular initiatives MIAME and MIAPE which we have already discussed. At the time of writing, MIRAGE already has published guidelines for sample preparation , mass spectrometry analysis (Kolarich et al., 2013) and glycan microarray analysis (Liu et al. 2017). Additionally, a guideline for liquid chromatography analysis is in preparation.


The combination and integration of multiple experimental and knowledge-based sources are essential to define the role of glycans. However, data are spread across different databases which act as "disconnected islands" . In this situation, ontologies provide a painless way to interconnect resources within glycomics and with other "omics".
As a first attempt, Sahoo et al. generated GlycO, "a glycoproteomics domain ontology for modelling the structure and functions of glycans, enzymes and pathways" (Sahoo et al., 2006). GlycO has a strong focus on the biosynthesis of complex glycan structures and their relationships with proteins, enzymes and other biochemical entities.

More recently, Ranziger et al. developed GlycoRDF (Ranzinger et al., 2015). Contrary to GlycO, GlycoRDF has been designed with the precise goal of integrating all the information available in glycomics resources limiting the development of multiple RDF dialects. A detailed diagram of GlycoRDF is given below.

The diagram of the core classes of the GlycoRDF ontology. Grey and white boxes respectively identify classes and subclasses. Courtesy of GlycoRDF : an ontology to standardize glycomics data in RDF (Ranzinger et al. 2015).


Despite its importance, data visualisation is still a challenge in glycomics. In the last decade, some initiatives have pushed the development of visual tools to improve some aspects of glycan identification and quantification.
Glycoviewer (Joshi et al., 2010) is the first example of data visualisation tool which allows glycoscientists to visualise, summarise and compare different glycomes. GlycomeAtlas (Konishi & Aoki-Kinoshita, 2012), provides an interactive interface for exploring data produced by the Consortium of Functional Glycomics (CFG). To conclude, as stated in its website, GlycoDomainViewer (Joshi et al., 2018) is a visual "integrative tool for glycoproteomics that enables global analysis of the interplay between protein sequences, glycosites, types of glycosylation, and local protein fold / domain and other PTM context". GlycoDomainViewer integrates experimental data as well as knowledge data sources presenting the most extensive collection of information to explore the possible effect of glycosylation on a protein.
Despite the availability of these and more visual tools, the majority of glycoscientists are still using general purpose applications like Excel to publish experimental results. Therefore, results are hardcoded in figures or text and data is stored in tables which populate the supplemental material section. Additionally, integrative tools like GlycoDomainViewer, although very useful, are usually developed taking into account the needs of a specific research group limiting the possibility of reaching out to the entire community.