Germline GASOLINE
When sequencing data are aligned to a reference genome, in principle, each SV subtype generates a typical pattern of mapped reads, named SV signature, which is then used to identify the underlying alteration. These signatures can be classified in two distinct categories: gapped alignment or split read alignments. Alignment algorithms use gap penalty20 to account for genomic differences (alterations) occurring from insertions or deletions in the sequences. For example, a deletion, that is a lack of a sequence, generates a gap in the alignment of the read relative to a reference, while an insertion creates a gap in the alignment of the reference relative to the read. When genomic differences are too large and exceed gap penalty, mapping algorithms generate split-alignment, in which consecutive segments of the query sequence are mapped to disjoint regions in the reference and can have discordant orientation.
While gapped alignment generates two signature categories (insertions and deletions), the signatures arising from split reads can recognize, in principle every type of alteration: (i) two consecutive segments mapping far apart with the same or opposite orientations define, respectively, deletion or inversion signatures; (ii) overlapping coordinates define a duplication signature; (iii) a read splitted in three segments, with the first and third segments closely mapped, define an insertion signature; (iv) finally, when two consecutive segments mapped on different chromosomes define a translocation signature (Supplementary Fig. S1).
The first critical step of SVs detection consists in finding and grouping, all the signatures generated by each genomic alteration. From a computational point of view this step consists in clustering the genomic coordinates of SV signatures to find groups of intervals with large reciprocal overlap.
Owing to the high error rate, the alignment of long read data can be very noisy and the genomic coordinates of SV signatures generated by the same event can be imprecise and have variances of tens of bp. In this situation, identifying and clustering SVs signatures generated by small events (50–500 bp) require small reciprocal overlap that takes into consideration noisy and imprecise alignments, while large SVs (tens or hundreds of kb), less affected by error rate, needs large reciprocal overlap to prevent the inclusion of signatures arising from other events.
Thus, the use of standard reciprocal overlap criteria can underestimate or completely miss signatures of small SVs (using large reciprocal overlap), or include wrong signatures in the identification of large SVs (using small reciprocal overlap may include signatures of other SVs).
To overcome the limits of classical reciprocal overlap we introduced a novel normalized reciprocal overlap (NRO) criterion that allows grouping both small and large SV signatures with high accuracy, thus reducing the effect of imprecise alignment.
NRO mitigates the effect of imprecise alignment by taking into account both overlapping and non-overlapping regions of two SV signatures with the following formula:
$$\begin{aligned} NRO_{ij}=2 \cdot \frac{Overlap}{Li+Lj}-\frac{NO_i+NO_j}{NO_{Norm}} \end{aligned}$$
where \(L_i\) and \(L_j\) are the the size of the two signatures, Overlap and NO are the size of the overlapping and non-overlapping regions between the two signatures respectively and \(NO_{Norm}\) is a normalization factor. The first term represents the classical reciprocal overlap between two signatures, while the second term allows mitigation of the effect of imprecise alignments for small variants and prevents the inclusion of erroneous signatures in large SVs.
To identify germline SVs, GASOLINE takes as input the aligned data of a sample (in BAM format) and extract the genomic coordinates of gap- and split-alignment signatures (Fig. 1a1 and Supplementary Fig. S1). SV signatures are then clustered with a sophisticated clustering procedure based on NRO measure, whose first step consists in calculating NRO for each pair of signatures (Fig. 1a2,a3).
Once \(NRO_{ij}\) is calculated for all the signatures pairs [i, j] (Fig. 1b1), we perform interval clustering by using a graph-based approach. We first create an undirected graph by exploiting the NRO matrix as adjacency matrix, in which nodes are SV signatures and edges between two nodes i and j exist if \(NRO_{ij}> Thr\) (where Thr is a predefined reciprocal overlap threshold, Fig. 1b2). An edge between two nodes expresses the confidence of two signatures being generated by the same SV event (Fig. 1b3).
The undirected graph is then used to extract maximal cliques (groups of fully connected nodes) by using the Eppstein–Löffler–Strash algorithm21. Maximal cliques represent groups of signatures that can be assumed to be generated from the same SV event (Fig. 1b3). SV signatures of each maximal clique are then used to estimate the genomic coordinates (start and end) of each SV event by calculating the median of all start and end coordinates (Fig. 1b4). For each cluster we then calculate different statistics including: cohesion score (the ratio between the numbers of links in the extended clique and the maximum numbers of link), mean mode and standard deviation of start and end coordinates. These statistics are then exploited to filter out low quality SV-signature clusters.
For deletions, duplications, inversions and insertions, the clustering procedure is applied separately to signatures extracted from each chromosome. Similarly, for translocations, clustering is applied to signatures extracted from each pair of chromosomes.
Finally, under the assumption of diploidy, each SV event is genotyped as reference, heterozygous, homozygous, by using the maximum-likelihood Bayesian classification algorithm as in22 (Fig. 1b5).
Somatic GASOLINE
The detection of somatic SVs in cancer genomes consists in the identification of SVs present in the cancer sample and absent in the patient-matched normal sample. GASOLINE identifies somatic SVs by first applying the germline detection module to the test data and then searching for overlapping SVs signatures in the matched normal sample (Fig. 1c).
The signature clusters identified in the test sample are then compared with the signatures extracted from the control sample, by calculating their NRO (Fig. 1c2). Signatures with a \(NRO_{SV}\) larger than a predefined threshold are considered to be generated from the SV event of the test sample (see Fig. 1c3).
Statistical significance of somatic SV is calculated by comparing the proportion between SV signatures and reference reads in the tumor and matched-normal samples with the Fisher’s exact test (contingency table of Fig. 1c4). SVs with a p-value smaller than a predefined significance threshold are considered somatic.
GASOLINE and germline SV detection
Many computational methods have been developed to detect SVs from different technologies and evaluation of their performance is a very challenging task, mainly due to the lack of gold-standard datasets including all subtypes of structural variation.
Thus, to assess the performance of GASOLINE in the detection and genotyping of germline SVs we first generated synthetic genomes with SVs of all subtypes and size by using the PBSIM2 software23. PBSIM2 was exploited to simulate datasets mimicking the characteristics of long reads generated by either ONT or PacBio platforms, with average size of 10 kb, a global error rate of 90% and a total sequencing coverage from 5\( \times \) to 30\(\times \) (coverage = 5\(\times \),10\(\times \), 15\(\times \), 20\(\times \), 25\(\times \) and 30\(\times \)).We simulated all SV subtypes (deletions, insertions, duplications, inversions and translocations) in homozygous and heterozygous state and with size ranging from 50 bp to 5 kb (50, 100, 200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000 bp). Simulated sequencing datasets were then aligned to the reference genome with minimap224 and NGMLR18 aligners (see “Methods”).
To investigate the performance of our tool as a function of the NRO threshold, we applied it to the synthetic dataset using different parameter settings (NRO=[0.5, 0.6, 0.7, 0.8, 0.9]) and calculated precision and recall as in25 (see “Methods”). The results of these analyses (Supplementary Figs. S2–S4) demonstrated that NRO thresholds have little effect on the global performance of our tool, with NGMLR-generated data giving better results with low NRO thresholds (0.5–0.7), minimap2-generated data requiring instead higher values (0.8–0.9), particularly in the case of deletions.
Recently, the Genome in a Bottle (GIAB) Consortium, using a combination of short-, linked-, and long-read sequencing, as well as optical mapping, has characterized the genome of an individual of Ashkenazim ancestry (NA24385), thus generating gold-standard datasets for SVs26. Although fundamental to test new technologies and algorithms, this dataset only contains high confidence sequence-resolved insertion and deletion calls \(>50\) base pairs (bp), and it does not enable performance assessment for inversions, duplications and translocations.
For this reason, we simulated inversions, duplications and translocation by using a computational strategy based on SURVIVOR27 to modify the human reference genome and PBSIM2 to simulate ONT or PacBio reads (see “Methods”). Simulated reads were then aligned to the human reference genome with minimap224 and NGMLR18 aligners (see “Methods”).
The synthetic dataset was then exploited to compare the performance of GASOLINE (using NRO=0.8) with those of other three state-of-the-art tools: Sniffles218, CuteSV17 and SVIM16. The results reported in Fig. 2a–f and Supplementary Fig. S5 demonstrate that our method (and Sniffles2) obtained the best performance in the identification of simulated inversions, duplications and translocations, especially for low sequencing coverages (5–10\(\times \)), thus demonstrating that our new NRO-based computational strategy is capable to group SV signatures with an high level of accuracy.
Remarkably, sequencing coverage has little effects on the global performance of our tool for both alignment algorithms: as previously reported by28 a sequencing coverage higher than 15\( \times \) is sufficient for detecting all subtypes of SVs and its increase does not lead to significant improvements.
To evaluate the accuracy of GASOLINE in the detection of real germline insertions and deletions we applied it to the publicly available ONT and PacBio NA24385 datasets generated by the GIAB consortium (see “Methods”) and compared its performance with the three aforementioned tools. To assess the performance of the four tools in detecting SVs at different sequencing coverages, we downsampled ONT and PacBio datasets to simulate 5, 10, 15, 20, 25 and 30\(\times \) coverages and performed alignments with both minimap2 and NGMLR.
To compare the capability of the four tools to cluster SV signatures, we calculated precision and recall as a function of numbers of reads supporting each SV event, separately for large and small SVs (see “Methods”). Figure 2g–j and Supplementary Figs. S6–S17 show that for large insertions and deletions (\(>500\) bp) all the tools performed very similarly. For small SVs (\(<500\) bp), instead, GASOLINE showed the superiority of its NRO-based clustering procedure in grouping true SV signatures from long reads noisy alignments, obtaining the highest precision at the same level of recall for all sequencing coverages (5–30\(\times \)), sequencing data (PacBio and ONT) and aligners (NGMLR and minimap2). These analyses also showed that NGMLR alignment gave better results in the detection of large SVs, while minimap2 data better detected small SVs.
Finally, we calculated precision and recall for all SVs genotyped as heterozygous (0/1) or homozygous (1/1) by the four tools and we found that while for large insertions and deletions (\(>500\) bp) all tools obtained very similar results, for small SVs (\(<500\) bp) GASOLINE obtained the highest F1 measure for all sequencing coverages and technologies (Fig. 2k–n).
Somatic SV detection on simulated and real data
As for germline variants, the validation of computational methods for somatic SVs detection is challenged by the lack of high-quality gold standard datasets enabling benchmarking and comparison of bioinformatic analysis pipelines, especially for tools exploiting long-read datasets and capable to identify small and complex variants previously unseen by short-reads WGS experiments.
Recently, Valle-Inclan et al.19, generated a comprehensive set of true somatic SVs (comprising all SV types) of the melanoma COLO829 cell lines by using four different sequencing technologies (Illumina HiSeq, ONT, PacBio and 10\(\times \) Genomics) combined with extensive experimental validation (see “Methods”).
Despite the great utility of such a gold reference dataset, its application for benchmarking purposes on long read data is limited by the small number of insertions and by the size distribution of all the SVs that are mainly larger than 50 kb not allowing to evaluate the performance of algorithms in the detection of small SVs ([50, 500] bp).
For these reasons, to assess the accuracy of GASOLINE in the identification of small somatic insertions and deletions we simulated somatic SVs of different sizes by using the PacBio and ONT WGS NA24385 dataset generated by the GIAB consortium (see “Methods”). We selected 330 heterozygous SVs (169 insertions and 161 deletions from the high-confidence callsets generated by the GIAB consortium) with size distribution in the range [50 bp, 5 kb] (with 230 SVs smaller than 1 kb). Using custom scripts, we first removed all reads containing signatures of the 330 SVs, we randomly splitted the remaining reads in two sets and finally, we added all the reads containing SV signatures to one of the splitted sets. With this workflow, for both PacBio and ONT data we obtained a \(\sim 30\times \) sequencing experiment with reads containing the 330 selected SVs (test), and a \(\sim 30\times \) control experiment without the 330 SVs. To evaluate detection accuracy at different sequencing coverages we downsampled read datasets to obtain 5\( \times \), 10\(\times \), 15\( \times \), 20\(\times \) and 25\( \times \). Raw reads were aligned using either minimap2 or NGMLR. We then applied the somatic module of GASOLINE to the simulated datasets and we compared its performance to the other three tools (see “Methods”). Sniffles2, SVIM and CuteSV were used as in Valle-Inclan et al.19. They were applied separately on test and control samples and somatic SVs were called discarding events with supporting reads in the control sample. Supporting signatures in control samples were searched by using a reciprocal overlap larger than 0.5.
GASOLINE obtained the best F-measure for all sequencing coverages of both PacBio and ONT datasets aligned with minimap2 or NGMLR (Fig. 3a–d and Supplementary Figs. S18–S29), especially in the detection of small SVs (\([50{-}500]\)). In contrast, the other three tools identified a large number of false positive somatic SVs resulting in very low levels of precision. These analyses also showed that low sequencing coverages (\(\le 15x\)) yield very poor performance and that, similarly to germline SVs, the best F1-measure is obtained with at least 20\(\times \) coverage, for both PacBio and ONT data. Notably, ONT data outperformed PacBio in all our analyses, while the minimap2 approach showed higher precision and recall than obtained with NGMLR.
We next tested the capability of our tool in detecting all SVs subtypes, applying it to the ONT and PacBio data of the COLO829 cell lines and comparing its performance with those of the other three tools by using the Valle-Inclan et al.19 true-set as benchmark (see “Methods”). As with the simulated datasets, the three state of the art tools identified a large number of false positive somatic events that generated very low levels of precision. On the other hands, GASOLINE, filtering out SVs on the basis of somatic p-values, was capable to drastically increase precision (removing a large fraction of false positive calls) at the expense of a minimal decrease in recall (removal of true positive calls). For all SV subtypes, with the exception of insertions (unfortunately the gold standard true dataset only contains three somatic insertions), our method was capable to identify between 80% and 100% of the Valle-Inclan et al. true-set with precision in the order of 60–80%, outperforming the other three state of the art methods (Fig. 3e–o).
Notably, in ONT analyses with somatic p-value< 0.001, GASOLINE detected 49 SVs (25 deletions, 4 insertions, 9 Inversions, 5 tandem duplications and 6 translocations). Among these variants, 6 SVs (1 deletion, 1 insertions, 1 duplication and 3 inversions) were not present in the Valle-Inclan et al. gold reference set (Supplementary Table S1) and were not detected by the other three tools. Visual inspection of aligned reads demonstrated that all the six SVs have somatic signatures in ONT data (signatures in cancer and not in normal). Moreover, 5 out 6 of these SVs are also supported by Pacbio and/or Illumina data of the COLO829 cell line (see Supplementary Figs. S30–S35).These results demonstrate that GASOLINE can expand our potential to study somatic alterations in cancer and that the precision and recall reported in Fig. 3 are even underestimated.
GASOLINE tool
GASOLINE is a collection of Perl, R and Fortran codes for the detection of somatic SVs from long read sequencing data. It takes as input two BAM files from a pair of test and matched normal samples and gives as output a VCF file (version 4.2) with statistically significant somatic SVs.
GASOLINE analyzes aligned data in BAM format and extracts the genomic coordinates of discordant alignments of a read with respect to the reference genome (SV signatures). The parsing module of our tool searches for two types of signatures: gapped alignments (in the CIGAR strings) and split alignments (primary and supplementary alignment of a read).
At present the signature extraction module is written in Perl and takes around 3 h for parsing a ONT or PacBio WGS at 30\(\times \) of sequencing coverage. After the parsing step, GASOLINE can be run in ‘germline’ or ‘somatic’ mode. In ‘germline’ mode, SV signatures are grouped with the NRO-based clustering and then genotyped. In ‘somatic’ mode SV signatures from a test cancer sample are clustered and then compared with those of matching control sample to calculate somatic p-value with Fisher’s exact test.
Both ‘germline’ and ‘somatic’ modules can be ran in multicores: the ‘germline’ module takes around 1 h to genotype a 30\(\times \) bam file (with 20 threads), while the ‘somatic’ module takes around 2 h two compare the SV signatures of two 30\(\times \) bam files (with 20 threads).
GASOLINE can run on any UNIX system (desktops and workstations). The GASOLINE tool is freely available at https://sourceforge.net/projects/gasoline/.