JaBbA v1 outperforms previous CN algorithms
We enhanced our previous JaBbA (v0.1; ref. 4) model with several methodological innovations to increase robustness to read depth waviness, improve algorithm convergence and enforce junction balance for allele-specific as well as total CN (Extended Data Fig. 1a–d and Methods). We also rigorously defined ‘CN-unmappable’ regions in the genome as positions surrounded by >90% repetitive bases in their 1-kb vicinity. CN-unmappable regions accounted for 13% of the genome (across read lengths and genome builds), primarily comprised regions in or around telomeres and centromeres, and showed high variance in read depth across a panel of diploid normal samples (Methods and Extended Data Fig. 2). We then limited analysis with the updated model (JaBbA v1) to the 87% of the human genome that was CN-mappable.
To assess the accuracy of JaBbA v1 for SV breakend detection in CN-mappable regions, we simulated 500 SRS whole-genome profiles comprising binned (1-kb) read depth, single nucleotide polymorphism (SNP) read counts and SV junctions (Extended Data Fig. 3a–d and Methods). In these simulations, JaBbA v1 loose ends showed substantially higher precision (median of 43% versus 5%) and recall (median of 70% versus 54%) than JaBbA v0.1 loose ends for missing CN-mappable SVs in high-purity (>0.5) cancer genomes (Extended Data Fig. 3e). JaBbA v1 also showed markedly improved accuracy for overall CN-mappable SV breakend inference relative to both JaBbA v0.1 and four state-of-the-art cancer CN inference algorithms (ASCAT14, TITAN15, Sequenza16 and FACETS17) (Extended Data Fig. 3f), particularly for high-purity samples (median precision of 82% (68–91%) and median recall of 96% (93–100%), with the interquartile range (IQR) in parentheses) (Fig. 1b). JaBbA v1 also accurately estimated both total and allelic CN (Extended Data Fig. 3g), suggesting that JaBbA v1 is a state-of-the-art algorithm for the inference of CN and missing SVs in cancer genomes.
Pan-cancer landscape of loose ends
We next applied JaBbA v1 to 1,330 high-purity tumor and matched normal SRS profiles previously analyzed in Hadi et al.4 (see Methods for details), identifying 154,322 (clonal and somatic) junctions (median of 63 per tumor sample) and 48,835 somatic loose ends (median of 21 per tumor sample). The somatic loose end burden per sample varied across a 200-fold range and was correlated (Spearman R2 = 0.68) with the junction burden (Fig. 1c,d).
Junction breakends may be reciprocal, meaning that they are near (within 10 kb) of another breakend with opposite orientation. Reciprocal breakends are usually copy-neutral (Fig. 1e, top left) which makes them difficult to detect through classic CN analyses. JaBbA’s bookkeeping of mass balance across segments and junctions enables sensitive detection of reciprocal and nonreciprocal SVs at both copy-neutral and copy-altered genomic regions (Extended Data Fig. 4a–e). Across cancer, we found that most (85%) cancer junctions were both nonreciprocal and copy-altered (Fig. 1e, bottom left). Such junctions can arise from inherently nonreciprocal SVs, such as simple deletions, or begin as reciprocal translocations that undergo subsequent loss or gain of one of the derivative alleles (Extended Data Fig. 4f). Like somatic junction breakends, somatic loose ends were predominantly (92%) copy-altered (Fig. 1e, bottom right), although copy-neutral loose ends were also identified (Fig. 1e, top right). Taken together, these results suggest that loose ends arise by breakage and repair mutational processes similar to those generating junction breakends.
Loose ends harbor repetitive and foreign sequences
To study the sequence context around loose ends, we defined a canonical axis originating at the loose end with coordinates increasing along the DNA strand whose 3′ terminus matches the side of a segment on which a loose end is found, which we refer to as the loose end’s ‘forward’ strand (Fig. 2a). We next asked whether loose ends occurred preferentially at reference sequence repeats. Indeed, we found that unmappable bases were enriched near loose ends, most frequently LINE elements (Fig. 2b and Extended Data Fig. 5a). We next reasoned that some loose ends would result from the somatic fusion of mappable bases to unalignable sequences. Confirming this, we found a tumor-specific enrichment of repetitive and foreign sequences, including satellite and viral sequences, mated to reads on the forward (but not reverse) strand of somatic loose ends (Fig. 2c and Extended Data Fig. 5b).
To identify distinct classes of repetitive SVs missing from SRS whole-genome profiles, we systematically classified tumor-specific sequences fused to each somatic loose end through assembly or consensus alignment (Fig. 2d and Methods). Overall, 55% of somatic loose ends showed evidence of tumor-specific fusion to a distal sequence. For over half of these (33% of somatic loose ends), the distal sequence aligned uniquely, indicating that these were fully mapped breakends missed by the initial junction caller (Fig. 2e) (SvAbA18). In 23% of somatic loose ends (3% of detected breakends), the distal sequence was repetitive or foreign and could not be unambiguously placed on any reference (ambiguously mapped breakends; Fig. 2e). Finally, 45% of somatic loose ends (6% of detected breakends) did not map to any distal location (partially mapped breakends; Fig. 2e). Notably, partially mapped breakends were enriched in boundaries of large (>1-Mb) CN-unmappable regions (odds ratio (OR) = 3.8; P < 2 × 10−16) (Extended Data Fig. 5c), indicating that some represented CN changes shifted away from a CN-unmappable SV breakend (for example, centromeric breakends causing arm-level chromosomal changes).
Combining fully mapped breakends across both loose ends and junctions indicated that 91% of JaBbA v1 breakends could be uniquely mapped. Notably, the fraction of partially or ambiguously mapped breakends did not vary substantially across cancer types (Extended Data Fig. 5d; range of 5–33%) or established cancer drivers (Extended Data Fig. 5e; range of 0–38%), although we observed tumor types (for example, acute myeloid leukemia) and cancer genes (SMARCB1, TSC2 and FGFR3) with higher (>25%) fractional burdens. Given the estimated recall of JaBbA v1 (~96%), these results suggest that 87% of cancer SVs in the 87% of the genome that is CN-mappable can be fully resolved by SRS.
Long-molecule validation
To orthogonally assess these SRS-derived estimates of missing somatic SVs, we profiled the whole genomes of 11 melanoma (n = 10) and breast cancer (n = 1) tumor samples and their matched normal tissues with both SRS and Oxford Nanopore Technologies long-read sequencing (LRS; median read N50 of 11 kb; median coverage of 73× and 32× for tumor and normal samples, respectively). After calling large (>10-kb) somatic SVs in CN-mappable regions (Methods), we found a strong overlap (87%, 7,258 breakends) between LRS and SRS breakends, including 77% overlap with fully mapped SRS breakends (Fig. 2f). The majority of junction calls identified by either platform had local read depth changes that were consistent with breakend topology; reciprocal breakends were copy-neutral, whereas nonreciprocal breakends showed a CN drop along their forward strand (Extended Data Fig. 6a). This analysis along with manual inspection of long and short read support at inidivudal junctions (Extended Data Fig. 6b) suggested that both SRS-only and LRS-only junctions comprise largely true positives; combining SRS and LRS breakend counts suggests that SRS missed ~12% of breakends. This result is consistent with our simulation-based estimate of recall (Fig. 1b and Extended Data Fig. 3f). Notably, we found a similar proportion of reciprocal and non-reciprocal breakends among those detected and missed by SRS (Fig. 2f), indicating that reciprocal and copy-neutral breakends do not comprise the bulk of missed structural variation in cancer genomes. These results confirm our SRS findings that most cancer SVs are nonreciprocal and copy-altered (Fig. 1e).
We next asked whether LRS improved SV event detection, which relies on the recognition of high-order patterns across multiple junctions3,4. Although LRS did not help identify many additional simple or complex events relative to SRS (Fig. 2g), LRS junctions also resolved breakends at complex SVs found by SRS, including for chromothripsis, pyrgo, rigma and templated insertion chains3,4. The incorporation of LRS junctions enabled more complete haplotype reconstruction at loci where SRS found loose ends (Fig. 2h).
As additional validation of our results, we analyzed 27 high-purity (purity of >0.5) breast cancer and matched normal samples with both SRS and synthetic LRS (sLRS) whole-genome profiles (10x Genomics linked reads, median N50 molecule length of 23 kb, median coverage of 173× and 98× in tumor and normal samples, respectively; Methods)19. Similar to LRS, most sLRS SV calls (Methods) overlapped with SRS breakends, showed concordant patterns of reciprocality and CN change, and yielded similar complex SV calls in sLRS junction-augmented genome graphs (Extended Data Fig. 6c–e). These breast cancer and melanoma LRS and sLRS results are consistent with our pan-cancer finding that SRS captures most large cancer SVs in CN-mappable regions.
Loose ends reveal neotelomeres
We next sought to investigate specific mutational processes engendering loose ends. We observed that a fraction (4.8%) of ambiguously mapped loose ends (0.01% of all breakends) were fused to telomere repeats, as evidenced by telomere repeat-positive sequences mated to reads on the positive loose end strand (Fig. 3a). We refer to these breakends as telomere repeat-positive loose ends and surmised that they might represent neotelomeres, telomere-stabilized chromosome ends at previously interstitial genomic loci.
Telomere repeat-positive mates were found on the forward strand of telomere repeat-positive loose ends, but not on the reverse strand or in matched normal samples (Fig. 3a), indicating that these were neither telomere insertions20,21 nor constitutional neotelomeres22,23. Deeper analysis of telomere repeats at loose ends revealed strong strand bias, with loose ends harboring either G-rich (GRTR) or C-rich (CRTR) repeats but not both (Fig. 3b). The GRTR pattern is consistent with a neotelomere, whereas the CRTR pattern is consistent with the fusion of an interstitial sequence to a native chromosome end (Fig. 3c, right). The predominance of the GRTR pattern among telomere repeat-positive loose ends, in combination with the tumor specificity and forward strand bias, suggested that somatic neotelomeres are frequent in cancer.
To better assess sequences fused to GRTR+ loose ends, we profiled three cancer cell lines (U2OS, NCI-H526 and NCI-H838) with sLRS (Methods). We found telomere repeat-positive linked reads within 5 kb of 26 of 31 GRTR+ loose ends (83.8%) (Methods). Telomere repeat-positive linked reads were found up to 50 kb upstream of each GRTR+ loose end, indicating power to map distal fusion partners at these loci (Fig. 3d). In contrast to sLRS junctions and telomere repeat-negative loose ends, linked reads at GRTR+ loose ends rarely (<1.5%) mapped to distant chromosomal locations, consistent with new chromosome ends (Fig. 3e). Quantitative analysis of repeat counts at linked reads mapping to these loci (Methods) revealed 2.4 ± 1.3 (s.d.) kb of telomere repeats per GRTR+ locus, in line with previous estimates of native cancer telomere lengths20 (Fig. 3f).
To confirm that GRTR+ loose ends were indeed chromosome ends, we performed Southern blot analysis on restriction-digested U2OS and control (Saos-2) genomic DNA using radiolabeled probes against two U2OS GRTR+ loose ends. At each locus (Fig. 3g and Extended Data Fig. 7a), we found a small (<5-kb) band consistent with an unaltered reference allele and a longer U2OS-specific diffuse band consistent with a neotelomere (Fig. 3h and Extended Data Fig. 7b). To further investigate the nature of these nonreference bands, we subjected intact genomic DNA to exonuclease (Bal-31) digestion24. The U2OS-specific (but not wild-type) bands disappeared with prolonged exonuclease exposure (Fig. 3i and Extended Data Fig. 7c), consistent with their origin at a chromosome end. These results establish these two U2OS GRTR+ loose ends as bona fide neotelomeres.
We next hypothesized that telomerase-mediated healing of double-stranded DNA breaks might give rise to neotelomeres (Fig. 3c, left)25. However, neotelomeres were not found more frequently in tumors that amplified TERT or expressed it at high levels (CN > 2 ploidy, expression z score > 2). Instead, neotelomeres were enriched in samples with low or negligible TERT expression (reads per kilobase per million mapped reads (RPKM) = 0) (Fig. 3j). Tumors that lack telomerase may activate the alternative lengthening of telomeres (ALT) pathway, a break-induced replication (BIR) process (Fig. 3c, middle) suppressed by ATRX26. Indeed, we found that neotelomeres were significantly more common in tumors harboring truncating mutations in ATRX than in ATRX-wild-type cancers (Fig. 3j). Furthermore, we found that several ALT-associated cancers, including sarcomas (18%; OR = 6.47; P = 1.95 × 10−5) and low-grade gliomas (12.3%; OR = 3.92; P = 4.1 × 10−3), had the highest rate of GRTR+ loose ends relative to other tumor types (Fig. 3k). These results indicate that GRTR+ loose ends and neotelomeres may be a new hallmark of the ALT phenotype.
Loose ends link viral integration to amplicon formation
Surveying additional mutational processes engendering loose ends, we found ambiguously mapped somatic breakends fused to viral sequences, indicating junctional viral integration at large SVs (Extended Data Fig. 8a). While the integration of viral sequences into otherwise unrearranged loci (Extended Data Fig. 8a, left) has been widely studied in cancer27,28, the role of viruses in causing chromosomal-scale SVs (Extended Data Fig. 8a, right) has been a topic of only recent interest29,30,31. Somatic loose ends harboring tumor-specific viral sequence (viral loose ends) were rare overall (~1% of cancers), although enriched in cancer types with viral etiology in our dataset4: cervical squamous cell carcinoma (CESC; 32%), liver hepatocellular carcinoma (LIHC; 13%) and head and neck squamous cell carcinoma (HNSC; 7%) (Extended Data Fig. 8b). Consistent with previously characterized viral integration patterns, we found viral loose ends fused to oncogenic HPV sequences in CESC and HNSC and hepatitis B virus (HBV) sequences in LIHC27.
Breakends initiating complex amplifications are themselves likely to be amplified4. Viral loose ends were frequently amplified (CN > 7) relative to nonviral loose ends (P = 1.7 × 10−4; OR = 8.66) (Extended Data Fig. 8c), and HPV-16 loose ends had higher mean CN than either HPV-18 or HBV loose ends (P = 8.2 × 10−3 and P = 2.2 × 10−5, respectively, Extended Data Fig. 8d). Among these was an HNSC tumor (TCGA-4077) locus where two high-copy viral loose ends on chromosome 14 flanking an intronic region of the RAD51B gene were fused to opposite ends of the HPV-16 genome (Extended Data Fig. 8e). This locus is consistent with an ecDNA where HPV-16 is fused between two ends of a long-range duplication junction. This and other similar amplicon structures with high-copy viral loose ends (Extended Data Fig. 8e,f) point to HPV-16 integration as an initiating event in SV evolution, rather than a viral insertion into an existing ecDNA.
Crossover between parental homologs is rare in cancer
We next asked whether loose ends could be used to assess the contribution of aberrant homologous recombination to cancer rearrangements. Homologous recombination-driven crossover between parental homologs (allelic homologous recombination, or AHR) is a hallmark of meiosis32. Although AHR has been observed in somatic cells33, its contribution to cancer structural variation is unclear. AHR crossovers lead to segmental uniparental disomy (UPD) in approximately half of segregants (Fig. 4a, left). In balanced allelic graphs, AHR crossovers manifest as reciprocal pairs of partially mapped and copy-neutral loose ends on distinct parental homologs (Fig. 4b, left, and Methods). Notably, this form of UPD (AHR-UPD) is mechanistically distinct from UPD arising through progressive acquisition of nonhomologous recombination (for example, end joining)-driven rearrangements and/or chromosomal missegregation (progressive UPD, or P-UPD; Fig. 4a,b, right).
In our simulations (Extended Data Fig. 3a and Methods), JaBbA v1 distinguished AHR-UPD from P-UPD with both high precision (84.4%) and high recall (87.4%), substantially outperforming previous allelic CN algorithms (with precision ranging from 11–44%) (Extended Data Fig. 9a,b). Analysis of segment width distributions showed that AHR-UPD was distinct from P-UPD, whose distribution closely mirrored that of other forms of loss of heterozygosity (LOH; Fig. 4c). Likewise, AHR-UPD events were large (median width of 19.8 Mb), unlike P-UPD events (median width of 0.69 Mb) and other forms of LOH (median width of 0.62 Mb), which were focal (Fig. 4c).
Although AHR was found in many cancers (24% of all tumors) and specific tumor types (for example, 55% of cases of malignant lymphoma) (Extended Data Fig. 9c), it contributed to a minority of UPD events, most of which were progressive (31% P-UPD versus 1% AHR-UPD by total width) (Fig. 4d). Overall, a small minority of detected cancer breakends (<1%) arose by AHR (including non-UPD LOH). On the basis of an approximate rate of 0.5 AHR events per tumor and 100 cell divisions in the average ancestral cancer clone, and barring effects of selection, we estimate a rate of 10−12 AHR events per base pair per cell division. This is four orders of magnitude lower than the rate of meiotic recombination in human gametes, suggesting that AHR events are infrequent in somatic evolution34.
Germline but not somatic loose ends are consistent with NAHR
A second mechanism by which aberrant homologous recombination can cause large SVs is through non-AHR (NAHR), or crossover between long (>500-bp) stretches of nearly identical genomic sequences at distant haploid coordinates32,35,36 (Fig. 4e). We reasoned that such SVs would engender pairs of loose ends with substantial (>500-bp) strand-specific sequence homology in their vicinity (Extended Data Fig. 10a and Methods)36. Indeed, the burden of homologous loose end pairs accurately reflected the true NAHR burden across a compendium of simulated SRS tumor whole-genome profiles (Extended Data Fig. 3a) harboring a wide range of NAHR SV fractions (1–10%) (Fig. 4f).
Analyzing breakend pairs within each tumor, we found that approximately 20% of germline loose ends (Methods) were consistent with NAHR in contrast to only about 0.5% of somatic loose ends (and 0.06% of all somatic SV breakends) (Fig. 4g). These findings are consistent with prior observations about the substantial role of NAHR in germline variation8,37. The somatic NAHR burden did not vary by tumor type nor was it lower in tumors harboring biallelic pathogenic mutations in DNA repair genes, including frequently mutated homologous recombination pathway mediators (BRCA1, BRCA2, PALB2 and RAD51C). In summary, given a mean of 0.16 somatic NAHR events per tumor occurring across an estimated eligible territory of 2.8 × 108 homologous position pairs, we estimate a somatic NAHR density of 6 × 10−10 events per cancer genome bp2 (Methods).
To validate these SRS findings in long-molecule whole-genome profiles, we analyzed 38 melanoma and breast cancer cases profiled with SRS and either LRS or sLRS. Both LRS and sLRS data confirmed our SRS findings that somatic NAHR SVs were rare (<1% of LRS junction calls) while germline NAHR SV events were common (Fig. 4h and Extended Data Fig. 10b–e). Notably, we did not identify any reciprocal somatic NAHR rearrangements, a class of SVs that may potentially be missed through analysis of SRS loose ends.
Extrapolating beyond the CN-mappable genome
The analyses described above were limited to the 87% of the genome where CN could be reliably measured with SRS (Fig. 5a). The remaining 13% that is CN-unmappable comprises largely regions in or around telomeres and centromeres (Extended Data Fig. 2b). To assess the burden of large SVs here, we applied two simplifying assumptions: (1) the rate of NAHR between any two regions in the genome is proportional to the number of position pairs with substantial homology (>500 bp with >96% homology) between these regions and (2) the density of non-NAHR-driven rearrangements is uniform across the genome, and hence the burden of non-NAHR breakends in a given region is proportional to its width. Both of these assertions hold true, to a first approximation, across the CN-mappable genome (Extended Data Fig. 10f,g).
We used the latest telomere-to-telomere build (T2T CHM13; ref. 38) to estimate the number of homologous position pairs outside CN-mappable regions (Fig. 5b). We found that CN-unmappable sequences harbored ~100-fold-greater homologous position pairs (2.7 × 1010 bp2) than the CN-mappable portion of the T2T CHM13 genome build (2.8 × 108 bp2) (Fig. 5c). This suggested that CN-unmappable regions harbor ~100 times as many NAHR SVs as CN-mappable regions. Integrating these measurements (Fig. 5a–c and Methods), we estimate that CN-mappable regions harbor 83% of all large SV cancer breakends, most of which are detected by SRS (Fig. 5d). Furthermore, even when CN-unmappable regions are taken into account, we estimate that homologous recombination contributes to a small proportion (~5%) of large cancer SV breakends (Fig. 5d).