Methodological and results overview
Traditional RNA-seq studies relied on short-read sequencing approaches that excel at quantifying gene-level expression, but cannot accurately assemble and quantify a large proportion of RNA isoforms11,21 (Fig. 1a). Thus, we sequenced 12 postmortem, aged, dorsolateral prefrontal cortex (Brodmann area 9/46) brain samples individually from six patients with AD and six cognitively unimpaired controls (50% female; Fig. 1b). All samples had postmortem intervals <5 h and an RNA integrity score (RIN) ≥ 9.0; demographics, summary sequencing statistics and read length distributions are shown in Supplementary Table 1 and Supplementary Figs. 1–4. Poly(A)-enriched complementary DNA from each sample was sequenced using one PromethION flow cell. Sequencing yielded a median of 35.5 million aligned reads per sample after excluding reads lacking the primer on either end and those with a mapping quality <10 (Extended Data Fig. 1a). By excluding all reads missing primers, reads included in the present study should closely represent the RNA as it was at extraction.
We performed RNA isoform quantification and discovery (including new gene bodies) using bambu14 (Fig. 1b)—a tool with emphasis on reducing false-positive RNA isoform discovery compared with other commonly used tools14. Bambu was highlighted as a top performer in a recent benchmark study13. However, as a tradeoff for higher precision, bambu is unable to discover new RNA isoforms that only differ from annotated RNA isoforms at the transcription start and/or end site (for example, shortened 5′-UTR). When it comes to quantification, the increasing complexity of annotations can impact quantification owing to non-unique reads being split between multiple transcripts. For example, if a read maps equally well to two RNA isoforms, each isoform will receive credit for 0.5 reads.
For our 12 samples, bambu reported an average of 42.4% reads uniquely assigned to an RNA isoform and 17.5% reads spanning a full-length RNA isoform (Extended Data Fig. 1c). We considered an isoform to be expressed above noise levels only if its median counts per million (CPM) was >1 (that is, at least half of the samples had a CPM > 1); this threshold is dependent on overall depth, because lower depths will require a higher, more stringent CPM threshold. Using this threshold, we observed 28,989 expressed RNA isoforms from 18,041 gene bodies in our samples (Extended Data Fig. 2a–c). Of the RNA isoforms expressed with median CPM > 1, exactly 20,183 were classified as protein coding, 2,303 as long noncoding RNAs, 3,213 as having a retained intron and the remaining 3,290 were scattered across other biotypes—including new transcripts (Extended Data Fig. 3).
We used publicly available mass spectrometry (MS) data from aged, human dorsolateral prefrontal cortex tissue (Brodmann area 9)22,23 and human cell lines24 to validate new RNA isoforms at the protein level, resulting in a small number of successful validations. We also leveraged existing short-read RNA-seq data from the Religious Orders Study Memory and Aging Project (ROSMAP)25,26 and long-read RNA-seq data from Glinos et al.19 to validate our newly discovered RNA isoforms and gene bodies.
Discovery of new RNA isoforms from known gene bodies
Our first goal was to identify and quantify new RNA isoforms expressed in human frontal cortex. In total, bambu discovered 1,534 new transcripts from known (that is, annotated) nuclear gene bodies. Of these 1,534 new RNA isoforms, exactly 1,106 had a median CPM ≤ 1. Although we expect that many of these new RNA isoforms with a median CPM ≤ 1 are legitimate, we consider them low-confidence discoveries and exclude them throughout the remainder of our analyses, except where explicitly noted.
After excluding all isoforms with a median CPM ≤ 1,428, isoforms remained that we consider high confidence (Fig. 2a,b), where 303 were from protein-coding genes (Fig. 2a). We report substantially fewer new isoforms compared with Glinos et al.19 (~70,000) and Leung et al.20 (~12,000) because of: (1) differences in the reference database; (2) the discovery tool employed13,27 (that is, bambu14 versus FLAIR28 versus Cupcake29); and (3) sequencing depth and stringency in what constitutes a new isoform. Specifically, Glinos et al.19 used gene annotations from 2016 when determining new isoforms. This is likely because they were trying to maintain consistency with previous Genotyope-Tisse Expression (GTEx) releases, but approximately 50,000 new isoforms have already been annotated since then2. We also set a stricter threshold for high-confidence isoforms, using a median CPM > 1. Given the depth of our data, a CPM = 1 corresponds to an average of 24 observed copies (that is, counts) per sample. Exactly 297 (69.4%) of our newly discovered isoforms are unique to our data, when compared with Ensembl v.107, Glinos et al.19 and Leung et al.20 (Supplementary Tables 2 and 3).
We performed a down-sampling analysis to assess the importance of depth on our discoveries. Including all discoveries (even those with median CPM ≤ 1), we discovered only 490 new isoforms from known genes with 20% of our aligned reads compared with 1,534 using 100% of our aligned reads (difference of 1,044; Extended Data Fig. 4a). Looking only at high-confidence discoveries in known genes, we discovered 238 and 428 at 20% and 100% of reads, respectively (Extended Data Fig. 4b), showing the importance of depth in our data. Although both annotations and read depth were important factors impacting new RNA isoform discovery, these do not explain the dramatic difference in reported discoveries between our work and that of Glinos et al.19. Thus, we conclude that the primary driver of these differences is the discovery tool employed. We observed a 33.8% increase in transcript discovery overlap between our dataset and GTEx when using the same tools and annotation, supporting the idea that these are large drivers of differences between our findings (Extended Data Fig. 5). We analyzed data from all tissue types from Glinos et al.19 to ensure consistency between our approaches. The discovery of new isoforms unique to GTEx when using the identical pipeline and annotations from our study probably results from tissue-specific isoforms that do not occur in the brain.
New high-confidence isoforms had a median of 761.5 nucleotides in length, ranging from 179 nt to 3,089 nt (Fig. 2c) and the number of exons ranged between 2 and 14, with most isoforms falling on the lower end of the distribution (Fig. 2d). Our data were enriched for new RNA isoforms containing all new exons and exon–exon boundaries (that is, exon junctions; Fig. 2e). The 428 new high-confidence isoforms contained 737 new exon–intron boundaries, where 94.9% (356/370) and 100% (367/367) of the 5′- and 3′-splice sites matched canonical splice site motifs, respectively, supporting their biological feasibility (Fig. 2f). We successfully validated 9 of 17 attempts for new high-confidence isoforms through PCR and gel electrophoresis (Fig. 2g, Supplementary Figs. 5–26 and Supplementary Table 4). Of the eight RNA isoforms that failed via standard PCR (no visible band on gel), six were validated through real-time quantitative PCR (RT–qPCR) using a conservative cutoff of Ct < 35 (ref. 30) (Supplementary Table 5). Of the 15 transcripts successfully validated through PCR and gel electrophoresis or RT–qPCR, 11 were unique to the present study. For additional validation, we compared relative abundance for known and new RNA isoforms between long-read sequencing and RT–qPCR for MAOB, SLC26A1 and MT-RNR2. The expression patterns were concordant for all three genes tested (Extended Data Fig. 6 and Supplementary Tables 6 and 7).
We further attempted to validate our new high-confidence transcripts from known genes using long-read RNA-seq data from five GTEx19 brain samples (Brodmann area 9) and short-read RNA-seq data from 251 ROSMAP25 brain samples (Brodmann area 9/46). Approximately 98.8% of the new high-confidence transcripts from known gene bodies had at least one uniquely mapped read in either GTEx or ROSMAP data and 69.6% had at least 100 uniquely mapped reads in either dataset (Extended Data Fig. 7 and Supplementary Table 8).
Out of interest, we also validated 6 RNA isoforms from the 99 newly predicted protein-coding genes reported in Nurk et al.31 using the new telomere-to-telomere (T2T) CHM13 reference genome (Extended Data Fig. 8). Our validation threshold for the CHM13 analysis was at least 10 uniquely mapped reads in total across our 12 frontal cortex samples.
Using MS data from the same brain region and human cell lines, we validated 11 of the new high-confidence isoforms from known genes at the protein level (Fig. 2h,i). Three of the eleven that we validated were unique to our study (BambuTx1879, BambuTx1758 and BambuTx2189).
Medically relevant genes
Identification and quantification of all isoforms are especially important for known medically relevant genes because, for example, when clinicians interpret the consequence of a genetic mutation, it is interpreted in the context of a single isoform of the parent gene body. That isoform may not even be expressed in the relevant tissue or cell type, however. Thus, knowledge about which tissues and cell types express each isoform will allow clinicians and researchers to better interpret the consequences of genetic mutations in human health and disease. To assess RNA isoform expression for medically relevant genes in the frontal cortex, we used the list of medically relevant genes defined in ref. 32, also adding genes relevant to brain-related diseases33,34,35,36,37,38,39,40,41,42.
Of the 428 new high-confidence isoforms, 53 originated from 49 medically relevant genes and we quantified the proportion of total expression for the gene that came from the new isoform(s) (Fig. 3a and Supplementary Fig. 27). The genes with the largest percentage of reads from a newly discovered isoform include SLC26A1 (86%; kidney stones43 and musculoskeletal health44), CAMKMT (61%; hypotonia–cystinuria syndrome, neonatal seizures, severe developmental delay and so on45) and WDR4 (61%; microcephaly46 and Galloway–Mowat syndrome-6 (ref. 47)). Other notable genes with new high-confidence isoforms include MTHFS (25%; major depression, schizophrenia and bipolar disorder48), CPLX2 (10%; schizophrenia, epilepsy and synaptic vesicle pathways49) and MAOB (9%; currently targeted for Parkinson’s disease treatment50; Fig. 3c). We also found an unannotated RNA isoform for TREM2 (16%; Fig. 3b), one of the top AD risk genes51, which skips exon 2. This isoform was reported as new in our data because it remains unannotated by Ensembl as of June 2023 (ref. 2), but has previously been reported by two groups52,53. The articles identifying this new TREM2 isoform reported a relative abundance of around 10%, corroborating our long-read sequencing results52,53. The new isoform for POLB—a gene implicated in base-excision repair for nuclear and mitochondrial genomes54,55—accounted for 28% of the gene’s expression (Fig. 3d). We discovered an additional 66 new transcripts from medically relevant genes with median CPM ≤ 1, including new RNA isoforms for SMN1 and SMN2 (spinal muscular atrophy56; Supplementary Figs. 28 and 29). Medically relevant genes with new RNA isoforms that did not meet our high confidence are shown in Supplementary Fig. 30.
Spliced, mitochondrially encoded isoforms
We identified a new set of spliced, mitochondrially encoded isoforms containing two exons (Fig. 3e), a highly unexpected result given that annotated mitochondrial transcripts contain only one exon. New mitochondrial isoforms were filtered using a count threshold based on full-length reads rather than a median CPM threshold owing to technical difficulties in quantification arising from the polycistronic nature of mitochondrial transcription. Bambu identified a total of 34 new spliced mitochondrial isoforms, but, after filtering using a strict median full-length count threshold of 40, only 5 high-confidence isoforms remained. Four of the new high-confidence isoforms span the MT-RNR2 transcript. Not only does MT-RNR2 encode the mitochondrial 16S rRNA, but it is also partially translated into a purported anti-apoptotic, 24-amino acid peptide (humanin) by inhibiting the Bax protein57. The fifth new high-confidence isoform spans the MT-ND1 and MT-ND2 genes, but on the opposite strand. Our results support previous important work by Herai et al. demonstrating splicing events in mitochondrial RNA58.
For context, although expression for the new mitochondrial isoforms was low compared with known mitochondrial genes (Fig. 3f), their expression was relatively high when compared with all nuclear isoforms. All five exons from new high-confidence mitochondrial isoforms contained the main nucleotides from the canonical 3′-splice site motif (AG), whereas three out of five (60%) contained the main nucleotides from the canonical 5′-splice site motif (GT) (Fig. 3g).
We attempted to validate three new high-confidence mitochondrially encoded isoforms through PCR and successfully validated two of them (Supplementary Figs. 25 and 26). It was not possible to design specific primers for the other two new high-confidence mitochondrial isoforms because of low sequence complexity or overlap with other lowly expressed (low-confidence) mtRNA isoforms found in our data. However, we were able to validate all five high-confidence spliced mitochondrial transcripts in the data from Glinos et al.19 because each had at least 100 uniquely aligned counts across each of the 5 GTEx brain samples (Extended Data Fig. 7). Mitochondria are essential to human cell life (and most eukaryotes) and have been implicated in a range of human diseases, including seizure disorders59, ataxias60, neurodegeneration61 and other age-related diseases62. Thus, although function for the new isoforms is not clear, determination of their function is important because they could have important biological roles or serve as biomarkers for mitochondrial function.
Discovery of transcripts from new gene bodies
RNA isoforms from new gene bodies refer to poly(adenylated) RNA species coming from regions of the genome where transcription was unexpected (that is, unannotated). Bambu identified a total of 1,860 isoforms from 1,676 new gene bodies. We observed a total of 1,593 potential new gene body isoforms with a CPM ≤ 1. We considered these potential discoveries as low confidence and excluded them from the remainder of our analyses, leaving 267 high-confidence isoforms from 245 gene bodies (Fig. 4a,b). Glinos et al.19 did not specifically report on new gene bodies, but Leung et al.20 reported 54 new gene bodies in human cortex where 5 overlapped with our high-confidence isoforms from new genes. The new isoforms from new gene bodies had a median length of 1,529 nt, ranging between 109 nt and 5,291 nt (Fig. 4c). The number of exons ranged between 2 and 4, with 96.6% of isoforms having only 2 exons (Fig. 4d). Given the large proportion of transcripts containing only two exons, it is possible that we sequenced only a fragment of larger RNA molecules.
Of the 267 new high-confidence isoforms from new gene bodies, 130 overlapped a known gene body on the opposite strand, 97 came from a completely new locus and 40 came from within a known gene body, but did not overlap a known exon (Fig. 4e). These 170 new transcripts from new gene bodies located in intragenic regions could be a result of leaky transcription and splicing. A recent article63 suggests that spurious intragenic transcription may result from aging in mammalian tissues. In new isoforms from new gene bodies, 82.5% (222 of 269) of exons contained the primary ‘GT’ nucleotides from the canonical 5′-splice site motif, whereas 90.7% (244 of 269) contained the primary ‘AG’ nucleotides from the canonical 3′-splice site motif (Fig. 4f). It is interesting that one new gene body (BambuGene290099) had three high-confidence RNA isoforms (Fig. 4g). We successfully validated 11 of 12 attempts for new high-confidence RNA isoforms from new gene bodies through PCR and gel electrophoresis (Fig. 4h, Supplementary Figs. 5–26 and Supplementary Table 4), where the 12th was successfully validated through RT–qPCR (mean Ct = 23.2; Supplementary Table 5). All 12 new RNA isoforms from new gene bodies validated through PCR were unique to the present study.
Over 94.4% of the new high-confidence transcripts from new gene bodies had at least one uniquely mapped read in either GTEx or ROSMAP data and >44.2% had at least 100 uniquely mapped reads in either dataset (Extended Data Fig. 7 and Supplementary Table 8). The validation rate for new transcripts from known gene bodies was higher than new transcripts from new gene bodies, indicating that some of our newly discovered genes could be aging related. Whether these newly discovered gene bodies are biologically meaningful or ‘biological noise’ is unclear. We validated three RNA isoforms from new gene bodies at the protein level using MS data from the same brain region and human cell lines (Fig. 4i); all three were unique to the present study.
During isoform discovery, we identified a new low-abundance RNA isoform (median CPM < 1) with two exons for the External RNA Controls Consortium (ERCC) RNA spike-ins (Supplementary Figs. 31 and 32). We were skeptical about this discovery because ERCCs contain only one exon, but we validated these results by PCR across two different batches of ERCC (Supplementary Figs. 33 and 34).
Medically relevant genes expressing multiple RNA isoforms
We found 7,042 genes expressing two or more RNA isoforms with a median CPM > 1, where 3,387 genes expressed ≥2 isoforms with distinct protein sequences (Fig. 5a,b). Of the 5,035 medically relevant genes included in the present study32, 1,917 expressed multiple isoforms and 1,018 expressed isoforms with different protein-coding sequences (Fig. 5c), demonstrating the isoform diversity of medically relevant genes in a single tissue and the importance of interpreting genetic variants in the proper context of tissue-specific isoforms. Of the 7,418 transcripts from medically relevant genes expressed with median CPM > 1, 5,695 are longer than 2,000 nt (Supplementary Fig. 35). Given the length of these 5,695 RNA isoforms, it is likely that their quantification is less accurate, despite the advantages that long-read sequencing offers.
It is interesting that 98 genes implicated in brain-related diseases expressed multiple RNA isoforms in human frontal cortex, including AD genes such as APP (Aβ-precursor protein) with 5, MAPT (tau protein) with 4 and BIN1 with 8 (Fig. 5d–h). Notably, we observed only four MAPT isoforms with a median CPM > 1, where two were expressed at levels many times greater than the others, whereas substantial previous research suggests that there are six tau proteins expressed in the central nervous system64,65,66. Similarly, several genes implicated in other neurogenerative diseases and neuropsychiatric disorders expressed multiple isoforms in human frontal cortex, including SOD1 (amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD); Fig. 5i) with two isoforms expressed with a median CPM > 1, SNCA (Parkinson’s disease (PD); Fig. 5i) with four, TARDBP (TDP-43 protein; involved in several neurodegenerative diseases; Fig. 5i,j) with four and SHANK3 (autism spectrum disorder; Fig. 5k,l) with three.
RNA isoform expression reveals patterns hidden at gene level
Perhaps the most compelling value in long-read RNA-seq is the ability to perform differential isoform expression analyses. Through these analyses, we can begin to distinguish which isoforms are expressed in specific cell types and tissue types and ultimately determine their associations with human health and disease. Thus, as proof of principle, we performed differential gene and isoform expression analyses comparing six pathologically confirmed cases of AD and six cognitively unimpaired controls. The dataset is not large enough to draw firm disease-specific conclusions, but it does demonstrate the need for larger studies.
We found 176 differentially expressed genes and 105 differentially expressed RNA isoforms (Fig. 6a,b and Supplementary Tables 9 and 10). Of these 105 isoforms, 99 came from genes that were not differentially expressed when collapsing all isoforms into a single gene measurement (Fig. 6a,b), demonstrating the utility of differential isoform expression analyses. It is interesting that there were two differentially expressed isoforms from the same gene (TNFSF12), with opposite trends. The TNFSF12-219 isoform was upregulated in cases with AD whereas TNFSF12-203 was upregulated in controls (Fig. 6c–e), even though the TNFSF12 gene was not differentially expressed when collapsing all transcripts into a single gene measurement (Fig. 6c).
Out of interest, we measured the expression patterns for the TNFSF12-203 and TNFSF12-219 isoforms in the five GTEx long-read RNA-seq samples from Brodmann area 9 to assess whether the expression pattern matched what we observed in our cognitively unimpaired controls (Extended Data Fig. 9). We found that the expression for both TNFSF12 isoforms shows greater variability than either of our groups, but arguably more closely resembles the pattern in our controls.
Out of interest, we also provided plots from a principal component analysis at both the gene and the isoform level where we observed a potential separation between cases and controls (Supplementary Fig. 36). We encourage caution to avoid overinterpreting this potential separation between cases and controls given the small sample size.