Supervised RNA-seq analysis
Cordeliers Artificial Intelligence and Bioinformatics
Source:vignettes/sup_rnaseq.Rmd
sup_rnaseq.Rmd
For the supervised and unsupervised analyses, we are using an open
dataset which is given by the airway package. It provides a
gene expression dataset derived from human bronchial epithelial
cells, treated or not with dexamethasone (a
corticosteroid).
Dexamethasone treatment is expected to triggers a stress‐response program and reset the peripheral circadian clock.
The treatment information is recorded in the dex column
of the data.
Here is an example of how the supervised part of the
CAIBIrnaseq package.
First, we need to import the required packages :
library(airway)
library(SummarizedExperiment)
library(CAIBIrnaseq)
library(tidyverse)
library(patchwork)
devtools::load_all()Before analysing the dataset, we define the variables with the names of the genes / pathways we want to analyse. This is not mandatory, but it can allow someone with limited coding experience to execute the notebook without changing the downstream code except for these variables.
These variables will change in function of how your dataset is build and which type of data it is.
Parameters
annotation <- "dex"
species <- "human"
pathwayMethod <- "ora"
collections <- c("Hallmark", "GO:BP")
boxplot_pathways <- c("GOBP RESPONSE TO HORMONE", "GOBP SMALL MOLECULE METABOLIC PROCESS", "GOBP POSITIVE REGULATION OF SYNAPSE ASSEMBLY")
heatmap_genes <- list(
`Glucocorticoid response genes` = c("FKBP5", "TSC22D3", "PER1", "ZBTB16"),
`Anti-inflammatory genes` = c("DUSP1", "SOCS1", "MT2A"))
pathway_genes <- c("GOBP SMALL MOLECULE METABOLIC PROCESS",
"GOBP POSITIVE REGULATION OF SYNAPSE ASSEMBLY")
heatmap_pathways <- c("GOBP RESPONSE TO HORMONE", "GOBP SMALL MOLECULE METABOLIC PROCESS", "GOBP POSITIVE REGULATION OF SYNAPSE ASSEMBLY")Remember to come back in this cell to modify the genes, pathways you are interested in studying.
Load data
This section loads the RNA-seq dataset for analysis. It ensures the correct input file is used, as specified in the parameters. rebase_gexp
Ensure your dataset is in a Summarized Experiment
object, because all the used functions below works with this type of
input.
If you want to know more about this type of object, please click here: Bioconductor
If you have a .xsl file (Excel), there is an existing notebook to convert it in a .RDS file.
data(airway, package="airway")
exp_data <- airway
rowData(exp_data)$gene_length_kb <-
(rowData(exp_data)$gene_seq_end - rowData(exp_data)$gene_seq_start) / 1000
# If you are using your own dataset .RDS file), use this command line :
# exp_data <- readRDS(data_file)
exp_data## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(11): gene_id gene_name ... symbol gene_length_kb
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
## [1] "gene_id" "gene_name" "entrezid" "gene_biotype"
## [5] "gene_seq_start" "gene_seq_end" "seq_name" "seq_strand"
## [9] "seq_coord_system" "symbol" "gene_length_kb"
Even if the datasets are globally build the same way, the names of the variables are not exactly the same, so if we want to keep the same code, we need to redefine a bit the variables.
If data was generated by CAIBI from an nf-core/RNAseq pipeline, the
salmon.merged.gene_counts_scaled.rds RDS object included
with the outputs will already be in the appropriate format.
If not, you should look at how your dataset is constructed. You might need to gene annotation metadata. We suggest using \[`biomaRt::getBM`\].
Pre-processing
Most datasets use ensembl gene ID by default after alignment, so this
step rebases the expression data to gene names. This ensures consistency
in naming for downstream analyses. This may be skipped if the exp_data
already uses gene_name or gene_symbol as its
default ID.
exp_data <- rebase_gexp(exp_data, keep_cols = c("gene_name", "gene_id",
"gene_biotype", "gene_length_kb"))
exp_data## class: SummarizedExperiment
## dim: 56638 8
## metadata(0):
## assays(2): counts tpm
## rownames(56638): 5S_rRNA 7SK ... snosnR66 yR211F11.2
## rowData names(4): gene_name gene_id gene_biotype gene_length_kb
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
Filter
Here, we filter out genes expressed in too few samples or with very low counts. This removes noise from the data and focuses on meaningful gene expressions. In this example, only zero-expression genes will be filtered, which is appropriate for a lot of analyses.
exp_data <- filter_gexp(exp_data,
min_nsamp = 1,
min_counts = 1)Visualization of the filtering process to ensure the criteria applied align with the dataset’s characteristics:
colData(exp_data)$sample_id <- colnames(exp_data)
plot_qc_filters(exp_data)Normalize
Here, we apply a normalization to the expression
data, making samples comparable by reducing variability due to technical
differences. For datasets with few samples, rlog is the
preferred normalization and when more samples are present,
vst is applied.
exp_data <- normalize_gexp(exp_data)PCA
Principal component analysis (PCA) identifies the major patterns in the dataset. These patterns help explore similarities or differences among samples based on gene expression.
metadata(exp_data)[["pca_res"]] <- pca_gexp(exp_data)
annotations <- setdiff(annotation, c("exp_cluster", "path_cluster"))
plot_pca(exp_data, color = annotation, ellipses = TRUE)## Warning: The following aesthetics were dropped during statistical transformation: label.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: label.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Differential expression
Differential expression analysis is a statistical method used to identify genes whose expression levels significantly change between different experimental conditions, i.e differentially expressed genes (DEGs). These changes in gene expression can provide insights into the biological processes or pathways affected, and help identify key molecular players involved in a condition or phenotype.
Design
The differential expression design argument allows us to
control which variables will be considered sources of biological (or
technical) variation by the model.
In our example, we say our only source of variation is the
annotation variable (which corresponds to the treatment
column, dex).
The ~ symbol may be added before the variable name to
represent that it’s the independent variable (i.e. the differential
expressed genes will depend on annotation i.e.
DEGs ~ annotation.
However, it is possible to make an additive model by adding variables
together in the design (e.g. design = "~ var1 + var2").
This can be very useful and lead to more accurate and stringent DEGs
prediction than doing separate analyses for different variables.
The model can also consider interactions, but this is a more advanced topic and should be explored from a statistical point of view.
Constrasts
Contrasts are how we tell the model what to compare. For our
function, contrasts should be defined as a string with exactly three
elements: the name of a variable in the design formula, the name of the
numerator level for the fold change, and the name of the denominator
level for the fold change, separated by _vs_.
For example: dex_trt_vs_untrt : for variable dex,
compare treated (trt) vs untreated untrt
samples.
- Variable:
dex - Numerator:
trt - Denominator:
untrt
Genes with positive fold-change will be up-regulated in the numerator
(trt).
Available contrasts will be built from the levels of the variable, where the first level will be used as denominator and all other levels as possible numerators.
Warning: Pay attention to the order of levels in the variable as this may lead to wrong comparisons where numerator/denominator are inverted. It may need to be altered manually, as in our example below.
Note: if contrasts are not given, all possible contrasts will be returned (with levels in automatic order)
Fitting the model
colData(exp_data)$dex <- factor(colData(exp_data)$dex, levels = c("untrt", "trt"))
diffexp <- diffExpAnalysis(exp_data, design = annotation, contrasts = "dex_trt_vs_untrt")Volcano plot
A volcano plot is a graphical tool used to visualize the results of differential expression analysis. It plots the statistical significance on the y-axis against the magnitude of change on the x-axis for each gene.
volcano_plot <- plot_exp_volcano(diffexp, 20)
volcano_plot
Top differentially-expressed genes
The differentially expressed genes (DEGs) can be checked directly from the results.
Here, we show the top 30 most differentially-expressed genes:
diffexp |>
slice(1:30)| baseMean | log2FoldChange | lfcSE | pvalue | padj | |
|---|---|---|---|---|---|
| SPARCL1 | 997.4590 | 4.588342 | 0.2121691 | 0 | 0 |
| STOM | 11193.9381 | 1.437605 | 0.0849568 | 0 | 0 |
| PER1 | 776.6130 | 3.158757 | 0.2023439 | 0 | 0 |
| PHC2 | 2738.0321 | 1.375870 | 0.0917735 | 0 | 0 |
| MT2A | 3656.3436 | 2.184313 | 0.1480886 | 0 | 0 |
| DUSP1 | 3409.0890 | 2.919829 | 0.2026634 | 0 | 0 |
| MAOA | 2341.8179 | 3.274171 | 0.2292011 | 0 | 0 |
| ZBTB16 | 385.0749 | 7.107543 | 0.5009888 | 0 | 0 |
| KCTD12 | 2649.9218 | -2.474067 | 0.1797743 | 0 | 0 |
| SAMHD1 | 12703.6077 | 3.829537 | 0.2807915 | 0 | 0 |
| NEXN | 5393.2150 | 2.008149 | 0.1524976 | 0 | 0 |
| SLC6A9 | 229.6453 | -2.192193 | 0.1671431 | 0 | 0 |
| FKBP5 | 2564.4337 | 3.895135 | 0.2988390 | 0 | 0 |
| ARHGEF2 | 2250.6470 | -1.002539 | 0.0825635 | 0 | 0 |
| DNAJB4 | 1467.6663 | 1.502061 | 0.1240173 | 0 | 0 |
| NNMT | 7433.0763 | 2.129004 | 0.1787083 | 0 | 0 |
| ABAT | 498.7005 | 1.143222 | 0.0973847 | 0 | 0 |
| FAM171B | 1303.3082 | -1.331451 | 0.1167911 | 0 | 0 |
| CCDC69 | 2057.2438 | 2.852120 | 0.2527701 | 0 | 0 |
| FAM198B | 5596.4628 | 1.109075 | 0.0985036 | 0 | 0 |
| COL1A1 | 100992.9955 | -1.285677 | 0.1162955 | 0 | 0 |
| GLUL | 19611.5929 | 2.954573 | 0.2713427 | 0 | 0 |
| KLF15 | 561.1218 | 4.508089 | 0.4164685 | 0 | 0 |
| VCAM1 | 508.1780 | -3.548945 | 0.3290099 | 0 | 0 |
| STEAP1 | 1920.7503 | 1.532424 | 0.1420552 | 0 | 0 |
| JADE1 | 1570.6660 | 1.188486 | 0.1103296 | 0 | 0 |
| FGD4 | 1223.4792 | 2.217150 | 0.2073593 | 0 | 0 |
| ARMC8 | 1010.3123 | 1.335065 | 0.1264607 | 0 | 0 |
| RAP2B | 630.8901 | -1.251762 | 0.1190572 | 0 | 0 |
| MORF4L2 | 5040.5952 | 1.012343 | 0.0964122 | 0 | 0 |
The most important values are log2FoldChange and
padj (you may see the explanation for all outputs by
checking ? DESeq2::results)
-
log2FoldChange: the effect size estimate. This value indicates how much the gene’s expression seems to have changed between the comparison and control groups. This value is reported on a logarithmic scale to base 2. -
padj: Adjusted P-value for multiple testing for the gene or transcript.
We should privilege using the adjusted p-value to account for
false-discoveries. Fold-changes are shrunk with apeglm by
default to avoid noise and preserve large differences. (see:
? DESeq2::lfcShrink and the apeglm
paper).
Pathway analysis
Pathway collections available in the MSIGdb can be specified in the
parameters. These pathways are scored and ranked by their variance in
the data. These are the available collections (use
gs_subcollection as name except for Hallmarks, which should
be ‘H’).
msigdbr::msigdbr_collections()| gs_collection | gs_subcollection | gs_collection_name | num_genesets |
|---|---|---|---|
| C1 | Positional | 302 | |
| C2 | CGP | Chemical and Genetic Perturbations | 3494 |
| C2 | CP | Canonical Pathways | 19 |
| C2 | CP:BIOCARTA | BioCarta Pathways | 292 |
| C2 | CP:KEGG_LEGACY | KEGG Legacy Pathways | 186 |
| C2 | CP:KEGG_MEDICUS | KEGG Medicus Pathways | 658 |
| C2 | CP:PID | PID Pathways | 196 |
| C2 | CP:REACTOME | Reactome Pathways | 1736 |
| C2 | CP:WIKIPATHWAYS | WikiPathways | 830 |
| C3 | MIR:MIRDB | miRDB | 2377 |
| C3 | MIR:MIR_LEGACY | MIR_Legacy | 221 |
| C3 | TFT:GTRD | GTRD | 505 |
| C3 | TFT:TFT_LEGACY | TFT_Legacy | 610 |
| C4 | 3CA | Curated Cancer Cell Atlas gene sets | 148 |
| C4 | CGN | Cancer Gene Neighborhoods | 427 |
| C4 | CM | Cancer Modules | 431 |
| C5 | GO:BP | GO Biological Process | 7608 |
| C5 | GO:CC | GO Cellular Component | 1026 |
| C5 | GO:MF | GO Molecular Function | 1820 |
| C5 | HPO | Human Phenotype Ontology | 5653 |
| C6 | Oncogenic Signature | 189 | |
| C7 | IMMUNESIGDB | ImmuneSigDB | 4872 |
| C7 | VAX | HIPC Vaccine Response | 347 |
| C8 | Cell Type Signature | 840 | |
| H | Hallmark | 50 |
Pathway scores
For scoring pathways from differential expression analysis results, we propose two scoring methods:
Overrepresentation Analysis (ORA) treats your significant hits (e.g. DE genes above a chosen cutoff) as a “bag” and asks whether any known pathway is represented more often in that bag than you’d expect by chance—typically via a Fisher’s exact or hypergeometric test against all measured genes. It’s extremely fast and easy to interpret, but hinges on where you set your significance cutoff.
(Fast) Gene Set Enrichment Analysis (FGSEA) avoids that arbitrary cutoff by taking a ranked list of all genes (for example, sorted by fold‑change) and “walking” down it to see if members of each pathway accumulate at the top or bottom. You end up with an enrichment score for each pathway, and statistical significance comes from permuting phenotype labels. GSEA is more sensitive to coordinated, subtle shifts across a pathway.
pathways <- get_annotation_collection(collections,
species = species)
pathwayResult <- pathwayAnalysis(diffexp,
pathways = pathways,
method = "ora")
pathwayResult2 <- pathwayAnalysis(diffexp,
pathways = pathways,
method = "fgsea")
metadata(exp_data)[["pathwayEnrichment"]] <- pathwayResult
metadata(exp_data)[["pathwayEnrichment_fgsea"]] <- pathwayResult2The resulting dot plot summarizes the most enriched
biological pathways across different conditions or samples
plot_pathway_dotplot(exp_data, score_name = "pathwayEnrichment_fgsea")
Single-sample pathway scores
Pathway scores can also be calculated per-sample (single-sample), and compared using statistics like the Wilcoxon rank-sum test to estimate the difference between groups.
pathway_scores <- score_pathways(exp_data, pathways, verbose = FALSE)
metadata(exp_data)[["pathway_scores"]] <- pathway_scoresBoxplots are used to visualize the distribution of a quantitative variables across experimental conditions. Here, we propose visualizing single-sample pathway scores using a variation called a “beeswarm” plot, that highlights all the points belonging to each group. We also add a line representing the mean per group and the results of a Wilcoxon test.
boxplots <- lapply(heatmap_pathways, function(path) {
lapply(annotation, function(annotation) {
plt <- plot_pathway_boxplot(exp_data,
pathway = path,
annotation = annotation,
color_var = annotation,
pt_size = 2,
fname = str_glue(
"results/sup/box_{path}_{annotation}.pdf"))
})
}) |> flatten()
wrap_plots(boxplots, nrows = round(length(boxplots)/2), guides = "collect")
## Heatmap of genes
Chosen gene signature
We can use heatmaps to visualize multiple genes at the same time, and group their expression profiles in our samples. Annotation tracks highlight whether samples group logically according to the expression profile of selected genes.
hms <- lapply(1:length(heatmap_genes), function(i) {
genes <- heatmap_genes[[i]]
name <- ifelse(is.null(names(heatmap_genes)), i, names(heatmap_genes)[i])
hm <- plot_exp_heatmap(exp_data, genes = genes,
annotations = annotation,
show_rownames = ifelse(length(genes) <= 100, TRUE, FALSE),
hm_color_limits = c(-2,2),
show_dend_row = FALSE,
fname = str_glue("results/sup/hm_genes_{name}.pdf"))
})

wrap_plots(hms, ncol = 2)
### Pathway genes
hms_path <- lapply(pathway_genes, function(path) {
path <- str_replace_all(path, " ", "_")
genes <- pathways |> filter(pathway %in% path) |> pull(gene_symbol) |> unique()
hm <- plot_exp_heatmap(exp_data, genes = genes,
annotations = annotation,
show_rownames = ifelse(length(genes) <= 100, TRUE, FALSE),
hm_color_limits = c(-2,2),
show_dend_row = FALSE,
fname = str_glue("results/sup/hm_genes_{path}.pdf"))
})

wrap_plots(hms_path, ncol = 2)