Unsupervised RNA-seq analysis
Cordeliers Artificial Intelligence and Bioinformatics
Source:vignettes/unsup_rnaseq.Rmd
unsup_rnaseq.Rmd
For the 2 examples, we are using an open dataset which is given by
the airway
package. It provides a gene expression dataset
derived from human bronchial epithelial cells, treated
or not with dexamethasone (a corticosteroid).
Here is an example of how the unsupervised part of
the CAIBIrnaseq
package can be use. If you want to use this
notebook for your projects, it is available here
First, we need to import the required packages :
library(airway) # dataset's package
library(SummarizedExperiment) # Object's type we are using
library(CAIBIrnaseq) # package
Before analysing the dataset, we define the variables with the names of the genes / pathways … that we want to analyse.
These variables will change in function of how your dataset is build and which type of data it is.
If you find the definition of the variables heavy in your code, you
can create a .yml
file, for example named
params.yml
, call it in a cell, and it will automatically
import the variables. Your code will be clearer (here is an example)
species <- "Homo sapiens" # Or "Mus musculus"
# Annotation variable to visualize
plot_annotations <- "dex" # Put the name of the condition (here 'dex' is treated or untreated)
# Quality parameters
qc_min_nsamp <- 2
qc_min_gene_counts <- 10
# Clustering of expression
exp_cluster <- data.frame(k = 2) #Number of cluster
# Clustering of metadata
metadata_clusters <- list(
pathway_scores = data.frame(k = 2),
microenv_scores = data.frame(k = 3)
)
# The following variables are those that will need to be modified depending on the analyses you want to do
# Pathway collections
pathway_collections <- c("CGP", "CP", "CP:KEGG_LEGACY", "Hallmark") #See the msigdb table and modify with the interesting collections
# Interesting genes
heatmap_genes <- list(
gr_response_genes <- c("FKBP5", "TSC22D3", "PER1", "ZBTB16"),
anti_inflam_genes <- c("DUSP1", "SOCS1", "MT2A")
) # same here, replace with the genes you are interested in
heatmap_pathways <- c(
"DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN",
"REN_ALVEOLAR_RHABDOMYOSARCOMA_DN",
"NUYTTEN_EZH2_TARGETS_UP",
"PASINI_SUZ12_TARGETS_DN"
) #Same
# Genes for the boxplots
boxplot_genes <- c("FKBP5", "TSC22D3") #same
# Pathways for the boxplots
boxplot_pathways <- c(
"KUMAMOTO_RESPONSE_TO_NUTLIN_3A_UP",
"CASTELLANO_HRAS_TARGETS_DN"
) #same
# Corrélations entre gènes
correlation_genes <- list(
c("FKBP5", "TSC22D3"),
c("FKBP5", "ZBTB16")
)
# Pathways correlation
correlation_pathways <- list(
c("DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN", "REN_ALVEOLAR_RHABDOMYOSARCOMA_DN"),
c("REN_ALVEOLAR_RHABDOMYOSARCOMA_DN", "NUYTTEN_EZH2_TARGETS_UP")
) #same
Load data
This section loads the RNA-seq dataset for analysis. It ensures the correct input file is used, as specified in the parameters. rebase_gexp
Ensure your dataset is in a Summarized Experiment
object, because all the used functions below works with
SummarizedExperiment input.
If you want to know more about this type of object, please click here: Bioconductor
data(airway, package="airway")
exp_data <- airway
# If you are using your own dataset .RDS file), use this command line :
# exp_data <- readRDS(data_file)
Even if the datasets are globally build the same way, the names of the variables are not exactly the same, so if we want to keep the same code, we need to redefine a bit the variables.
If you want to know what are the used variables in this part, run
this command line :
colnames(SummarizedExperiment::rowData(exp_data))
You should have these variables (with these exact same names):
- gene_name : The commonly used symbol or name for the
gene (e.g., A1BG).
- gene_id : A unique and stable identifier for the
gene, often from databases like Ensembl.
- gene_length_kb : The length of the gene measured in
kilobases
- gene_description : A brief textual summary of the
gene’s function or characteristics. - gene_biotype : A
classification of the gene based on its biological function or
transcript type, such as protein_coding, lncRNA, or pseudogene.
If you are using the notebook that Clarice GROENEVELD created to convert a .xsl file into a .RDS one, you can SKIP the next cell. If not, you should look at how your dataset is defined. You might need to run some command line as the following ones:
library(biomaRt)
# The goal here is to match the existing columns with the expected ones
rowData(exp_data)$gene_length_kb <-
(rowData(exp_data)$gene_seq_end - rowData(exp_data)$gene_seq_start) / 1000
mart <- useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl")
gene_ids <- rowData(exp_data)$gene_id
annot <- getBM(attributes = c("ensembl_gene_id", "description"),
filters = "ensembl_gene_id",
values = gene_ids,
mart = mart)
matched <- match(rowData(exp_data)$gene_id, annot$ensembl_gene_id)
rowData(exp_data)$gene_description <- annot$description[matched]
Pre-processing
Most datasets use ensembl gene ID by default after alignment, so this step rebases the expression data to gene names. This ensures consistency in naming for downstream analyses.
exp_data <- rebase_gexp(exp_data, annotation = "gene_name")
Filter
Here, we filter out genes expressed in too few samples or with very low counts. This removes noise from the data and focuses on meaningful gene expressions.
exp_data <- filter_gexp(exp_data,
min_nsamp = 1,
min_counts = 1)
Visualization of the filtering process to ensure the criteria applied align with the dataset’s characteristics:
colData(exp_data)$sample_id <- colnames(exp_data)
plot_qc_filters(exp_data)
Normalize
Here, we apply a normalization to the expression data, making samples
comparable by reducing variability due to technical differences. For
datasets with few samples, rlog
is the preferred
normalization and when more samples are present, vst
is
applied.
exp_data <- normalize_gexp(exp_data)
PCA
Principal component analysis (PCA) identifies the major patterns in the dataset. These patterns help explore similarities or differences among samples based on gene expression.
pca_res = pca_gexp(exp_data)
exp_data@metadata[["pca_res"]] <- pca_res
annotations <- setdiff(plot_annotations, c("exp_cluster", "path_cluster"))
plot_pca(exp_data, color = plot_annotations)
If you want something more visual, you can add a circular/oval shape
to circle the different genotypes of samples, use the
fviz_pca_ind
function from the factoextra
package. With this dataset, it is not relevant but it can be with your
persona dataset. Here, it highlights the fact that the 2 groups are not
crossing each other. The trt
group has more PC1, whereas
the untrt
group has less. We could conclude that PC1 is
more represented in the treated samples.
library(factoextra)
groups <- SummarizedExperiment::colData(exp_data)$dex # here we want to split in function of `treated` and `untreated`
fviz_pca_ind(pca_res,
geom = "point",
habillage = groups,
palette = c("#00AFBB", "#E7B800"), # Personalized colors
addEllipses = TRUE,
ellipse.type = "confidence",
repel = TRUE,
label = "none"
)
Unsupervised clustering
Here, we group samples based on expression patterns without prior knowledge using hierarchical clustering on either a selected gene list from the parameters or, by default, the 2000 most highly expressed genes.
This can be useful for discovering sample subgroups or new biological insights.
exp_data <- cluster_exp(exp_data, k = exp_cluster$k, genes = exp_cluster$genes, n_pcs = 3)
Visual representation of expression levels for HVG across clusters, highlighting distinct patterns.
hvg <- highly_variable_genes(exp_data)
exp_cluster <- data.frame(k = 2)
hm <- plot_exp_heatmap(exp_data, genes = hvg,
annotations = c(plot_annotations, "exp_cluster"),
show_rownames = FALSE,
hm_color_limits = c(-2,2),
fname = "results/unsup/clustering/heatmap_2000hvg_exp_cluster.pdf")
hm
Pathway activity
Pathway analysis enables us to understand the functional implications of gene expression changes. Here, we analyze the dataset for pathway activity using two methods.
PROGENy
PROGENy is a collection of only 14 core pathway responsive genes from large signaling perturbation experiments. For more information see the original paper.
The returned plot will give us information about the pathways that
are activated for each sample. There is especially one pathway that is
highly activated : EGFR
, in the sample
SRR1039517
progeny_scores <- score_progeny(exp_data, species = "Homo sapiens")
progeny_scores <- score_progeny(exp_data, species = "Homo sapiens")
metadata(exp_data)[["progeny_scores"]] <- progeny_scores
plot_progeny_heatmap(exp_data, annotations = plot_annotations,
fname = "results/unsup/pathways/hm_progeny_scores.pdf")
write.csv(progeny_scores, file = "results/unsup/pathways/progeny_scores.csv")
Pathways
Ensure your dataset is in a Summarized Experiment
object, because all the used functions below works with
SummarizedExperiment input.
Pathway collections available in the MSIGdb can be specified in the
parameters. These pathways are scored and ranked by their variance in
the data. These are the available collections (use
gs_subcollection
as name except for Hallmarks, which should
be ‘H’).
library(msigdbr)
library(dplyr)
library(kableExtra)
msigdbr::msigdbr_collections() |>
kableExtra::kbl() |>
kableExtra::kable_styling() |>
kableExtra::scroll_box(height = "300px")
gs_collection | gs_subcollection | gs_collection_name | num_genesets |
---|---|---|---|
C1 | Positional | 302 | |
C2 | CGP | Chemical and Genetic Perturbations | 3494 |
C2 | CP | Canonical Pathways | 19 |
C2 | CP:BIOCARTA | BioCarta Pathways | 292 |
C2 | CP:KEGG_LEGACY | KEGG Legacy Pathways | 186 |
C2 | CP:KEGG_MEDICUS | KEGG Medicus Pathways | 658 |
C2 | CP:PID | PID Pathways | 196 |
C2 | CP:REACTOME | Reactome Pathways | 1736 |
C2 | CP:WIKIPATHWAYS | WikiPathways | 830 |
C3 | MIR:MIRDB | miRDB | 2377 |
C3 | MIR:MIR_LEGACY | MIR_Legacy | 221 |
C3 | TFT:GTRD | GTRD | 505 |
C3 | TFT:TFT_LEGACY | TFT_Legacy | 610 |
C4 | 3CA | Curated Cancer Cell Atlas gene sets | 148 |
C4 | CGN | Cancer Gene Neighborhoods | 427 |
C4 | CM | Cancer Modules | 431 |
C5 | GO:BP | GO Biological Process | 7608 |
C5 | GO:CC | GO Cellular Component | 1026 |
C5 | GO:MF | GO Molecular Function | 1820 |
C5 | HPO | Human Phenotype Ontology | 5653 |
C6 | Oncogenic Signature | 189 | |
C7 | IMMUNESIGDB | ImmuneSigDB | 4872 |
C7 | VAX | HIPC Vaccine Response | 347 |
C8 | Cell Type Signature | 840 | |
H | Hallmark | 50 |
pathways <- get_annotation_collection(pathway_collections,
species = species)
pathway_scores <- score_pathways(exp_data, pathways, verbose = FALSE)
metadata(exp_data)[["pathway_scores"]] <- pathway_scores
collections <- pathway_collections |>
paste(collapse = "_") |>
stringr::str_remove("\\:")
plot_pathway_heatmap(exp_data, annotations = plot_annotations,
fwidth = 9,
fname = stringr::str_glue(
"results/unsup/pathways/hm_paths_{collections}_top20.pdf")
)
Microenvironment scores
This step calculates immune and stromal cell type abundances using MCPcounter or mMCPcounter. It helps to infer the composition of the tumor microenvironment or similar contexts.
mcp_scores <- mcp_counter(exp_data, species = species)
S4Vectors::metadata(exp_data)[["microenv_scores"]] <- mcp_scores
plot_microenv_heatmap(exp_data, annotations = c("dex", "exp_cluster"),
fname = "results/unsup/tme/heatSmap_mcpcounter.pdf")
write.csv(mcp_scores, file = "results/unsup/tme/scores_mcpcounter.csv")
By default, the rows are order by scores. But, the
plot_microenv_heatsmap
function has the
ellipsis argument
. That means that this function can have a
wide range of inputs. So, it is possible to plot the rows in a different
order than the default one :
plot_microenv_heatmap(exp_data,
annotations = c("dex", "exp_cluster"),
fname = "results/unsup/tme/heatmap_sorted_bydex.pdf",
cluster_rows = FALSE)
Targeted plots
This section focuses on visualizing specific genes or pathways of interest, as specified in the parameters.
Heatmaps
Generates heatmaps for pre-selected genes of interest to observe their expression across samples or conditions.
hms <- lapply(1:length(heatmap_genes), function(i) {
gene_annot <- SummarizedExperiment::rowData(exp_data)
genes <- heatmap_genes[[i]]
name <- ifelse(is.null(names(heatmap_genes)), i, names(heatmap_genes)[i])
plot_exp_heatmap(exp_data, genes = genes,
annotations = plot_annotations,
fname = stringr::str_glue("results/unsup/targeted/hm_genes_{i}.pdf"))
})
patchwork::wrap_plots(hms, ncol = 2, guides = "collect")
Selected pathways
valid_pathways <- intersect(heatmap_pathways, rownames(pathway_scores))
plot_pathway_heatmap(exp_data,
annotations = plot_annotations,
pathways = valid_pathways,
fname = stringr::str_glue("results/unsup/targeted/hm_pathways_selected.pdf"))
Boxplots
Boxplots provide a clear comparison of expression levels across experimental groups or conditions. All jobs still running at 10:00 on Friday will be killed as part of the maintenance. #### Selected genes
genes <- boxplot_genes
annotations <- plot_annotations
boxplots <- lapply(genes, function(gene) {
lapply(annotations, function(annotation) {
plt <- plot_exp_boxplot(exp_data, gene = gene,
annotation = annotation,
color_var = annotation,
pt_size = 2,
fname = stringr::str_glue("results/unsup/targeted/boxplots/box_{gene}_{annotation}.pdf"))
})
}) |> purrr::flatten()
patchwork::wrap_plots(boxplots, nrows = round(length(boxplots)/2), guides = "collect")
Selected pathways
paths <- boxplot_pathways
annotations <- plot_annotations
boxplots <- lapply(paths, function(path) {
lapply(annotations, function(annotation) {
plt <- plot_path_boxplot(exp_data,
pathway = path,
annotation = annotation,
color_var = annotation,
pt_size = 2,
fname = stringr::str_glue("results/unsup/targeted/boxplots/box_{path}_{annotation}.pdf"))
})
}) |> purrr::flatten()
patchwork::wrap_plots(boxplots, nrows = round(length(boxplots)/2), guides = "collect")
Correlations
This section visualizes relationships between pairs of genes or pathways by plotting their expression/activity correlations. Correlation analysis can reveal important co-regulation or interaction patterns, helping to uncover biologically meaningful relationships.
Selected genes
Here we plot the correlation between selected gene pairs across the dataset. Each pair is plotted separately, and color-coded by sample annotation.
gene_pairs <- correlation_genes
annotations <- plot_annotations
cor_plts <- lapply(gene_pairs, function(gene_pair) {
lapply(annotations, function(annot) {
# Correction de la construction du nom de fichier
fname <- stringr::str_glue("results/unsup/targeted/correlations/cor_{gene_pair[1]}_{gene_pair[2]}_color={annot}.pdf")
# Génération du scatter plot
plot_exp_scatter(exp_data,
gene1 = gene_pair[1],
gene2 = gene_pair[2],
color_var = annot,
fname = fname)
})
}) |> purrr::flatten()
patchwork::wrap_plots(cor_plts, nrows = round(length(cor_plts)/2), guides = "collect")
Selected pathways
Correlation plots for selected pathways can help identify similarities or differences in pathway activity patterns across samples. Each pathway pair is plotted separately and color-coded by sample annotation to illustrate trends within each condition.
path_pairs <-correlation_pathways
annotations <- plot_annotations
cor_plts <- lapply(path_pairs, function(path_pair) {
lapply(annotations, function(annot) {
plot_path_scatter(exp_data,
pathway1 = path_pair[1],
pathway2 = path_pair[2],## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
color_var = annot,
fname = stringr::str_glue(
"results/unsup/targeted/correlations/cor_{path_pair[1]}_{path_pair[2]}_color={annot}.pdf"))
})
}) |> purrr::flatten()
patchwork::wrap_plots(cor_plts, nrows = round(length(cor_plts)/2), guides = "collect")
Cluster using metadata
types = names(metadata_clusters)
for(type in types) {
exp_data <- cluster_metadata(exp_data,
metadata_name = type,
k = metadata_clusters[[type]]$k,
features = metadata_clusters[[type]]$features,
n_pcs = 3 )
}
Save SummarizedExperiment
The final step saves the processed dataset and results. This ensures all outputs can be revisited or shared for further analysis.
Report parameters
For reproducibility, the parameters used in the analysis and the computational environment details are documented.
sessionInfo
The sessionInfo()
prints out all packages loaded at the
time of analysis, as well as their versions.
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Paris
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] kableExtra_1.4.0 dplyr_1.1.4
## [3] msigdbr_10.0.2 factoextra_1.0.7
## [5] ggplot2_3.5.2 biomaRt_2.64.0
## [7] CAIBIrnaseq_1.0.0 R.utils_2.13.0
## [9] R.oo_1.27.1 R.methodsS3_1.8.2
## [11] airway_1.28.0 SummarizedExperiment_1.38.1
## [13] Biobase_2.68.0 GenomicRanges_1.60.0
## [15] GenomeInfoDb_1.44.0 IRanges_2.42.0
## [17] S4Vectors_0.46.0 BiocGenerics_0.54.0
## [19] generics_0.1.3 MatrixGenerics_1.20.0
## [21] matrixStats_1.5.0
##
## loaded via a namespace (and not attached):
## [1] ggplotify_0.1.2 filelock_1.0.3
## [3] tibble_3.2.1 graph_1.86.0
## [5] XML_3.99-0.18 lifecycle_1.0.4
## [7] httr2_1.1.2 rstatix_0.7.2
## [9] lattice_0.22-5 crosstalk_1.2.1
## [11] backports_1.5.0 magrittr_2.0.3
## [13] plotly_4.10.4 sass_0.4.10
## [15] rmarkdown_2.29 jquerylib_0.1.4
## [17] yaml_2.3.10 rlist_0.4.6.2
## [19] cowplot_1.1.3 DBI_1.2.3
## [21] RColorBrewer_1.1-3 eulerr_7.0.2
## [23] lubridate_1.9.4 abind_1.4-8
## [25] purrr_1.0.4 yulab.utils_0.2.0
## [27] rappdirs_0.3.3 GenomeInfoDbData_1.2.14
## [29] ggrepel_0.9.6 irlba_2.3.5.1
## [31] tidytree_0.4.6 GSVA_2.2.0
## [33] MCPcounter_1.2.0 annotate_1.86.0
## [35] svglite_2.1.3 pkgdown_2.1.2
## [37] codetools_0.2-20 DelayedArray_0.34.1
## [39] xml2_1.3.8 tidyselect_1.2.1
## [41] aplot_0.2.5 UCSC.utils_1.4.0
## [43] farver_2.1.2 ScaledMatrix_1.16.0
## [45] BiocFileCache_2.16.0 jsonlite_2.0.0
## [47] Formula_1.2-5 systemfonts_1.2.3
## [49] tools_4.5.0 progress_1.2.3
## [51] treeio_1.32.0 ragg_1.4.0
## [53] Rcpp_1.0.14 glue_1.8.0
## [55] gridExtra_2.3 SparseArray_1.8.0
## [57] xfun_0.52 DESeq2_1.48.0
## [59] HDF5Array_1.36.0 withr_3.0.2
## [61] fastmap_1.2.0 rhdf5filters_1.20.0
## [63] digest_0.6.37 rsvd_1.0.5
## [65] timechange_0.3.0 R6_2.6.1
## [67] gridGraphics_0.5-1 textshaping_1.0.1
## [69] RSQLite_2.3.10 h5mread_1.0.0
## [71] tidyr_1.3.1 data.table_1.17.0
## [73] prettyunits_1.2.0 httr_1.4.7
## [75] htmlwidgets_1.6.4 S4Arrays_1.8.0
## [77] pkgconfig_2.0.3 gtable_0.3.6
## [79] progeny_1.30.0 blob_1.2.4
## [81] SingleCellExperiment_1.30.0 XVector_0.48.0
## [83] htmltools_0.5.8.1 carData_3.0-5
## [85] fgsea_1.34.0 msigdbdf_24.1.0
## [87] GSEABase_1.70.0 scales_1.4.0
## [89] tidyverse_2.0.0 png_0.1-8
## [91] SpatialExperiment_1.18.0 ggfun_0.1.8
## [93] knitr_1.50 rstudioapi_0.17.1
## [95] tzdb_0.5.0 reshape2_1.4.4
## [97] rjson_0.2.23 nlme_3.1-168
## [99] curl_6.2.2 cachem_1.1.0
## [101] rhdf5_2.52.0 stringr_1.5.1
## [103] parallel_4.5.0 vipor_0.4.7
## [105] AnnotationDbi_1.70.0 desc_1.4.3
## [107] pillar_1.10.2 grid_4.5.0
## [109] vctrs_0.6.5 ggpubr_0.6.0
## [111] BiocSingular_1.24.0 car_3.1-3
## [113] dbplyr_2.5.0 beachmat_2.24.0
## [115] xtable_1.8-4 beeswarm_0.4.0
## [117] evaluate_1.0.3 readr_2.1.5
## [119] magick_2.8.6 cli_3.6.5
## [121] locfit_1.5-9.12 compiler_4.5.0
## [123] rlang_1.1.6 crayon_1.5.3
## [125] ggsignif_0.6.4.9000 labeling_0.4.3
## [127] plyr_1.8.9 forcats_1.0.0
## [129] fs_1.6.6 ggbeeswarm_0.7.2
## [131] stringi_1.8.7 viridisLite_0.4.2
## [133] BiocParallel_1.42.0 assertthat_0.2.1
## [135] babelgene_22.9 Biostrings_2.76.0
## [137] lazyeval_0.2.2 ggheatmapper_0.2.1
## [139] Matrix_1.7-3 hms_1.1.3
## [141] patchwork_1.3.0 sparseMatrixStats_1.20.0
## [143] bit64_4.6.0-1 Rhdf5lib_1.30.0
## [145] KEGGREST_1.48.0 broom_1.0.8
## [147] memoise_2.0.1 bslib_0.9.0
## [149] ggtree_3.16.0 fastmatch_1.1-6
## [151] bit_4.6.0 ape_5.8-1