Unsupervised RNA-seq analysis

For the 2 examples, we are using an open dataset which is given by the airway package. It provides a gene expression dataset derived from human bronchial epithelial cells, treated or not with dexamethasone (a corticosteroid).

Here is an example of how the unsupervised part of the CAIBIrnaseq package can be use. If you want to use this notebook for your projects, it is available here

First, we need to import the required packages :

library(airway)               # dataset's package  
library(SummarizedExperiment) # Object's type we are using
library(CAIBIrnaseq)          # package

Before analysing the dataset, we define the variables with the names of the genes / pathways … that we want to analyse.

These variables will change in function of how your dataset is build and which type of data it is.

If you find the definition of the variables heavy in your code, you can create a .yml file, for example named params.yml, call it in a cell, and it will automatically import the variables. Your code will be clearer (here is an example)

species <- "Homo sapiens"   # Or "Mus musculus"

# Annotation variable to visualize
plot_annotations <- "dex"  # Put the name of the condition (here 'dex' is treated or untreated)

# Quality parameters
qc_min_nsamp <- 2
qc_min_gene_counts <- 10  

# Clustering of expression
exp_cluster <- data.frame(k = 2) #Number of cluster

# Clustering of metadata
metadata_clusters <- list(
  pathway_scores = data.frame(k = 2),
  microenv_scores = data.frame(k = 3)
)

# The following variables are those that will need to be modified depending on the analyses you want to do  

# Pathway collections 
pathway_collections <- c("CGP", "CP", "CP:KEGG_LEGACY", "Hallmark") #See the msigdb table and modify with the interesting collections

# Interesting genes
heatmap_genes <- list(
  gr_response_genes <- c("FKBP5", "TSC22D3", "PER1", "ZBTB16"),
  anti_inflam_genes <- c("DUSP1", "SOCS1", "MT2A")
)     # same here, replace with the genes you are interested in

heatmap_pathways <- c(
  "DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN",
  "REN_ALVEOLAR_RHABDOMYOSARCOMA_DN",
  "NUYTTEN_EZH2_TARGETS_UP",
  "PASINI_SUZ12_TARGETS_DN"
) #Same

# Genes for the boxplots
boxplot_genes <- c("FKBP5", "TSC22D3")  #same

# Pathways for the boxplots
boxplot_pathways <- c(
  "KUMAMOTO_RESPONSE_TO_NUTLIN_3A_UP",
  "CASTELLANO_HRAS_TARGETS_DN"
) #same

# Corrélations entre gènes
correlation_genes <- list(
  c("FKBP5", "TSC22D3"),
  c("FKBP5", "ZBTB16")
)

# Pathways correlation
correlation_pathways <- list(
  c("DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN", "REN_ALVEOLAR_RHABDOMYOSARCOMA_DN"),
  c("REN_ALVEOLAR_RHABDOMYOSARCOMA_DN", "NUYTTEN_EZH2_TARGETS_UP")
)   #same

Load data

This section loads the RNA-seq dataset for analysis. It ensures the correct input file is used, as specified in the parameters. rebase_gexp

Ensure your dataset is in a Summarized Experiment object, because all the used functions below works with SummarizedExperiment input.

If you want to know more about this type of object, please click here: Bioconductor

data(airway, package="airway")
exp_data <- airway

# If you are using your own dataset .RDS file), use this command line : 
# exp_data <- readRDS(data_file)

Even if the datasets are globally build the same way, the names of the variables are not exactly the same, so if we want to keep the same code, we need to redefine a bit the variables.

If you want to know what are the used variables in this part, run this command line : colnames(SummarizedExperiment::rowData(exp_data))

You should have these variables (with these exact same names):
- gene_name : The commonly used symbol or name for the gene (e.g., A1BG).
- gene_id : A unique and stable identifier for the gene, often from databases like Ensembl.
- gene_length_kb : The length of the gene measured in kilobases
- gene_description : A brief textual summary of the gene’s function or characteristics. - gene_biotype : A classification of the gene based on its biological function or transcript type, such as protein_coding, lncRNA, or pseudogene.

If you are using the notebook that Clarice GROENEVELD created to convert a .xsl file into a .RDS one, you can SKIP the next cell. If not, you should look at how your dataset is defined. You might need to run some command line as the following ones:

library(biomaRt)

# The goal here is to match the existing columns with the expected ones

rowData(exp_data)$gene_length_kb <- 
  (rowData(exp_data)$gene_seq_end - rowData(exp_data)$gene_seq_start) / 1000

mart <- useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl")
gene_ids <- rowData(exp_data)$gene_id

annot <- getBM(attributes = c("ensembl_gene_id", "description"),
               filters = "ensembl_gene_id",
               values = gene_ids,
               mart = mart)

matched <- match(rowData(exp_data)$gene_id, annot$ensembl_gene_id)
rowData(exp_data)$gene_description <- annot$description[matched]

Pre-processing

Most datasets use ensembl gene ID by default after alignment, so this step rebases the expression data to gene names. This ensures consistency in naming for downstream analyses.

exp_data <- rebase_gexp(exp_data, annotation = "gene_name")

Filter

Here, we filter out genes expressed in too few samples or with very low counts. This removes noise from the data and focuses on meaningful gene expressions.

exp_data <- filter_gexp(exp_data,
                        min_nsamp = 1, 
                        min_counts = 1)

Visualization of the filtering process to ensure the criteria applied align with the dataset’s characteristics:

colData(exp_data)$sample_id <- colnames(exp_data)
plot_qc_filters(exp_data)

Normalize

Here, we apply a normalization to the expression data, making samples comparable by reducing variability due to technical differences. For datasets with few samples, rlog is the preferred normalization and when more samples are present, vst is applied.

exp_data <- normalize_gexp(exp_data)

PCA

Principal component analysis (PCA) identifies the major patterns in the dataset. These patterns help explore similarities or differences among samples based on gene expression.

pca_res = pca_gexp(exp_data)
exp_data@metadata[["pca_res"]] <- pca_res

annotations <- setdiff(plot_annotations, c("exp_cluster", "path_cluster"))
plot_pca(exp_data, color = plot_annotations)

If you want something more visual, you can add a circular/oval shape to circle the different genotypes of samples, use the fviz_pca_ind function from the factoextra package. With this dataset, it is not relevant but it can be with your persona dataset. Here, it highlights the fact that the 2 groups are not crossing each other. The trt group has more PC1, whereas the untrt group has less. We could conclude that PC1 is more represented in the treated samples.

library(factoextra)
groups <- SummarizedExperiment::colData(exp_data)$dex  # here we want to split in function of `treated` and `untreated`

fviz_pca_ind(pca_res,
             geom = "point",
             habillage = groups,
             palette = c("#00AFBB", "#E7B800"),  # Personalized colors
             addEllipses = TRUE,
             ellipse.type = "confidence",
             repel = TRUE,
             label = "none"
)

Unsupervised clustering

Here, we group samples based on expression patterns without prior knowledge using hierarchical clustering on either a selected gene list from the parameters or, by default, the 2000 most highly expressed genes.

This can be useful for discovering sample subgroups or new biological insights.

exp_data <- cluster_exp(exp_data, k = exp_cluster$k, genes = exp_cluster$genes, n_pcs = 3)

Visual representation of expression levels for HVG across clusters, highlighting distinct patterns.

hvg <- highly_variable_genes(exp_data)
exp_cluster <- data.frame(k = 2)
hm <- plot_exp_heatmap(exp_data,  genes = hvg, 
                 annotations = c(plot_annotations, "exp_cluster"),
                 show_rownames = FALSE,
                 hm_color_limits = c(-2,2),
                 fname = "results/unsup/clustering/heatmap_2000hvg_exp_cluster.pdf")
hm

Pathway activity

Pathway analysis enables us to understand the functional implications of gene expression changes. Here, we analyze the dataset for pathway activity using two methods.

PROGENy

PROGENy is a collection of only 14 core pathway responsive genes from large signaling perturbation experiments. For more information see the original paper.

The returned plot will give us information about the pathways that are activated for each sample. There is especially one pathway that is highly activated : EGFR , in the sample SRR1039517

progeny_scores <- score_progeny(exp_data, species = "Homo sapiens")

progeny_scores <- score_progeny(exp_data, species = "Homo sapiens")

metadata(exp_data)[["progeny_scores"]] <- progeny_scores   

plot_progeny_heatmap(exp_data, annotations = plot_annotations,
                     fname = "results/unsup/pathways/hm_progeny_scores.pdf")

write.csv(progeny_scores, file = "results/unsup/pathways/progeny_scores.csv")

Pathways

Ensure your dataset is in a Summarized Experiment object, because all the used functions below works with SummarizedExperiment input.

Pathway collections available in the MSIGdb can be specified in the parameters. These pathways are scored and ranked by their variance in the data. These are the available collections (use gs_subcollection as name except for Hallmarks, which should be ‘H’).

library(msigdbr)
library(dplyr)
library(kableExtra)


msigdbr::msigdbr_collections() |>
  kableExtra::kbl() |>
  kableExtra::kable_styling() |>
  kableExtra::scroll_box(height = "300px")

gs_collection	gs_subcollection	gs_collection_name	num_genesets
C1		Positional	302
C2	CGP	Chemical and Genetic Perturbations	3494
C2	CP	Canonical Pathways	19
C2	CP:BIOCARTA	BioCarta Pathways	292
C2	CP:KEGG_LEGACY	KEGG Legacy Pathways	186
C2	CP:KEGG_MEDICUS	KEGG Medicus Pathways	658
C2	CP:PID	PID Pathways	196
C2	CP:REACTOME	Reactome Pathways	1736
C2	CP:WIKIPATHWAYS	WikiPathways	830
C3	MIR:MIRDB	miRDB	2377
C3	MIR:MIR_LEGACY	MIR_Legacy	221
C3	TFT:GTRD	GTRD	505
C3	TFT:TFT_LEGACY	TFT_Legacy	610
C4	3CA	Curated Cancer Cell Atlas gene sets	148
C4	CGN	Cancer Gene Neighborhoods	427
C4	CM	Cancer Modules	431
C5	GO:BP	GO Biological Process	7608
C5	GO:CC	GO Cellular Component	1026
C5	GO:MF	GO Molecular Function	1820
C5	HPO	Human Phenotype Ontology	5653
C6		Oncogenic Signature	189
C7	IMMUNESIGDB	ImmuneSigDB	4872
C7	VAX	HIPC Vaccine Response	347
C8		Cell Type Signature	840
H		Hallmark	50

pathways <- get_annotation_collection(pathway_collections, 
                                      species = species)
pathway_scores <- score_pathways(exp_data, pathways, verbose = FALSE)
metadata(exp_data)[["pathway_scores"]] <- pathway_scores

collections <- pathway_collections |> 
  paste(collapse = "_") |>
  stringr::str_remove("\\:")

plot_pathway_heatmap(exp_data, annotations = plot_annotations,
                    fwidth = 9,
                    fname = stringr::str_glue(
                      "results/unsup/pathways/hm_paths_{collections}_top20.pdf")
                    )

write.csv(pathways, file = stringr::str_glue("results/unsup/pathways/paths_{collections}.csv"))

Microenvironment scores

This step calculates immune and stromal cell type abundances using MCPcounter or mMCPcounter. It helps to infer the composition of the tumor microenvironment or similar contexts.

mcp_scores <- mcp_counter(exp_data, species = species)
S4Vectors::metadata(exp_data)[["microenv_scores"]] <- mcp_scores

plot_microenv_heatmap(exp_data, annotations = c("dex", "exp_cluster"),
                      fname = "results/unsup/tme/heatSmap_mcpcounter.pdf")

write.csv(mcp_scores, file = "results/unsup/tme/scores_mcpcounter.csv")

By default, the rows are order by scores. But, the plot_microenv_heatsmap function has the ellipsis argument. That means that this function can have a wide range of inputs. So, it is possible to plot the rows in a different order than the default one :

plot_microenv_heatmap(exp_data,
                      annotations = c("dex", "exp_cluster"),
                      fname = "results/unsup/tme/heatmap_sorted_bydex.pdf",
                      cluster_rows = FALSE)

Targeted plots

This section focuses on visualizing specific genes or pathways of interest, as specified in the parameters.

Heatmaps

Generates heatmaps for pre-selected genes of interest to observe their expression across samples or conditions.

hms <- lapply(1:length(heatmap_genes), function(i) {
  gene_annot <- SummarizedExperiment::rowData(exp_data)
  genes <- heatmap_genes[[i]]
  name <- ifelse(is.null(names(heatmap_genes)), i, names(heatmap_genes)[i])
  plot_exp_heatmap(exp_data, genes = genes, 
                   annotations = plot_annotations,
                   fname = stringr::str_glue("results/unsup/targeted/hm_genes_{i}.pdf"))
})
patchwork::wrap_plots(hms, ncol = 2, guides = "collect")

Selected pathways

valid_pathways <- intersect(heatmap_pathways, rownames(pathway_scores))

plot_pathway_heatmap(exp_data, 
                     annotations = plot_annotations, 
                     pathways = valid_pathways,
                     fname = stringr::str_glue("results/unsup/targeted/hm_pathways_selected.pdf"))

Boxplots

Boxplots provide a clear comparison of expression levels across experimental groups or conditions. All jobs still running at 10:00 on Friday will be killed as part of the maintenance. #### Selected genes

genes <- boxplot_genes
annotations <- plot_annotations

boxplots <- lapply(genes, function(gene) {
  lapply(annotations, function(annotation) {
    plt <- plot_exp_boxplot(exp_data, gene = gene, 
                   annotation = annotation, 
                   color_var = annotation, 
                   pt_size = 2,
                   fname = stringr::str_glue("results/unsup/targeted/boxplots/box_{gene}_{annotation}.pdf"))
  })
}) |> purrr::flatten()

patchwork::wrap_plots(boxplots, nrows = round(length(boxplots)/2), guides = "collect")

Selected pathways

paths <- boxplot_pathways
annotations <- plot_annotations

boxplots <- lapply(paths, function(path) {
  lapply(annotations, function(annotation) {
    plt <- plot_path_boxplot(exp_data, 
                             pathway = path,
                   annotation = annotation, 
                   color_var = annotation, 
                   pt_size = 2,
                   fname = stringr::str_glue("results/unsup/targeted/boxplots/box_{path}_{annotation}.pdf"))
  })
}) |> purrr::flatten()

patchwork::wrap_plots(boxplots, nrows = round(length(boxplots)/2), guides = "collect")

Correlations

This section visualizes relationships between pairs of genes or pathways by plotting their expression/activity correlations. Correlation analysis can reveal important co-regulation or interaction patterns, helping to uncover biologically meaningful relationships.

Selected genes

Here we plot the correlation between selected gene pairs across the dataset. Each pair is plotted separately, and color-coded by sample annotation.

gene_pairs <- correlation_genes
annotations <- plot_annotations

cor_plts <- lapply(gene_pairs, function(gene_pair) {
  lapply(annotations, function(annot) {
    # Correction de la construction du nom de fichier
    fname <- stringr::str_glue("results/unsup/targeted/correlations/cor_{gene_pair[1]}_{gene_pair[2]}_color={annot}.pdf")
    
    # Génération du scatter plot
    plot_exp_scatter(exp_data, 
                     gene1 = gene_pair[1],
                     gene2 = gene_pair[2], 
                     color_var = annot,
                     fname = fname)
  })
}) |> purrr::flatten()

patchwork::wrap_plots(cor_plts, nrows = round(length(cor_plts)/2), guides = "collect")

Selected pathways

Correlation plots for selected pathways can help identify similarities or differences in pathway activity patterns across samples. Each pathway pair is plotted separately and color-coded by sample annotation to illustrate trends within each condition.

path_pairs <-correlation_pathways
annotations <- plot_annotations

cor_plts <- lapply(path_pairs, function(path_pair) {
  lapply(annotations, function(annot) {
      plot_path_scatter(exp_data, 
                   pathway1 = path_pair[1],
                   pathway2 = path_pair[2],## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was 
                   color_var = annot,
                   fname = stringr::str_glue(
                     "results/unsup/targeted/correlations/cor_{path_pair[1]}_{path_pair[2]}_color={annot}.pdf"))
  })
}) |> purrr::flatten()

patchwork::wrap_plots(cor_plts, nrows = round(length(cor_plts)/2), guides = "collect")

Cluster using metadata

types = names(metadata_clusters)

for(type in types) {
  exp_data <- cluster_metadata(exp_data, 
                   metadata_name = type, 
                   k = metadata_clusters[[type]]$k, 
                   features = metadata_clusters[[type]]$features,
                   n_pcs = 3 )
}

Save SummarizedExperiment

The final step saves the processed dataset and results. This ensures all outputs can be revisited or shared for further analysis.

saveRDS(exp_data, file = stringr::str_glue("results/unsup/data_SummarizedExp_{lubridate::today()}.RDS"))

Report parameters

For reproducibility, the parameters used in the analysis and the computational environment details are documented.

sessionInfo

The sessionInfo() prints out all packages loaded at the time of analysis, as well as their versions.

sessionInfo()

## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Paris
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0            dplyr_1.1.4                
##  [3] msigdbr_10.0.2              factoextra_1.0.7           
##  [5] ggplot2_3.5.2               biomaRt_2.64.0             
##  [7] CAIBIrnaseq_1.0.0           R.utils_2.13.0             
##  [9] R.oo_1.27.1                 R.methodsS3_1.8.2          
## [11] airway_1.28.0               SummarizedExperiment_1.38.1
## [13] Biobase_2.68.0              GenomicRanges_1.60.0       
## [15] GenomeInfoDb_1.44.0         IRanges_2.42.0             
## [17] S4Vectors_0.46.0            BiocGenerics_0.54.0        
## [19] generics_0.1.3              MatrixGenerics_1.20.0      
## [21] matrixStats_1.5.0          
## 
## loaded via a namespace (and not attached):
##   [1] ggplotify_0.1.2             filelock_1.0.3             
##   [3] tibble_3.2.1                graph_1.86.0               
##   [5] XML_3.99-0.18               lifecycle_1.0.4            
##   [7] httr2_1.1.2                 rstatix_0.7.2              
##   [9] lattice_0.22-5              crosstalk_1.2.1            
##  [11] backports_1.5.0             magrittr_2.0.3             
##  [13] plotly_4.10.4               sass_0.4.10                
##  [15] rmarkdown_2.29              jquerylib_0.1.4            
##  [17] yaml_2.3.10                 rlist_0.4.6.2              
##  [19] cowplot_1.1.3               DBI_1.2.3                  
##  [21] RColorBrewer_1.1-3          eulerr_7.0.2               
##  [23] lubridate_1.9.4             abind_1.4-8                
##  [25] purrr_1.0.4                 yulab.utils_0.2.0          
##  [27] rappdirs_0.3.3              GenomeInfoDbData_1.2.14    
##  [29] ggrepel_0.9.6               irlba_2.3.5.1              
##  [31] tidytree_0.4.6              GSVA_2.2.0                 
##  [33] MCPcounter_1.2.0            annotate_1.86.0            
##  [35] svglite_2.1.3               pkgdown_2.1.2              
##  [37] codetools_0.2-20            DelayedArray_0.34.1        
##  [39] xml2_1.3.8                  tidyselect_1.2.1           
##  [41] aplot_0.2.5                 UCSC.utils_1.4.0           
##  [43] farver_2.1.2                ScaledMatrix_1.16.0        
##  [45] BiocFileCache_2.16.0        jsonlite_2.0.0             
##  [47] Formula_1.2-5               systemfonts_1.2.3          
##  [49] tools_4.5.0                 progress_1.2.3             
##  [51] treeio_1.32.0               ragg_1.4.0                 
##  [53] Rcpp_1.0.14                 glue_1.8.0                 
##  [55] gridExtra_2.3               SparseArray_1.8.0          
##  [57] xfun_0.52                   DESeq2_1.48.0              
##  [59] HDF5Array_1.36.0            withr_3.0.2                
##  [61] fastmap_1.2.0               rhdf5filters_1.20.0        
##  [63] digest_0.6.37               rsvd_1.0.5                 
##  [65] timechange_0.3.0            R6_2.6.1                   
##  [67] gridGraphics_0.5-1          textshaping_1.0.1          
##  [69] RSQLite_2.3.10              h5mread_1.0.0              
##  [71] tidyr_1.3.1                 data.table_1.17.0          
##  [73] prettyunits_1.2.0           httr_1.4.7                 
##  [75] htmlwidgets_1.6.4           S4Arrays_1.8.0             
##  [77] pkgconfig_2.0.3             gtable_0.3.6               
##  [79] progeny_1.30.0              blob_1.2.4                 
##  [81] SingleCellExperiment_1.30.0 XVector_0.48.0             
##  [83] htmltools_0.5.8.1           carData_3.0-5              
##  [85] fgsea_1.34.0                msigdbdf_24.1.0            
##  [87] GSEABase_1.70.0             scales_1.4.0               
##  [89] tidyverse_2.0.0             png_0.1-8                  
##  [91] SpatialExperiment_1.18.0    ggfun_0.1.8                
##  [93] knitr_1.50                  rstudioapi_0.17.1          
##  [95] tzdb_0.5.0                  reshape2_1.4.4             
##  [97] rjson_0.2.23                nlme_3.1-168               
##  [99] curl_6.2.2                  cachem_1.1.0               
## [101] rhdf5_2.52.0                stringr_1.5.1              
## [103] parallel_4.5.0              vipor_0.4.7                
## [105] AnnotationDbi_1.70.0        desc_1.4.3                 
## [107] pillar_1.10.2               grid_4.5.0                 
## [109] vctrs_0.6.5                 ggpubr_0.6.0               
## [111] BiocSingular_1.24.0         car_3.1-3                  
## [113] dbplyr_2.5.0                beachmat_2.24.0            
## [115] xtable_1.8-4                beeswarm_0.4.0             
## [117] evaluate_1.0.3              readr_2.1.5                
## [119] magick_2.8.6                cli_3.6.5                  
## [121] locfit_1.5-9.12             compiler_4.5.0             
## [123] rlang_1.1.6                 crayon_1.5.3               
## [125] ggsignif_0.6.4.9000         labeling_0.4.3             
## [127] plyr_1.8.9                  forcats_1.0.0              
## [129] fs_1.6.6                    ggbeeswarm_0.7.2           
## [131] stringi_1.8.7               viridisLite_0.4.2          
## [133] BiocParallel_1.42.0         assertthat_0.2.1           
## [135] babelgene_22.9              Biostrings_2.76.0          
## [137] lazyeval_0.2.2              ggheatmapper_0.2.1         
## [139] Matrix_1.7-3                hms_1.1.3                  
## [141] patchwork_1.3.0             sparseMatrixStats_1.20.0   
## [143] bit64_4.6.0-1               Rhdf5lib_1.30.0            
## [145] KEGGREST_1.48.0             broom_1.0.8                
## [147] memoise_2.0.1               bslib_0.9.0                
## [149] ggtree_3.16.0               fastmatch_1.1-6            
## [151] bit_4.6.0                   ape_5.8-1

Cordeliers Artificial Intelligence and Bioinformatics

Load data

Pre-processing

Filter

Normalize

PCA

Unsupervised clustering

Pathway activity

PROGENy

Pathways

Microenvironment scores

Targeted plots

Heatmaps

Selected pathways

Boxplots

Selected pathways

Correlations

Selected genes

Selected pathways

Cluster using metadata

Save SummarizedExperiment

Report parameters

sessionInfo