Unsupervised RNA-seq analysis

For the supervised and unsupervised analyses, we are using an open dataset which is given by the airway package. It provides a gene expression dataset derived from human bronchial epithelial cells, treated or not with dexamethasone (a corticosteroid).

Here is an example of how the supervised part of the CAIBIrnaseq package can be use. If you want to use this notebook for your projects, it is available here

First, we need to import the required packages :

library(airway)               # dataset's package  
library(SummarizedExperiment) # Object's type we are using
library(CAIBIrnaseq)          # package

Before analysing the dataset, we define the variables with the names of the genes / pathways we want to analyse.

These variables will change in function of how your dataset is build and which type of data it is.

annotation <- "dex"
diffexpMethod <- "limma" # Either limma or deseq2
species <- "human"
pathwayMethod <- "ora"  # Either ORA or fgsea

# The following variables are those that will need to be modified depending on the analyses you want to do  

collections <- c("CGP", "CP", "CP:KEGG_LEGACY", "Hallmark")

pathwayOfInterest <- list("HALLMARK_INFLAMMATORY_RESPONSE", "GOBP_RESPONSE_TO_GLUCOCORTICOID","GOBP_REGULATION_OF_CELL_PROLIFERATION","HALLMARK_APOPTOSIS")

boxplot_pathways <- list("HALLMARK_INFLAMMATORY_RESPONSE","GOBP_RESPONSE_TO_STEROID_HORMONE")

heatmap_genes <- list(
  gr_response_genes = c("FKBP5", "TSC22D3", "PER1", "ZBTB16"),
  anti_inflam_genes = c("DUSP1", "SOCS1", "MT2A")
)

heatmap_pathways <- c(
  "DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN",
  "REN_ALVEOLAR_RHABDOMYOSARCOMA_DN",
  "NUYTTEN_EZH2_TARGETS_UP",
  "PASINI_SUZ12_TARGETS_DN"
)

Remember to come back in this cell to modify the genes, pathways you are interested in studying.

Load data

This section loads the RNA-seq dataset for analysis. It ensures the correct input file is used, as specified in the parameters. rebase_gexp

Ensure your dataset is in a Summarized Experiment object, because all the used functions below works with this type of input.

If you want to know more about this type of object, please click here: Bioconductor

If you have a .xsl file (Excel), there is an existing notebook to convert it in a .RDS file.

data(airway, package="airway")
exp_data <- airway

# If you are using your own dataset .RDS file), use this command line : 
# exp_data <- readRDS(data_file)

Even if the datasets are globally build the same way, the names of the variables are not exactly the same, so if we want to keep the same code, we need to redefine a bit the variables.

If you want to know what are the used variables in this part, run this command line : colnames(SummarizedExperiment::rowData(exp_data))

You should have these variables (with these exact same names):
- gene_name : The commonly used symbol or name for the gene (e.g., A1BG).
- gene_id : A unique and stable identifier for the gene, often from databases like Ensembl.
- gene_length_kb : The length of the gene measured in kilobases
- gene_description : A brief textual summary of the gene’s function or characteristics. - gene_biotype : A classification of the gene based on its biological function or transcript type, such as protein_coding, lncRNA, or pseudogene.

If you are using the notebook that Clarice GROENEVELD created to convert a .xsl file into a .RDS one, you can SKIP the next cell. If not, you should look at how your dataset is defined. You might need to run some command line as the following ones:

rowData(exp_data)$gene_length_kb <- 
  (rowData(exp_data)$gene_seq_end - rowData(exp_data)$gene_seq_start) / 1000

library(biomaRt)
mart <- useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl")
gene_ids <- rowData(exp_data)$gene_id

annot <- getBM(attributes = c("ensembl_gene_id", "description"),
               filters = "ensembl_gene_id",
               values = gene_ids,
               mart = mart)
matched <- match(rowData(exp_data)$gene_id, annot$ensembl_gene_id)
rowData(exp_data)$gene_description <- annot$description[matched]

Pre-processing

Most datasets use ensembl gene ID by default after alignment, so this step rebases the expression data to gene names. This ensures consistency in naming for downstream analyses.

exp_data <- rebase_gexp(exp_data, annotation = "gene_name")

Filter

Here, we filter out genes expressed in too few samples or with very low counts. This removes noise from the data and focuses on meaningful gene expressions.

exp_data <- filter_gexp(exp_data,
                        min_nsamp = 1, 
                        min_counts = 1)

Visualization of the filtering process to ensure the criteria applied align with the dataset’s characteristics:

colData(exp_data)$sample_id <- colnames(exp_data)

plot_qc_filters(exp_data)

Normalize

Here, we apply a normalization to the expression data, making samples comparable by reducing variability due to technical differences. For datasets with few samples, rlog is the preferred normalization and when more samples are present, vst is applied.

exp_data <- normalize_gexp(exp_data)

PCA

Principal component analysis (PCA) identifies the major patterns in the dataset. These patterns help explore similarities or differences among samples based on gene expression.

metadata(exp_data)[["pca_res"]] <- pca_gexp(exp_data)
annotations <- setdiff(annotation, c("exp_cluster", "path_cluster"))
plot_pca(exp_data, color = annotation)

Diffexp

Differential expression analysis (diffexp) is a statistical method used to identify genes whose expression levels significantly change between different experimental conditions. These changes in gene expression can provide insights into the biological processes or pathways affected, and help identify key molecular players involved in a condition or phenotype.

library(edgeR)
colData(exp_data)$dex <- factor(colData(exp_data)$dex)
colData(exp_data)$dex <- factor(colData(exp_data)$dex, levels = c("untrt", "trt"))    # conditions' names 


diffexp <- diffExpAnalysis(countData = assays(exp_data)$counts,
                                          sampleInfo = colData(exp_data),
                                          method = diffexpMethod, cutoff = 10,
                                          annotation = annotation)

Diffexp Filtering

Differential expression filtering include thresholds on adjusted p-values (e.g., padj < 0.05) to ensure statistical significance, and log2 fold change (e.g., |log2FC| > 1) to focus on genes with meaningful expression differences.

if(tolower(diffexpMethod) == "limma") {   # for 'limma', the default variables names are different, so we have to rename it
  diffexp <- diffexp |>
    dplyr::rename(
      log2FoldChange = logFC, 
      pvalue = P.Value,
      padj = adj.P.Val
    )
}

diffexp_filtered <- diffexp |>
    dplyr::filter(padj <= 0.05) |>    # we filter the diffexp, to keep the values under a given threshold 'padj'. this threshold can be change 
    dplyr::arrange(padj)

# Select the 16 most differentially expressed genes
top_de_genes <- rownames(diffexp_filtered)[1:16]

Volcano plot

A volcano plot is a graphical tool used to visualize the results of differential expression analysis. It plots the statistical significance on the y-axis against the magnitude of change on the x-axis for each gene.

volcano_plot <- plot_exp_volcano(diffexp, 20)
volcano_plot

There is also the possibility to flip this graph by using coord_flip of the ggplot2 package

volcano_plot + ggplot2::coord_flip()

Pathway analysis

Pathway collections available in the MSIGdb can be specified in the parameters. These pathways are scored and ranked by their variance in the data. These are the available collections (use gs_subcollection as name except for Hallmarks, which should be ‘H’).

msigdbr::msigdbr_collections() |> kableExtra::kbl() |> kableExtra::kable_styling() |> kableExtra::scroll_box(height = "300px")

gs_collection	gs_subcollection	gs_collection_name	num_genesets
C1		Positional	302
C2	CGP	Chemical and Genetic Perturbations	3494
C2	CP	Canonical Pathways	19
C2	CP:BIOCARTA	BioCarta Pathways	292
C2	CP:KEGG_LEGACY	KEGG Legacy Pathways	186
C2	CP:KEGG_MEDICUS	KEGG Medicus Pathways	658
C2	CP:PID	PID Pathways	196
C2	CP:REACTOME	Reactome Pathways	1736
C2	CP:WIKIPATHWAYS	WikiPathways	830
C3	MIR:MIRDB	miRDB	2377
C3	MIR:MIR_LEGACY	MIR_Legacy	221
C3	TFT:GTRD	GTRD	505
C3	TFT:TFT_LEGACY	TFT_Legacy	610
C4	3CA	Curated Cancer Cell Atlas gene sets	148
C4	CGN	Cancer Gene Neighborhoods	427
C4	CM	Cancer Modules	431
C5	GO:BP	GO Biological Process	7608
C5	GO:CC	GO Cellular Component	1026
C5	GO:MF	GO Molecular Function	1820
C5	HPO	Human Phenotype Ontology	5653
C6		Oncogenic Signature	189
C7	IMMUNESIGDB	ImmuneSigDB	4872
C7	VAX	HIPC Vaccine Response	347
C8		Cell Type Signature	840
H		Hallmark	50

pathways <- get_annotation_collection(collections, 
                                      species = species)

pathwayResult <- pathwayAnalysis(diffexp_filtered,
                              pathways = pathways,
                              method = pathwayMethod, species = species) |> as.data.frame()
rownames(pathwayResult) <- pathwayResult$pathway
S4Vectors::metadata(exp_data)[["pathwayEnrichment"]] <- pathwayResult
pathwayResult_sorted <- pathwayResult[order(pathwayResult$PValue), ]

top10_pathways <- head(pathwayResult_sorted, 10)

Dot Plot

The resulting dot plot summarizes the most enriched biological pathways across different conditions or samples

plot_pathway_dotplot(exp_data, score_name = "pathwayEnrichment")

Boxplots

Boxplots are valuable tools in biology for visualizing the distribution of gene expression or other quantitative variables across experimental conditions. They allow quick comparisons between groups by summarizing key statistics A Wilcoxon test is made, and the p-value is also printed. The closer the value is to 0, the more the genes/pathways are influenced by the experimental conditions. We can consider that the experimental conditions influence the results when p < 0.05.

dir.create(file.path("results", "sup", "targeted", pathwayMethod, "boxplots"), recursive = TRUE, showWarnings = FALSE)
pathway_scores <- score_pathways(exp_data, pathways, verbose = FALSE)
S4Vectors::metadata(exp_data)[["pathway_scores"]] <- pathway_scores

boxplots <- lapply(heatmap_pathways, function(path) {
  lapply(annotation, function(annotation) {
    plt <- plot_path_boxplot(exp_data, 
                             pathway = path,
                   annotation = annotation, 
                  color_var = annotation, 
                   pt_size = 2,
                   fname = stringr::str_glue("results/sup/targeted/{pathwayMethod}/boxplots/box_{path}_{annotation}.pdf"))
  })
}) |> purrr::flatten()

patchwork::wrap_plots(boxplots, nrows = round(length(boxplots)/2), guides = "collect")

Genes Heatmap

On the heatmaps, we can tell that the top_genes expressions are higher for the treated samples. It means that these genes might be involved in the treatment response.

hms <- lapply(1:length(heatmap_genes), function(i) {
  genes <- heatmap_genes[[i]]
  name <- ifelse(is.null(names(heatmap_genes)), i, names(heatmap_genes)[i])
  hm <- plot_exp_heatmap(exp_data, genes = genes, 
                   annotations = annotation,
                   show_rownames = ifelse(length(genes) <= 100, TRUE, FALSE),
                   hm_color_limits = c(-2,2),
                   show_dend_row = FALSE,
                   fname = stringr::str_glue("results/sup/targeted/{pathwayMethod}/hm_genes_{i}.pdf"))
})
patchwork::wrap_plots(hms, ncol = 2, guides = "collect")

Cordeliers Artificial Intelligence and Bioinformatics