Skip to contents

For the supervised and unsupervised analyses, we are using an open dataset which is given by the airway package. It provides a gene expression dataset derived from human bronchial epithelial cells, treated or not with dexamethasone (a corticosteroid).

Here is an example of how the supervised part of the CAIBIrnaseq package can be use. If you want to use this notebook for your projects, it is available here

First, we need to import the required packages :

library(airway)               # dataset's package  
library(SummarizedExperiment) # Object's type we are using
library(CAIBIrnaseq)          # package

Before analysing the dataset, we define the variables with the names of the genes / pathways we want to analyse.

These variables will change in function of how your dataset is build and which type of data it is.

annotation <- "dex"
diffexpMethod <- "limma" # Either limma or deseq2
species <- "human"
pathwayMethod <- "ora"  # Either ORA or fgsea

# The following variables are those that will need to be modified depending on the analyses you want to do  

collections <- c("CGP", "CP", "CP:KEGG_LEGACY", "Hallmark")

pathwayOfInterest <- list("HALLMARK_INFLAMMATORY_RESPONSE", "GOBP_RESPONSE_TO_GLUCOCORTICOID","GOBP_REGULATION_OF_CELL_PROLIFERATION","HALLMARK_APOPTOSIS")

boxplot_pathways <- list("HALLMARK_INFLAMMATORY_RESPONSE","GOBP_RESPONSE_TO_STEROID_HORMONE")

heatmap_genes <- list(
  gr_response_genes = c("FKBP5", "TSC22D3", "PER1", "ZBTB16"),
  anti_inflam_genes = c("DUSP1", "SOCS1", "MT2A")
)

heatmap_pathways <- c(
  "DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN",
  "REN_ALVEOLAR_RHABDOMYOSARCOMA_DN",
  "NUYTTEN_EZH2_TARGETS_UP",
  "PASINI_SUZ12_TARGETS_DN"
)

Remember to come back in this cell to modify the genes, pathways you are interested in studying.

Load data

This section loads the RNA-seq dataset for analysis. It ensures the correct input file is used, as specified in the parameters. rebase_gexp

Ensure your dataset is in a Summarized Experiment object, because all the used functions below works with this type of input.

If you want to know more about this type of object, please click here: Bioconductor

If you have a .xsl file (Excel), there is an existing notebook to convert it in a .RDS file.

data(airway, package="airway")
exp_data <- airway

# If you are using your own dataset .RDS file), use this command line : 
# exp_data <- readRDS(data_file)

Even if the datasets are globally build the same way, the names of the variables are not exactly the same, so if we want to keep the same code, we need to redefine a bit the variables.

If you want to know what are the used variables in this part, run this command line : colnames(SummarizedExperiment::rowData(exp_data))

You should have these variables (with these exact same names):
- gene_name : The commonly used symbol or name for the gene (e.g., A1BG).
- gene_id : A unique and stable identifier for the gene, often from databases like Ensembl.
- gene_length_kb : The length of the gene measured in kilobases
- gene_description : A brief textual summary of the gene’s function or characteristics. - gene_biotype : A classification of the gene based on its biological function or transcript type, such as protein_coding, lncRNA, or pseudogene.

If you are using the notebook that Clarice GROENEVELD created to convert a .xsl file into a .RDS one, you can SKIP the next cell. If not, you should look at how your dataset is defined. You might need to run some command line as the following ones:

rowData(exp_data)$gene_length_kb <- 
  (rowData(exp_data)$gene_seq_end - rowData(exp_data)$gene_seq_start) / 1000

library(biomaRt)
mart <- useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl")
gene_ids <- rowData(exp_data)$gene_id

annot <- getBM(attributes = c("ensembl_gene_id", "description"),
               filters = "ensembl_gene_id",
               values = gene_ids,
               mart = mart)
matched <- match(rowData(exp_data)$gene_id, annot$ensembl_gene_id)
rowData(exp_data)$gene_description <- annot$description[matched]

Pre-processing

Most datasets use ensembl gene ID by default after alignment, so this step rebases the expression data to gene names. This ensures consistency in naming for downstream analyses.

exp_data <- rebase_gexp(exp_data, annotation = "gene_name")

Filter

Here, we filter out genes expressed in too few samples or with very low counts. This removes noise from the data and focuses on meaningful gene expressions.

exp_data <- filter_gexp(exp_data,
                        min_nsamp = 1, 
                        min_counts = 1)

Visualization of the filtering process to ensure the criteria applied align with the dataset’s characteristics:

colData(exp_data)$sample_id <- colnames(exp_data)

plot_qc_filters(exp_data)

Normalize

Here, we apply a normalization to the expression data, making samples comparable by reducing variability due to technical differences. For datasets with few samples, rlog is the preferred normalization and when more samples are present, vst is applied.

exp_data <- normalize_gexp(exp_data)

PCA

Principal component analysis (PCA) identifies the major patterns in the dataset. These patterns help explore similarities or differences among samples based on gene expression.

metadata(exp_data)[["pca_res"]] <- pca_gexp(exp_data)
annotations <- setdiff(annotation, c("exp_cluster", "path_cluster"))
plot_pca(exp_data, color = annotation)

Diffexp

Differential expression analysis (diffexp) is a statistical method used to identify genes whose expression levels significantly change between different experimental conditions. These changes in gene expression can provide insights into the biological processes or pathways affected, and help identify key molecular players involved in a condition or phenotype.

library(edgeR)
colData(exp_data)$dex <- factor(colData(exp_data)$dex)
colData(exp_data)$dex <- factor(colData(exp_data)$dex, levels = c("untrt", "trt"))    # conditions' names 


diffexp <- diffExpAnalysis(countData = assays(exp_data)$counts,
                                          sampleInfo = colData(exp_data),
                                          method = diffexpMethod, cutoff = 10,
                                          annotation = annotation)

Diffexp Filtering

Differential expression filtering include thresholds on adjusted p-values (e.g., padj < 0.05) to ensure statistical significance, and log2 fold change (e.g., |log2FC| > 1) to focus on genes with meaningful expression differences.

if(tolower(diffexpMethod) == "limma") {   # for 'limma', the default variables names are different, so we have to rename it
  diffexp <- diffexp |>
    dplyr::rename(
      log2FoldChange = logFC, 
      pvalue = P.Value,
      padj = adj.P.Val
    )
}

diffexp_filtered <- diffexp |>
    dplyr::filter(padj <= 0.05) |>    # we filter the diffexp, to keep the values under a given threshold 'padj'. this threshold can be change 
    dplyr::arrange(padj)

# Select the 16 most differentially expressed genes
top_de_genes <- rownames(diffexp_filtered)[1:16]

Volcano plot

A volcano plot is a graphical tool used to visualize the results of differential expression analysis. It plots the statistical significance on the y-axis against the magnitude of change on the x-axis for each gene.

volcano_plot <- plot_exp_volcano(diffexp, 20)
volcano_plot

There is also the possibility to flip this graph by using coord_flip of the ggplot2 package

volcano_plot + ggplot2::coord_flip()

Pathway analysis

Pathway collections available in the MSIGdb can be specified in the parameters. These pathways are scored and ranked by their variance in the data. These are the available collections (use gs_subcollection as name except for Hallmarks, which should be ‘H’).

msigdbr::msigdbr_collections() |> kableExtra::kbl() |> kableExtra::kable_styling() |> kableExtra::scroll_box(height = "300px")
gs_collection gs_subcollection gs_collection_name num_genesets
C1 Positional 302
C2 CGP Chemical and Genetic Perturbations 3494
C2 CP Canonical Pathways 19
C2 CP:BIOCARTA BioCarta Pathways 292
C2 CP:KEGG_LEGACY KEGG Legacy Pathways 186
C2 CP:KEGG_MEDICUS KEGG Medicus Pathways 658
C2 CP:PID PID Pathways 196
C2 CP:REACTOME Reactome Pathways 1736
C2 CP:WIKIPATHWAYS WikiPathways 830
C3 MIR:MIRDB miRDB 2377
C3 MIR:MIR_LEGACY MIR_Legacy 221
C3 TFT:GTRD GTRD 505
C3 TFT:TFT_LEGACY TFT_Legacy 610
C4 3CA Curated Cancer Cell Atlas gene sets 148
C4 CGN Cancer Gene Neighborhoods 427
C4 CM Cancer Modules 431
C5 GO:BP GO Biological Process 7608
C5 GO:CC GO Cellular Component 1026
C5 GO:MF GO Molecular Function 1820
C5 HPO Human Phenotype Ontology 5653
C6 Oncogenic Signature 189
C7 IMMUNESIGDB ImmuneSigDB 4872
C7 VAX HIPC Vaccine Response 347
C8 Cell Type Signature 840
H Hallmark 50
pathways <- get_annotation_collection(collections, 
                                      species = species)

pathwayResult <- pathwayAnalysis(diffexp_filtered,
                              pathways = pathways,
                              method = pathwayMethod, species = species) |> as.data.frame()
rownames(pathwayResult) <- pathwayResult$pathway
S4Vectors::metadata(exp_data)[["pathwayEnrichment"]] <- pathwayResult
pathwayResult_sorted <- pathwayResult[order(pathwayResult$PValue), ]

top10_pathways <- head(pathwayResult_sorted, 10)

Dot Plot

The resulting dot plot summarizes the most enriched biological pathways across different conditions or samples

plot_pathway_dotplot(exp_data, score_name = "pathwayEnrichment")

Boxplots

Boxplots are valuable tools in biology for visualizing the distribution of gene expression or other quantitative variables across experimental conditions. They allow quick comparisons between groups by summarizing key statistics A Wilcoxon test is made, and the p-value is also printed. The closer the value is to 0, the more the genes/pathways are influenced by the experimental conditions. We can consider that the experimental conditions influence the results when p < 0.05.

dir.create(file.path("results", "sup", "targeted", pathwayMethod, "boxplots"), recursive = TRUE, showWarnings = FALSE)
pathway_scores <- score_pathways(exp_data, pathways, verbose = FALSE)
S4Vectors::metadata(exp_data)[["pathway_scores"]] <- pathway_scores

boxplots <- lapply(heatmap_pathways, function(path) {
  lapply(annotation, function(annotation) {
    plt <- plot_path_boxplot(exp_data, 
                             pathway = path,
                   annotation = annotation, 
                  color_var = annotation, 
                   pt_size = 2,
                   fname = stringr::str_glue("results/sup/targeted/{pathwayMethod}/boxplots/box_{path}_{annotation}.pdf"))
  })
}) |> purrr::flatten()
patchwork::wrap_plots(boxplots, nrows = round(length(boxplots)/2), guides = "collect")

Genes Heatmap

On the heatmaps, we can tell that the top_genes expressions are higher for the treated samples. It means that these genes might be involved in the treatment response.

hms <- lapply(1:length(heatmap_genes), function(i) {
  genes <- heatmap_genes[[i]]
  name <- ifelse(is.null(names(heatmap_genes)), i, names(heatmap_genes)[i])
  hm <- plot_exp_heatmap(exp_data, genes = genes, 
                   annotations = annotation,
                   show_rownames = ifelse(length(genes) <= 100, TRUE, FALSE),
                   hm_color_limits = c(-2,2),
                   show_dend_row = FALSE,
                   fname = stringr::str_glue("results/sup/targeted/{pathwayMethod}/hm_genes_{i}.pdf"))
})
patchwork::wrap_plots(hms, ncol = 2, guides = "collect")