Unsupervised RNA-seq analysis
Cordeliers Artificial Intelligence and Bioinformatics
Source:vignettes/sup_rnaseq.Rmd
sup_rnaseq.Rmd
For the supervised and unsupervised analyses, we are using an open
dataset which is given by the airway
package. It provides a
gene expression dataset derived from human bronchial epithelial
cells, treated or not with dexamethasone (a
corticosteroid).
Here is an example of how the supervised part of the
CAIBIrnaseq
package can be use. If you want to use this
notebook for your projects, it is available here
First, we need to import the required packages :
library(airway) # dataset's package
library(SummarizedExperiment) # Object's type we are using
library(CAIBIrnaseq) # package
Before analysing the dataset, we define the variables with the names of the genes / pathways we want to analyse.
These variables will change in function of how your dataset is build and which type of data it is.
annotation <- "dex"
diffexpMethod <- "limma" # Either limma or deseq2
species <- "human"
pathwayMethod <- "ora" # Either ORA or fgsea
# The following variables are those that will need to be modified depending on the analyses you want to do
collections <- c("CGP", "CP", "CP:KEGG_LEGACY", "Hallmark")
pathwayOfInterest <- list("HALLMARK_INFLAMMATORY_RESPONSE", "GOBP_RESPONSE_TO_GLUCOCORTICOID","GOBP_REGULATION_OF_CELL_PROLIFERATION","HALLMARK_APOPTOSIS")
boxplot_pathways <- list("HALLMARK_INFLAMMATORY_RESPONSE","GOBP_RESPONSE_TO_STEROID_HORMONE")
heatmap_genes <- list(
gr_response_genes = c("FKBP5", "TSC22D3", "PER1", "ZBTB16"),
anti_inflam_genes = c("DUSP1", "SOCS1", "MT2A")
)
heatmap_pathways <- c(
"DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN",
"REN_ALVEOLAR_RHABDOMYOSARCOMA_DN",
"NUYTTEN_EZH2_TARGETS_UP",
"PASINI_SUZ12_TARGETS_DN"
)
Remember to come back in this cell to modify the genes, pathways you are interested in studying.
Load data
This section loads the RNA-seq dataset for analysis. It ensures the correct input file is used, as specified in the parameters. rebase_gexp
Ensure your dataset is in a Summarized Experiment
object, because all the used functions below works with this type of
input.
If you want to know more about this type of object, please click here: Bioconductor
If you have a .xsl file (Excel), there is an existing notebook to convert it in a .RDS file.
data(airway, package="airway")
exp_data <- airway
# If you are using your own dataset .RDS file), use this command line :
# exp_data <- readRDS(data_file)
Even if the datasets are globally build the same way, the names of the variables are not exactly the same, so if we want to keep the same code, we need to redefine a bit the variables.
If you want to know what are the used variables in this part, run
this command line :
colnames(SummarizedExperiment::rowData(exp_data))
You should have these variables (with these exact same names):
- gene_name : The commonly used symbol or name for the
gene (e.g., A1BG).
- gene_id : A unique and stable identifier for the
gene, often from databases like Ensembl.
- gene_length_kb : The length of the gene measured in
kilobases
- gene_description : A brief textual summary of the
gene’s function or characteristics. - gene_biotype : A
classification of the gene based on its biological function or
transcript type, such as protein_coding, lncRNA, or pseudogene.
If you are using the notebook that Clarice GROENEVELD created to convert a .xsl file into a .RDS one, you can SKIP the next cell. If not, you should look at how your dataset is defined. You might need to run some command line as the following ones:
rowData(exp_data)$gene_length_kb <-
(rowData(exp_data)$gene_seq_end - rowData(exp_data)$gene_seq_start) / 1000
library(biomaRt)
mart <- useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl")
gene_ids <- rowData(exp_data)$gene_id
annot <- getBM(attributes = c("ensembl_gene_id", "description"),
filters = "ensembl_gene_id",
values = gene_ids,
mart = mart)
matched <- match(rowData(exp_data)$gene_id, annot$ensembl_gene_id)
rowData(exp_data)$gene_description <- annot$description[matched]
Pre-processing
Most datasets use ensembl gene ID by default after alignment, so this step rebases the expression data to gene names. This ensures consistency in naming for downstream analyses.
exp_data <- rebase_gexp(exp_data, annotation = "gene_name")
Filter
Here, we filter out genes expressed in too few samples or with very low counts. This removes noise from the data and focuses on meaningful gene expressions.
exp_data <- filter_gexp(exp_data,
min_nsamp = 1,
min_counts = 1)
Visualization of the filtering process to ensure the criteria applied align with the dataset’s characteristics:
colData(exp_data)$sample_id <- colnames(exp_data)
plot_qc_filters(exp_data)
Normalize
Here, we apply a normalization to the expression
data, making samples comparable by reducing variability due to technical
differences. For datasets with few samples, rlog
is the
preferred normalization and when more samples are present,
vst
is applied.
exp_data <- normalize_gexp(exp_data)
PCA
Principal component analysis (PCA) identifies the major patterns in the dataset. These patterns help explore similarities or differences among samples based on gene expression.
Diffexp
Differential expression analysis (diffexp) is a statistical method used to identify genes whose expression levels significantly change between different experimental conditions. These changes in gene expression can provide insights into the biological processes or pathways affected, and help identify key molecular players involved in a condition or phenotype.
library(edgeR)
colData(exp_data)$dex <- factor(colData(exp_data)$dex)
colData(exp_data)$dex <- factor(colData(exp_data)$dex, levels = c("untrt", "trt")) # conditions' names
diffexp <- diffExpAnalysis(countData = assays(exp_data)$counts,
sampleInfo = colData(exp_data),
method = diffexpMethod, cutoff = 10,
annotation = annotation)
Diffexp Filtering
Differential expression filtering include thresholds on adjusted p-values (e.g., padj < 0.05) to ensure statistical significance, and log2 fold change (e.g., |log2FC| > 1) to focus on genes with meaningful expression differences.
if(tolower(diffexpMethod) == "limma") { # for 'limma', the default variables names are different, so we have to rename it
diffexp <- diffexp |>
dplyr::rename(
log2FoldChange = logFC,
pvalue = P.Value,
padj = adj.P.Val
)
}
diffexp_filtered <- diffexp |>
dplyr::filter(padj <= 0.05) |> # we filter the diffexp, to keep the values under a given threshold 'padj'. this threshold can be change
dplyr::arrange(padj)
# Select the 16 most differentially expressed genes
top_de_genes <- rownames(diffexp_filtered)[1:16]
Volcano plot
A volcano plot is a graphical tool used to visualize the results of differential expression analysis. It plots the statistical significance on the y-axis against the magnitude of change on the x-axis for each gene.
volcano_plot <- plot_exp_volcano(diffexp, 20)
volcano_plot
There is also the possibility to flip this graph by using
coord_flip
of the ggplot2
package
volcano_plot + ggplot2::coord_flip()
Pathway analysis
Pathway collections available in the MSIGdb can be specified in the
parameters. These pathways are scored and ranked by their variance in
the data. These are the available collections (use
gs_subcollection
as name except for Hallmarks, which should
be ‘H’).
msigdbr::msigdbr_collections() |> kableExtra::kbl() |> kableExtra::kable_styling() |> kableExtra::scroll_box(height = "300px")
gs_collection | gs_subcollection | gs_collection_name | num_genesets |
---|---|---|---|
C1 | Positional | 302 | |
C2 | CGP | Chemical and Genetic Perturbations | 3494 |
C2 | CP | Canonical Pathways | 19 |
C2 | CP:BIOCARTA | BioCarta Pathways | 292 |
C2 | CP:KEGG_LEGACY | KEGG Legacy Pathways | 186 |
C2 | CP:KEGG_MEDICUS | KEGG Medicus Pathways | 658 |
C2 | CP:PID | PID Pathways | 196 |
C2 | CP:REACTOME | Reactome Pathways | 1736 |
C2 | CP:WIKIPATHWAYS | WikiPathways | 830 |
C3 | MIR:MIRDB | miRDB | 2377 |
C3 | MIR:MIR_LEGACY | MIR_Legacy | 221 |
C3 | TFT:GTRD | GTRD | 505 |
C3 | TFT:TFT_LEGACY | TFT_Legacy | 610 |
C4 | 3CA | Curated Cancer Cell Atlas gene sets | 148 |
C4 | CGN | Cancer Gene Neighborhoods | 427 |
C4 | CM | Cancer Modules | 431 |
C5 | GO:BP | GO Biological Process | 7608 |
C5 | GO:CC | GO Cellular Component | 1026 |
C5 | GO:MF | GO Molecular Function | 1820 |
C5 | HPO | Human Phenotype Ontology | 5653 |
C6 | Oncogenic Signature | 189 | |
C7 | IMMUNESIGDB | ImmuneSigDB | 4872 |
C7 | VAX | HIPC Vaccine Response | 347 |
C8 | Cell Type Signature | 840 | |
H | Hallmark | 50 |
pathways <- get_annotation_collection(collections,
species = species)
pathwayResult <- pathwayAnalysis(diffexp_filtered,
pathways = pathways,
method = pathwayMethod, species = species) |> as.data.frame()
rownames(pathwayResult) <- pathwayResult$pathway
S4Vectors::metadata(exp_data)[["pathwayEnrichment"]] <- pathwayResult
pathwayResult_sorted <- pathwayResult[order(pathwayResult$PValue), ]
top10_pathways <- head(pathwayResult_sorted, 10)
Dot Plot
The resulting dot plot
summarizes the most enriched
biological pathways across different conditions or samples
plot_pathway_dotplot(exp_data, score_name = "pathwayEnrichment")
Boxplots
Boxplots are valuable tools in biology for visualizing the distribution of gene expression or other quantitative variables across experimental conditions. They allow quick comparisons between groups by summarizing key statistics A Wilcoxon test is made, and the p-value is also printed. The closer the value is to 0, the more the genes/pathways are influenced by the experimental conditions. We can consider that the experimental conditions influence the results when p < 0.05.
dir.create(file.path("results", "sup", "targeted", pathwayMethod, "boxplots"), recursive = TRUE, showWarnings = FALSE)
pathway_scores <- score_pathways(exp_data, pathways, verbose = FALSE)
S4Vectors::metadata(exp_data)[["pathway_scores"]] <- pathway_scores
boxplots <- lapply(heatmap_pathways, function(path) {
lapply(annotation, function(annotation) {
plt <- plot_path_boxplot(exp_data,
pathway = path,
annotation = annotation,
color_var = annotation,
pt_size = 2,
fname = stringr::str_glue("results/sup/targeted/{pathwayMethod}/boxplots/box_{path}_{annotation}.pdf"))
})
}) |> purrr::flatten()
patchwork::wrap_plots(boxplots, nrows = round(length(boxplots)/2), guides = "collect")
Genes Heatmap
On the heatmaps, we can tell that the top_genes expressions are higher for the treated samples. It means that these genes might be involved in the treatment response.
hms <- lapply(1:length(heatmap_genes), function(i) {
genes <- heatmap_genes[[i]]
name <- ifelse(is.null(names(heatmap_genes)), i, names(heatmap_genes)[i])
hm <- plot_exp_heatmap(exp_data, genes = genes,
annotations = annotation,
show_rownames = ifelse(length(genes) <= 100, TRUE, FALSE),
hm_color_limits = c(-2,2),
show_dend_row = FALSE,
fname = stringr::str_glue("results/sup/targeted/{pathwayMethod}/hm_genes_{i}.pdf"))
})
patchwork::wrap_plots(hms, ncol = 2, guides = "collect")