Cystic Fibrosis (CF) is a congenital condition resulting from mutations of the cystic fibrosis transmembrane conductance regulator (CFTR) gene [1,2]. CFTR protein is a chloride transporter needed for mucus hydration of the airway surface [3]; dehydration leads to high neutrophil, interleukin, and macrophage counts [4,5].
Although CF affects various organs, most of its histological complications occur in the lungs. Frequent CF-induced infections cause inflammation, then lung fibrosis (increased fibroblast production and excessive extracellular matrix (ECM) materials) [6], leading to lung tissue remodelling. Fibroblasts are associated with CF because they cause the proliferation of fibrocytes, fibrotic lesions and lung mesenchymal cells [7] - ultimately causing up-regulation of pro-fibrotic genes.
Using transcriptomics, differentially expressed genes (DEGs) in CF tissues can be quantified. Also, it is useful for elucidating the biological and molecular mechanisms causing CF onset, and, possibly, targeted therapeutic approaches. This study examines the transcriptomic profiles of CF and normal human lung fibroblasts (NHLF).
CF- and NHLF RNA-Seq raw counts matrix and series matrix files were downloaded from NCBI GEO (GSE141535). The NHLF were obtained from 47-year-old males, while the CFHLF were from 45-year-old females. The experimental design and GEO numbers are shown in Table 1.
GEO_accession | Sample_title | Sample_name | Cell_type | Gender |
---|---|---|---|---|
GSM4205771 | Normal Human Lung Fibroblasts from 2D culture, rep1 | Normal.1 | Normal | Male |
GSM4205772 | Normal Human Lung Fibroblasts from 2D culture, rep3 | Normal.2 | Normal | Male |
GSM4205773 | Normal Human Lung Fibroblasts from 2D culture, rep3 | Normal.3 | Normal | Male |
GSM4205774 | Normal Human Lung Fibroblasts from 2D culture, rep4 | Normal.4 | Normal | Male |
GSM4205775 | Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep1 | Disease.1 | CF | Female |
GSM4205776 | Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep2 | Disease.2 | CF | Female |
GSM4205777 | Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep3 | Disease.3 | CF | Female |
GSM4205778 | Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep4 | Disease.4 | CF | Female |
RNA extraction was performed using Roche’s High Pure RNA Tissue Kit. Library preparation was done using QuantSeq 3’mRNA-Seq Kit. Library quality was examined with High sensitivity DNA D1000. Illumina NovaSeq 6000 was used for sequencing. Sequence base calls were converted into FASTQ using bcl2fastq. Gene expressions were determined using htseq-count. The count matrix is shown in Table 2.
GeneID | GSM4205771 | GSM4205772 | GSM4205773 | GSM4205774 | GSM4205775 | GSM4205776 | GSM4205777 | GSM4205778 |
---|---|---|---|---|---|---|---|---|
100287102 | 1 | 2 | 0 | 0 | 3 | 2 | 2 | 1 |
653635 | 1 | 7 | 4 | 0 | 7 | 6 | 6 | 3 |
102466751 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
107985730 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
100302278 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
645520 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
79501 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
100996442 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
729737 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 |
102725121 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
The following R packages were installed from BiocManager:
edgeR
: To create a DGEList object;
RColorBrewer
: To get nice colours;
org.Hs.eg.db
: For gene annotation; and
limma
: For differential expression and gene enrichment
analyses.
# Installing packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager", "tidyverse")
BiocManager::install(c("edgeR", "RColorBrewer", "org.Hs.eg.db", "limma"))
The libraries were also loaded:
# Loading libraries
library(knitr)
library(RColorBrewer)
library(edgeR)
library(limma)
library(org.Hs.eg.db)
The raw sequence count and series matrix were loaded in R and converted into a data frame (Table 2) as shown below.
# Load the raw count matrix data into R
seq_data <- read.table (gzfile ("C:/Users/USER/Desktop/E-portfolio-Main/GSE141535_raw_counts_GRCh38.p13_NCBI.tsv.gz"), header = TRUE)
# Load the series matrix data (sample info) downloaded from NCBI into R
sampleinfo <- read.delim ("C:/Users/USER/Desktop/E-portfolio-Main/GSE141535_series_matrix_2.txt", header = T, stringsAsFactors = T)
sampleinfo = subset (sampleinfo, select = -c(X)) ## To remove mislabelled column header
sampleinfo <- sampleinfo [-c(9:23), ] ## To remove empty cells in the data frame
The data was adjusted to remove the GeneID column, creating a new matrix.
# To create a new matrix containing only the count data from the 8 samples, removing the first column (GeneID)
count_data <- seq_data [, -1]
The GeneIDs were converted to row names and the columns were renamed.
# To turn the GeneID names (in the first column) into row names
rownames(count_data) <- seq_data [, 1] ## The GeneIDs are now the row names.
# The column names are actually the sample names,
# So the columns were renamed to reveal information from the Sample Info, such as the condition and replicate number, about each sample.
colnames(count_data) <- c("Normal.1", "Normal.2", "Normal.3", "Normal.4", "Disease.1", "Disease.2", "Disease.3", "Disease.4")
# The data now reflects information from the cell types.
kable(head(count_data))
Normal.1 | Normal.2 | Normal.3 | Normal.4 | Disease.1 | Disease.2 | Disease.3 | Disease.4 | |
---|---|---|---|---|---|---|---|---|
100287102 | 1 | 2 | 0 | 0 | 3 | 2 | 2 | 1 |
653635 | 1 | 7 | 4 | 0 | 7 | 6 | 6 | 3 |
102466751 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
107985730 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
100302278 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
645520 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Note: The column names should bear the same name as the sample name in the Sample Info data
# The code below was ran to check if the Sample Name in sampleinfo and count_data are in the same order
table(colnames(count_data)==sampleinfo$Sample_name)
##
## TRUE
## 8
A DGEList
object, containing the sample information, and
read counts, was created. The DGEList
function in edgeR
allows for easy manipulation and sub-setting of the DGEList
object:
# Convert read counts to a DGEList object
dge_list <- DGEList(count_data)
# To extract grouping information from "sampleinfo"
group_info <- paste(sampleinfo$Cell_type,sampleinfo$Gender,sep=".")
# Convert "group_info" into a factor
group_info <- factor(group_info)
# View the values
table(group_info)
## group_info
## CF.Female Normal.Male
## 4 4
The disease condition and gender were not treated as separate factors in the original experiment, so, both factors were regarded as one in the analysis (Table 4).
# Include the factors into the DGEList object
dge_list$samples$group <- group_info
# View the Samples slot of the DGEList object
kable(dge_list$samples)
group | lib.size | norm.factors | |
---|---|---|---|
Normal.1 | Normal.Male | 955342 | 1 |
Normal.2 | Normal.Male | 2620178 | 1 |
Normal.3 | Normal.Male | 1498839 | 1 |
Normal.4 | Normal.Male | 1070215 | 1 |
Disease.1 | CF.Female | 1181443 | 1 |
Disease.2 | CF.Female | 1304686 | 1 |
Disease.3 | CF.Female | 1233521 | 1 |
Disease.4 | CF.Female | 1531346 | 1 |
Gene names and gene symbols associated with the respective Entrez IDs
were used to annotate the DGEList
object via the
org.Hs.eg.db
package.
# Adding annotation, such as Gene Symbol, Gene Name, and ENTREZID, to DGEList object
anno_data <- select(org.Hs.eg.db, keys=rownames(dge_list$counts), columns=c("ENTREZID", "SYMBOL","GENENAME"))
## 'select()' returned 1:1 mapping between keys and columns
# To check that the ENTREZIDs match the dge_list$count row names
table(anno_data$ENTREZID==rownames(dge_list$counts))
##
## TRUE
## 39376
# Add the annotation information into a new slot (called genes) in the DGEList object
dge_list$genes <- anno_data
Genes with low expression were removed to avoid false
discovery.Hence, they should be removed. To achieve this, the
filterByExpr
function in edgeR
was used
because it retains only genes that possess a minimum count in multiple
samples.
# Filter out gene with low expression using filterByExpr
exprs_kept <- filterByExpr(dge_list)
dge_list_2 <- dge_list[exprs_kept,]
# Compare the value of the new DGEList object to the previous one.
list(dim(dge_list), dim(dge_list_2))
## [[1]]
## [1] 39376 8
##
## [[2]]
## [1] 9460 8
# 9,460 genes (24%) were kept out of 39,376
This aspect is a quality control step. The library sizes of the samples were examined to identify differences in their distributions (Figures 1-2).
## Quality Control
## Number of reads in the filtered DGE_list_2 data
dge_list_2$samples$lib.size
## [1] 955342 2620178 1498839 1070215 1181443 1304686 1233521 1531346
An MDS plot was generated to examine the source of variation in the samples, making it easy to detect outliers (Figure 3).
# multidimensional scaling plots
plotMDS(dge_list_2)
Normalisation was done using calcNormFactors
function.
The DEG profiles of each sample were closely examined using the plotMD
function, comparing samples with the highest and lowest
norm.factors
before and after normalisation
correction..
# Normalisation of DGE_list_2
dge_list_2 <- calcNormFactors(dge_list_2)
A design matrix (Table 5) to compare the Normal and CF groups was created for differential expression (DE) analysis using limma.
# create a design matrix with no intercept using the group_info variable
matrix <- model.matrix(~ 0 + group_info)
# Rename the matrix columns to bear the grouping information, CF/Normal
colnames(matrix) <- levels(group_info)
# View matrix
kable (matrix) ## The rows represents the samples, while the columns represent the coefficient of comparison - 1 for present, 0 for absent- between CF and Normal groups.
CF.Female | Normal.Male |
---|---|
0 | 1 |
0 | 1 |
0 | 1 |
0 | 1 |
1 | 0 |
1 | 0 |
1 | 0 |
1 | 0 |
Using the design matrix and DGEList
object, the
voom
function created an EList
object with
adjusted library sizes and normalised log2 counts.
# Use voom function to transform the data, with the normalised dge_list_2 and matrix as the objects
voom_trans <- voom(dge_list_2, matrix, plot=F)
The voom-transformed object was tested for DE by fitting it to a
linear model using the lmFit
function.
# Now to test for differential expression using the lmfit function to fit a linear model
fit_model <- lmFit(voom_trans)
The makeContrasts
function was used to determine DEGs in
the CF and Normal groups, with the null hypothesis defined as “CF.Female
- Normal.Male = 0”.
# Test for difference between groups
contrast.matrix <- makeContrasts(CysticVsNormal=CF.Female - Normal.Male,levels=group_info)
kable(contrast.matrix)
CysticVsNormal | |
---|---|
CF.Female | 1 |
Normal.Male | -1 |
Using the contrasts.fit
function, the contrast matrix
was applied to the voom-transformed object to obtain relevant parameters
for the CF-Normal comparison. Afterwards, the eBayes
function was called to apply empirical Bayes shrinkage to the
contrast.
# Using the constrasts.fit function in Limma, the constrast.matrix included the fit_model object
fit.contrast <- contrasts.fit(fit_model, contrast.matrix)
# Next, the eBayes function was applied to the fit.contrast object to calculate the p-values and t-statistics, and shrink the variances using empirical Bayes.
fit.contrast <- eBayes(fit.contrast)
Using decideTests
function, the DEGs for the contrast
were produced and visualised. The top DEGs were stored in an object,
with the p-value capped at 0.05 and adjusted.
# Created summary data of DEG for contrasts using 'decideTest' function of Limma
summary.fit <- decideTests(fit.contrast)
# Created an object to store the top 100 DEGs
top.gene <- topTable(fit.contrast,coef=1,number=100,genelist=fit.contrast$genes, p.value = 0.05, adjust.method = "BH", sort.by = "p") ## p-value was adjusted using the Benjamini-Hochberg method
TREAT analysis was performed to rank DEGs by their Log fold-change
(LFC). A LFC of 1, meaning double expression, was applied to the
fit.contrast
object to determine which DEGs are more
biologically relevant. The TREAT results were sorted by their p-values,
then stored and visualised.
# Testing relative to a threshold (TREAT) using a logFC of 1
## Performed TREAT analysis to rank the genes per their logFCs, re-estimating the p-values and t-statistic
fit.treat <- treat(fit.contrast, lfc =1)
## Computed summary of the TREAT results
results.treat <- decideTests(fit.treat)
## Created an object to store the top 10 genes post-TREAT analysis
treat.top.genes <- topTable(fit.treat,coef=1,sort.by="p") ## By default, only the top 10 DEGs will be stored in the treat.top.genes data
GO analysis (GOANA) was performed to find over-represented or enriched GO terms in the TREAT output of the CF-Normal comparison using the goana function (FDR < 0.05).
# Gene Ontology of TREAT results
GO_treat <- goana(fit.treat, coef = "CysticVsNormal", species = "Hs") ## FDR < 0.05 by default
Additionally, CAMERA was used to check for enriched gene sets using
the Hallmark gene sets from MSigDB. The ids2indices
function was used to map the Entrez ID of the voom-transformed object to
the gene sets. An inter-gene correlation of 0.05 was used, as it allows
for biological interpretation and low FDR.
# Loaded the Hallmark Human gene data set from the Broad Institute’s Molecular Signatures Database (MSigDB) into R
Hs.H <- readRDS("C:/Users/USER/Desktop/E-portfolio-Main/Rmd/Hs.h.all.v7.1.entrez.rds")
# The Hallmark Gene Set is a vector object containing gene sets and each vector has ENTREZ IDs for the genes in the gene set
## The Entrez IDs contained in the vectors of the H.indices object is exactly what is contained in the row names of the voom_trans object. The ids2indices function was be used to map the IDs between the gene list in the vector and voom objects.
H.index <- ids2indices(Hs.H, rownames(voom_trans))
# Used Camera function to get correlation for each gene set
## The inter.gene.cor argument was kept low at 0.05
camera.out <- camera(voom_trans,index = H.index, design = matrix, contrast = contrast.matrix[,1],inter.gene.cor = 0.05)
The samples’ counts were not normally distributed (Figure 1), so the
counts were converted to the log2 scale, correcting the library size
differences, using the cpm
function (Figure 2).
# To create a barplot of the library sizes for each sample
barplot(dge_list_2$samples$lib.size/1e06, names=colnames(dge_list_2),las=2, ann=FALSE, cex.names=0.75)
mtext(side = 1, text = "Samples", line = 4)
mtext(side = 2, text = "Library size in Millions", line = 3)
Figure 1: Barplot of Library Sizes.
The LogCPM distribution of the samples had little or no difference. No sample was far above or below the red line.
# Get log counts of DGEList object
dge_logcounts <- cpm (dge_list_2, log = T)
# Check distributions of samples using boxplots
boxplot(dge_logcounts, xlab="", ylab="Log2 Counts Per Million",las=2)
# Include a horizontal line that represents to the median logCPM
abline(h=median(dge_logcounts),col="red")
Figure 2: Boxplots of LogCPMs (Unnormalised).
The MDS plot revealed that the samples cluster by cell type, Normal or Disease (Figure 3); that is, the cell types are the greatest variation source. The distance between the Disease and the Normal replicates on the x-axis are 2 units, corresponding to a 4-fold-change between the samples. No inter-group clustering was observed.
# To reset factor to existing levels in the data frame
sampleinfo$Cell_type <- factor (sampleinfo$Cell_type)
# Create a new value containing color-coded characters for the levels
color.cell.type <- c("blue", "red")[sampleinfo$Cell_type]
# Now infuse the color code into the MDS plot
plotMDS(dge_list_2,col=color.cell.type, xlab = "Leading LogFC Dim1", ylab = "Leading LogFC Dim2")
# Let's add a legend to the plot so we know which colours correspond to which cell type
legend("top",fill=c("blue","red"),legend=levels(sampleinfo$Cell_type))
Figure 3: MDS Plot of Cystic Fibrosis (CF) and Normal Cell Type Samples.
The CF group had a relatively low norm.factor
(0.96-1.01) (Table 7). Normal.4 had the highest
norm.factors
(1.03), while Disease.1 had the lowest
(0.96).
# View the normalisation factors of the samples
kable(dge_list_2$samples)
group | lib.size | norm.factors | |
---|---|---|---|
Normal.1 | Normal.Male | 955342 | 1.0177303 |
Normal.2 | Normal.Male | 2620178 | 0.9743332 |
Normal.3 | Normal.Male | 1498839 | 1.0216918 |
Normal.4 | Normal.Male | 1070215 | 1.0309631 |
Disease.1 | CF.Female | 1181443 | 0.9649097 |
Disease.2 | CF.Female | 1304686 | 1.0127833 |
Disease.3 | CF.Female | 1233521 | 1.0010146 |
Disease.4 | CF.Female | 1531346 | 0.9787079 |
Although the genes in the logcount data clustered around the zero Log-ratio (Figure 4), the normalised data showed a more uniform clustering (Figure 5). Additionally, the CF group had more gene expressions with positive log-ratios.
# To compare the logcounts and compositional bias normalisation of Normal.4 and Disease.1
# The logcount-normalised data was used here
par(mfrow=c(1,2))
plotMD(dge_logcounts,column = 4)
abline(h=0,col="red", lty=2, lwd=2)
plotMD(dge_logcounts,column = 5)
abline(h=0,col="red", lty=2, lwd=2)
Figure 4: Comparing the Log Count Data Between Normal.4 and Disease.1.
# The compositional bias-normalised data was used here
par(mfrow=c(1,2))
plotMD(dge_list_2,column = 4)
abline(h=0,col="red",lty=2, lwd=2)
plotMD(dge_list_2,column = 5)
abline(h=0,col="red",lty=2, lwd=2)
Figure 5: Comparing the Normalised Data Between Normal.4 and Disease.1.
The mean-variance trend showed the biological variation of the genes in all the samples (Figure 6). The mean-variance trend decreased progressively and asymptotically with abundance.
# Use voom function to transform the data, with the normalised dge_list_2 and matrix as the objects
voom_trans <- voom(dge_list_2, matrix, plot=T)
Figure 6: Scatterplot of Square-root Standard Deviation Against Normalised Log2 Counts. The red line represents the mean-variance trend.
The library sizes of the voom-transformed data have been adjusted, with the boxes closer to the median LogCPM, unlike the unnormalised data (Figure 7). Furthermore, the voom-transformed data had more outliers less than zero LogCPM and lower extremes.
# Compare the unnormalised logcounts (logCPM) with the voom logCPM
par(mfrow=c(1,2))
boxplot(dge_logcounts, xlab="", ylab="Log2 Counts Per Million",las=2,main="Unnormalised LogCPM")
# Insert a red line to represent the median logCPM of the unnormalised DGE List
abline(h=median(dge_logcounts),col="red")
boxplot(voom_trans$E, xlab="", ylab="Log2 Counts Per Million",las=2,main="Voom-Transformed logCPM")
# Insert a red line to represent the median logCPM of the voom-transformed DGE List
abline(h=median(voom_trans$E),col="red")
Figure 7: Comparing the Unnormalised LogCPM with The Voom-transformed LogCPM.
There were 918 DEGs (Table 8): 419 were down-regulated, 499 were up-regulated, and 8,542 were not DE.
# Summary of the DEGs for the comparison
kable(summary(summary.fit))
CysticVsNormal | |
---|---|
Down | 419 |
NotSig | 8542 |
Up | 499 |
The top gene in Table 9 (XIST) had expressions ranging from 7.20 to 7.46 in the CF samples (Figure 8).
# View the top 10 genes
kable(head (top.gene, n = 10), digits = 32)
ENTREZID | SYMBOL | GENENAME | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
---|---|---|---|---|---|---|---|---|---|
7503 | 7503 | XIST | X inactive specific transcript | 8.848339 | 2.905534 | 16.68974 | 2.900737e-11 | 7.977232e-08 | 13.07372 |
387066 | 387066 | SNHG5 | small nucleolar RNA host gene 5 | 3.308967 | 7.618322 | 16.41473 | 3.697172e-11 | 7.977232e-08 | 15.98093 |
9383 | 9383 | TSIX | TSIX transcript, XIST antisense RNA | 8.505410 | 2.735071 | 16.25616 | 4.259378e-11 | 7.977232e-08 | 12.84729 |
6192 | 6192 | RPS4Y1 | ribosomal protein S4 Y-linked 1 | -9.395521 | 3.317442 | -16.25523 | 4.262923e-11 | 7.977232e-08 | 12.87789 |
1191 | 1191 | CLU | clusterin | 2.906360 | 9.577039 | 16.21474 | 4.420702e-11 | 7.977232e-08 | 15.79160 |
715 | 715 | C1R | complement C1r | 2.439766 | 7.503698 | 16.06515 | 5.059555e-11 | 7.977232e-08 | 15.67444 |
6319 | 6319 | SCD | stearoyl-CoA desaturase | 3.386719 | 5.834727 | 13.52253 | 6.058330e-10 | 8.187401e-07 | 13.17015 |
1277 | 1277 | COL1A1 | collagen type I alpha 1 chain | 1.775692 | 11.029431 | 12.50043 | 1.843980e-09 | 2.180506e-06 | 11.92301 |
125 | 125 | ADH1B | alcohol dehydrogenase 1B (class I), beta polypeptide | 3.440099 | 5.081107 | 12.26592 | 2.406117e-09 | 2.529096e-06 | 11.75897 |
4500 | 4500 | MT1L | metallothionein 1L (pseudogene) | -1.887294 | 7.693046 | -12.02138 | 3.189964e-09 | 2.798492e-06 | 11.50982 |
# The expression levels of the topmost up-regulated gene in the table above.
nice.colors <- brewer.pal(7,name="Dark2") # to add nice colors
stripchart(voom_trans$E["7503",]~group_info,vertical=TRUE,las=2,cex.axis=0.8,pch=16,cex=1.3,col=nice.colors,method="jitter",ylab="Normalised log2 expression", main="XIST")
Figure 8: Scatterplot Showing the Normalised Log2 Expressions of XIST gene in CF and Normal Samples.
The top DEGs were visualised using plotMD, while volcanoplot was used to highlight the symbols of the top 10 up- and down-regulated DEGs (Figure 9).
# Constructing MD Plot and Volcano Plot after testing for Differential expression
# Use out from decideTests to show significant genes
# Visualise using plotMD function
plotMD(fit.contrast, coef = 1, status = summary.fit[,"CysticVsNormal"], values = c(1, -1), hl.col = c("red","blue"), xlab="Average Log-expression", ylab="Log Fold-Change", cex=1.0)
legend("topright", title=paste("LogFC"), legend=c("Up", "Down", "No-DE"), pch=19, col=c("red","blue", "black"))
abline(h=0)
# Using volcano plot, the top DE genes and gene symbols will be highlighted
# Here the top 10 genes are highlighted
volcanoplot(fit.contrast,coef=1,highlight=10,names=fit.contrast$genes$SYMBOL, main="CysticVsNormal", xlab = "Log2 Fold-Change", ylab = "-Log10 (P-value)", cex= 0.4)
Figure 9: MD Plot (left) showing the LogFC and average expression of each gene. Volcano Plot (right) showing the highly expressed up- and down-regulated genes.
After TREAT, only 103 DEGs were detected at LogFC of 1 - more than 88% decrease (Table 10). 85 and 18 DEGs were up- and down-regulated, respectively. Consequently, only DEGs with large LogFCs were identified (Figure 10).
# Summary of the TREAT results
kable(summary(results.treat))
CysticVsNormal | |
---|---|
Down | 18 |
NotSig | 9357 |
Up | 85 |
The DEG ranking is now different. Notably, RPS4Y1 replaced SNHG5 as the second most DEG, due to their p-values (Table 11).
kable(treat.top.genes, digits = 32)
ENTREZID | SYMBOL | GENENAME | logFC | AveExpr | t | P.Value | adj.P.Val | |
---|---|---|---|---|---|---|---|---|
7503 | 7503 | XIST | X inactive specific transcript | 8.848339 | 2.905534 | 14.803540 | 8.576286e-11 | 4.236456e-07 |
6192 | 6192 | RPS4Y1 | ribosomal protein S4 Y-linked 1 | -9.395521 | 3.317442 | -14.525124 | 1.136275e-10 | 4.236456e-07 |
9383 | 9383 | TSIX | TSIX transcript, XIST antisense RNA | 8.505410 | 2.735071 | 14.344883 | 1.343485e-10 | 4.236456e-07 |
387066 | 387066 | SNHG5 | small nucleolar RNA host gene 5 | 3.308967 | 7.618322 | 11.454048 | 3.125797e-09 | 7.392510e-06 |
1191 | 1191 | CLU | clusterin | 2.906360 | 9.577039 | 10.635684 | 8.658790e-09 | 1.638243e-05 |
25878 | 25878 | MXRA5 | matrix remodeling associated 5 | 7.132167 | 2.049141 | 10.184420 | 1.590158e-08 | 2.507149e-05 |
6319 | 6319 | SCD | stearoyl-CoA desaturase | 3.386719 | 5.834727 | 9.529721 | 3.796179e-08 | 4.807616e-05 |
715 | 715 | C1R | complement C1r | 2.439766 | 7.503698 | 9.480443 | 4.065638e-08 | 4.807616e-05 |
9537 | 9537 | TP53I11 | tumor protein p53 inducible protein 11 | 4.598240 | 4.737967 | 9.187043 | 6.169558e-08 | 6.311704e-05 |
8287 | 8287 | USP9Y | ubiquitin specific peptidase 9 Y-linked | -6.011449 | 1.618448 | -9.112061 | 6.924605e-08 | 6.311704e-05 |
# View the TREAT results using a MD Plot
plotMD(fit.treat,coef=1,status=results.treat[,"CysticVsNormal"], values=c(1,-1), hl.col=c("red","blue"), xlab="Average Log-expression", ylab="Log Fold Change", cex=1.0)
legend("topright", title=paste("LogFC"), legend=c("Up", "Down", "No-DE"), pch=19, col=c("red","blue", "black"))
abline(h=0,col="grey")
# Here the top 10 genes are highlighted
volcanoplot(fit.treat,coef=1, highlight=10, names=fit.treat$genes$SYMBOL, main="CysticVsNormal", xlab = "Log2 Fold Change", ylab = "-Log10 (P-value)", cex= 0.4)
Figure 10: Post-TREAT MD Plot (left) showing DEGs with fold-change greater than 1. Post-TREAT Volcano Plot (right) showing the highly expressed up- and down-regulated genes.
There were 18,835 enriched GO terms associated with molecular function (MF), cellular component (CC), and biological process (BP). In CC (Table 12), cell periphery had the most genes (2,187) and 40 were significantly up-regulated; in BP (Table 13), developmental process had the most genes (3,058) and 48 were significantly up-regulated; and in MF (Table 14), signaling receptor binding had the most genes (566) and 18 were significantly up-regulated. The table is arranged by the least P.Up and P.Down values.
# Used topGO to view the GO terms with the most significance and enrichment
kable(topGO(GO_treat, ontology = "CC")) ## Cellular Component
Term | Ont | N | Up | Down | P.Up | P.Down | |
---|---|---|---|---|---|---|---|
GO:0030312 | external encapsulating structure | CC | 204 | 20 | 1 | 0.0000000 | 0.3259219 |
GO:0031012 | extracellular matrix | CC | 204 | 20 | 1 | 0.0000000 | 0.3259219 |
GO:0062023 | collagen-containing extracellular matrix | CC | 175 | 17 | 1 | 0.0000000 | 0.2866704 |
GO:0005576 | extracellular region | CC | 1712 | 40 | 4 | 0.0000000 | 0.4189050 |
GO:0005615 | extracellular space | CC | 1484 | 35 | 3 | 0.0000000 | 0.5560731 |
GO:0071944 | cell periphery | CC | 2187 | 40 | 6 | 0.0000011 | 0.2233059 |
GO:0098643 | banded collagen fibril | CC | 8 | 3 | 0 | 0.0000384 | 1.0000000 |
GO:0005583 | fibrillar collagen trimer | CC | 8 | 3 | 0 | 0.0000384 | 1.0000000 |
GO:0031091 | platelet alpha granule | CC | 45 | 5 | 0 | 0.0000489 | 1.0000000 |
GO:0005581 | collagen trimer | CC | 30 | 4 | 0 | 0.0001414 | 1.0000000 |
GO:0098644 | complex of collagen trimers | CC | 12 | 3 | 0 | 0.0001471 | 1.0000000 |
GO:0009986 | cell surface | CC | 280 | 9 | 2 | 0.0009069 | 0.0985845 |
GO:0005604 | basement membrane | CC | 49 | 4 | 0 | 0.0009599 | 1.0000000 |
GO:0097440 | apical dendrite | CC | 6 | 2 | 0 | 0.0011787 | 1.0000000 |
GO:0070821 | tertiary granule membrane | CC | 30 | 0 | 2 | 1.0000000 | 0.0014531 |
GO:0099512 | supramolecular fiber | CC | 504 | 12 | 0 | 0.0017842 | 1.0000000 |
GO:0099081 | supramolecular polymer | CC | 507 | 12 | 0 | 0.0018757 | 1.0000000 |
GO:0042583 | chromaffin granule | CC | 9 | 2 | 0 | 0.0027796 | 1.0000000 |
GO:0035579 | specific granule membrane | CC | 47 | 0 | 2 | 1.0000000 | 0.0035423 |
GO:0042383 | sarcolemma | CC | 71 | 4 | 1 | 0.0037886 | 0.1274119 |
kable(topGO(GO_treat, ontology = "BP")) ## Biological Process
Term | Ont | N | Up | Down | P.Up | P.Down | |
---|---|---|---|---|---|---|---|
GO:0009653 | anatomical structure morphogenesis | BP | 1296 | 31 | 3 | 1.00e-07 | 0.4591457 |
GO:0030154 | cell differentiation | BP | 1977 | 39 | 3 | 2.00e-07 | 0.7613987 |
GO:0048869 | cellular developmental process | BP | 1978 | 39 | 3 | 2.00e-07 | 0.7617329 |
GO:0048513 | animal organ development | BP | 1411 | 32 | 4 | 2.00e-07 | 0.2786936 |
GO:0009887 | animal organ morphogenesis | BP | 451 | 17 | 3 | 4.00e-07 | 0.0520817 |
GO:0048731 | system development | BP | 1909 | 37 | 3 | 9.00e-07 | 0.7378814 |
GO:0007275 | multicellular organism development | BP | 2260 | 41 | 3 | 9.00e-07 | 0.8429937 |
GO:0048732 | gland development | BP | 233 | 12 | 1 | 1.10e-06 | 0.3631273 |
GO:0045229 | external encapsulating structure organization | BP | 156 | 10 | 0 | 1.30e-06 | 1.0000000 |
GO:0030198 | extracellular matrix organization | BP | 156 | 10 | 0 | 1.30e-06 | 1.0000000 |
GO:0043062 | extracellular structure organization | BP | 156 | 10 | 0 | 1.30e-06 | 1.0000000 |
GO:0009888 | tissue development | BP | 931 | 24 | 3 | 1.40e-06 | 0.2601920 |
GO:0048856 | anatomical structure development | BP | 2798 | 46 | 4 | 2.10e-06 | 0.8288663 |
GO:0032502 | developmental process | BP | 3058 | 48 | 6 | 4.00e-06 | 0.5566032 |
GO:0060384 | innervation | BP | 13 | 4 | 0 | 4.10e-06 | 1.0000000 |
GO:0032963 | collagen metabolic process | BP | 53 | 6 | 0 | 7.40e-06 | 1.0000000 |
GO:0007399 | nervous system development | BP | 1225 | 26 | 0 | 1.66e-05 | 1.0000000 |
GO:0016477 | cell migration | BP | 745 | 19 | 2 | 2.66e-05 | 0.4222843 |
GO:0050768 | negative regulation of neurogenesis | BP | 68 | 6 | 0 | 3.16e-05 | 1.0000000 |
GO:0048608 | reproductive structure development | BP | 138 | 8 | 1 | 3.22e-05 | 0.2334522 |
kable(topGO(GO_treat, ontology = "MF")) ## Molecular Function
Term | Ont | N | Up | Down | P.Up | P.Down | |
---|---|---|---|---|---|---|---|
GO:0005539 | glycosaminoglycan binding | MF | 86 | 11 | 0 | 0.0000000 | 1.0000000 |
GO:0008201 | heparin binding | MF | 63 | 9 | 0 | 0.0000000 | 1.0000000 |
GO:0005201 | extracellular matrix structural constituent | MF | 73 | 9 | 0 | 0.0000000 | 1.0000000 |
GO:1901681 | sulfur compound binding | MF | 122 | 10 | 1 | 0.0000001 | 0.2092964 |
GO:0008131 | primary amine oxidase activity | MF | 3 | 3 | 0 | 0.0000007 | 1.0000000 |
GO:0005102 | signaling receptor binding | MF | 566 | 18 | 1 | 0.0000022 | 0.6725110 |
GO:0048018 | receptor ligand activity | MF | 97 | 8 | 1 | 0.0000024 | 0.1701087 |
GO:0030546 | signaling receptor activator activity | MF | 102 | 8 | 1 | 0.0000035 | 0.1780900 |
GO:0030545 | signaling receptor regulator activity | MF | 111 | 8 | 1 | 0.0000065 | 0.1922741 |
GO:0005198 | structural molecule activity | MF | 395 | 14 | 1 | 0.0000103 | 0.5377672 |
GO:0005178 | integrin binding | MF | 85 | 7 | 0 | 0.0000106 | 1.0000000 |
GO:0048407 | platelet-derived growth factor binding | MF | 8 | 3 | 0 | 0.0000384 | 1.0000000 |
GO:0016641 | oxidoreductase activity, acting on the CH-NH2 group of donors, oxygen as acceptor | MF | 9 | 3 | 0 | 0.0000573 | 1.0000000 |
GO:0052595 | aliphatic amine oxidase activity | MF | 2 | 2 | 0 | 0.0000805 | 1.0000000 |
GO:0097621 | monoamine oxidase activity | MF | 2 | 2 | 0 | 0.0000805 | 1.0000000 |
GO:0016638 | oxidoreductase activity, acting on the CH-NH2 group of donors | MF | 10 | 3 | 0 | 0.0000813 | 1.0000000 |
GO:0050840 | extracellular matrix binding | MF | 38 | 4 | 0 | 0.0003606 | 1.0000000 |
GO:0019838 | growth factor binding | MF | 69 | 5 | 0 | 0.0003793 | 1.0000000 |
GO:0004252 | serine-type endopeptidase activity | MF | 40 | 4 | 1 | 0.0004404 | 0.0737926 |
GO:0030020 | extracellular matrix structural constituent conferring tensile strength | MF | 19 | 3 | 0 | 0.0006188 | 1.0000000 |
The 2 most significant gene sets (FDR < 0.05) in the CAMERA output are the Hallmark Epithelial-Mesenchymal Transition (EMT), followed by Hallmark E2F Targets (E2FT).
# View the top 10 gene sets
# The list is ranked by p-value, so, the most significant gene set is at the top
kable(camera.out[1:10, ], digits = 32)
NGenes | Direction | PValue | FDR | |
---|---|---|---|---|
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION | 160 | Up | 8.108529e-05 | 0.004054265 |
HALLMARK_E2F_TARGETS | 179 | Down | 1.171711e-03 | 0.029292780 |
HALLMARK_HEDGEHOG_SIGNALING | 14 | Up | 5.246642e-03 | 0.079750264 |
HALLMARK_CHOLESTEROL_HOMEOSTASIS | 58 | Up | 6.380021e-03 | 0.079750264 |
HALLMARK_G2M_CHECKPOINT | 183 | Down | 1.233071e-02 | 0.123307116 |
HALLMARK_MYOGENESIS | 101 | Up | 1.738763e-02 | 0.144896897 |
HALLMARK_MYC_TARGETS_V1 | 198 | Down | 2.712901e-02 | 0.193778614 |
HALLMARK_ANGIOGENESIS | 18 | Up | 8.518489e-02 | 0.532405547 |
HALLMARK_INTERFERON_ALPHA_RESPONSE | 65 | Up | 1.254412e-01 | 0.696895612 |
HALLMARK_UV_RESPONSE_DN | 126 | Up | 1.810185e-01 | 0.730667720 |
# Genes significant at FDR < 0.05
table(camera.out$FDR < 0.05) # Only 2 genes were significant at 5% and 4 at 10%
##
## FALSE TRUE
## 48 2
For the CF-Normal contrast, the gene sets (GS) are ranked by increasing LogFC from left to right. The bars, forming the barcode, represent genes within a gene set. Neutral enrichment is represented by the dotted line. In the EMT GS, the worm’s direction was upward toward a +LogFC and became thicker toward the right (Figure 11). However, for the E2FT GS, the direction was downward, with the barcode thickened leftward of the plot (-LogFC) (Figure 12).
# Barcode plot using LFCs.
# HALLMARK EPITHELIAL MESENCHYMAL TRANSITION
barcodeplot(fit.contrast$coefficients[, 1],
index=H.index[["HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION"]],
main="LogFC: HALLMARK EPITHELIAL MESENCHYMAL TRANSITION (UP)",
labels=c("Normal.Male", "CF.Female"))
Figure 11: Bardcode plot showing the enrichment of the EMT gene set in the CF and Normal samples The x-axis is the LogFC for CF and Normal samples.
# Hallmark E2F TARGETS
barcodeplot(fit.contrast$coefficients[, 1],
index=H.index[["HALLMARK_E2F_TARGETS"]],
main="LogFC: HALLMARK E2F TARGETS (DOWN)",
labels=c("Normal.Male", "CF.Female"))
Figure 12: Bardcode plot showing the enrichment of the E2FT gene set in the CF and Normal samples. The x-axis is the LogFC for CF and Normal samples.
A bi-directional plot for both gene sets was produced (Figure 13).
# Barcode plot using 2 gene sets
barcodeplot(fit.contrast$coefficients[, 1],
index = H.index[["HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION"]],
index2 = H.index[["HALLMARK_E2F_TARGETS"]],
main="H. EPITHELIAL MESENCHYMAL TRANSITION & H. E2F TARGETS",
labels=c("Normal.Male", "CF.Female")
)
Figure 13: Bardcode plot showing the enrichment of the EMT and E2FT gene sets in the CF and Normal samples.
This transcriptomic analysis illustrates the difference between NHLF and CFHLF. Specifically, the MDS plot (Figure 3) indicated that the differences between CFHLF and NHLF are greater than those within CFHLF/NHLF [1]; the CFHLF genes are 4 times more expressed than NHLF. Additionally, it suggests statistical differences in DE between both groups.
The CF group has a relatively low normalisation factor, suggesting a sizeable presence of highly expressed and up-regulated genes. Disease.1’s log-ratio, especially, was somewhat positively skewed (Figures 4-5) with lots of highly up-regulated DEGS, which explains the low norm.factor (Table 7). Additionally, per the makeConstrasts arguments, a +LogFC means the gene is up-regulated in CF samples, however, a -LogFC denotes high expressions in Normal samples (Tables 9-11).
Top CF up-regulated DEGs, such as XIST, CLU, MXRA5, SCD, and TP53I11 (Table 11), have been associated with lung fibrosis severity [8-12]. Subsequent analysis (Tables 12-14) indicated that these up-regulated genes are strongly correlated with extracellular matrix organization (GO:0030198), extracellular space (GO:0005615), and signaling receptor binding (GO:0005102), suggesting that CFHLF is heavily involved in ECM buildup [1,13].
Furthermore, expectedly, CFHLF up-regulated genes (Figure 11) were significantly (FDR < 0.05) involved in EMT activation, a major driver of lung fibrosis [14,15,16,17]. Conversely, CFHLF significantly down-regulated E2FT pathway (Figure 12). E2FT down-regulation is associated with lung fibrosis inhibition [18,19,20,21], through, for example, USP9Y-assisted deubiquitination and RPS4Y1-mediated immune infiltration [20,22,23] (Table 11). Consequently, USP9Y/RPS4Y1 could regulate CF pathogenesis and have therapeutic effects. Lab validation should be done to confirm the regulatory effects of USP9Y/RPS4Y1.
This study did not control the age difference between CFHLF and NHLF; older patients may have disease stages and fibroblast conditions different from younger patients, so the results may not be comparable. Also, the gender bias and the disease stages were not considered.
[1] C. Mazio et al., “Intrinsic Abnormalities of Cystic Fibrosis Airway Connective Tissue Revealed by an In Vitro 3D Stromal Model,” Cells, vol. 9, no. 6, p. 1371, Jun. 2020, doi: 10.3390/cells9061371.
[2] B. Kerem, J. M. Rommens, and J. A. Buchanan, “Identification of the Cystic Fibrosis Gene: Genetic Analysis,” Science, vol. 245, no. 4925, pp. 1437–1437, Sep. 1989, doi: 10.1126/science.245.4925.1437.e.
[3] M. C. Dechecchi, A. Tamanini, and G. Cabrini, “Molecular basis of cystic fibrosis: from bench to bedside,” Annals of Translational Medicine, vol. 6, no. 17, pp. 334–334, Sep. 2018, doi: 10.21037/atm.2018.06.48.
[4] D. A. Stoltz , D. K. Meyerholz , and M. J. Welsh , “Origins of Cystic Fibrosis Lung Disease,” New England Journal of Medicine, vol. 372, no. 16, pp. 1574–1575, Apr. 2015, doi: 10.1056/nejmc1502191.
[5] J. L. Gillan, D. J. Davidson, and R. D. Gray, “Targeting cystic fibrosis inflammation in the age of CFTR modulators: focus on macrophages,” European Respiratory Journal, vol. 57, no. 6, p. 2003502, Dec. 2020, doi: 10.1183/13993003.03502-2020.
[6] A. M. Cantin, D. Hartl, M. W. Konstan, and J. F. Chmiel, “Inflammation in cystic fibrosis lung disease: Pathogenesis and therapy,” Journal of Cystic Fibrosis, vol. 14, no. 4, pp. 419–430, Jul. 2015, doi: 10.1016/j.jcf.2015.03.003.
[7] R. K. Kasam et al., “Fibrocyte accumulation in the lungs of cystic fibrosis patients,” Journal of Cystic Fibrosis, vol. 19, no. 5, pp. 815–822, Sep. 2020, doi: 10.1016/j.jcf.2020.06.011.
[8] A. M. A. Glasgow, C. De Santi, and C. M. Greene, “Non-coding RNA in cystic fibrosis,” Biochemical Society Transactions, vol. 46, no. 3, pp. 619–630, May 2018, doi: 10.1042/bst20170469.
[9] Q. Zhang, Y. Yue, and R. Zheng, “Clusterin as a serum biomarker candidate contributes to the lung fibroblasts activation in chronic obstructive pulmonary disease,” Chinese Medical Journal, vol. 135, no. 9, pp. 1076–1086, Feb. 2022, doi: 10.1097/cm9.0000000000002065.
[10] S. Kamei et al., “Integrative expression analysis identifies a novel interplay between CFTR and linc-SUMF1-2 that involves CF-associated gene dysregulation,” Biochemical and Biophysical Research Communications, vol. 509, no. 2, pp. 521–528, Feb. 2019, doi: 10.1016/j.bbrc.2018.12.152.
[11] Y. Xu, C. Tertilt, A. Krause, L. E. Quadri, R. G. Crystal, and S. Worgall, “Influence of the cystic fibrosis transmembrane conductance regulator on expression of lipid metabolism-related genes in dendritic cells,” Respiratory Research, vol. 10, no. 1, Apr. 2009, doi: 10.1186/1465-9921-10-26.
[12] Q. Wu et al., “p53: A Key Protein That Regulates Pulmonary Fibrosis,” Oxidative Medicine and Cellular Longevity, vol. 2020, pp. 1–13, Nov. 2020, doi: 10.1155/2020/6635794.
[13] M. R. Pinezich et al., “Pathological remodeling of distal lung matrix in end-stage cystic fibrosis patients,” Journal of Cystic Fibrosis, vol. 21, no. 6, pp. 1027–1035, Nov. 2022, doi: 10.1016/j.jcf.2022.04.016.
[14] N. Rout-Pitt, N. Farrow, D. Parsons, and M. Donnelley, “Epithelial mesenchymal transition (EMT): a universal process in lung diseases with implications for cystic fibrosis pathophysiology,” Respiratory Research, vol. 19, no. 1, Jul. 2018, doi: 10.1186/s12931-018-0834-8.
[15] N. Rout-Pitt, B. Boog, A. McCarron, N. Reyne, D. Parsons, and M. Donnelley, “Insights into epithelial-mesenchymal transition from cystic fibrosis rat models,” Journal of Cystic Fibrosis, Sep. 2024, doi: 10.1016/j.jcf.2024.09.003.
[16] M. C. Quaresma et al., “Mutant CFTR Drives TWIST1 mediated epithelial–mesenchymal transition,” Cell Death & Disease, vol. 11, no. 10, Oct. 2020, doi: 10.1038/s41419-020-03119-z.
[17] Z. Li et al., “LncRNA SNHG5 Suppresses Cell Migration and Invasion of Human Lung Adenocarcinoma via Regulation of Epithelial-Mesenchymal Transition,” Journal of Oncology, vol. 2023, pp. 1–13, Jan. 2023, doi: 10.1155/2023/3335959.
[18] Q. Ye et al., “Molecular Regulation of Heme Oxygenase-1 Expression by E2F Transcription Factor 2 in Lung Fibroblast Cells: Relevance to Idiopathic Pulmonary Fibrosis,” Biomolecules, vol. 12, no. 10, p. 1531, Oct. 2022, doi: 10.3390/biom12101531.
[19] T. Yoshihara et al., “Periostin plays a critical role in the cell cycle in lung fibroblasts,” Respiratory Research, vol. 21, no. 1, Jan. 2020, doi: 10.1186/s12931-020-1299-0.
[20] Q. Qian et al., “MicroRNA‐205‐5p targets E2F1 to promote autophagy and inhibit pulmonary fibrosis in silicosis through impairing SKP2‐mediated Beclin1 ubiquitination,” Journal of Cellular and Molecular Medicine, vol. 25, no. 19, pp. 9214–9227, Aug. 2021, doi: 10.1111/jcmm.16825.
[21] F. Wang, P. Li, and F. Li, “Integrated Analysis of a Gene Correlation Network Identifies Critical Regulation of Fibrosis by lncRNAs and TFs in Idiopathic Pulmonary Fibrosis,” BioMed Research International, vol. 2020, pp. 1–14, Jun. 2020, doi: 10.1155/2020/6537462.
[22] H. Zhao et al., “Single-cell RNA Sequencing Analysis Reveals New Immune Disorder Complexities in Hypersplenism,” Frontiers in Immunology, vol. 13, Jul. 2022, doi: 10.3389/fimmu.2022.921900.
[23] L. Xiu, B. Ma, and L. Ding, “Antioncogenic roles of USP9Y and DDX3Y in lung cancer: USP9Y stabilizes DDX3Y by preventing its degradation through deubiquitination,” Acta Histochemica, vol. 126, no. 1, p. 152132, Jan. 2024, doi: 10.1016/j.acthis.2023.152132.