Word count: 2000

Background

Cystic Fibrosis (CF) is a congenital condition resulting from mutations of the cystic fibrosis transmembrane conductance regulator (CFTR) gene [1,2]. CFTR protein is a chloride transporter needed for mucus hydration of the airway surface [3]; dehydration leads to high neutrophil, interleukin, and macrophage counts [4,5].

Although CF affects various organs, most of its histological complications occur in the lungs. Frequent CF-induced infections cause inflammation, then lung fibrosis (increased fibroblast production and excessive extracellular matrix (ECM) materials) [6], leading to lung tissue remodelling. Fibroblasts are associated with CF because they cause the proliferation of fibrocytes, fibrotic lesions and lung mesenchymal cells [7] - ultimately causing up-regulation of pro-fibrotic genes.

Using transcriptomics, differentially expressed genes (DEGs) in CF tissues can be quantified. Also, it is useful for elucidating the biological and molecular mechanisms causing CF onset, and, possibly, targeted therapeutic approaches. This study examines the transcriptomic profiles of CF and normal human lung fibroblasts (NHLF).

Materials and Methods

Data Information

CF- and NHLF RNA-Seq raw counts matrix and series matrix files were downloaded from NCBI GEO (GSE141535). The NHLF were obtained from 47-year-old males, while the CFHLF were from 45-year-old females. The experimental design and GEO numbers are shown in Table 1.

Table 1: GEO Numbers and Experimental Design.
GEO_accession Sample_title Sample_name Cell_type Gender
GSM4205771 Normal Human Lung Fibroblasts from 2D culture, rep1 Normal.1 Normal Male
GSM4205772 Normal Human Lung Fibroblasts from 2D culture, rep3 Normal.2 Normal Male
GSM4205773 Normal Human Lung Fibroblasts from 2D culture, rep3 Normal.3 Normal Male
GSM4205774 Normal Human Lung Fibroblasts from 2D culture, rep4 Normal.4 Normal Male
GSM4205775 Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep1 Disease.1 CF Female
GSM4205776 Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep2 Disease.2 CF Female
GSM4205777 Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep3 Disease.3 CF Female
GSM4205778 Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep4 Disease.4 CF Female

RNA extraction was performed using Roche’s High Pure RNA Tissue Kit. Library preparation was done using QuantSeq 3’mRNA-Seq Kit. Library quality was examined with High sensitivity DNA D1000. Illumina NovaSeq 6000 was used for sequencing. Sequence base calls were converted into FASTQ using bcl2fastq. Gene expressions were determined using htseq-count. The count matrix is shown in Table 2.

Table 2: Count Matrix (10 Rows)
GeneID GSM4205771 GSM4205772 GSM4205773 GSM4205774 GSM4205775 GSM4205776 GSM4205777 GSM4205778
100287102 1 2 0 0 3 2 2 1
653635 1 7 4 0 7 6 6 3
102466751 0 0 0 0 0 0 0 0
107985730 0 1 0 0 0 0 0 0
100302278 0 0 0 0 0 0 0 0
645520 0 0 0 0 0 0 0 0
79501 0 0 0 0 0 0 0 0
100996442 0 1 1 0 0 0 0 0
729737 0 0 0 0 0 2 1 0
102725121 0 0 0 0 0 0 1 0

Analytical Workflow

Installation of Packages

The following R packages were installed from BiocManager:

  1. edgeR: To create a DGEList object;
  2. RColorBrewer: To get nice colours;
  3. org.Hs.eg.db: For gene annotation; and
  4. limma: For differential expression and gene enrichment analyses.
# Installing packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager", "tidyverse")
BiocManager::install(c("edgeR", "RColorBrewer", "org.Hs.eg.db", "limma"))

The libraries were also loaded:

# Loading libraries
library(knitr)
library(RColorBrewer)
library(edgeR)
library(limma)
library(org.Hs.eg.db)

Loading the Data

The raw sequence count and series matrix were loaded in R and converted into a data frame (Table 2) as shown below.

# Load the raw count matrix data into R
seq_data <- read.table (gzfile ("C:/Users/USER/Desktop/E-portfolio-Main/GSE141535_raw_counts_GRCh38.p13_NCBI.tsv.gz"), header = TRUE)
# Load the series matrix data (sample info) downloaded from NCBI into R
sampleinfo <- read.delim ("C:/Users/USER/Desktop/E-portfolio-Main/GSE141535_series_matrix_2.txt", header = T, stringsAsFactors = T)

sampleinfo = subset (sampleinfo, select = -c(X)) ## To remove mislabelled column header
sampleinfo <- sampleinfo [-c(9:23), ] ## To remove empty cells in the data frame

Data Processing

The data was adjusted to remove the GeneID column, creating a new matrix.

# To create a new matrix containing only the count data from the 8 samples, removing the first column (GeneID)
count_data <- seq_data [, -1]

The GeneIDs were converted to row names and the columns were renamed.

# To turn the GeneID names (in the first column) into row names
rownames(count_data) <- seq_data [, 1] ## The GeneIDs are now the row names.

# The column names are actually the sample names,
# So the columns were renamed to reveal information from the Sample Info, such as the condition and replicate number, about each sample.
colnames(count_data) <- c("Normal.1", "Normal.2", "Normal.3", "Normal.4", "Disease.1", "Disease.2", "Disease.3", "Disease.4")
# The data now reflects information from the cell types.
kable(head(count_data))
Table 3: New Read Counts Data Frame.
Normal.1 Normal.2 Normal.3 Normal.4 Disease.1 Disease.2 Disease.3 Disease.4
100287102 1 2 0 0 3 2 2 1
653635 1 7 4 0 7 6 6 3
102466751 0 0 0 0 0 0 0 0
107985730 0 1 0 0 0 0 0 0
100302278 0 0 0 0 0 0 0 0
645520 0 0 0 0 0 0 0 0
# Note: The column names should bear the same name as the sample name in the Sample Info data
# The code below was ran to check if the Sample Name in sampleinfo and count_data are in the same order
table(colnames(count_data)==sampleinfo$Sample_name)
## 
## TRUE 
##    8

Create a DGEList from the Count Matrix

A DGEList object, containing the sample information, and read counts, was created. The DGEList function in edgeR allows for easy manipulation and sub-setting of the DGEList object:

# Convert read counts to a DGEList object
dge_list <- DGEList(count_data)

# To extract grouping information from "sampleinfo" 
group_info <- paste(sampleinfo$Cell_type,sampleinfo$Gender,sep=".")

# Convert "group_info" into a factor
group_info <- factor(group_info)

# View the values
table(group_info)
## group_info
##   CF.Female Normal.Male 
##           4           4

The disease condition and gender were not treated as separate factors in the original experiment, so, both factors were regarded as one in the analysis (Table 4).

# Include the factors into the DGEList object
dge_list$samples$group <- group_info

# View the Samples slot of the DGEList object
kable(dge_list$samples)
Table 4: Groups, Library Size, and Normalisation Factors of each Sample.
group lib.size norm.factors
Normal.1 Normal.Male 955342 1
Normal.2 Normal.Male 2620178 1
Normal.3 Normal.Male 1498839 1
Normal.4 Normal.Male 1070215 1
Disease.1 CF.Female 1181443 1
Disease.2 CF.Female 1304686 1
Disease.3 CF.Female 1233521 1
Disease.4 CF.Female 1531346 1

Gene Annotation

Gene names and gene symbols associated with the respective Entrez IDs were used to annotate the DGEList object via the org.Hs.eg.db package.

# Adding annotation, such as Gene Symbol, Gene Name, and ENTREZID, to DGEList object
anno_data <- select(org.Hs.eg.db, keys=rownames(dge_list$counts), columns=c("ENTREZID", "SYMBOL","GENENAME"))
## 'select()' returned 1:1 mapping between keys and columns
# To check that the ENTREZIDs match the dge_list$count row names
table(anno_data$ENTREZID==rownames(dge_list$counts))
## 
##  TRUE 
## 39376
# Add the annotation information into a new slot (called genes) in the DGEList object
dge_list$genes <- anno_data

Filtering Genes with Low Expression

Genes with low expression were removed to avoid false discovery.Hence, they should be removed. To achieve this, the filterByExpr function in edgeR was used because it retains only genes that possess a minimum count in multiple samples.

# Filter out gene with low expression using filterByExpr
exprs_kept <- filterByExpr(dge_list)
dge_list_2 <- dge_list[exprs_kept,]

# Compare the value of the new DGEList object to the previous one.
list(dim(dge_list), dim(dge_list_2))
## [[1]]
## [1] 39376     8
## 
## [[2]]
## [1] 9460    8
# 9,460 genes (24%) were kept out of 39,376 

Library Size

This aspect is a quality control step. The library sizes of the samples were examined to identify differences in their distributions (Figures 1-2).

## Quality Control
## Number of reads in the filtered DGE_list_2 data
dge_list_2$samples$lib.size
## [1]  955342 2620178 1498839 1070215 1181443 1304686 1233521 1531346

Multi-dimensional scaling (MDS) Plots

An MDS plot was generated to examine the source of variation in the samples, making it easy to detect outliers (Figure 3).

# multidimensional scaling plots
plotMDS(dge_list_2)

Normalisation to Remove Composition Bias

Normalisation was done using calcNormFactors function. The DEG profiles of each sample were closely examined using the plotMD function, comparing samples with the highest and lowest norm.factors before and after normalisation correction..

# Normalisation of DGE_list_2
dge_list_2 <- calcNormFactors(dge_list_2)

Design Matrix

A design matrix (Table 5) to compare the Normal and CF groups was created for differential expression (DE) analysis using limma.

# create a design matrix with no intercept using the group_info variable
matrix <- model.matrix(~ 0 + group_info)
# Rename the matrix columns to bear the grouping information, CF/Normal
colnames(matrix) <- levels(group_info)
# View matrix
kable (matrix) ## The rows represents the samples, while the columns represent the coefficient of comparison  - 1 for present, 0 for absent- between CF and Normal groups.
Table 5: Design Matrix Comparing CF to Normal.
CF.Female Normal.Male
0 1
0 1
0 1
0 1
1 0
1 0
1 0
1 0

Voom Transformation

Using the design matrix and DGEList object, the voom function created an EList object with adjusted library sizes and normalised log2 counts.

# Use voom function to transform the data, with the normalised dge_list_2 and matrix as the objects
voom_trans <- voom(dge_list_2, matrix, plot=F)

DE Analysis

The voom-transformed object was tested for DE by fitting it to a linear model using the lmFit function.

# Now to test for differential expression using the lmfit function to fit a linear model
fit_model <- lmFit(voom_trans)

The makeContrasts function was used to determine DEGs in the CF and Normal groups, with the null hypothesis defined as “CF.Female - Normal.Male = 0”.

# Test for difference between groups
contrast.matrix <- makeContrasts(CysticVsNormal=CF.Female - Normal.Male,levels=group_info)
kable(contrast.matrix)
Table 6: Contrast Matrix for DE Analysis.
CysticVsNormal
CF.Female 1
Normal.Male -1

Using the contrasts.fit function, the contrast matrix was applied to the voom-transformed object to obtain relevant parameters for the CF-Normal comparison. Afterwards, the eBayes function was called to apply empirical Bayes shrinkage to the contrast.

# Using the constrasts.fit function in Limma, the constrast.matrix included the fit_model object
fit.contrast <- contrasts.fit(fit_model, contrast.matrix)
# Next, the eBayes function was applied to the fit.contrast object to calculate the p-values and t-statistics, and shrink the variances using empirical Bayes.
fit.contrast <-  eBayes(fit.contrast)

Using decideTests function, the DEGs for the contrast were produced and visualised. The top DEGs were stored in an object, with the p-value capped at 0.05 and adjusted.

# Created summary data of DEG for contrasts using 'decideTest' function of Limma
summary.fit <- decideTests(fit.contrast)
# Created an object to store the top 100 DEGs 
top.gene <- topTable(fit.contrast,coef=1,number=100,genelist=fit.contrast$genes, p.value = 0.05, adjust.method = "BH", sort.by = "p") ## p-value was adjusted using the Benjamini-Hochberg method

Testing Relative to a Threshold (TREAT)

TREAT analysis was performed to rank DEGs by their Log fold-change (LFC). A LFC of 1, meaning double expression, was applied to the fit.contrast object to determine which DEGs are more biologically relevant. The TREAT results were sorted by their p-values, then stored and visualised.

# Testing relative to a threshold (TREAT) using a logFC of 1
## Performed TREAT analysis to rank the genes per their logFCs, re-estimating the p-values and t-statistic
fit.treat <- treat(fit.contrast, lfc =1) 
## Computed summary of the TREAT results
results.treat <- decideTests(fit.treat)
## Created an object to store the top 10 genes post-TREAT analysis
treat.top.genes <- topTable(fit.treat,coef=1,sort.by="p") ## By default, only the top 10 DEGs will be stored in the treat.top.genes data

Gene Ontology (GO) Enrichment Analysis

GO analysis (GOANA) was performed to find over-represented or enriched GO terms in the TREAT output of the CF-Normal comparison using the goana function (FDR < 0.05).

# Gene Ontology of TREAT results
GO_treat <- goana(fit.treat, coef = "CysticVsNormal", species = "Hs") ## FDR < 0.05 by default

Additionally, CAMERA was used to check for enriched gene sets using the Hallmark gene sets from MSigDB. The ids2indices function was used to map the Entrez ID of the voom-transformed object to the gene sets. An inter-gene correlation of 0.05 was used, as it allows for biological interpretation and low FDR.

# Loaded the Hallmark Human gene data set from the Broad Institute’s Molecular Signatures Database (MSigDB) into R
Hs.H <- readRDS("C:/Users/USER/Desktop/E-portfolio-Main/Rmd/Hs.h.all.v7.1.entrez.rds")
# The Hallmark Gene Set is a vector object containing gene sets and each vector has ENTREZ IDs for the genes in the gene set
## The Entrez IDs contained in the vectors of the H.indices object is exactly what is contained in the row names of the voom_trans object. The ids2indices function was be used to map the IDs between the gene list in the vector and voom objects.
H.index <- ids2indices(Hs.H, rownames(voom_trans))
# Used Camera function to get correlation for each gene set
## The inter.gene.cor argument was kept low at 0.05
camera.out <- camera(voom_trans,index = H.index, design = matrix, contrast = contrast.matrix[,1],inter.gene.cor = 0.05)

Results

Quality Control

Library Size Distribution

The samples’ counts were not normally distributed (Figure 1), so the counts were converted to the log2 scale, correcting the library size differences, using the cpm function (Figure 2).

# To create a barplot of the library sizes for each sample
barplot(dge_list_2$samples$lib.size/1e06, names=colnames(dge_list_2),las=2, ann=FALSE, cex.names=0.75)
mtext(side = 1, text = "Samples", line = 4)
mtext(side = 2, text = "Library size in Millions", line = 3)
Figure 1: Barplot of Library Sizes.

Figure 1: Barplot of Library Sizes.

The LogCPM distribution of the samples had little or no difference. No sample was far above or below the red line.

# Get log counts of DGEList object
dge_logcounts <- cpm (dge_list_2, log = T)
# Check distributions of samples using boxplots
boxplot(dge_logcounts, xlab="", ylab="Log2 Counts Per Million",las=2)
# Include a horizontal line that represents to the median logCPM
abline(h=median(dge_logcounts),col="red")
Figure 2: Boxplots of LogCPMs (Unnormalised).

Figure 2: Boxplots of LogCPMs (Unnormalised).

MDS Plot

The MDS plot revealed that the samples cluster by cell type, Normal or Disease (Figure 3); that is, the cell types are the greatest variation source. The distance between the Disease and the Normal replicates on the x-axis are 2 units, corresponding to a 4-fold-change between the samples. No inter-group clustering was observed.

# To reset factor to existing levels in the data frame
sampleinfo$Cell_type <- factor (sampleinfo$Cell_type)
# Create a new value containing color-coded characters for the levels
color.cell.type <- c("blue", "red")[sampleinfo$Cell_type]
# Now infuse the color code into the MDS plot
plotMDS(dge_list_2,col=color.cell.type, xlab = "Leading LogFC Dim1", ylab = "Leading LogFC Dim2")
# Let's add a legend to the plot so we know which colours correspond to which cell type
legend("top",fill=c("blue","red"),legend=levels(sampleinfo$Cell_type))
Figure 3: MDS Plot of Cystic Fibrosis (CF) and Normal Cell Type Samples.

Figure 3: MDS Plot of Cystic Fibrosis (CF) and Normal Cell Type Samples.

Normalisation for Composition Bias

The CF group had a relatively low norm.factor (0.96-1.01) (Table 7). Normal.4 had the highest norm.factors (1.03), while Disease.1 had the lowest (0.96).

# View the normalisation factors of the samples
kable(dge_list_2$samples)
Table 7: Unbiased Normalisation Factors.
group lib.size norm.factors
Normal.1 Normal.Male 955342 1.0177303
Normal.2 Normal.Male 2620178 0.9743332
Normal.3 Normal.Male 1498839 1.0216918
Normal.4 Normal.Male 1070215 1.0309631
Disease.1 CF.Female 1181443 0.9649097
Disease.2 CF.Female 1304686 1.0127833
Disease.3 CF.Female 1233521 1.0010146
Disease.4 CF.Female 1531346 0.9787079

Although the genes in the logcount data clustered around the zero Log-ratio (Figure 4), the normalised data showed a more uniform clustering (Figure 5). Additionally, the CF group had more gene expressions with positive log-ratios.

# To compare the logcounts and compositional bias normalisation of Normal.4 and Disease.1
# The logcount-normalised data was used here
par(mfrow=c(1,2))
plotMD(dge_logcounts,column = 4)
abline(h=0,col="red", lty=2, lwd=2)
plotMD(dge_logcounts,column = 5)
abline(h=0,col="red", lty=2, lwd=2)
Figure 4: Comparing the Log Count Data Between Normal.4 and Disease.1.

Figure 4: Comparing the Log Count Data Between Normal.4 and Disease.1.

# The compositional bias-normalised data was used here
par(mfrow=c(1,2))
plotMD(dge_list_2,column = 4)
abline(h=0,col="red",lty=2, lwd=2)
plotMD(dge_list_2,column = 5)
abline(h=0,col="red",lty=2, lwd=2)
Figure 5: Comparing the Normalised Data Between Normal.4 and Disease.1.

Figure 5: Comparing the Normalised Data Between Normal.4 and Disease.1.

Voom Transformation

The mean-variance trend showed the biological variation of the genes in all the samples (Figure 6). The mean-variance trend decreased progressively and asymptotically with abundance.

# Use voom function to transform the data, with the normalised dge_list_2 and matrix as the objects
voom_trans <- voom(dge_list_2, matrix, plot=T)
Figure 6: Scatterplot of Square-root Standard Deviation Against Normalised Log2 Counts. The red line represents the mean-variance trend.

Figure 6: Scatterplot of Square-root Standard Deviation Against Normalised Log2 Counts. The red line represents the mean-variance trend.

The library sizes of the voom-transformed data have been adjusted, with the boxes closer to the median LogCPM, unlike the unnormalised data (Figure 7). Furthermore, the voom-transformed data had more outliers less than zero LogCPM and lower extremes.

# Compare the unnormalised logcounts (logCPM) with the voom logCPM
par(mfrow=c(1,2))
boxplot(dge_logcounts, xlab="", ylab="Log2 Counts Per Million",las=2,main="Unnormalised LogCPM")
# Insert a red line to represent the median logCPM of the unnormalised DGE List
abline(h=median(dge_logcounts),col="red")
boxplot(voom_trans$E, xlab="", ylab="Log2 Counts Per Million",las=2,main="Voom-Transformed logCPM")
# Insert a red line to represent the median logCPM of the voom-transformed DGE List
abline(h=median(voom_trans$E),col="red")
Figure 7: Comparing the Unnormalised LogCPM with The Voom-transformed LogCPM.

Figure 7: Comparing the Unnormalised LogCPM with The Voom-transformed LogCPM.

DE Results

There were 918 DEGs (Table 8): 419 were down-regulated, 499 were up-regulated, and 8,542 were not DE.

# Summary of the DEGs for the comparison
kable(summary(summary.fit)) 
Table 8: DEGs.
CysticVsNormal
Down 419
NotSig 8542
Up 499

The top gene in Table 9 (XIST) had expressions ranging from 7.20 to 7.46 in the CF samples (Figure 8).

# View the top 10 genes
kable(head (top.gene, n = 10), digits = 32)
Table 9: Top DEGs.
ENTREZID SYMBOL GENENAME logFC AveExpr t P.Value adj.P.Val B
7503 7503 XIST X inactive specific transcript 8.848339 2.905534 16.68974 2.900737e-11 7.977232e-08 13.07372
387066 387066 SNHG5 small nucleolar RNA host gene 5 3.308967 7.618322 16.41473 3.697172e-11 7.977232e-08 15.98093
9383 9383 TSIX TSIX transcript, XIST antisense RNA 8.505410 2.735071 16.25616 4.259378e-11 7.977232e-08 12.84729
6192 6192 RPS4Y1 ribosomal protein S4 Y-linked 1 -9.395521 3.317442 -16.25523 4.262923e-11 7.977232e-08 12.87789
1191 1191 CLU clusterin 2.906360 9.577039 16.21474 4.420702e-11 7.977232e-08 15.79160
715 715 C1R complement C1r 2.439766 7.503698 16.06515 5.059555e-11 7.977232e-08 15.67444
6319 6319 SCD stearoyl-CoA desaturase 3.386719 5.834727 13.52253 6.058330e-10 8.187401e-07 13.17015
1277 1277 COL1A1 collagen type I alpha 1 chain 1.775692 11.029431 12.50043 1.843980e-09 2.180506e-06 11.92301
125 125 ADH1B alcohol dehydrogenase 1B (class I), beta polypeptide 3.440099 5.081107 12.26592 2.406117e-09 2.529096e-06 11.75897
4500 4500 MT1L metallothionein 1L (pseudogene) -1.887294 7.693046 -12.02138 3.189964e-09 2.798492e-06 11.50982
# The expression levels of the topmost up-regulated gene in the table above.
nice.colors <- brewer.pal(7,name="Dark2") # to add nice colors
stripchart(voom_trans$E["7503",]~group_info,vertical=TRUE,las=2,cex.axis=0.8,pch=16,cex=1.3,col=nice.colors,method="jitter",ylab="Normalised log2 expression", main="XIST")
Figure 8: Scatterplot Showing the Normalised Log2 Expressions of XIST gene in CF and Normal Samples.

Figure 8: Scatterplot Showing the Normalised Log2 Expressions of XIST gene in CF and Normal Samples.

The top DEGs were visualised using plotMD, while volcanoplot was used to highlight the symbols of the top 10 up- and down-regulated DEGs (Figure 9).

# Constructing MD Plot and Volcano Plot after testing for Differential expression 
# Use out from decideTests to show significant genes
# Visualise using plotMD function
plotMD(fit.contrast, coef = 1, status = summary.fit[,"CysticVsNormal"], values = c(1, -1), hl.col = c("red","blue"), xlab="Average Log-expression", ylab="Log Fold-Change", cex=1.0)
legend("topright", title=paste("LogFC"), legend=c("Up", "Down", "No-DE"), pch=19, col=c("red","blue", "black"))
abline(h=0)
# Using volcano plot, the top DE genes and gene symbols will be highlighted
# Here the top 10 genes are highlighted
volcanoplot(fit.contrast,coef=1,highlight=10,names=fit.contrast$genes$SYMBOL, main="CysticVsNormal", xlab = "Log2 Fold-Change", ylab = "-Log10 (P-value)", cex= 0.4)
Figure 9: MD Plot (left) showing the LogFC and average expression of each gene. Volcano Plot (right) showing the highly expressed up- and down-regulated genes.Figure 9: MD Plot (left) showing the LogFC and average expression of each gene. Volcano Plot (right) showing the highly expressed up- and down-regulated genes.

Figure 9: MD Plot (left) showing the LogFC and average expression of each gene. Volcano Plot (right) showing the highly expressed up- and down-regulated genes.

TREAT

After TREAT, only 103 DEGs were detected at LogFC of 1 - more than 88% decrease (Table 10). 85 and 18 DEGs were up- and down-regulated, respectively. Consequently, only DEGs with large LogFCs were identified (Figure 10).

# Summary of the TREAT results
kable(summary(results.treat))
Table 10: DEGs (TREAT).
CysticVsNormal
Down 18
NotSig 9357
Up 85

The DEG ranking is now different. Notably, RPS4Y1 replaced SNHG5 as the second most DEG, due to their p-values (Table 11).

kable(treat.top.genes, digits = 32)
Table 11: Top DEGs (TREAT).
ENTREZID SYMBOL GENENAME logFC AveExpr t P.Value adj.P.Val
7503 7503 XIST X inactive specific transcript 8.848339 2.905534 14.803540 8.576286e-11 4.236456e-07
6192 6192 RPS4Y1 ribosomal protein S4 Y-linked 1 -9.395521 3.317442 -14.525124 1.136275e-10 4.236456e-07
9383 9383 TSIX TSIX transcript, XIST antisense RNA 8.505410 2.735071 14.344883 1.343485e-10 4.236456e-07
387066 387066 SNHG5 small nucleolar RNA host gene 5 3.308967 7.618322 11.454048 3.125797e-09 7.392510e-06
1191 1191 CLU clusterin 2.906360 9.577039 10.635684 8.658790e-09 1.638243e-05
25878 25878 MXRA5 matrix remodeling associated 5 7.132167 2.049141 10.184420 1.590158e-08 2.507149e-05
6319 6319 SCD stearoyl-CoA desaturase 3.386719 5.834727 9.529721 3.796179e-08 4.807616e-05
715 715 C1R complement C1r 2.439766 7.503698 9.480443 4.065638e-08 4.807616e-05
9537 9537 TP53I11 tumor protein p53 inducible protein 11 4.598240 4.737967 9.187043 6.169558e-08 6.311704e-05
8287 8287 USP9Y ubiquitin specific peptidase 9 Y-linked -6.011449 1.618448 -9.112061 6.924605e-08 6.311704e-05
# View the TREAT results using a MD Plot
plotMD(fit.treat,coef=1,status=results.treat[,"CysticVsNormal"], values=c(1,-1), hl.col=c("red","blue"), xlab="Average Log-expression", ylab="Log Fold Change", cex=1.0)
legend("topright", title=paste("LogFC"), legend=c("Up", "Down", "No-DE"), pch=19, col=c("red","blue", "black"))
abline(h=0,col="grey")
# Here the top 10 genes are highlighted
volcanoplot(fit.treat,coef=1, highlight=10, names=fit.treat$genes$SYMBOL, main="CysticVsNormal", xlab = "Log2 Fold Change", ylab = "-Log10 (P-value)", cex= 0.4)
Figure 10: Post-TREAT MD Plot (left) showing DEGs with fold-change greater than 1. Post-TREAT Volcano Plot (right) showing the highly expressed up- and down-regulated genes.Figure 10: Post-TREAT MD Plot (left) showing DEGs with fold-change greater than 1. Post-TREAT Volcano Plot (right) showing the highly expressed up- and down-regulated genes.

Figure 10: Post-TREAT MD Plot (left) showing DEGs with fold-change greater than 1. Post-TREAT Volcano Plot (right) showing the highly expressed up- and down-regulated genes.

Enrichment Analysis

GOANA

There were 18,835 enriched GO terms associated with molecular function (MF), cellular component (CC), and biological process (BP). In CC (Table 12), cell periphery had the most genes (2,187) and 40 were significantly up-regulated; in BP (Table 13), developmental process had the most genes (3,058) and 48 were significantly up-regulated; and in MF (Table 14), signaling receptor binding had the most genes (566) and 18 were significantly up-regulated. The table is arranged by the least P.Up and P.Down values.

# Used topGO to view the GO terms with the most significance and enrichment
kable(topGO(GO_treat, ontology = "CC")) ## Cellular Component
Table 12: Top GO Terms: Celluar Component.
Term Ont N Up Down P.Up P.Down
GO:0030312 external encapsulating structure CC 204 20 1 0.0000000 0.3259219
GO:0031012 extracellular matrix CC 204 20 1 0.0000000 0.3259219
GO:0062023 collagen-containing extracellular matrix CC 175 17 1 0.0000000 0.2866704
GO:0005576 extracellular region CC 1712 40 4 0.0000000 0.4189050
GO:0005615 extracellular space CC 1484 35 3 0.0000000 0.5560731
GO:0071944 cell periphery CC 2187 40 6 0.0000011 0.2233059
GO:0098643 banded collagen fibril CC 8 3 0 0.0000384 1.0000000
GO:0005583 fibrillar collagen trimer CC 8 3 0 0.0000384 1.0000000
GO:0031091 platelet alpha granule CC 45 5 0 0.0000489 1.0000000
GO:0005581 collagen trimer CC 30 4 0 0.0001414 1.0000000
GO:0098644 complex of collagen trimers CC 12 3 0 0.0001471 1.0000000
GO:0009986 cell surface CC 280 9 2 0.0009069 0.0985845
GO:0005604 basement membrane CC 49 4 0 0.0009599 1.0000000
GO:0097440 apical dendrite CC 6 2 0 0.0011787 1.0000000
GO:0070821 tertiary granule membrane CC 30 0 2 1.0000000 0.0014531
GO:0099512 supramolecular fiber CC 504 12 0 0.0017842 1.0000000
GO:0099081 supramolecular polymer CC 507 12 0 0.0018757 1.0000000
GO:0042583 chromaffin granule CC 9 2 0 0.0027796 1.0000000
GO:0035579 specific granule membrane CC 47 0 2 1.0000000 0.0035423
GO:0042383 sarcolemma CC 71 4 1 0.0037886 0.1274119
kable(topGO(GO_treat, ontology = "BP")) ## Biological Process
Table 13: Top GO Terms: Biological Process.
Term Ont N Up Down P.Up P.Down
GO:0009653 anatomical structure morphogenesis BP 1296 31 3 1.00e-07 0.4591457
GO:0030154 cell differentiation BP 1977 39 3 2.00e-07 0.7613987
GO:0048869 cellular developmental process BP 1978 39 3 2.00e-07 0.7617329
GO:0048513 animal organ development BP 1411 32 4 2.00e-07 0.2786936
GO:0009887 animal organ morphogenesis BP 451 17 3 4.00e-07 0.0520817
GO:0048731 system development BP 1909 37 3 9.00e-07 0.7378814
GO:0007275 multicellular organism development BP 2260 41 3 9.00e-07 0.8429937
GO:0048732 gland development BP 233 12 1 1.10e-06 0.3631273
GO:0045229 external encapsulating structure organization BP 156 10 0 1.30e-06 1.0000000
GO:0030198 extracellular matrix organization BP 156 10 0 1.30e-06 1.0000000
GO:0043062 extracellular structure organization BP 156 10 0 1.30e-06 1.0000000
GO:0009888 tissue development BP 931 24 3 1.40e-06 0.2601920
GO:0048856 anatomical structure development BP 2798 46 4 2.10e-06 0.8288663
GO:0032502 developmental process BP 3058 48 6 4.00e-06 0.5566032
GO:0060384 innervation BP 13 4 0 4.10e-06 1.0000000
GO:0032963 collagen metabolic process BP 53 6 0 7.40e-06 1.0000000
GO:0007399 nervous system development BP 1225 26 0 1.66e-05 1.0000000
GO:0016477 cell migration BP 745 19 2 2.66e-05 0.4222843
GO:0050768 negative regulation of neurogenesis BP 68 6 0 3.16e-05 1.0000000
GO:0048608 reproductive structure development BP 138 8 1 3.22e-05 0.2334522
kable(topGO(GO_treat, ontology = "MF")) ## Molecular Function
Table 14: Top GO Terms: Molecular Function.
Term Ont N Up Down P.Up P.Down
GO:0005539 glycosaminoglycan binding MF 86 11 0 0.0000000 1.0000000
GO:0008201 heparin binding MF 63 9 0 0.0000000 1.0000000
GO:0005201 extracellular matrix structural constituent MF 73 9 0 0.0000000 1.0000000
GO:1901681 sulfur compound binding MF 122 10 1 0.0000001 0.2092964
GO:0008131 primary amine oxidase activity MF 3 3 0 0.0000007 1.0000000
GO:0005102 signaling receptor binding MF 566 18 1 0.0000022 0.6725110
GO:0048018 receptor ligand activity MF 97 8 1 0.0000024 0.1701087
GO:0030546 signaling receptor activator activity MF 102 8 1 0.0000035 0.1780900
GO:0030545 signaling receptor regulator activity MF 111 8 1 0.0000065 0.1922741
GO:0005198 structural molecule activity MF 395 14 1 0.0000103 0.5377672
GO:0005178 integrin binding MF 85 7 0 0.0000106 1.0000000
GO:0048407 platelet-derived growth factor binding MF 8 3 0 0.0000384 1.0000000
GO:0016641 oxidoreductase activity, acting on the CH-NH2 group of donors, oxygen as acceptor MF 9 3 0 0.0000573 1.0000000
GO:0052595 aliphatic amine oxidase activity MF 2 2 0 0.0000805 1.0000000
GO:0097621 monoamine oxidase activity MF 2 2 0 0.0000805 1.0000000
GO:0016638 oxidoreductase activity, acting on the CH-NH2 group of donors MF 10 3 0 0.0000813 1.0000000
GO:0050840 extracellular matrix binding MF 38 4 0 0.0003606 1.0000000
GO:0019838 growth factor binding MF 69 5 0 0.0003793 1.0000000
GO:0004252 serine-type endopeptidase activity MF 40 4 1 0.0004404 0.0737926
GO:0030020 extracellular matrix structural constituent conferring tensile strength MF 19 3 0 0.0006188 1.0000000

CAMERA

The 2 most significant gene sets (FDR < 0.05) in the CAMERA output are the Hallmark Epithelial-Mesenchymal Transition (EMT), followed by Hallmark E2F Targets (E2FT).

# View the top 10 gene sets
# The list is ranked by p-value, so, the most significant gene set is at the top
kable(camera.out[1:10, ], digits = 32)
Table 15: Top Gene Sets
NGenes Direction PValue FDR
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 160 Up 8.108529e-05 0.004054265
HALLMARK_E2F_TARGETS 179 Down 1.171711e-03 0.029292780
HALLMARK_HEDGEHOG_SIGNALING 14 Up 5.246642e-03 0.079750264
HALLMARK_CHOLESTEROL_HOMEOSTASIS 58 Up 6.380021e-03 0.079750264
HALLMARK_G2M_CHECKPOINT 183 Down 1.233071e-02 0.123307116
HALLMARK_MYOGENESIS 101 Up 1.738763e-02 0.144896897
HALLMARK_MYC_TARGETS_V1 198 Down 2.712901e-02 0.193778614
HALLMARK_ANGIOGENESIS 18 Up 8.518489e-02 0.532405547
HALLMARK_INTERFERON_ALPHA_RESPONSE 65 Up 1.254412e-01 0.696895612
HALLMARK_UV_RESPONSE_DN 126 Up 1.810185e-01 0.730667720
# Genes significant at FDR < 0.05
table(camera.out$FDR < 0.05) # Only 2 genes were significant at 5% and 4 at 10%
## 
## FALSE  TRUE 
##    48     2

For the CF-Normal contrast, the gene sets (GS) are ranked by increasing LogFC from left to right. The bars, forming the barcode, represent genes within a gene set. Neutral enrichment is represented by the dotted line. In the EMT GS, the worm’s direction was upward toward a +LogFC and became thicker toward the right (Figure 11). However, for the E2FT GS, the direction was downward, with the barcode thickened leftward of the plot (-LogFC) (Figure 12).

# Barcode plot using LFCs.
# HALLMARK EPITHELIAL MESENCHYMAL TRANSITION
barcodeplot(fit.contrast$coefficients[, 1], 
            index=H.index[["HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION"]], 
            main="LogFC: HALLMARK EPITHELIAL MESENCHYMAL TRANSITION (UP)",
            labels=c("Normal.Male", "CF.Female"))
Figure 11: Bardcode plot showing the enrichment of the EMT gene set in the CF and Normal samples The x-axis is the LogFC for CF and Normal samples.

Figure 11: Bardcode plot showing the enrichment of the EMT gene set in the CF and Normal samples The x-axis is the LogFC for CF and Normal samples.

# Hallmark E2F TARGETS
barcodeplot(fit.contrast$coefficients[, 1], 
            index=H.index[["HALLMARK_E2F_TARGETS"]], 
            main="LogFC: HALLMARK E2F TARGETS (DOWN)",
            labels=c("Normal.Male", "CF.Female"))
Figure 12: Bardcode plot showing the enrichment of the E2FT gene set in the CF and Normal samples. The x-axis is the LogFC for CF and Normal samples.

Figure 12: Bardcode plot showing the enrichment of the E2FT gene set in the CF and Normal samples. The x-axis is the LogFC for CF and Normal samples.

A bi-directional plot for both gene sets was produced (Figure 13).

# Barcode plot using 2 gene sets
barcodeplot(fit.contrast$coefficients[, 1], 
            index = H.index[["HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION"]], 
            index2 = H.index[["HALLMARK_E2F_TARGETS"]],
            main="H. EPITHELIAL MESENCHYMAL TRANSITION & H. E2F TARGETS",
            labels=c("Normal.Male", "CF.Female")
            )
Figure 13: Bardcode plot showing the enrichment of the EMT and E2FT gene sets in the CF and Normal samples.

Figure 13: Bardcode plot showing the enrichment of the EMT and E2FT gene sets in the CF and Normal samples.

Discussion

This transcriptomic analysis illustrates the difference between NHLF and CFHLF. Specifically, the MDS plot (Figure 3) indicated that the differences between CFHLF and NHLF are greater than those within CFHLF/NHLF [1]; the CFHLF genes are 4 times more expressed than NHLF. Additionally, it suggests statistical differences in DE between both groups.

The CF group has a relatively low normalisation factor, suggesting a sizeable presence of highly expressed and up-regulated genes. Disease.1’s log-ratio, especially, was somewhat positively skewed (Figures 4-5) with lots of highly up-regulated DEGS, which explains the low norm.factor (Table 7). Additionally, per the makeConstrasts arguments, a +LogFC means the gene is up-regulated in CF samples, however, a -LogFC denotes high expressions in Normal samples (Tables 9-11).

Top CF up-regulated DEGs, such as XIST, CLU, MXRA5, SCD, and TP53I11 (Table 11), have been associated with lung fibrosis severity [8-12]. Subsequent analysis (Tables 12-14) indicated that these up-regulated genes are strongly correlated with extracellular matrix organization (GO:0030198), extracellular space (GO:0005615), and signaling receptor binding (GO:0005102), suggesting that CFHLF is heavily involved in ECM buildup [1,13].

Furthermore, expectedly, CFHLF up-regulated genes (Figure 11) were significantly (FDR < 0.05) involved in EMT activation, a major driver of lung fibrosis [14,15,16,17]. Conversely, CFHLF significantly down-regulated E2FT pathway (Figure 12). E2FT down-regulation is associated with lung fibrosis inhibition [18,19,20,21], through, for example, USP9Y-assisted deubiquitination and RPS4Y1-mediated immune infiltration [20,22,23] (Table 11). Consequently, USP9Y/RPS4Y1 could regulate CF pathogenesis and have therapeutic effects. Lab validation should be done to confirm the regulatory effects of USP9Y/RPS4Y1.

This study did not control the age difference between CFHLF and NHLF; older patients may have disease stages and fibroblast conditions different from younger patients, so the results may not be comparable. Also, the gender bias and the disease stages were not considered.

References

[1] C. Mazio et al., “Intrinsic Abnormalities of Cystic Fibrosis Airway Connective Tissue Revealed by an In Vitro 3D Stromal Model,” Cells, vol. 9, no. 6, p. 1371, Jun. 2020, doi: 10.3390/cells9061371.

[2] B. Kerem, J. M. Rommens, and J. A. Buchanan, “Identification of the Cystic Fibrosis Gene: Genetic Analysis,” Science, vol. 245, no. 4925, pp. 1437–1437, Sep. 1989, doi: 10.1126/science.245.4925.1437.e.

[3] M. C. Dechecchi, A. Tamanini, and G. Cabrini, “Molecular basis of cystic fibrosis: from bench to bedside,” Annals of Translational Medicine, vol. 6, no. 17, pp. 334–334, Sep. 2018, doi: 10.21037/atm.2018.06.48.

[4] D. A. Stoltz , D. K. Meyerholz , and M. J. Welsh , “Origins of Cystic Fibrosis Lung Disease,” New England Journal of Medicine, vol. 372, no. 16, pp. 1574–1575, Apr. 2015, doi: 10.1056/nejmc1502191.

[5] J. L. Gillan, D. J. Davidson, and R. D. Gray, “Targeting cystic fibrosis inflammation in the age of CFTR modulators: focus on macrophages,” European Respiratory Journal, vol. 57, no. 6, p. 2003502, Dec. 2020, doi: 10.1183/13993003.03502-2020.

[6] A. M. Cantin, D. Hartl, M. W. Konstan, and J. F. Chmiel, “Inflammation in cystic fibrosis lung disease: Pathogenesis and therapy,” Journal of Cystic Fibrosis, vol. 14, no. 4, pp. 419–430, Jul. 2015, doi: 10.1016/j.jcf.2015.03.003.

[7] R. K. Kasam et al., “Fibrocyte accumulation in the lungs of cystic fibrosis patients,” Journal of Cystic Fibrosis, vol. 19, no. 5, pp. 815–822, Sep. 2020, doi: 10.1016/j.jcf.2020.06.011.

[8] A. M. A. Glasgow, C. De Santi, and C. M. Greene, “Non-coding RNA in cystic fibrosis,” Biochemical Society Transactions, vol. 46, no. 3, pp. 619–630, May 2018, doi: 10.1042/bst20170469.

[9] Q. Zhang, Y. Yue, and R. Zheng, “Clusterin as a serum biomarker candidate contributes to the lung fibroblasts activation in chronic obstructive pulmonary disease,” Chinese Medical Journal, vol. 135, no. 9, pp. 1076–1086, Feb. 2022, doi: 10.1097/cm9.0000000000002065.

[10] S. Kamei et al., “Integrative expression analysis identifies a novel interplay between CFTR and linc-SUMF1-2 that involves CF-associated gene dysregulation,” Biochemical and Biophysical Research Communications, vol. 509, no. 2, pp. 521–528, Feb. 2019, doi: 10.1016/j.bbrc.2018.12.152.

[11] Y. Xu, C. Tertilt, A. Krause, L. E. Quadri, R. G. Crystal, and S. Worgall, “Influence of the cystic fibrosis transmembrane conductance regulator on expression of lipid metabolism-related genes in dendritic cells,” Respiratory Research, vol. 10, no. 1, Apr. 2009, doi: 10.1186/1465-9921-10-26.

[12] Q. Wu et al., “p53: A Key Protein That Regulates Pulmonary Fibrosis,” Oxidative Medicine and Cellular Longevity, vol. 2020, pp. 1–13, Nov. 2020, doi: 10.1155/2020/6635794.

[13] M. R. Pinezich et al., “Pathological remodeling of distal lung matrix in end-stage cystic fibrosis patients,” Journal of Cystic Fibrosis, vol. 21, no. 6, pp. 1027–1035, Nov. 2022, doi: 10.1016/j.jcf.2022.04.016.

[14] N. Rout-Pitt, N. Farrow, D. Parsons, and M. Donnelley, “Epithelial mesenchymal transition (EMT): a universal process in lung diseases with implications for cystic fibrosis pathophysiology,” Respiratory Research, vol. 19, no. 1, Jul. 2018, doi: 10.1186/s12931-018-0834-8.

[15] N. Rout-Pitt, B. Boog, A. McCarron, N. Reyne, D. Parsons, and M. Donnelley, “Insights into epithelial-mesenchymal transition from cystic fibrosis rat models,” Journal of Cystic Fibrosis, Sep. 2024, doi: 10.1016/j.jcf.2024.09.003.

[16] M. C. Quaresma et al., “Mutant CFTR Drives TWIST1 mediated epithelial–mesenchymal transition,” Cell Death & Disease, vol. 11, no. 10, Oct. 2020, doi: 10.1038/s41419-020-03119-z.

[17] Z. Li et al., “LncRNA SNHG5 Suppresses Cell Migration and Invasion of Human Lung Adenocarcinoma via Regulation of Epithelial-Mesenchymal Transition,” Journal of Oncology, vol. 2023, pp. 1–13, Jan. 2023, doi: 10.1155/2023/3335959.

[18] Q. Ye et al., “Molecular Regulation of Heme Oxygenase-1 Expression by E2F Transcription Factor 2 in Lung Fibroblast Cells: Relevance to Idiopathic Pulmonary Fibrosis,” Biomolecules, vol. 12, no. 10, p. 1531, Oct. 2022, doi: 10.3390/biom12101531.

[19] T. Yoshihara et al., “Periostin plays a critical role in the cell cycle in lung fibroblasts,” Respiratory Research, vol. 21, no. 1, Jan. 2020, doi: 10.1186/s12931-020-1299-0.

[20] Q. Qian et al., “MicroRNA‐205‐5p targets E2F1 to promote autophagy and inhibit pulmonary fibrosis in silicosis through impairing SKP2‐mediated Beclin1 ubiquitination,” Journal of Cellular and Molecular Medicine, vol. 25, no. 19, pp. 9214–9227, Aug. 2021, doi: 10.1111/jcmm.16825.

[21] F. Wang, P. Li, and F. Li, “Integrated Analysis of a Gene Correlation Network Identifies Critical Regulation of Fibrosis by lncRNAs and TFs in Idiopathic Pulmonary Fibrosis,” BioMed Research International, vol. 2020, pp. 1–14, Jun. 2020, doi: 10.1155/2020/6537462.

[22] H. Zhao et al., “Single-cell RNA Sequencing Analysis Reveals New Immune Disorder Complexities in Hypersplenism,” Frontiers in Immunology, vol. 13, Jul. 2022, doi: 10.3389/fimmu.2022.921900.

[23] L. Xiu, B. Ma, and L. Ding, “Antioncogenic roles of USP9Y and DDX3Y in lung cancer: USP9Y stabilizes DDX3Y by preventing its degradation through deubiquitination,” Acta Histochemica, vol. 126, no. 1, p. 152132, Jan. 2024, doi: 10.1016/j.acthis.2023.152132.