Word count: 2000

Background

Cystic Fibrosis (CF) is a congenital condition resulting from mutations of the cystic fibrosis transmembrane conductance regulator (CFTR) gene [1,2]. CFTR protein is a chloride transporter needed for mucus hydration of the airway surface [3]; dehydration leads to high neutrophil, interleukin, and macrophage counts [4,5].

Although CF affects various organs, most of its histological complications occur in the lungs. Frequent CF-induced infections cause inflammation, then lung fibrosis (increased fibroblast production and excessive extracellular matrix (ECM) materials) [6], leading to lung tissue remodelling. Fibroblasts are associated with CF because they cause the proliferation of fibrocytes, fibrotic lesions and lung mesenchymal cells [7] - ultimately causing up-regulation of pro-fibrotic genes.

Using transcriptomics, differentially expressed genes (DEGs) in CF tissues can be quantified. Also, it is useful for elucidating the biological and molecular mechanisms causing CF onset, and, possibly, targeted therapeutic approaches. This study examines the transcriptomic profiles of CF and normal human lung fibroblasts (NHLF).

Materials and Methods

Data Information

CF- and NHLF RNA-Seq raw counts matrix and series matrix files were downloaded from NCBI GEO (GSE141535). The NHLF were obtained from 47-year-old males, while the CFHLF were from 45-year-old females. The experimental design and GEO numbers are shown in Table 1.

Table 1: GEO Numbers and Experimental Design.
GEO_accession	Sample_title	Sample_name	Cell_type	Gender
GSM4205771	Normal Human Lung Fibroblasts from 2D culture, rep1	Normal.1	Normal	Male
GSM4205772	Normal Human Lung Fibroblasts from 2D culture, rep3	Normal.2	Normal	Male
GSM4205773	Normal Human Lung Fibroblasts from 2D culture, rep3	Normal.3	Normal	Male
GSM4205774	Normal Human Lung Fibroblasts from 2D culture, rep4	Normal.4	Normal	Male
GSM4205775	Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep1	Disease.1	CF	Female
GSM4205776	Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep2	Disease.2	CF	Female
GSM4205777	Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep3	Disease.3	CF	Female
GSM4205778	Disease Human Lung Fibroblasts (Cystic Fibrosis) from 2D culture, rep4	Disease.4	CF	Female

RNA extraction was performed using Roche’s High Pure RNA Tissue Kit. Library preparation was done using QuantSeq 3’mRNA-Seq Kit. Library quality was examined with High sensitivity DNA D1000. Illumina NovaSeq 6000 was used for sequencing. Sequence base calls were converted into FASTQ using bcl2fastq. Gene expressions were determined using htseq-count. The count matrix is shown in Table 2.

Table 2: Count Matrix (10 Rows)
GeneID	GSM4205771	GSM4205772	GSM4205773	GSM4205775	GSM4205776	GSM4205777	GSM4205778
100287102	1	2	0	3	2	2	1
653635	1	7	4	7	6	6	3
102466751	0	0	0	0	0	0	0
107985730	0	1	0	0	0	0	0
100302278	0	0	0	0	0	0	0
645520	0	0	0	0	0	0	0
79501	0	0	0	0	0	0	0
100996442	0	1	1	0	0	0	0
729737	0	0	0	0	2	1	0
102725121	0	0	0	0	0	1	0

Analytical Workflow

Installation of Packages

The following R packages were installed from BiocManager:

edgeR: To create a DGEList object;
RColorBrewer: To get nice colours;
org.Hs.eg.db: For gene annotation; and
limma: For differential expression and gene enrichment analyses.

# Installing packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager", "tidyverse")
BiocManager::install(c("edgeR", "RColorBrewer", "org.Hs.eg.db", "limma"))

The libraries were also loaded:

# Loading libraries
library(knitr)
library(RColorBrewer)
library(edgeR)
library(limma)
library(org.Hs.eg.db)

Loading the Data

The raw sequence count and series matrix were loaded in R and converted into a data frame (Table 2) as shown below.

# Load the raw count matrix data into R
seq_data <- read.table (gzfile ("C:/Users/USER/Desktop/E-portfolio-Main/GSE141535_raw_counts_GRCh38.p13_NCBI.tsv.gz"), header = TRUE)
# Load the series matrix data (sample info) downloaded from NCBI into R
sampleinfo <- read.delim ("C:/Users/USER/Desktop/E-portfolio-Main/GSE141535_series_matrix_2.txt", header = T, stringsAsFactors = T)

sampleinfo = subset (sampleinfo, select = -c(X)) ## To remove mislabelled column header
sampleinfo <- sampleinfo [-c(9:23), ] ## To remove empty cells in the data frame

Data Processing

The data was adjusted to remove the GeneID column, creating a new matrix.

# To create a new matrix containing only the count data from the 8 samples, removing the first column (GeneID)
count_data <- seq_data [, -1]

The GeneIDs were converted to row names and the columns were renamed.

# To turn the GeneID names (in the first column) into row names
rownames(count_data) <- seq_data [, 1] ## The GeneIDs are now the row names.

# The column names are actually the sample names,
# So the columns were renamed to reveal information from the Sample Info, such as the condition and replicate number, about each sample.
colnames(count_data) <- c("Normal.1", "Normal.2", "Normal.3", "Normal.4", "Disease.1", "Disease.2", "Disease.3", "Disease.4")
# The data now reflects information from the cell types.
kable(head(count_data))

Table 3: New Read Counts Data Frame.
	Normal.1	Normal.2	Normal.3	Disease.1	Disease.2	Disease.3	Disease.4
100287102	1	2	0	3	2	2	1
653635	1	7	4	7	6	6	3
102466751	0	0	0	0	0	0	0
107985730	0	1	0	0	0	0	0
100302278	0	0	0	0	0	0	0
645520	0	0	0	0	0	0	0

# Note: The column names should bear the same name as the sample name in the Sample Info data
# The code below was ran to check if the Sample Name in sampleinfo and count_data are in the same order
table(colnames(count_data)==sampleinfo$Sample_name)

## 
## TRUE 
##    8

Create a DGEList from the Count Matrix

A DGEList object, containing the sample information, and read counts, was created. The DGEList function in edgeR allows for easy manipulation and sub-setting of the DGEList object:

# Convert read counts to a DGEList object
dge_list <- DGEList(count_data)

# To extract grouping information from "sampleinfo" 
group_info <- paste(sampleinfo$Cell_type,sampleinfo$Gender,sep=".")

# Convert "group_info" into a factor
group_info <- factor(group_info)

# View the values
table(group_info)

## group_info
##   CF.Female Normal.Male 
##           4           4

The disease condition and gender were not treated as separate factors in the original experiment, so, both factors were regarded as one in the analysis (Table 4).

# Include the factors into the DGEList object
dge_list$samples$group <- group_info

# View the Samples slot of the DGEList object
kable(dge_list$samples)

Table 4: Groups, Library Size, and Normalisation Factors of each Sample.
	group	lib.size	norm.factors
Normal.1	Normal.Male	955342	1
Normal.2	Normal.Male	2620178	1
Normal.3	Normal.Male	1498839	1
Normal.4	Normal.Male	1070215	1
Disease.1	CF.Female	1181443	1
Disease.2	CF.Female	1304686	1
Disease.3	CF.Female	1233521	1
Disease.4	CF.Female	1531346	1

Gene Annotation

Gene names and gene symbols associated with the respective Entrez IDs were used to annotate the DGEList object via the org.Hs.eg.db package.

# Adding annotation, such as Gene Symbol, Gene Name, and ENTREZID, to DGEList object
anno_data <- select(org.Hs.eg.db, keys=rownames(dge_list$counts), columns=c("ENTREZID", "SYMBOL","GENENAME"))

## 'select()' returned 1:1 mapping between keys and columns

# To check that the ENTREZIDs match the dge_list$count row names
table(anno_data$ENTREZID==rownames(dge_list$counts))

## 
##  TRUE 
## 39376

# Add the annotation information into a new slot (called genes) in the DGEList object
dge_list$genes <- anno_data

Filtering Genes with Low Expression

Genes with low expression were removed to avoid false discovery.Hence, they should be removed. To achieve this, the filterByExpr function in edgeR was used because it retains only genes that possess a minimum count in multiple samples.

# Filter out gene with low expression using filterByExpr
exprs_kept <- filterByExpr(dge_list)
dge_list_2 <- dge_list[exprs_kept,]

# Compare the value of the new DGEList object to the previous one.
list(dim(dge_list), dim(dge_list_2))

## [[1]]
## [1] 39376     8
## 
## [[2]]
## [1] 9460    8

# 9,460 genes (24%) were kept out of 39,376

Library Size

This aspect is a quality control step. The library sizes of the samples were examined to identify differences in their distributions (Figures 1-2).

## Quality Control
## Number of reads in the filtered DGE_list_2 data
dge_list_2$samples$lib.size

## [1]  955342 2620178 1498839 1070215 1181443 1304686 1233521 1531346

Multi-dimensional scaling (MDS) Plots

An MDS plot was generated to examine the source of variation in the samples, making it easy to detect outliers (Figure 3).

# multidimensional scaling plots
plotMDS(dge_list_2)

Normalisation to Remove Composition Bias

Normalisation was done using calcNormFactors function. The DEG profiles of each sample were closely examined using the plotMD function, comparing samples with the highest and lowest norm.factors before and after normalisation correction..

# Normalisation of DGE_list_2
dge_list_2 <- calcNormFactors(dge_list_2)

Design Matrix

A design matrix (Table 5) to compare the Normal and CF groups was created for differential expression (DE) analysis using limma.

# create a design matrix with no intercept using the group_info variable
matrix <- model.matrix(~ 0 + group_info)
# Rename the matrix columns to bear the grouping information, CF/Normal
colnames(matrix) <- levels(group_info)
# View matrix
kable (matrix) ## The rows represents the samples, while the columns represent the coefficient of comparison  - 1 for present, 0 for absent- between CF and Normal groups.

Table 5: Design Matrix Comparing CF to Normal.
CF.Female	Normal.Male
0	1
0	1
0	1
0	1
1	0
1	0
1	0
1	0

Voom Transformation

Using the design matrix and DGEList object, the voom function created an EList object with adjusted library sizes and normalised log2 counts.

# Use voom function to transform the data, with the normalised dge_list_2 and matrix as the objects
voom_trans <- voom(dge_list_2, matrix, plot=F)

DE Analysis

The voom-transformed object was tested for DE by fitting it to a linear model using the lmFit function.

# Now to test for differential expression using the lmfit function to fit a linear model
fit_model <- lmFit(voom_trans)

The makeContrasts function was used to determine DEGs in the CF and Normal groups, with the null hypothesis defined as “CF.Female - Normal.Male = 0”.

# Test for difference between groups
contrast.matrix <- makeContrasts(CysticVsNormal=CF.Female - Normal.Male,levels=group_info)
kable(contrast.matrix)

Table 6: Contrast Matrix for DE Analysis.
	CysticVsNormal
CF.Female	1
Normal.Male	-1

Using the contrasts.fit function, the contrast matrix was applied to the voom-transformed object to obtain relevant parameters for the CF-Normal comparison. Afterwards, the eBayes function was called to apply empirical Bayes shrinkage to the contrast.

# Using the constrasts.fit function in Limma, the constrast.matrix included the fit_model object
fit.contrast <- contrasts.fit(fit_model, contrast.matrix)
# Next, the eBayes function was applied to the fit.contrast object to calculate the p-values and t-statistics, and shrink the variances using empirical Bayes.
fit.contrast <-  eBayes(fit.contrast)

Using decideTests function, the DEGs for the contrast were produced and visualised. The top DEGs were stored in an object, with the p-value capped at 0.05 and adjusted.

# Created summary data of DEG for contrasts using 'decideTest' function of Limma
summary.fit <- decideTests(fit.contrast)
# Created an object to store the top 100 DEGs 
top.gene <- topTable(fit.contrast,coef=1,number=100,genelist=fit.contrast$genes, p.value = 0.05, adjust.method = "BH", sort.by = "p") ## p-value was adjusted using the Benjamini-Hochberg method

Testing Relative to a Threshold (TREAT)

TREAT analysis was performed to rank DEGs by their Log fold-change (LFC). A LFC of 1, meaning double expression, was applied to the fit.contrast object to determine which DEGs are more biologically relevant. The TREAT results were sorted by their p-values, then stored and visualised.

# Testing relative to a threshold (TREAT) using a logFC of 1
## Performed TREAT analysis to rank the genes per their logFCs, re-estimating the p-values and t-statistic
fit.treat <- treat(fit.contrast, lfc =1) 
## Computed summary of the TREAT results
results.treat <- decideTests(fit.treat)
## Created an object to store the top 10 genes post-TREAT analysis
treat.top.genes <- topTable(fit.treat,coef=1,sort.by="p") ## By default, only the top 10 DEGs will be stored in the treat.top.genes data

Gene Ontology (GO) Enrichment Analysis

GO analysis (GOANA) was performed to find over-represented or enriched GO terms in the TREAT output of the CF-Normal comparison using the goana function (FDR < 0.05).

# Gene Ontology of TREAT results
GO_treat <- goana(fit.treat, coef = "CysticVsNormal", species = "Hs") ## FDR < 0.05 by default

Additionally, CAMERA was used to check for enriched gene sets using the Hallmark gene sets from MSigDB. The ids2indices function was used to map the Entrez ID of the voom-transformed object to the gene sets. An inter-gene correlation of 0.05 was used, as it allows for biological interpretation and low FDR.

# Loaded the Hallmark Human gene data set from the Broad Institute’s Molecular Signatures Database (MSigDB) into R
Hs.H <- readRDS("C:/Users/USER/Desktop/E-portfolio-Main/Rmd/Hs.h.all.v7.1.entrez.rds")
# The Hallmark Gene Set is a vector object containing gene sets and each vector has ENTREZ IDs for the genes in the gene set
## The Entrez IDs contained in the vectors of the H.indices object is exactly what is contained in the row names of the voom_trans object. The ids2indices function was be used to map the IDs between the gene list in the vector and voom objects.
H.index <- ids2indices(Hs.H, rownames(voom_trans))
# Used Camera function to get correlation for each gene set
## The inter.gene.cor argument was kept low at 0.05
camera.out <- camera(voom_trans,index = H.index, design = matrix, contrast = contrast.matrix[,1],inter.gene.cor = 0.05)

Results

Quality Control

Library Size Distribution

The samples’ counts were not normally distributed (Figure 1), so the counts were converted to the log2 scale, correcting the library size differences, using the cpm function (Figure 2).

# To create a barplot of the library sizes for each sample
barplot(dge_list_2$samples$lib.size/1e06, names=colnames(dge_list_2),las=2, ann=FALSE, cex.names=0.75)
mtext(side = 1, text = "Samples", line = 4)
mtext(side = 2, text = "Library size in Millions", line = 3)

Figure 1: Barplot of Library Sizes.

The LogCPM distribution of the samples had little or no difference. No sample was far above or below the red line.

# Get log counts of DGEList object
dge_logcounts <- cpm (dge_list_2, log = T)
# Check distributions of samples using boxplots
boxplot(dge_logcounts, xlab="", ylab="Log2 Counts Per Million",las=2)
# Include a horizontal line that represents to the median logCPM
abline(h=median(dge_logcounts),col="red")

Figure 2: Boxplots of LogCPMs (Unnormalised).

MDS Plot

The MDS plot revealed that the samples cluster by cell type, Normal or Disease (Figure 3); that is, the cell types are the greatest variation source. The distance between the Disease and the Normal replicates on the x-axis are 2 units, corresponding to a 4-fold-change between the samples. No inter-group clustering was observed.

# To reset factor to existing levels in the data frame
sampleinfo$Cell_type <- factor (sampleinfo$Cell_type)
# Create a new value containing color-coded characters for the levels
color.cell.type <- c("blue", "red")[sampleinfo$Cell_type]
# Now infuse the color code into the MDS plot
plotMDS(dge_list_2,col=color.cell.type, xlab = "Leading LogFC Dim1", ylab = "Leading LogFC Dim2")
# Let's add a legend to the plot so we know which colours correspond to which cell type
legend("top",fill=c("blue","red"),legend=levels(sampleinfo$Cell_type))

Figure 3: MDS Plot of Cystic Fibrosis (CF) and Normal Cell Type Samples.

Normalisation for Composition Bias

The CF group had a relatively low norm.factor (0.96-1.01) (Table 7). Normal.4 had the highest norm.factors (1.03), while Disease.1 had the lowest (0.96).

# View the normalisation factors of the samples
kable(dge_list_2$samples)

Table 7: Unbiased Normalisation Factors.
	group	lib.size	norm.factors
Normal.1	Normal.Male	955342	1.0177303
Normal.2	Normal.Male	2620178	0.9743332
Normal.3	Normal.Male	1498839	1.0216918
Normal.4	Normal.Male	1070215	1.0309631
Disease.1	CF.Female	1181443	0.9649097
Disease.2	CF.Female	1304686	1.0127833
Disease.3	CF.Female	1233521	1.0010146
Disease.4	CF.Female	1531346	0.9787079

Although the genes in the logcount data clustered around the zero Log-ratio (Figure 4), the normalised data showed a more uniform clustering (Figure 5). Additionally, the CF group had more gene expressions with positive log-ratios.

# To compare the logcounts and compositional bias normalisation of Normal.4 and Disease.1
# The logcount-normalised data was used here
par(mfrow=c(1,2))
plotMD(dge_logcounts,column = 4)
abline(h=0,col="red", lty=2, lwd=2)
plotMD(dge_logcounts,column = 5)
abline(h=0,col="red", lty=2, lwd=2)

Figure 4: Comparing the Log Count Data Between Normal.4 and Disease.1.

# The compositional bias-normalised data was used here
par(mfrow=c(1,2))
plotMD(dge_list_2,column = 4)
abline(h=0,col="red",lty=2, lwd=2)
plotMD(dge_list_2,column = 5)
abline(h=0,col="red",lty=2, lwd=2)

Figure 5: Comparing the Normalised Data Between Normal.4 and Disease.1.

Voom Transformation

The mean-variance trend showed the biological variation of the genes in all the samples (Figure 6). The mean-variance trend decreased progressively and asymptotically with abundance.

# Use voom function to transform the data, with the normalised dge_list_2 and matrix as the objects
voom_trans <- voom(dge_list_2, matrix, plot=T)

Figure 6: Scatterplot of Square-root Standard Deviation Against Normalised Log2 Counts. The red line represents the mean-variance trend.

The library sizes of the voom-transformed data have been adjusted, with the boxes closer to the median LogCPM, unlike the unnormalised data (Figure 7). Furthermore, the voom-transformed data had more outliers less than zero LogCPM and lower extremes.

# Compare the unnormalised logcounts (logCPM) with the voom logCPM
par(mfrow=c(1,2))
boxplot(dge_logcounts, xlab="", ylab="Log2 Counts Per Million",las=2,main="Unnormalised LogCPM")
# Insert a red line to represent the median logCPM of the unnormalised DGE List
abline(h=median(dge_logcounts),col="red")
boxplot(voom_trans$E, xlab="", ylab="Log2 Counts Per Million",las=2,main="Voom-Transformed logCPM")
# Insert a red line to represent the median logCPM of the voom-transformed DGE List
abline(h=median(voom_trans$E),col="red")

Figure 7: Comparing the Unnormalised LogCPM with The Voom-transformed LogCPM.

DE Results

There were 918 DEGs (Table 8): 419 were down-regulated, 499 were up-regulated, and 8,542 were not DE.

# Summary of the DEGs for the comparison
kable(summary(summary.fit))

Table 8: DEGs.
	CysticVsNormal
Down	419
NotSig	8542
Up	499

The top gene in Table 9 (XIST) had expressions ranging from 7.20 to 7.46 in the CF samples (Figure 8).

# View the top 10 genes
kable(head (top.gene, n = 10), digits = 32)

Table 9: Top DEGs.
	ENTREZID	SYMBOL	GENENAME	logFC	AveExpr	t	P.Value	adj.P.Val	B
7503	7503	XIST	X inactive specific transcript	8.848339	2.905534	16.68974	2.900737e-11	7.977232e-08	13.07372
387066	387066	SNHG5	small nucleolar RNA host gene 5	3.308967	7.618322	16.41473	3.697172e-11	7.977232e-08	15.98093
9383	9383	TSIX	TSIX transcript, XIST antisense RNA	8.505410	2.735071	16.25616	4.259378e-11	7.977232e-08	12.84729
6192	6192	RPS4Y1	ribosomal protein S4 Y-linked 1	-9.395521	3.317442	-16.25523	4.262923e-11	7.977232e-08	12.87789
1191	1191	CLU	clusterin	2.906360	9.577039	16.21474	4.420702e-11	7.977232e-08	15.79160
715	715	C1R	complement C1r	2.439766	7.503698	16.06515	5.059555e-11	7.977232e-08	15.67444
6319	6319	SCD	stearoyl-CoA desaturase	3.386719	5.834727	13.52253	6.058330e-10	8.187401e-07	13.17015
1277	1277	COL1A1	collagen type I alpha 1 chain	1.775692	11.029431	12.50043	1.843980e-09	2.180506e-06	11.92301
125	125	ADH1B	alcohol dehydrogenase 1B (class I), beta polypeptide	3.440099	5.081107	12.26592	2.406117e-09	2.529096e-06	11.75897
4500	4500	MT1L	metallothionein 1L (pseudogene)	-1.887294	7.693046	-12.02138	3.189964e-09	2.798492e-06	11.50982

# The expression levels of the topmost up-regulated gene in the table above.
nice.colors <- brewer.pal(7,name="Dark2") # to add nice colors
stripchart(voom_trans$E["7503",]~group_info,vertical=TRUE,las=2,cex.axis=0.8,pch=16,cex=1.3,col=nice.colors,method="jitter",ylab="Normalised log2 expression", main="XIST")

Figure 8: Scatterplot Showing the Normalised Log2 Expressions of XIST gene in CF and Normal Samples.

The top DEGs were visualised using plotMD, while volcanoplot was used to highlight the symbols of the top 10 up- and down-regulated DEGs (Figure 9).

# Constructing MD Plot and Volcano Plot after testing for Differential expression 
# Use out from decideTests to show significant genes
# Visualise using plotMD function
plotMD(fit.contrast, coef = 1, status = summary.fit[,"CysticVsNormal"], values = c(1, -1), hl.col = c("red","blue"), xlab="Average Log-expression", ylab="Log Fold-Change", cex=1.0)
legend("topright", title=paste("LogFC"), legend=c("Up", "Down", "No-DE"), pch=19, col=c("red","blue", "black"))
abline(h=0)
# Using volcano plot, the top DE genes and gene symbols will be highlighted
# Here the top 10 genes are highlighted
volcanoplot(fit.contrast,coef=1,highlight=10,names=fit.contrast$genes$SYMBOL, main="CysticVsNormal", xlab = "Log2 Fold-Change", ylab = "-Log10 (P-value)", cex= 0.4)

Figure 9: MD Plot (left) showing the LogFC and average expression of each gene. Volcano Plot (right) showing the highly expressed up- and down-regulated genes.

TREAT

After TREAT, only 103 DEGs were detected at LogFC of 1 - more than 88% decrease (Table 10). 85 and 18 DEGs were up- and down-regulated, respectively. Consequently, only DEGs with large LogFCs were identified (Figure 10).

# Summary of the TREAT results
kable(summary(results.treat))

Table 10: DEGs (TREAT).
	CysticVsNormal
Down	18
NotSig	9357
Up	85

The DEG ranking is now different. Notably, RPS4Y1 replaced SNHG5 as the second most DEG, due to their p-values (Table 11).

kable(treat.top.genes, digits = 32)

Table 11: Top DEGs (TREAT).
	ENTREZID	SYMBOL	GENENAME	logFC	AveExpr	t	P.Value	adj.P.Val
7503	7503	XIST	X inactive specific transcript	8.848339	2.905534	14.803540	8.576286e-11	4.236456e-07
6192	6192	RPS4Y1	ribosomal protein S4 Y-linked 1	-9.395521	3.317442	-14.525124	1.136275e-10	4.236456e-07
9383	9383	TSIX	TSIX transcript, XIST antisense RNA	8.505410	2.735071	14.344883	1.343485e-10	4.236456e-07
387066	387066	SNHG5	small nucleolar RNA host gene 5	3.308967	7.618322	11.454048	3.125797e-09	7.392510e-06
1191	1191	CLU	clusterin	2.906360	9.577039	10.635684	8.658790e-09	1.638243e-05
25878	25878	MXRA5	matrix remodeling associated 5	7.132167	2.049141	10.184420	1.590158e-08	2.507149e-05
6319	6319	SCD	stearoyl-CoA desaturase	3.386719	5.834727	9.529721	3.796179e-08	4.807616e-05
715	715	C1R	complement C1r	2.439766	7.503698	9.480443	4.065638e-08	4.807616e-05
9537	9537	TP53I11	tumor protein p53 inducible protein 11	4.598240	4.737967	9.187043	6.169558e-08	6.311704e-05
8287	8287	USP9Y	ubiquitin specific peptidase 9 Y-linked	-6.011449	1.618448	-9.112061	6.924605e-08	6.311704e-05

# View the TREAT results using a MD Plot
plotMD(fit.treat,coef=1,status=results.treat[,"CysticVsNormal"], values=c(1,-1), hl.col=c("red","blue"), xlab="Average Log-expression", ylab="Log Fold Change", cex=1.0)
legend("topright", title=paste("LogFC"), legend=c("Up", "Down", "No-DE"), pch=19, col=c("red","blue", "black"))
abline(h=0,col="grey")
# Here the top 10 genes are highlighted
volcanoplot(fit.treat,coef=1, highlight=10, names=fit.treat$genes$SYMBOL, main="CysticVsNormal", xlab = "Log2 Fold Change", ylab = "-Log10 (P-value)", cex= 0.4)

Figure 10: Post-TREAT MD Plot (left) showing DEGs with fold-change greater than 1. Post-TREAT Volcano Plot (right) showing the highly expressed up- and down-regulated genes.

Enrichment Analysis

GOANA

There were 18,835 enriched GO terms associated with molecular function (MF), cellular component (CC), and biological process (BP). In CC (Table 12), cell periphery had the most genes (2,187) and 40 were significantly up-regulated; in BP (Table 13), developmental process had the most genes (3,058) and 48 were significantly up-regulated; and in MF (Table 14), signaling receptor binding had the most genes (566) and 18 were significantly up-regulated. The table is arranged by the least P.Up and P.Down values.

# Used topGO to view the GO terms with the most significance and enrichment
kable(topGO(GO_treat, ontology = "CC")) ## Cellular Component

Table 12: Top GO Terms: Celluar Component.
	Term	Ont	N	Up	Down	P.Up	P.Down
GO:0030312	external encapsulating structure	CC	204	20	1	0.0000000	0.3259219
GO:0031012	extracellular matrix	CC	204	20	1	0.0000000	0.3259219
GO:0062023	collagen-containing extracellular matrix	CC	175	17	1	0.0000000	0.2866704
GO:0005576	extracellular region	CC	1712	40	4	0.0000000	0.4189050
GO:0005615	extracellular space	CC	1484	35	3	0.0000000	0.5560731
GO:0071944	cell periphery	CC	2187	40	6	0.0000011	0.2233059
GO:0098643	banded collagen fibril	CC	8	3	0	0.0000384	1.0000000
GO:0005583	fibrillar collagen trimer	CC	8	3	0	0.0000384	1.0000000
GO:0031091	platelet alpha granule	CC	45	5	0	0.0000489	1.0000000
GO:0005581	collagen trimer	CC	30	4	0	0.0001414	1.0000000
GO:0098644	complex of collagen trimers	CC	12	3	0	0.0001471	1.0000000
GO:0009986	cell surface	CC	280	9	2	0.0009069	0.0985845
GO:0005604	basement membrane	CC	49	4	0	0.0009599	1.0000000
GO:0097440	apical dendrite	CC	6	2	0	0.0011787	1.0000000
GO:0070821	tertiary granule membrane	CC	30	0	2	1.0000000	0.0014531
GO:0099512	supramolecular fiber	CC	504	12	0	0.0017842	1.0000000
GO:0099081	supramolecular polymer	CC	507	12	0	0.0018757	1.0000000
GO:0042583	chromaffin granule	CC	9	2	0	0.0027796	1.0000000
GO:0035579	specific granule membrane	CC	47	0	2	1.0000000	0.0035423
GO:0042383	sarcolemma	CC	71	4	1	0.0037886	0.1274119

kable(topGO(GO_treat, ontology = "BP")) ## Biological Process

Table 13: Top GO Terms: Biological Process.
	Term	Ont	N	Up	Down	P.Up	P.Down
GO:0009653	anatomical structure morphogenesis	BP	1296	31	3	1.00e-07	0.4591457
GO:0030154	cell differentiation	BP	1977	39	3	2.00e-07	0.7613987
GO:0048869	cellular developmental process	BP	1978	39	3	2.00e-07	0.7617329
GO:0048513	animal organ development	BP	1411	32	4	2.00e-07	0.2786936
GO:0009887	animal organ morphogenesis	BP	451	17	3	4.00e-07	0.0520817
GO:0048731	system development	BP	1909	37	3	9.00e-07	0.7378814
GO:0007275	multicellular organism development	BP	2260	41	3	9.00e-07	0.8429937
GO:0048732	gland development	BP	233	12	1	1.10e-06	0.3631273
GO:0045229	external encapsulating structure organization	BP	156	10	0	1.30e-06	1.0000000
GO:0030198	extracellular matrix organization	BP	156	10	0	1.30e-06	1.0000000
GO:0043062	extracellular structure organization	BP	156	10	0	1.30e-06	1.0000000
GO:0009888	tissue development	BP	931	24	3	1.40e-06	0.2601920
GO:0048856	anatomical structure development	BP	2798	46	4	2.10e-06	0.8288663
GO:0032502	developmental process	BP	3058	48	6	4.00e-06	0.5566032
GO:0060384	innervation	BP	13	4	0	4.10e-06	1.0000000
GO:0032963	collagen metabolic process	BP	53	6	0	7.40e-06	1.0000000
GO:0007399	nervous system development	BP	1225	26	0	1.66e-05	1.0000000
GO:0016477	cell migration	BP	745	19	2	2.66e-05	0.4222843
GO:0050768	negative regulation of neurogenesis	BP	68	6	0	3.16e-05	1.0000000
GO:0048608	reproductive structure development	BP	138	8	1	3.22e-05	0.2334522

kable(topGO(GO_treat, ontology = "MF")) ## Molecular Function

Table 14: Top GO Terms: Molecular Function.
	Term	Ont	N	Up	Down	P.Up	P.Down
GO:0005539	glycosaminoglycan binding	MF	86	11	0	0.0000000	1.0000000
GO:0008201	heparin binding	MF	63	9	0	0.0000000	1.0000000
GO:0005201	extracellular matrix structural constituent	MF	73	9	0	0.0000000	1.0000000
GO:1901681	sulfur compound binding	MF	122	10	1	0.0000001	0.2092964
GO:0008131	primary amine oxidase activity	MF	3	3	0	0.0000007	1.0000000
GO:0005102	signaling receptor binding	MF	566	18	1	0.0000022	0.6725110
GO:0048018	receptor ligand activity	MF	97	8	1	0.0000024	0.1701087
GO:0030546	signaling receptor activator activity	MF	102	8	1	0.0000035	0.1780900
GO:0030545	signaling receptor regulator activity	MF	111	8	1	0.0000065	0.1922741
GO:0005198	structural molecule activity	MF	395	14	1	0.0000103	0.5377672
GO:0005178	integrin binding	MF	85	7	0	0.0000106	1.0000000
GO:0048407	platelet-derived growth factor binding	MF	8	3	0	0.0000384	1.0000000
GO:0016641	oxidoreductase activity, acting on the CH-NH2 group of donors, oxygen as acceptor	MF	9	3	0	0.0000573	1.0000000
GO:0052595	aliphatic amine oxidase activity	MF	2	2	0	0.0000805	1.0000000
GO:0097621	monoamine oxidase activity	MF	2	2	0	0.0000805	1.0000000
GO:0016638	oxidoreductase activity, acting on the CH-NH2 group of donors	MF	10	3	0	0.0000813	1.0000000
GO:0050840	extracellular matrix binding	MF	38	4	0	0.0003606	1.0000000
GO:0019838	growth factor binding	MF	69	5	0	0.0003793	1.0000000
GO:0004252	serine-type endopeptidase activity	MF	40	4	1	0.0004404	0.0737926
GO:0030020	extracellular matrix structural constituent conferring tensile strength	MF	19	3	0	0.0006188	1.0000000

CAMERA

The 2 most significant gene sets (FDR < 0.05) in the CAMERA output are the Hallmark Epithelial-Mesenchymal Transition (EMT), followed by Hallmark E2F Targets (E2FT).

# View the top 10 gene sets
# The list is ranked by p-value, so, the most significant gene set is at the top
kable(camera.out[1:10, ], digits = 32)

Table 15: Top Gene Sets
	NGenes	Direction	PValue	FDR
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION	160	Up	8.108529e-05	0.004054265
HALLMARK_E2F_TARGETS	179	Down	1.171711e-03	0.029292780
HALLMARK_HEDGEHOG_SIGNALING	14	Up	5.246642e-03	0.079750264
HALLMARK_CHOLESTEROL_HOMEOSTASIS	58	Up	6.380021e-03	0.079750264
HALLMARK_G2M_CHECKPOINT	183	Down	1.233071e-02	0.123307116
HALLMARK_MYOGENESIS	101	Up	1.738763e-02	0.144896897
HALLMARK_MYC_TARGETS_V1	198	Down	2.712901e-02	0.193778614
HALLMARK_ANGIOGENESIS	18	Up	8.518489e-02	0.532405547
HALLMARK_INTERFERON_ALPHA_RESPONSE	65	Up	1.254412e-01	0.696895612
HALLMARK_UV_RESPONSE_DN	126	Up	1.810185e-01	0.730667720

# Genes significant at FDR < 0.05
table(camera.out$FDR < 0.05) # Only 2 genes were significant at 5% and 4 at 10%

## 
## FALSE  TRUE 
##    48     2

For the CF-Normal contrast, the gene sets (GS) are ranked by increasing LogFC from left to right. The bars, forming the barcode, represent genes within a gene set. Neutral enrichment is represented by the dotted line. In the EMT GS, the worm’s direction was upward toward a +LogFC and became thicker toward the right (Figure 11). However, for the E2FT GS, the direction was downward, with the barcode thickened leftward of the plot (-LogFC) (Figure 12).

# Barcode plot using LFCs.
# HALLMARK EPITHELIAL MESENCHYMAL TRANSITION
barcodeplot(fit.contrast$coefficients[, 1], 
            index=H.index[["HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION"]], 
            main="LogFC: HALLMARK EPITHELIAL MESENCHYMAL TRANSITION (UP)",
            labels=c("Normal.Male", "CF.Female"))

Figure 11: Bardcode plot showing the enrichment of the EMT gene set in the CF and Normal samples The x-axis is the LogFC for CF and Normal samples.

# Hallmark E2F TARGETS
barcodeplot(fit.contrast$coefficients[, 1], 
            index=H.index[["HALLMARK_E2F_TARGETS"]], 
            main="LogFC: HALLMARK E2F TARGETS (DOWN)",
            labels=c("Normal.Male", "CF.Female"))

Figure 12: Bardcode plot showing the enrichment of the E2FT gene set in the CF and Normal samples. The x-axis is the LogFC for CF and Normal samples.

A bi-directional plot for both gene sets was produced (Figure 13).

# Barcode plot using 2 gene sets
barcodeplot(fit.contrast$coefficients[, 1], 
            index = H.index[["HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION"]], 
            index2 = H.index[["HALLMARK_E2F_TARGETS"]],
            main="H. EPITHELIAL MESENCHYMAL TRANSITION & H. E2F TARGETS",
            labels=c("Normal.Male", "CF.Female")
            )

Figure 13: Bardcode plot showing the enrichment of the EMT and E2FT gene sets in the CF and Normal samples.

Discussion

This transcriptomic analysis illustrates the difference between NHLF and CFHLF. Specifically, the MDS plot (Figure 3) indicated that the differences between CFHLF and NHLF are greater than those within CFHLF/NHLF [1]; the CFHLF genes are 4 times more expressed than NHLF. Additionally, it suggests statistical differences in DE between both groups.

The CF group has a relatively low normalisation factor, suggesting a sizeable presence of highly expressed and up-regulated genes. Disease.1’s log-ratio, especially, was somewhat positively skewed (Figures 4-5) with lots of highly up-regulated DEGS, which explains the low norm.factor (Table 7). Additionally, per the makeConstrasts arguments, a +LogFC means the gene is up-regulated in CF samples, however, a -LogFC denotes high expressions in Normal samples (Tables 9-11).

Top CF up-regulated DEGs, such as XIST, CLU, MXRA5, SCD, and TP53I11 (Table 11), have been associated with lung fibrosis severity [8-12]. Subsequent analysis (Tables 12-14) indicated that these up-regulated genes are strongly correlated with extracellular matrix organization (GO:0030198), extracellular space (GO:0005615), and signaling receptor binding (GO:0005102), suggesting that CFHLF is heavily involved in ECM buildup [1,13].

Furthermore, expectedly, CFHLF up-regulated genes (Figure 11) were significantly (FDR < 0.05) involved in EMT activation, a major driver of lung fibrosis [14,15,16,17]. Conversely, CFHLF significantly down-regulated E2FT pathway (Figure 12). E2FT down-regulation is associated with lung fibrosis inhibition [18,19,20,21], through, for example, USP9Y-assisted deubiquitination and RPS4Y1-mediated immune infiltration [20,22,23] (Table 11). Consequently, USP9Y/RPS4Y1 could regulate CF pathogenesis and have therapeutic effects. Lab validation should be done to confirm the regulatory effects of USP9Y/RPS4Y1.

This study did not control the age difference between CFHLF and NHLF; older patients may have disease stages and fibroblast conditions different from younger patients, so the results may not be comparable. Also, the gender bias and the disease stages were not considered.

References

[1] C. Mazio et al., “Intrinsic Abnormalities of Cystic Fibrosis Airway Connective Tissue Revealed by an In Vitro 3D Stromal Model,” Cells, vol. 9, no. 6, p. 1371, Jun. 2020, doi: 10.3390/cells9061371.

[2] B. Kerem, J. M. Rommens, and J. A. Buchanan, “Identification of the Cystic Fibrosis Gene: Genetic Analysis,” Science, vol. 245, no. 4925, pp. 1437–1437, Sep. 1989, doi: 10.1126/science.245.4925.1437.e.

[3] M. C. Dechecchi, A. Tamanini, and G. Cabrini, “Molecular basis of cystic fibrosis: from bench to bedside,” Annals of Translational Medicine, vol. 6, no. 17, pp. 334–334, Sep. 2018, doi: 10.21037/atm.2018.06.48.

[4] D. A. Stoltz , D. K. Meyerholz , and M. J. Welsh , “Origins of Cystic Fibrosis Lung Disease,” New England Journal of Medicine, vol. 372, no. 16, pp. 1574–1575, Apr. 2015, doi: 10.1056/nejmc1502191.

[5] J. L. Gillan, D. J. Davidson, and R. D. Gray, “Targeting cystic fibrosis inflammation in the age of CFTR modulators: focus on macrophages,” European Respiratory Journal, vol. 57, no. 6, p. 2003502, Dec. 2020, doi: 10.1183/13993003.03502-2020.

[6] A. M. Cantin, D. Hartl, M. W. Konstan, and J. F. Chmiel, “Inflammation in cystic fibrosis lung disease: Pathogenesis and therapy,” Journal of Cystic Fibrosis, vol. 14, no. 4, pp. 419–430, Jul. 2015, doi: 10.1016/j.jcf.2015.03.003.

[7] R. K. Kasam et al., “Fibrocyte accumulation in the lungs of cystic fibrosis patients,” Journal of Cystic Fibrosis, vol. 19, no. 5, pp. 815–822, Sep. 2020, doi: 10.1016/j.jcf.2020.06.011.

[8] A. M. A. Glasgow, C. De Santi, and C. M. Greene, “Non-coding RNA in cystic fibrosis,” Biochemical Society Transactions, vol. 46, no. 3, pp. 619–630, May 2018, doi: 10.1042/bst20170469.

[9] Q. Zhang, Y. Yue, and R. Zheng, “Clusterin as a serum biomarker candidate contributes to the lung fibroblasts activation in chronic obstructive pulmonary disease,” Chinese Medical Journal, vol. 135, no. 9, pp. 1076–1086, Feb. 2022, doi: 10.1097/cm9.0000000000002065.

[10] S. Kamei et al., “Integrative expression analysis identifies a novel interplay between CFTR and linc-SUMF1-2 that involves CF-associated gene dysregulation,” Biochemical and Biophysical Research Communications, vol. 509, no. 2, pp. 521–528, Feb. 2019, doi: 10.1016/j.bbrc.2018.12.152.

[11] Y. Xu, C. Tertilt, A. Krause, L. E. Quadri, R. G. Crystal, and S. Worgall, “Influence of the cystic fibrosis transmembrane conductance regulator on expression of lipid metabolism-related genes in dendritic cells,” Respiratory Research, vol. 10, no. 1, Apr. 2009, doi: 10.1186/1465-9921-10-26.

[12] Q. Wu et al., “p53: A Key Protein That Regulates Pulmonary Fibrosis,” Oxidative Medicine and Cellular Longevity, vol. 2020, pp. 1–13, Nov. 2020, doi: 10.1155/2020/6635794.

[13] M. R. Pinezich et al., “Pathological remodeling of distal lung matrix in end-stage cystic fibrosis patients,” Journal of Cystic Fibrosis, vol. 21, no. 6, pp. 1027–1035, Nov. 2022, doi: 10.1016/j.jcf.2022.04.016.

[14] N. Rout-Pitt, N. Farrow, D. Parsons, and M. Donnelley, “Epithelial mesenchymal transition (EMT): a universal process in lung diseases with implications for cystic fibrosis pathophysiology,” Respiratory Research, vol. 19, no. 1, Jul. 2018, doi: 10.1186/s12931-018-0834-8.

[15] N. Rout-Pitt, B. Boog, A. McCarron, N. Reyne, D. Parsons, and M. Donnelley, “Insights into epithelial-mesenchymal transition from cystic fibrosis rat models,” Journal of Cystic Fibrosis, Sep. 2024, doi: 10.1016/j.jcf.2024.09.003.

[16] M. C. Quaresma et al., “Mutant CFTR Drives TWIST1 mediated epithelial–mesenchymal transition,” Cell Death & Disease, vol. 11, no. 10, Oct. 2020, doi: 10.1038/s41419-020-03119-z.

[17] Z. Li et al., “LncRNA SNHG5 Suppresses Cell Migration and Invasion of Human Lung Adenocarcinoma via Regulation of Epithelial-Mesenchymal Transition,” Journal of Oncology, vol. 2023, pp. 1–13, Jan. 2023, doi: 10.1155/2023/3335959.

[18] Q. Ye et al., “Molecular Regulation of Heme Oxygenase-1 Expression by E2F Transcription Factor 2 in Lung Fibroblast Cells: Relevance to Idiopathic Pulmonary Fibrosis,” Biomolecules, vol. 12, no. 10, p. 1531, Oct. 2022, doi: 10.3390/biom12101531.

[19] T. Yoshihara et al., “Periostin plays a critical role in the cell cycle in lung fibroblasts,” Respiratory Research, vol. 21, no. 1, Jan. 2020, doi: 10.1186/s12931-020-1299-0.

[20] Q. Qian et al., “MicroRNA‐205‐5p targets E2F1 to promote autophagy and inhibit pulmonary fibrosis in silicosis through impairing SKP2‐mediated Beclin1 ubiquitination,” Journal of Cellular and Molecular Medicine, vol. 25, no. 19, pp. 9214–9227, Aug. 2021, doi: 10.1111/jcmm.16825.

[21] F. Wang, P. Li, and F. Li, “Integrated Analysis of a Gene Correlation Network Identifies Critical Regulation of Fibrosis by lncRNAs and TFs in Idiopathic Pulmonary Fibrosis,” BioMed Research International, vol. 2020, pp. 1–14, Jun. 2020, doi: 10.1155/2020/6537462.

[22] H. Zhao et al., “Single-cell RNA Sequencing Analysis Reveals New Immune Disorder Complexities in Hypersplenism,” Frontiers in Immunology, vol. 13, Jul. 2022, doi: 10.3389/fimmu.2022.921900.

[23] L. Xiu, B. Ma, and L. Ding, “Antioncogenic roles of USP9Y and DDX3Y in lung cancer: USP9Y stabilizes DDX3Y by preventing its degradation through deubiquitination,” Acta Histochemica, vol. 126, no. 1, p. 152132, Jan. 2024, doi: 10.1016/j.acthis.2023.152132.

Effect of Cystic Fibrosis on The Lung Fibroblast Transcriptome in Humans

Marvellous O. Oyebanjo (24013601)

December 09, 2024

Word count: 2000

Background

Materials and Methods

Data Information

Analytical Workflow

Installation of Packages

Loading the Data

Data Processing

Create a DGEList from the Count Matrix

Gene Annotation

Filtering Genes with Low Expression

Library Size

Multi-dimensional scaling (MDS) Plots

Normalisation to Remove Composition Bias

Design Matrix

Voom Transformation

DE Analysis

Testing Relative to a Threshold (TREAT)

Gene Ontology (GO) Enrichment Analysis

Results

Quality Control

Library Size Distribution

MDS Plot

Normalisation for Composition Bias

Voom Transformation

DE Results

TREAT

Enrichment Analysis

GOANA

CAMERA

Discussion

References