数据下载及处理

页面介绍

GEO（Gene Expression Omnibus）是 NCBI 中公共高通量基因数据共享平台

Link：Home - GEO - NCBI

如下是进入搜索的一个关于甲状腺癌的文件

GEO Accession viewer

Platforms：（GPLxxxx）平台编号对应的检测技术体系，包含了探针序列与基因的对应关系，是后续进行探针 ID 转换为基因 Symbol/Ensembl ID 的必需文件，直接影响下游功能富集分析的准确性。
Samples：（GSMxxxx）样本列表，可以查看每一个样本信息
Relations：关联信息，可找到该项目的其他数据
Download family：
- SOFT formatted family file(s) 、 MINiML formatted family file(s)：包含原始信号和所有元数据，信息非常全，但格式比较复杂，适合进阶分析。
- Series Matrix File(s)：GEO 整理好的表达矩阵，TXT 格式，直接就能用。里面的行是基因或探针，列是样本，每个单元格是对应的表达量。
Supplementary file：作者上传的补充材料，未经过GEO统一处理的文件，是作者对原数据的个性化解读加工。

Platform里面的GPL注释表和文件是注释信息，探针ID 到基因。而Download family里面的文件是表达量数据，只能确定探针中的表达值。

需要Download full table…

GEO数据处理

下载Download family处的Series Matrix File(s)文件和Platform里的full table文件。

安装BiocManager包，再用BiocManager安装其他包

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82


BiocManager::install("GEOquery") # 下载GEOquery包
library(GEOquery)
library(tidyverse)
setwd('C:/Users/30521/Desktop/R自学/GEO-analysis') # 设置工作路径；注意只能是“/”
gset<-getGEO("GSE299988",destdir=".",AnnotGPL=F,getGPL=F)
# getGEO(GEO,GSEMatrix = TRUE,AnnotGPL = FALSE,destdir = ".",getGPL = TRUE)，这是getGEO()函数的使用格式。
# GSEMatrix指是否下载整理好的 Series Matrix 表达矩阵
# AnnotGPL指是否同时下载平台注释表（GPL）
# destdir指数据保存到本地的路径（默认为"."）
# getGPL指是否自动下载对应的 GPL 平台文件。

class(gset) # 读取的为list类型
gset[[1]] # 查看信息，GSE只有一个平台就是1，可能有列表中其他平台，要查看就是更改成gset[[2]]等等
exp<-exprs(gset[[1]])# 提取基因表达矩阵数据，由于没有载入注释文件，矩阵的行名还是探针名

# 1、标准化
boxplot(exp,outline=F,las=2) # 绘制表达量分布箱线图，快速检查数据的整体质量和标准化效果。在同一条直线上则是已经标准化了
# boxplot(x, outline, las) x为表达矩阵，outline为是否显示异常值点，las为控制x轴标签方向：1 = 水平，2 = 垂直
library(limma)
exp<-normalizeBetweenArrays(exp) # 手动标准化，再去查看箱图

# 2、归一化
range(exp) # 查看range范围，数值过大需要归一化，log2处理
exp<-log2(exp+1)

# 3、基因注释，GPL导入
gpl<-read.table("GPL21185-21174.txt",header=T,fill=T,sep="\t",comment.char="#",stringsAsFactors=F,quote='')
# header=T列名保留
# fill=T文件中某些行的列数不一致时，自动用空值填充缺失的列
# sep="\t"指定列之间的分隔符是制表符（Tab）
# comment.char="#"忽略文件中以 # 开头的注释行
# stringsAsFactors=F把字符型数据（如探针 ID、基因名）保留为字符串，而不是自动转换成因子
# quote=''告诉 R 文件中没有引号包裹的字符串，避免把基因名或 ID 中的特殊字符误判为引号
colnames(gpl) # 查看列名
ids<-gpl[,c("ID","GENE_SYMBOL")] # 选取查看的两列
colnames(ids)<-c("probe_id","symbol")
library(stringr)
ids$symbol<-trimws(str_split(ids$symbol,"//",simplify=T)[,3])# 如果是第3个是基因，就选择第3列
# stringr包是tidyverse生态里专门处理字符串的工具包
# str_split()是按指定分隔符拆分字符串
# simplify=T默认返回列表，设为TRUE后会直接返回矩阵
# trimws()去除提取出的第 3 个基因名两端的空白字符，确保基因名格式干净
ids<-ids[ids$symbol!='',] # 删除了空白探针
ids<-ids[ids$symbol!='---',]
ids<-ids[ids$symbol!='--- ',] # 具体情况具体格式删除
ids<-ids[ids$probe_id %in% rownames(exp),] # 保留ids中与exp匹配的行
exp2<-exp[ids$probe_id,] # 选取exp中与ids相互匹配的行
table(rownames(exp2)==ids$probe_id)
exp3<-cbind(ids,exp2) # 合并两者为数据框

# 4、基因去重（一个基因会有多个探针）
exp3<-aggregate( . ~symbol,exp3,max) # 合并同一基因的多探针表达量
# . ~ symbol去重的核心逻辑，“.”代表除了symbol列之外的所有列，~symbol代表按照symbol列进行分组
# max指取每个基因对应探针的最大表达量作为代表值，也可以选择mean平均
rownames(exp3)<-exp3$symbol
exp3<-exp3[,-c(1,2)] # 再删去合并的前两列

# 5、获取临床数据
library(Biobase) # pData是Biobase里面的函数
rt1<-pData(gset[[1]]) # 先前exprs函数是获取的gset中基因表达矩阵
# pData函数用来从 ExpressionSet 对象中提取 phenotype data（表型数据），也就是每个样本的临床信息，比如分组、性别、年龄、疾病分期等。
write.csv(rt1,file='GSE299988_cli.csv',row.names=T)
#（1）疾病分组
colnames(rt1)
group_list<-ifelse(rt1$`tissue types:ch1`=='Normal','Normal','Tumor')
# 将Normal与Tumor细胞分开
# 更多组别的分组：group_list<-ifelse(str_detect(rt1$`title`,'normal_adjace'),'Normal',ifelse(str_detect(rt1$`title`,'positiveLN_RAIrefractive'),'P_tumor','N_tumor'))
# str_detect可以模糊字符选择
rt1$group<-group_list
rt2<-rt1 %>% dplyr::select("group")
# %>%管道操作符，把左边的rt1作为右边函数的输入
# dplyr::select()用于从数据框中选择特定的列
# （2）分组排列
rt3<-rt2 %>% arrange(rt2) # 升序排列，Normal在前，Tumor在后
# rt3<-rt2 %>% arrange(desc(group)) Tumor在前，Normal在后

# 6、数据保存
# exp3与rt3的列行一一对应
exp3<-exp3[,rownames(rt3)]
identical(rownames(rt3),colnames(exp3)) # 返回为T确认一一对应
save(rt3,exp3,file="GSE299988.rda") # 保存数据
# 之后可以直接load("GSE299988.rda")进行使用

GEO非symbol基因名的处理

常见标识类型： Gene Symbol直接对应基因名的标识；Entrez ID是NCBI Entrez 数据库的基因编号；RefSeq Accession是NCBI RefSeq 数据库的转录本编号，如 NM_000546；ENSEMBL ID对应Ensembl 数据库的基因 / 转录本编号，如 ENSG00000141510；GenBank Accession（GB ACC）对应GenBank 数据库的核酸序列编号，如 U94788。

如GSE200539的GPL文件中不包含symbol文件

GEO Accession viewer

Platforms进入的网页下可能有的没有Download full table…，则需要下载table下方的Download family中的SOFT formatted family file(s)，下载为soft.gz文件

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


library(GEOquery)
library(tidyverse)
setwd('C:/Users/30521/Desktop/R自学/GEO-analysis/GSE200539')
# 中间部分与上述GEO数据处理部分一致
# 基因注释，soft导入（没有Download full table…的情况下）
GPL<-getGEO(filename="GPL23159_family.soft.gz",destdir=".")
gpl<-GPL@dataTable@table

# 由于这个注释文件没有Gene symbol，有NM_序列，因此提取这一列
ids<-gpl[,c("ID","SPOT_ID.1")] # 选取查看的两列
colnames(ids)<-c("probe_id","symbol") # 重命名
library(stringr)
ids$symbol<-trimws(str_split(ids$symbol,"//",simplify=T)[,1]) # 选取//分割的第一列NM_序列
# 下载两个包：BiocManager::install(c("org.Hs.eg.db", "AnnotationDbi"))
# org.Hs.eg.db包是人类基因注释数据库包，Hs 代表 Homo sapiens（人），eg 代表 Entrez Gene。包含了人类基因的各种标识符（如 Entrez Gene ID、Gene Symbol、Ensembl ID 等）之间的映射关系
# AnnotationDbi包是所有注释数据库包的核心接口包
library(org.Hs.eg.db)
library(AnnotationDbi)
symbol_ids<-mapIds(x=org.Hs.eg.db,keys=ids$symbol,keytype="REFSEQ",column="SYMBOL",multivals="first")
# org.Hs.eg.db指定要使用的注释数据库，这里是人类基因注释库
# keytype告诉输入的类型
# column指定想要转换后的目标 ID 类型
# multivals="first" 处理 “一个输入 ID 对应多个输出 ID” 的情况。当一个 REFSEQ ID 对应多个基因符号时，只取第一个结果，避免返回列表，保证输出是一个向量
symbol_ids
ids$symbol<-symbol_ids # NM_序列转换为基因名
ids<-na.omit(ids) # 删除NA

# 注意：exp行名默认是字符，如果ids$symbol是纯数字，应当转变为字符格式ids$probe_id<-as.character(ids$probe_id)

exp2<-exp[ids$probe_id,] # 选取exp中与ids相互匹配的行
table(rownames(exp2)==ids$probe_id)
exp3<-cbind(ids,exp2) # 合并两者为数据框

# 之后与之前GEO数据处理部分一致

Affymetrix原始数据整理

上述的GEO处理基于Download family中的Series Matrix File(s)下载后处理，这是已经预处理过的表达矩阵数据。通常是经过背景校正、归一化和探针注释后的表格。无法验证或调整预处理过程（如归一化方法）。

因此有时选取Supplementary file(s)里面的RAW原始文件，是原始的 CEL 文件压缩包，包含芯片扫描得到的原始荧光信号数据。

如上述的GSE200539的Supplementary file里面的GSE200539_RAW.tar

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45


setwd('C:/Users/30521/Desktop/R自学/GEO-analysis/GSE200539/Affymetrix原始数据处理')
library(tidyverse)
# 下载：BiocManager::install("oligo")
# Bioconductor 生态里专门用来处理 Affymetrix 芯片原始数据的 R 包
library(oligo)
# 1、数据导入
FileName<-list.files(path='./GSE200539_RAW') # 注意提前解压
FileName
sample_id<-str_sub(FileName,1,10) # 重命名，此处选取前10个字符，即命名为如GSM6037488的样式
rt<-data.frame(FileName=FileName,row.names=sample_id,group=rep(c('CTL','IDMF','DRF','MMF'),each=3))
# 分组数据参考网站上的Sample的分组，共4组，每组3个样本
cel_files<-list.celfiles('./GSE200539_RAW',full.names=T,listGzipped=T)
# list.celfiles()是oligo包的核心函数，用于搜索指定目录下的所有 .CEL文件
# './GSE200539_RAW'当前目录下的GSE200539_RAW文件夹
# full.names=T返回完整的文件路径（包含文件夹名）
# listGzipped=T支持识别并列出被gzip压缩的 .CEL.gz文件
cel_files
cel_data<-read.celfiles(cel_files) # 通过cel_files获取的所有文件路径导入并读取

# 2、质控处理
# 下载：BiocManager::install('arrayQualityMetrics')
# arrayQualityMetrics包是Bioconductor中专门用于芯片数据质量控制的工具包，能自动生成详细的质控报告，包含多种图表来评估数据质量。
library(arrayQualityMetrics)
# 版本问题手动下载吧……然后根据指示下一堆依赖包
library(GEOquery)
expr<-exprs(cel_data) #读取数据为原始表达量的矩阵
colnames(expr)<-sample_id # 列名重命名
eset<-ExpressionSet(assayData=expr,phenoData=AnnotatedDataFrame(rt))
# ExpressionSet()是Biobase包的核心函数，用来整合表达数据、样本信息和注释信息
# AnnotatedDataFrame()把rt数据框转换成 ExpressionSet 要求的注释数据框格式
# 此处是转换为ExpressionSet用于质控分析
arrayQualityMetrics(eset,outdir='GSE200539_QulityMetrics',force=T,do.logtransform=T,intgroup='group')
# force=T如果输出文件夹已经存在，强制覆盖它，避免报错
# do.logtransform=T对表达数据进行对数转换，这是芯片数据分析的常规预处理步骤
# intgroup='group'指定用于分组的表型信息列（比如 rt 数据框里的 group 列）

#rma标准化,对原始 CEL 文件数据进行背景校正、归一化和汇总
eset_rma<-oligo::rma(cel_data)
class(eset_rma) # "ExpressionSet"
exp<-exprs(eset_rma)
colnames(exp)<-str_sub(colnames(exp),1,10) # 保留列名前面10个字符
boxplot(exp,outline=F,notch=T,las=2) # 箱图看结果

# 这里的exp与之前章节得到的直接读取的exp相同
# 之后部分是基因注释，与先前相同

illumina原始数据处理

有的GEO标准化后的Series Matrix File包含负值，因此需要进行原始数据处理，此处选取GSE68004进行分析。选择原始的non-normalized数据进行处理。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88


#清空环境
rm(list=ls())
# BiocManager::install("GEOquery")
library(GEOquery)
library(tidyverse)
library(limma)
setwd('E:/R_do/阿伦生信教学代码/GEO/ilumina原始数据处理/') #设置工作路径
gset <- getGEO("GSE68004", destdir = ".", AnnotGPL = F, getGPL = F)#手动下载，自动读取
class(gset) 
gset[[1]] #查看信息
#提取基因表达数据  
exp <- exprs(gset[[1]])
#1.标准化####
boxplot(exp, outline=F,las=2)
library(limma)
exp <- normalizeBetweenArrays(exp)
boxplot(exp, outline=F,las=2)
#2.归一化####
range(exp)#查看range范围，如果数值过大，需要归一化，log2处理
range(exp,na.rm = T)
# [1]   -22.82553 17174.43372


#1.观察数据结构####
dat1 <- data.table::fread('GSE68004_non-normalized_ProbeRowSignalDetectionDataset_5383_143523_1.txt')
dat2 <- data.table::fread('GSE68004_non-normalized_ProbeRowSignalDetectionDataset_5383_143529_1.txt')
#2.读取数据####
x1 <- read.ilmn(files = 'GSE68004_non-normalized_ProbeRowSignalDetectionDataset_5383_143523_1.txt',
                expr = '750',#数据列名 相同的内容
                other.columns='Detection', #p值
                probeid='ID_REF') #芯片列名


x2 <- read.ilmn(files = 'GSE68004_non-normalized_ProbeRowSignalDetectionDataset_5383_143529_1.txt',
                expr = '586',#数据列名 相同的开头
                other.columns='Detection', #p值
                probeid='ID_REF') #芯片列名
#合并
pro_id <- intersect(rownames(x1),rownames(x2))
x1 <- x1[pro_id,]
x2 <- x2[pro_id,]
x <- cbind(x1, x2)

#3.校准化-log处理####
y <- neqc(x) 
#4.提取表达数据####
exp <- y$E
boxplot(exp,las=3,outline=F)
range(exp)
#[1]  4.161602 14.037369
#去批次
library(sva)
batch <- c(rep(1, ncol(x1)), rep(2, ncol(x2))) #输入批次信息
exp_combat <- ComBat(dat = exp, batch = batch, mod = NULL, par.prior = TRUE)
boxplot(exp_combat,las=3,outline=F)
range(exp_combat)
#[1]  3.807521 14.370650
#5.整理临床数据
rt <- pData(gset[[1]])
colnames(rt)
rt <- dplyr::select(rt,'title','source_name_ch1')
sample_id <- substring(rt$title,nchar(rt$title)-8)#提取字符-8就是最后8个
sample_id
rownames(rt) <- sample_id

rt$group <- trimws(str_split(rt$source_name_ch1,',',simplify = T)[,2])
rt <- dplyr::select(rt,group)

#保持行名列名一致
exp_combat <- exp_combat[,rownames(rt)]
identical(rownames(rt),colnames(exp_combat))
#[1] TRUE
save(exp_combat,rt,file = 'exp_combat.rad')

#质控检查
rt1 <- rt[colnames(x$E), , drop = FALSE]
library(arrayQualityMetrics)
colnames(x$E) <- rownames(rt1)
eset <- ExpressionSet(assayData = x$E,phenoData = AnnotatedDataFrame(data = rt1))
arrayQualityMetrics(eset,
                    outdir='GSE68004_quality',
                    force=T,
                    intgroup='group',
                    do.logtransform=T)


rm(list=ls())
load('exp_combat.rad')

Agilent原始数据处理

选取GSE211806进行处理

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96


# control type: 0是对照，1是基因
# glsaboveBG:信号强度是否高于背景，结果是1则说明值是可靠的

# 1、原始数据读取
setwd("C:/Users/30521/Desktop/R自学/GEO-analysis/GSE211806")
library(limma)
library(tidyverse)
FileName<-list.files(path="./GSE211806_RAW")
FileName
sample_id<-str_sub(FileName,1,10)
tissue_type<-rep(c("PLC/PRF/5","SNU-449"),each=12)
group<-rep(c("control","TGFb","Galunisertib","TGFb + Galunisertib"),each=3)

Ids<-data.frame(FileName=FileName,
                sample_id=sample_id,
                tissue_type=tissue_type,
                group=group,
                row.names=sample_id,
                stringsAsFactors=F)
table(Ids$group)
write.table(Ids,file="GSE211806_Ids.txt",quote=F,sep="\t",row.names=F)
save(Ids,file="GSE211806_Ids.rda")

GSE211806_raw<-read.maimages(files=Ids$FileName,
                             source="agilent", # 代表经过哪种程序得到的，有些Agilent芯片是通过genepix处理的
                             path="GSE211806_RAW",
                             names=Ids$sample_id,
                             other.columns="gIsWellAboveBG", # 读取进去是为了判断是否高于背景值
                             green.only=T) # 代表是个单色芯片(Cy3)，双色芯片F(Cy3,Cy5)，见官方样本信息页的Label
# GSE211806_raw的E代表expression，Eb代表background
# 添加标签
GSE211806_raw$Ids<-Ids

# probe type summary
table(GSE211806_raw$genes$ControlType)
# 0说明不是control，需要检测这60901个基因，1是阳性对照，-1是阴性对照
head(GSE211806_raw$E)[,1:5]
dim(GSE211806_raw)

# 保存数据
save(GSE211806_raw,file="GSE211806_raw_Agilent.rda")
boxplot(log2(GSE211806_raw$E),ylab=expression(log[2](intensity)),las=2,
        col=factor(Ids$group),outline=F)
source('Function/PCA.R')
PCA(log2(GSE211806_raw$E),
    ntop=nrow(GSE211806_raw$E),
    group=Ids$group,
    show_name=F)# 或者T

library(arrayQualityMetrics)

# 函数不接受ElistRaw对象，只接受ExpressionSet
library(GEOquery)
rownames(Ids)<-Ids$sample_id
eset<-ExpressionSet(assayData=GSE211806_raw$E,
                    phenoData=AnnotatedDataFrame(data=Ids))

arrayQualityMetrics(eset,
                    outdir="GSE211806_QulitMetrics",
                    do.logtransform=T,
                    force=T,
                    intgroup="group")

# 个体差异较大
# 背景校正及标准化
bgc<-backgroundCorrect(RG=GSE211806_raw,
                       method="normexp",
                       offset=50,
                       normexp.method="mle")

normdata<-normalizeBetweenArrays(bgc,method="quantile")
# normalizeBetweenArrays适用于单色芯片，normalizeinArrays适用于双色芯片
sum(is.na(normdata$E))
boxplot(normdata$E,ylab=expression(log[2](intensity)),las=2,
        col=factor(Ids$group),outline=F)
PCA(normdata$E,
    ntop=nrow(normdata$E),
    group=Ids$group,
    show_name=F)# 或者T

# 可再做一步质控（选）
eset2<-ExpressionSet(assayData=normdata$E,
                    phenoData=AnnotatedDataFrame(data=Ids))

arrayQualityMetrics(eset2,
                    outdir="GSE211806_QulitMetrics_norm",
                    do.logtransform=T,
                    force=T,
                    intgroup="group")
exp<-exprs(eset2)
# 行名问题
rownames(normdata$E)<-normdata$genes$ProbeName
# 选取探针id的行，舍去前三行
exp<-normdata$E[4:nrow(normdata$E),]
rt<-dplyr::select(Ids,"tissue_type","group")
save(exp,normdata,Ids,rt,file='GSE211806_Agilent_processed.rda')

GEO高通量数据

先前选取的都是GEO中的Expression profiling by array（芯片表达谱），通过荧光强度判断表达量。

而GEO中Expression profiling by high throughput sequencing（高通量测序表达谱，通常指 RNA-seq）将样本 RNA 反转录建库后，直接进行大规模测序，通过比对基因组计数 reads 数。

芯片表达谱是对RNA与探针结合的荧光进行测定，反映相对值。而高通量测序是对RNA基因片段数量进行测定，反映绝对值。

高通量数据除了可以富集分析外，还可做可变剪接、融合基因、新转录本发现。

选取GSE309139高通量数据为例，下载Series Matrix File(s)与GSE309139_raw_counts.txt.gz，原始的counts数据需解压

GEO Accession viewer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62


setwd('C:/Users/30521/Desktop/R自学/GEO-analysis/GSE309139')
library(tidyverse)
exp<-data.table::fread("GSE309139_raw_counts.txt",header=T)
# 用data.table::fread()高效读取RNA-seq原始计数文件
# header=T 确保文件第一行（样本ID）作为列名
table(is.na(exp)) # 查看是否有缺失值
exp[is.na(exp)]<-0
table(is.na(exp)) # 查看是否有缺失值
exp<-column_to_rownames(exp,var='Gene')
# 把基因名那一列挪到最前面成为行名

# 1、设置分组信息
sample_id<-colnames(exp)
sample_id
group<-rep(c('L','H','LL','HL'),each=6) # 具体分组看具体数据
group
rt<-data.frame(row.names=sample_id,group=group)
identical(rownames(rt),colnames(exp))

# 2、DESeq2分析
# BiocManager::install("DESeq2")
library(DESeq2)
rt$group<-factor(rt$group) # 变成因子的格式
rt$group
# 构建DESeq2对象
dds<-DESeqDataSetFromMatrix(countData=exp,colData=rt,design=~group)
# design=~group根据group分组来比较差异
# dds把原始计数数据、样本分组信息、实验设计逻辑打包在一起，后续所有分析（标准化、差异检验）都基于这个对象

# 低表达基因过滤（自定义，可选）
filter<-rowSums(counts(dds))>1 # 筛选每行加起来大于1的基因
dds<-dds[filter,]
# 差异表达分析
dds<-DESeq(dds)
# 格式为c("因子名"，“实验组”，“对照组”)
res<-results(dds,contrast=c("group","H","L"))
# 提取差异表达结果
res1<-as.data.frame(res)
# log2FoldChange差异倍数的对数值：H 组表达量 / L 组表达量 的 log₂ 结果
# lfcSE是log2FC 的标准误，代表差异倍数的统计波动程度（值越小，log2FC 越可靠）
# pvalue越小越拒绝原假设H0，越显著性差异
# padj是校准后的P值，控制多检验的假阳性
# 结果筛选padj<0.05 & abs(log2FoldChange)>=1，选出差异表达
degs<-filter(res1,padj<0.05 & abs(log2FoldChange)>=1)
view(degs)

# 绘制离散图
plotDispEsts(dds)
# 离散图横轴是基因经过标准化后的平均表达量，纵轴离散度，衡量基因表达量的变异程度
# P值直方图，看P值显著基因多少
hist(res$padj)

# 3、count数据vst转化，这里得到exp1的数据就不是整数了，可以理解为之前章节的exp，之后就是一样的标准化操作等等
vst<-vst(dds,blind=F)
# vst()将RNA-seq的原始计数数据转换为适合可视化和下游分析的数值
# blind=F变换时考虑实验设计信息，不会忽略分组差异，这样能更真实地保留生物学差异
exp1<-assay(vst)
range(exp1)
exp1<-as.data.frame(exp1)
save(exp,exp1,rt,file='GSE309139.rda')

# 注意，这里得到exp1之后记得看看需要标准化，归一化那一系列操作，之后再进行热图绘制。

差异分析

差异分析：芯片数据——array limma ；高通量——count DESeq2

limma芯片数据差异分析

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


setwd('C:/Users/30521/Desktop/R自学/GEO-analysis/limma差异分析')
library(limma)
library(pheatmap)
library(tidyverse)
library(ggpubr)

# 数据读取，直接使用之前第一部分处理好的数据
load('C:/Users/30521/Desktop/R自学/GEO-analysis/GSE299988/GSE299988.rda')
rt<-rt3
exp<-exp3
str(exp) # 数据全是字符型：$ GSM9051410: chr  "8.561352698" "5.812654031" "2.853925376" "13.89898258" ...
exp<-exp %>% mutate(across(everything(),as.numeric))
# 把数据框 exp 里所有列的数据类型都转换成数值型
str(exp) # 把数据转换成了数字型：$ GSM9051410: num  8.56 5.81 2.85 13.9 8.65 ...
identical(rownames(rt),colnames(exp))
# 确认行名列名一致

火山图

接上述limma差异分析

如下是下述代码的DEG结果数据框图：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69


group_list<-factor(rt$group,levels=c("Normal","Tumor"))
# 把临床数据的分组弄成因子形式
design<-model.matrix(~group_list)
# model.matrix()：把因子型的分组信息转化为统计模型可识别的数值矩阵
# 基于上面的 group_list，生成的设计矩阵会以 Normal 作为截距（基线），用一列数值来表示 Tumor 组与基线的差异
design
# 线性模型拟合
fit<-lmFit(exp,design)
# 贝叶斯检验
fit2<-eBayes(fit)
deg<-topTable(fit2,coef=2,number=Inf)
# 是差异表达基因结果数据框，包含了所有基因的差异分析结果
DEG=na.omit(deg) # 去除可能的NA值
write.table(DEG,file='DEG1.txt',sep="\t",row.names=T,col.names=NA,quote=F)
# quote=F：不在字符串两端添加引号，避免后续读取时出现多余的引号干扰

# 通常logFC设置为1，adj.P.Val<0.05，下述代码进行显著水平与正负相关性的区分
logFC_cutoff<-1
type1=(DEG$adj.P.Val<0.05)&(DEG$logFC < -logFC_cutoff)
type2=(DEG$adj.P.Val<0.05)&(DEG$logFC > logFC_cutoff)
DEG$change=ifelse(type1,"DOWN",ifelse(type2,"UP","NOT"))
write.table(DEG,file='DEG2.txt',sep="\t",row.names=T,col.names=NA,quote=F)
table(DEG$change) # 统计上述3中情况："DOWN","UP","NOT"的个数

library(ggplot2)
library(cowplot)
library(ggrepel)
DEG<-mutate(DEG,Gene_symbol=row.names(DEG)) # 把Gene_symbol弄到最后一列
UPDEG<-subset(DEG,change=="UP")
UPDEG_5<-top_n(x=UPDEG,n=-5,wt=P.Value)
# n = -5：- 表示按升序取最小的 5 个，+ 表示按降序取最大的 5 个
# wt = P.Value：按 P.Value 列来排序，P 值越小代表差异越显著
# 由此选取最小的5组就是最小的5个P，最显著的5个
DOWNDEG<-subset(DEG,change=="DOWN")
DOWNDEG_5<-top_n(x=DOWNDEG,n=-5,wt=P.Value)
p <- ggplot(data = DEG,
            aes(x = logFC,
                y = -log10(adj.P.Val))) +
  geom_point(alpha = 0.5, size = 4.5,
             aes(color = change)) +
  ylab("-log10(adj.P.Val)") +
  xlab("log2(Fold Change)") +
  scale_color_manual(values = c("#1F77B4", "grey", "#FF7F0E")) +
  geom_vline(xintercept = c(-1, 1), lty = 5, col = "black", linewidth = 0.8) +
  geom_hline(yintercept = -log10(0.05), lty = 4, col = "black", linewidth = 0.8) +
  theme_half_open() +
  # 标记最显著的5个上调基因
  geom_text_repel(data = UPDEG_5, 
                  aes(label = Gene_symbol), 
                  vjust = 1.5,
                  box.padding = unit(0.35, "lines"),
                  point.padding = unit(0.35, "lines"),
                  segment.color = 'grey50',
                  nudge_y = 0.1,
                  direction = "y") +
  # 标记最显著的5个下调基因
  geom_text_repel(data = DOWNDEG_5, 
                  aes(label = Gene_symbol), 
                  vjust = 1.5,
                  box.padding = unit(0.35, "lines"),
                  point.padding = unit(0.35, "lines"),
                  segment.color = 'grey50',
                  nudge_y = 0.1,
                  direction = "y")
p # 查看火山图
# 保存为pdf
pdf(file="volcano.pdf",width=10,height=6)
print(p)
dev.off()

热图

接上述火山图

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55


library(pheatmap)
identical(rownames(rt),colnames(exp))
DEG<-read.table("DEG2.txt",sep="\t",check.names=F,stringsAsFactor=F,header=T,row.names=1)
table(DEG$change) 
diff<-rownames(DEG)[DEG$change != "NOT"] # 选出仅有显著上调或下调的基因
exp_diff<-exp[diff,]
range(exp_diff)

# 补：从DEG中选出目标基因来做热图
DEG_target <- DEG[grepl(pattern = "CASP1|ZDHHC|NLRP3|IL1B|IL18|GSD", x = DEG$name, ignore.case = TRUE), ]
target<-exp[rownames(DEG_target),]
# 作图
# 绘制差异基因表达热图
map<-pheatmap(
  exp_diff,
  annotation_col = rt, # 列注释（样本分组信息）
  color = colorRampPalette(c("blue", "white", "red"))(100), # 颜色渐变（100个梯度）
  cluster_cols = F, # 不对样本（列）进行聚类
  cluster_rows = T,  # 对基因（行）进行聚类
  show_colnames = F, # 不显示样本名（避免拥挤）
  show_rownames = T, # 显示基因名
  border_color = NA, # 去掉单元格边框
  scale = 'row', # 按行标准化（让每个基因的表达在不同样本间可比较）
  treeheight_row = 20, # 行聚类树的高度（可选）
  fontsize_row = 8, # 行名字体大小（可选）
  fontsize_col = 8 # 列名字体大小（可选）
)
pdf(file='heatmap.pdf',width=6,height=20)
print(map)
dev.off()

# 特定展示（可选）
library(tidyverse)
diff=DEG[DEG$change != "NOT",]
up<-diff %>% top_n(25,logFC)
dw<-diff %>% top_n(-25,logFC)
all<-c(rownames(up),rownames(dw))
exp_diff2<-exp[all,]
map2<-pheatmap(
  exp_diff2,
  annotation_col = rt, # 列注释（样本分组信息）
  color = colorRampPalette(c("blue", "white", "red"))(100), # 颜色渐变（100个梯度）
  cluster_cols = F, # 不对样本（列）进行聚类
  cluster_rows = T,  # 对基因（行）进行聚类
  show_colnames = F, # 不显示样本名（避免拥挤）
  show_rownames = T, # 显示基因名
  border_color = NA, # 去掉单元格边框
  scale = 'row', # 按行标准化（让每个基因的表达在不同样本间可比较）
  treeheight_row = 20, # 行聚类树的高度（可选）
  fontsize_row = 8, # 行名字体大小（可选）
  fontsize_col = 8 # 列名字体大小（可选）
)
pdf(file='heatmap2.pdf',width=6,height=8)
print(map2)
dev.off()

高通量数据分析

先前GEO高通量数据处理时运行得到degs那一步，仅在得到res1差异结果后增加一步，将其进行txt文件导出，作为DEG结果，如下是得到degs的代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46


setwd('C:/Users/30521/Desktop/R自学/GEO-analysis/GSE309139')
library(tidyverse)
exp<-data.table::fread("GSE309139_raw_counts.txt",header=T)
# 用data.table::fread()高效读取RNA-seq原始计数文件
# header=T 确保文件第一行（样本ID）作为列名
table(is.na(exp)) # 查看是否有缺失值
exp[is.na(exp)]<-0
table(is.na(exp)) # 查看是否有缺失值
exp<-column_to_rownames(exp,var='Gene')
# 把基因名那一列挪到最前面成为行名

# 1、设置分组信息
sample_id<-colnames(exp)
sample_id
group<-rep(c('L','H','LL','HL'),each=6) # 具体分组看具体数据
group
rt<-data.frame(row.names=sample_id,group=group)
identical(rownames(rt),colnames(exp))

# 2、DESeq2分析
# BiocManager::install("DESeq2")
library(DESeq2)
rt$group<-factor(rt$group) # 变成因子的格式
rt$group
# 构建DESeq2对象
dds<-DESeqDataSetFromMatrix(countData=exp,colData=rt,design=~group)
# design=~group根据group分组来比较差异
# dds把原始计数数据、样本分组信息、实验设计逻辑打包在一起，后续所有分析（标准化、差异检验）都基于这个对象

# 低表达基因过滤（自定义，可选）
filter<-rowSums(counts(dds))>1 # 筛选每行加起来大于1的基因
dds<-dds[filter,]
# 差异表达分析
dds<-DESeq(dds)
# 格式为c("因子名"，“实验组”，“对照组”)
res<-results(dds,contrast=c("group","H","L"))
# 提取差异表达结果
res1<-as.data.frame(res)
# log2FoldChange差异倍数的对数值：H 组表达量 / L 组表达量 的 log₂ 结果
# lfcSE是log2FC 的标准误，代表差异倍数的统计波动程度（值越小，log2FC 越可靠）
# pvalue越小越拒绝原假设H0，越显著性差异
# padj是校准后的P值，控制多检验的假阳性
# 结果筛选padj<0.05 & abs(log2FoldChange)>=1，选出差异表达
write.table(res1,file='DEG1.txt',sep='\t',row.names=T,col.names=NA,quote=F)
degs<-filter(res1,padj<0.05 & abs(log2FoldChange)>=1)
view(degs)

这是最后选出的degs表格样式：

根据筛选出来的表格样式，使用导出的DEG1.txt进行火山图和热图绘制，与之前limma部分一致，注意纵轴名的相应更改即可。

如下是limma后进行的火山图与热图分析结果：

富集分析

在先前差异分析得到的差异表格基础上进行，先前差异分析已经注释好的DOWN、UP或NOT的数据，即DEG2.txt

GO 富集：关注基因的功能属性（这个基因干什么、在哪里、怎么参与过程）

KEGG 富集：关注基因所在的通路 / 代谢网络（这些基因一起参与哪条通路、哪条代谢链）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


library(tidyverse)
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)
library(ggplot2)
setwd('C:/Users/30521/Desktop/R自学/GEO-analysis/富集分析')
DEG<-read.table("DEG2.txt",sep="\t",check.names=F,stringsAsFactors=F,header=T,row.names=1)
table(DEG$change)
diff<-rownames(DEG)[DEG$change!='NOT']

# 将基因ID从Symbol转换为Entrez ID（富集分析主要用Entrez ID 或 Ensembl ID来索引基因，因此要进行转换）
gene_entrez <- bitr(diff, fromType ="SYMBOL", toType ="ENTREZID", OrgDb = org.Hs.eg.db)
# OrgDb = org.Hs.eg.db：使用人类（Homo sapiens）的注释数据库
cat("所有基因ID转换", nrow(gene_entrez),"/", length(diff)) # 看转换成功多少个
untrans <- setdiff(diff, gene_entrez$SYMBOL) # 两部分的差，得出哪些转换成功
untrans 
isValid <- "C9orf92" %in% keys(org.Hs.eg.db, keytype = "SYMBOL")
isValid # 查看T还是F，判断没有被转换的基因是否被收录在org.Hs.eg.db里面

# 使用mapIds函数转换
library(AnnotationDbi)
entrez_ids <- mapIds(
  x = org.Hs.eg.db,
  keys = untrans,       # 输入的基因别名列表
  keytype = "ALIAS",    # 明确指定输入类型为别名（ALIAS）
  column = "ENTREZID",  # 转换为ENTREZID作为中间步骤
  multiVals = "first"   # 多个匹配时取第一个
)
entrez_df <- data.frame(ENTREZID = entrez_ids, check.names = FALSE)
entrez_df <- na.omit(entrez_df)
entrez_df <- entrez_df %>% rownames_to_column(var = 'SYMBOL')
#合并
gene_entrez2 <- rbind(gene_entrez,entrez_df)

GO富集

上述完成ID转换后得到gene_entrez2，再进行GO富集

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


GO <- enrichGO(gene = gene_entrez2$ENTREZID,
               OrgDb = org.Hs.eg.db,
               keyType ="ENTREZID",
               ont ="ALL", # "BP", "MF", "CC"或"ALL"
               pAdjustMethod ="BH",
               pvalueCutoff =0.05,
               qvalueCutoff =0.05,readable = T)
# GO 分为三个分支：BP（Biological Process，生物学过程）；MF（Molecular Function，分子功能）；CC（Cellular Component，细胞组分）；ALL 表示同时分析三个分支
# pAdjustMethod = "BH"：多重检验校正方法，BH（Benjamini-Hochberg）是最常用的方法，用于控制假阳性率
# readable = T：将结果中的Entrez ID自动转换为更易读的基因Symbol，方便解读
GO_res <- GO@result
write.table(GO_res,file="GO_res.txt",sep="\t",quote=F,row.names = F)
#简单作图
barplot(GO, showCategory = 20)
dotplot(GO, showCategory = 20)
#分类展示，按 GO 的三个分支（BP/MF/CC）分别展示最显著的富集条目
barplot(GO, showCategory = 5,,split="ONTOLOGY") +
  facet_grid(ONTOLOGY~., scale='free')
# showCategory = 5：每个分类下只展示最显著的5个GO条目
# split="ONTOLOGY"：按ONTOLOGY（即BP/MF/CC三个分支）对结果进行分组
# facet_grid(ONTOLOGY~.)：将分组后的结果分面展示，生成三个独立的子图

KEGG分析

也是紧接上述完成ID转换后得到gene_entrez2进行

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


KEGG <- enrichKEGG(gene = gene_entrez2$ENTREZID,
                   organism     = 'hsa', # 物种：hsa = Homo sapiens（人类）
                   pvalueCutoff = 0.05, 
                   qvalueCutoff =0.05) 
KEGG_res <- KEGG@result
write.table(KEGG_res,file="KEGG_res.txt",sep="\t",quote=F,row.names = F)
#简单作图
barplot(KEGG, showCategory = 20)
dotplot(KEGG, showCategory = 20)
#通路图可视化
#BiocManager::install('pathview')
library(pathview)
DEG <- rownames_to_column(DEG,var = 'SYMBOL')
pgenes <- inner_join(DEG,gene_entrez2,by = 'SYMBOL')
pgenes <- dplyr::select(pgenes,ENTREZID,P.Value)

pvalue <- -log10(pgenes$P.Value)
gene_names <- as.character(pgenes$ENTREZID)

gene_data <- pvalue
names(gene_data) <- gene_names
#选择兴趣的通路
hsa04978 <- pathview(gene.data  = gene_data,
                     pathway.id = "hsa04978", #通路名字
                     species    = "hsa", #代表人类
                     out.suffix = "pathview",  
                     limit      = list(gene= 10, cpd= 1)) #对基因数量限制，cpd化合物数量

GSEA分析

GSEA分析（Gene Set Enrichment Analysis，基因集富集分析），与差异富集分析不同。

GSEA 不是看 “差异显著的基因” 富集到哪些通路 / 功能，而是看整个基因列表（不分差异与否） 中，某一 “基因集”（比如某条 KEGG 通路的所有基因）是否整体偏向于列表的顶端或底端，以此判断该基因集是否被显著激活 / 抑制。

GSEA官网：GSEA

如下是进入页面后的Database页面，右侧的 Human Collections 就是给人类研究用的 10 类基因集

H：特征基因集，把多个相似基因集合并，代表最核心、最清晰的生物学状态 / 过程，快速抓最关键的通路，结果干净、易解释。

选取H的右侧Gene Symbols进行下载，下载得到gmt文件

常规GSEA

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58


library(tidyverse)
library(limma)
setwd('C:/Users/30521/Desktop/R自学/GEO-analysis/富集分析/GSEA分析')#设置工作路径
load('GSE299988.rda')
rt <- rt3
exp <- exp3
str(exp) 
exp <- exp %>% mutate(across(everything(), as.numeric))
str(exp)
#确认行名列名一致#确认行名列名一致
identical(rownames(rt),colnames(exp))
#组间差异分析
group_list <- factor(rt$group,levels = c("Normal","Tumor"))
group_list
design <- model.matrix(~group_list)
#比较矩阵命名
design
#线性模型拟合
fit <- lmFit(exp,design)
#贝叶斯检验
fit2 <- eBayes(fit)
deg <- topTable(fit2, coef = 2, number = Inf)
DEG = na.omit(deg) 
DEG <- rownames_to_column(DEG,var = 'Gene')

# 至此其实和差异分析的处理相类似

#整理输入数据
library(enrichplot)
library(clusterProfiler)
library(org.Hs.eg.db)
#转换ENTREZID（如果在GSEA中下载的是ENTREZID则需要转换，下载的Symbol则不需要）
# genelist <- bitr(DEG$Gene, fromType="SYMBOL",
#                  toType="ENTREZID", OrgDb='org.Hs.eg.db')
# DEG <- inner_join(DEG,genelist,by=c("Gene"="SYMBOL"))
geneList <- DEG[,2]
names(geneList) <- as.character(DEG[,'Gene'])
#排序
geneList <- sort(geneList, decreasing = TRUE) # sort默认是从小到大排列
geneList

#读取gmt文件
gmt <- read.gmt('h.all.v2026.1.Hs.symbols.gmt')
GSEA <- GSEA(geneList,
             TERM2GENE = gmt, 
             pvalueCutoff = 0.05,
             pAdjustMethod = "BH",
             minGSSize = 20,
             maxGSSize = 500) 

res <- GSEA@result
write.csv(res,file = 'H_GSEA.csv')
# 选择一条通路进行展示
p <- gseaplot2(GSEA,geneSetID = 'HALLMARK_ALLOGRAFT_REJECTION',title = 'HALLMARK_ALLOGRAFT_REJECTION',color = 'red')
p
# 选择多条通路进行展示
p1 <- gseaplot2(GSEA,geneSetID = 1:3,pvalue_table = T, subplots = 1:2)
p1  

如下是富集分析得到的结果res：

enrichmentScore (ES)：原始富集分数

NES (Normalized ES)：标准化富集分数，大于0表示上调

rank：该基因集在所有结果中的排名

leading_edge：核心富集基因的标签。tags：核心基因占该基因集的比例；list：占全基因列表的比例；signal：富集信号强度

core_enrichment：核心富集基因列表

以下是分析结果图：

单基因GSEA

单基因GSEA分析可以明确一个基因表达的变化时通路的变化情况

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75


library(tidyverse)
library(limma)
setwd('C:/Users/30521/Desktop/R自学/GEO-analysis/富集分析/GSEA分析')#设置工作路径
load('GSE299988.rda')
rt <- rt3
exp <- exp3
str(exp) 
exp <- exp %>% mutate(across(everything(), as.numeric))
str(exp)
#确认行名列名一致#确认行名列名一致
identical(rownames(rt),colnames(exp))
#提取 自定义基因,整理高低表达组
rt1 <- as.data.frame(t(exp['PTPRC',]))
# 选取的基因表达如果是大于这一组中的中位数，则认为是上调，完成单个基因的高表达或低表达区分
rt1$group <- ifelse(rt1$PTPRC > median(rt1$PTPRC),'up','down')
# 依照group进行排列，中间的'PTPRC'没用
rt1 <- arrange(rt1,'PTPRC',group)
identical(rownames(rt1),colnames(exp))
# 让exp的列名与rt1的行名一致
exp <- exp[,rownames(rt1)]
identical(rownames(rt1),colnames(exp))

# 组间差异分析
group_list <- factor(rt1$group,levels = c("down","up"))
group_list
design <- model.matrix(~group_list)
# 比较矩阵命名
design
# 线性模型拟合
fit <- lmFit(exp,design)
# 贝叶斯检验
fit2 <- eBayes(fit)
deg <- topTable(fit2, coef = 2, number = Inf)
DEG = na.omit(deg) 
DEG <- rownames_to_column(DEG,var = 'Gene')
library(enrichplot)
library(clusterProfiler)
library(org.Hs.eg.db)
# 转换ENTREZID
# genelist <- bitr(DEG$Gene, fromType="SYMBOL",
#                  toType="ENTREZID", OrgDb='org.Hs.eg.db')
# DEG <- inner_join(DEG,genelist,by=c("Gene"="SYMBOL"))
geneList <- DEG[,2]
names(geneList) <- as.character(DEG[,'Gene'])
# 排序
geneList <- sort(geneList, decreasing = TRUE)
geneList

# 读取gmt文件，去官网下载的C2基因集
gmt <- read.gmt('c2.cp.kegg_medicus.v2026.1.Hs.symbols.gmt')
GSEA <- GSEA(geneList,
             TERM2GENE = gmt,
             pvalueCutoff = 0.05, #改成1获得全部结果
             pAdjustMethod = "BH",
             minGSSize = 20,
             maxGSSize = 500) 
res <- GSEA@result
write.csv(res,file = 'KEGG_GSEA.csv')
p <- gseaplot2(GSEA,geneSetID = 'KEGG_MEDICUS_REFERENCE_COPI_VESICLE_FORMATION',
               title = 'KEGG_MEDICUS_REFERENCE_COPI_VESICLE_FORMATION',color = 'red')
p
p1 <- gseaplot2(GSEA,geneSetID = 1:3)
p1  

p2 <- gseaplot2(GSEA,
                geneSetID = c('KEGG_MEDICUS_REFERENCE_WNT_SIGNALING_PATHWAY',
                              'KEGG_MEDICUS_REFERENCE_ITGA_B_FAK_CDC42_SIGNALING_PATHWAY'))
p2

#3.是不是可以自创基因集？更改文件内容即可
library(enrichplot)
library(clusterProfiler)
library(org.Hs.eg.db)
gmt <- read.gmt('GOBP_MITOPHAGY.v2025.1.Hs.gmt')
# GO 生物学过程（GOBP）下的「线粒体自噬」功能模块

如下是单基因GSEA分析的作图结果：结果曲线为正即为该通路上调

韦恩图

选取GeneCards

Link:GeneCards - Human Genes | Gene Database | Gene Search

选取mitophage线粒体自噬相关的蛋白基因，保存为excel

同样地再选取ferroptosis铁死亡相关的蛋白基因，保存为excel

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92


rm(list = ls())
library(tidyverse)
setwd('C:/Users/30521/Desktop/R自学/GEO-analysis/VN图')

# 1.数据读取
DEG <- read.table("DEG2.txt",sep = "\t",check.names = F,stringsAsFactors = F,header = T,row.names = 1)
table(DEG$change)
diff <- rownames(DEG)[DEG$change!='NOT']

dat1 <- read.csv('ferroptosis.csv')
dat1 <- dat1$Gene.Symbol

dat2 <- read.csv('mitophage.csv')
dat2 <- dat2$Gene.Symbol

a <- list(DEGs=diff,
          ferroptosis=dat1,
          mitophage=dat2)

#演示数据，sample随机抽样（此处知道怎么选取颜色即可,pass）
library(RColorBrewer)
display.brewer.pal(9,'Set1')
brewer.pal(9, "Set1")[3]
"#4DAF4A"
display.brewer.pal(8,'Accent')

'#E41A1C'（红）  '#377EB8'（蓝）  '#4DAF4A'（绿）   '#984EA3'（紫）
'#FF7F00'（橙）  '#FFFF33'（黄）  '#A65628'（棕）   '#F781BF'（粉）
'#1B9E77'（深绿）'#D95F02'（深橙）'#7570B3'（深紫） '#E7298A'（深粉）
'#66A61E'（亮绿）'#E6AB02'（亮黄）'#A6761D'（棕黄） '#666666'（深灰）
'#A6CEE3'（浅蓝）'#1F78B4'（中蓝）'#B2DF8A'（浅绿） '#33A02C'（中绿）
'#FB9A99'（浅红）'#E31A1C'（中红）'#FDBF6F'（浅橙） '#FF7F00'（中橙）

set.seed(123) #随机种子（pass）
a <- list(a=sample(1:100, 35),
          b=sample(1:100, 45),
          c=sample(1:100, 55),
          d=sample(1:100, 80),
          e=sample(1:100, 90),
          f=sample(1:100, 100))

# 2.VN作图####
# 2.1 ggvenn包 最多4组
library(ggvenn)
ggvenn(a)

# 不太理解这一段美化为啥有问题，可能得看看具体ggvenn的用法
ggvenn(
  a,
  columns = NULL,
  show_elements = FALSE,
  show_percentage = TRUE,
  digits = 1,
  fill_color = c("blue", "yellow", "green", "red"),
  fill_alpha = 0.3,
  stroke_color = "black",
  stroke_alpha = 1,
  stroke_size = F,
  stroke_linetype = "solid",
  set_name_color = "black",
  set_name_size = 6,
  text_color = "black",
  text_size = 3,
  label_sep = ",",
  count_column = NULL,
  show_outside = c("auto", "none", "always"),
  auto_scale = FALSE
)

#2.2 venn包 最多7组  
# BiocManager::install('venn')
library(venn) 
venn(a)

{
venn(
  a,
  ilabels = "counts",  # 显示数字
  zcolor = c("#FF9999", "#66B2FF", "#99FF99"),  # 自定义柔和花瓣色（粉、蓝、绿）
  opacity = 0.6,       # 透明度
  box = FALSE,         # 是否除外边框
  ilcs = 1,            # 圈内数字大小
  sncs = 1.2,          # 组名大小
  border = "white",    # 边缘颜色
  lwd = 1.1            # 边框线条粗细
)
  }

# 3.输出重叠结果####
res <- Reduce(intersect, a)
res
write.csv(res,file = 'venn.csv',row.names = F)

最终VN图结果如下：

箱线图

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90


rm(list=ls())
setwd('C:/Users/30521/Desktop/R自学/GEO-analysis/箱线图') #设置工作路径
library(tidyverse)
library(ggplot2)
library(ggpubr)
load('GSE299988.rda')

# 数据准备
exp <- exp3
str(exp)
exp <- exp %>% mutate(across(everything(), as.numeric))
str(exp)
# 分组信息
rt <- rt3

# 差异表达可视化验证（箱线图）
# 将exp转置为矩阵，再弄成数据框，选取列名为上述韦恩图得到的交集基因
exps <- exp %>% t() %>% as.data.frame() %>% dplyr::select("PPARGC1A",'SNCA')
identical(rownames(exps),rownames(rt))
# [1] TRUE 

# 合并
rts <- cbind(exps,rt)

# 转换数据为长格式，方便绘图
rts_long <- rts %>% pivot_longer(cols = c("PPARGC1A",'SNCA'),
                                 names_to = "genes",values_to = "Expression")

#箱线图 gglot2
library(ggplot2)
p <- ggplot(data = rts_long,aes(x = genes,y = Expression,
                                fill = group))+
  geom_boxplot() +stat_compare_means(method ='wilcox.test') +theme_classic()+
  scale_fill_manual(values = c("#e67e22", "#3498db"))
p


# 小提琴图 ggpubr
library(ggplot2)
library(ggpubr)
p <- ggviolin(rts_long,x = "genes",y = "Expression",
              fill = "group",
              palette = c("#e67e22", "#3498db"),
              add.params = list(fill = "white")) +
  stat_compare_means(aes(group = group),
                     method = "wilcox.test", #method='wilcox.test' ;method = "t.test"
                     label = "p.signif",symnum.args = list(cutpoints = c(0, 0.001, 0.01, 0.05, 1),
                                                           symbols = c("***","**","*","ns")))
p


'#E41A1C'（红）  '#377EB8'（蓝）  '#4DAF4A'（绿）   '#984EA3'（紫）
'#FF7F00'（橙）  '#FFFF33'（黄）  '#A65628'（棕）   '#F781BF'（粉）
'#1B9E77'（深绿）'#D95F02'（深橙）'#7570B3'（深紫） '#E7298A'（深粉）
'#66A61E'（亮绿）'#E6AB02'（亮黄）'#A6761D'（棕黄） '#666666'（深灰）
'#A6CEE3'（浅蓝）'#1F78B4'（中蓝）'#B2DF8A'（浅绿） '#33A02C'（中绿）
'#FB9A99'（浅红）'#E31A1C'（中红）'#FDBF6F'（浅橙） '#FF7F00'（中橙）

mycol <- c('#984EA3','#B2DF8A')

mycol <- c("#e67e22", "#3498db")

{
p1 <- ggviolin(rts_long, x = "genes", y = "Expression", fill = "group", 
               palette = mycol, legend = "top", alpha = 0.5,
               add = "boxplot", 
               add.params = list(width = 0.3, #箱图的宽度
                                 alpha = 0.8, outlier.shape = NA),  # 调整箱线图宽度和透明度
               outlier.shape = NA, color = alpha("black", 0.5)) +             
  stat_compare_means(aes(group = group),
                     method = "wilcox.test",label = "p.signif",
                     symnum.args = list(cutpoints = c(0, 0.001, 0.01, 0.05, 1),
                                        symbols = c("***","**","*","ns"))) + 
  geom_jitter(aes(color = group),
              position = position_jitterdodge(
                jitter.width = 0.25,
                dodge.width = 0.8),   #散点的位置
              size = 1.2,
              alpha = 0.3) +
  scale_color_manual(values = mycol) +
  theme(panel.grid = element_line(color = NA),
        panel.grid.minor = element_line(color = NA),
        panel.border = element_rect(fill = NA, color = NA),
        axis.text.x = element_text(size = 10, face = "plain", hjust = 0.5),
        axis.title.x = element_blank(),
        axis.text.y = element_text(size = 10, face = "plain", hjust = 0.5),
        axis.title.y = element_text(vjust = 0.5, size = 10, face = "bold"))
}

p1

最终合并的结果：

WGCNA

WGCNA（Weighted Gene Co-expression Network Analysis，加权基因共表达网络分析） 是一种无监督系统生物学方法，核心是从基因表达数据中挖掘协同表达的基因模块，并将模块与表型 / 临床性状关联，最终找到驱动表型的关键基因（Hub Gene）

解决的问题：从海量基因表达数据中，把 “杂乱无章” 的基因按表达模式相似性分组（模块），并找到与你关心的表型（疾病、性状、处理等）最相关的模块与核心基因。

加权网络：不用硬阈值（如 0.8）一刀切，用软阈值 β给相关性加权，更符合真实生物网络的无尺度分布。

软阈值：软阈值就是用一个幂次 power（比如 6、8、10）把相关系数做一个非线性变换：

$$ a_{ij}=∣cor(i,j)∣^β $$

由此把强相关保留放大，弱相关压缩到接近 0

无标度网络：一种节点连接度分布极不均匀的网络，极少数节点拥有大量连接，绝大多数节点只有很少的连接。选软阈值的核心目的，就是让构建的基因共表达网络尽可能接近无标度特性，能精准识别出真正的核心基因和共表达模块

模块 - 表型关联：直接把基因模块和临床 / 表型数据做关联，快速定位功能单元。

拓扑重叠矩阵（TOM）：邻接矩阵只看两个基因之间的直接连接强度，但真实的生物网络中，基因间的关系是 “多维度” 的。而 TOM 能把直接连接 + 间接连接 结合起来，更贴近真实的生物调控关系。

拓扑重叠分数（0~1）：

$$ TOM_{ij}=\frac{l_{ij}+a_{ij}}{min(k_i,k_j)+1-a_{ij}} $$$$ a_{ij}:基因 i 和 j 的直接邻接值 $$$$ k_{i}:基因 i 的总连接度（邻接矩阵中第 i 行所有元素的和，代表 i 和其他所有基因的直接连接总和） $$$$ l_{ij}:基因 i 和 j 的共同连接数（所有同时和 i、j 强连接的基因数量，比如 i 和 k 强连、j 和 k 强连 → k 就是 i 和 j 的共同连接） $$$$ min(k_i,k_j):取 k_i 和 k_j 的较小值，避免 “高连接度基因” 主导结果 $$

Hub 基因挖掘：在模块内找连接度最高、最核心的基因，作为候选靶点。

核心思想：基因表达模式相似 → 功能相关 → 聚成模块（Module）；模块整体与表型相关 → 该模块参与表型调控 → 模块内Hub 基因是关键调控因子。

加权基因共表达网络分析 (WGCNA）| 共表达网络模块构建 |生物信息学数据分析必备技能

WGCNA官方教程网站：Tutorials for WGCNA R package

示例分析

选取官网的第一个示例来分析，下载每一部分的R脚本

FemaleLiver-01-dataInput

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169


#=====================================================================================
#
#  Code chunk 1
#
#=====================================================================================


# Display the current working directory
getwd();
# If necessary, change the path below to the directory where the data files are stored. 
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = ".";
setwd(workingDir); 
# Load the WGCNA package
library(WGCNA);
# The following setting is important, do not omit.
# 以下是全局选项，告诉 R 不要把数据框里的字符串自动转换成因子
options(stringsAsFactors = FALSE);
#Read in the female liver data set
femData = read.csv("LiverFemale3600.csv");
# Take a quick look at what is in the data set:
dim(femData);
names(femData);


#=====================================================================================
#
#  Code chunk 2
#
#=====================================================================================

# 除去了前8列的注释，然后转置，行名变成样本名
datExpr0 = as.data.frame(t(femData[, -c(1:8)]));
# 转置后的矩阵缺乏列名，把之前的每个基因编号作为列名
names(datExpr0) = femData$substanceBXH;
rownames(datExpr0) = names(femData)[-c(1:8)];


#=====================================================================================
#
#  Code chunk 3
#
#=====================================================================================

# 检查表达矩阵的质量
gsg = goodSamplesGenes(datExpr0, verbose = 3);
# 看看结果是否OK
gsg$allOK


#=====================================================================================
#
#  Code chunk 4
#
#=====================================================================================

# 样本不合格的情况进行处理
if (!gsg$allOK)
{
  # Optionally, print the gene and sample names that were removed:
  if (sum(!gsg$goodGenes)>0) 
     printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));
  if (sum(!gsg$goodSamples)>0) 
     printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
  # Remove the offending genes and samples from the data:
  datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}


#=====================================================================================
#
#  Code chunk 5
#
#=====================================================================================

# dist(datExpr0)：计算样本间的表达距离，生成距离矩阵。
#  平均链接法对样本进行层次聚类，生成聚类树。
sampleTree = hclust(dist(datExpr0), method = "average");

# Plot the sample tree: 打开一个 12×9 英寸的图形窗口，用于显示聚类树。
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);

# 设置图中文字和符号的缩放比例为 0.6，避免标签重叠。
par(cex = 0.6);
# 设置绘图边距（下、左、上、右），优化布局，让聚类树显示更清晰。
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, 
     cex.axis = 1.5, cex.main = 2)
# 这一步进行的是用所有基因的表达来聚样本，用于质量控制。反映的是整个表达谱的整体相似性，目的是检测离群样本。

#=====================================================================================
#
#  Code chunk 6
#
#=====================================================================================


# Plot a line to show the cut，选取15高度作为离群样本的判定阈值
abline(h = 15, col = "red");

# Determine cluster under the line，在高度 15 处 “砍树”，将聚类树切割成不同的簇。只保留包含至少 10 个样本的簇，过滤掉太小的异常分支。
clust = cutreeStatic(sampleTree, cutHeight = 15, minSize = 10)
table(clust)
# clust 1 contains the samples we want to keep.代表要保留的主簇。选出主簇生成逻辑向量
keepSamples = (clust==1)
datExpr = datExpr0[keepSamples, ]
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)


#=====================================================================================
#
#  Code chunk 7
#
#=====================================================================================

# 读取临床数据、表型数据等
traitData = read.csv("ClinicalTraits.csv");
dim(traitData)
names(traitData)

# remove columns that hold information we do not need.除去不要的临床信息列
allTraits = traitData[, -c(31, 16)];
allTraits = allTraits[, c(2, 11:36) ];
dim(allTraits)
names(allTraits)

# Form a data frame analogous to expression data that will hold the clinical traits.匹配行名
femaleSamples = rownames(datExpr);
# 返回allTraits中与femaleSamples行名一一匹配的位置向量
traitRows = match(femaleSamples, allTraits$Mice);
# 取出匹配的行并去除第一列的Mice样品名
datTraits = allTraits[traitRows, -1];
# 把Mice样品名命成行名
rownames(datTraits) = allTraits[traitRows, 1];

# 释放内存，清理不再需要的临时对象
collectGarbage();


#=====================================================================================
#
#  Code chunk 8
#
#=====================================================================================


# Re-cluster samples，再进行聚类
sampleTree2 = hclust(dist(datExpr), method = "average")
# Convert traits to a color representation: white means low, red means high, grey means missing entry
# 将表型数据 datTraits 中的数值转换为颜色矩阵。
traitColors = numbers2colors(datTraits, signed = FALSE);

# Plot the sample dendrogram and the colors underneath.这样可以在聚类树下用颜色条直观展示每个样本的表型水平。
plotDendroAndColors(sampleTree2, traitColors,
                    groupLabels = names(datTraits), 
                    main = "Sample dendrogram and trait heatmap")


#=====================================================================================
#
#  Code chunk 9
#
#=====================================================================================


save(datExpr, datTraits, file = "FemaleLiver-01-dataInput.RData")

如下是完成的第一次样品聚类和剔除异常值后的样品聚类+临床热图作图：

FemaleLiver-02-networkConstr-auto

自动、一步网络构建和模块检测

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134


#=====================================================================================
#
#  Code chunk 1
#
#=====================================================================================


# Display the current working directory
getwd();
# If necessary, change the path below to the directory where the data files are stored. 
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = ".";
setwd(workingDir); 
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Allow multi-threading within WGCNA. This helps speed up certain calculations.
# At present this call is necessary for the code to work.
# Any error here may be ignored but you may want to update WGCNA if you see one.
# Caution: skip this line if you run RStudio or other third-party R environments. 
# See note above.
# 启用 WGCNA 的多线程计算，以加速分析。
enableWGCNAThreads()
# Load the data saved in the first part
lnames = load(file = "FemaleLiver-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames


#=====================================================================================
#
#  Code chunk 2
#
#=====================================================================================


# Choose a set of soft-thresholding powers，生成一系列数作为软阈值的备选
powers = c(c(1:10), seq(from = 12, to=20, by=2))
# Call the network topology analysis function，
# 'pickSoftThreshold'用于评估不同软阈值下网络的拓扑特性，'verbose = 5'输出详细的计算过程，方便你跟踪进度。
# 最后返回为一个列表，包含了每个阈值对应的无标度拟合指数（R²）、平均连通性等关键指标
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;

# Scale-free topology fit index as a function of the soft-thresholding power
# 绘图横轴为第一列软阈值的值，纵轴为斜率符号取反后乘以对应的无标度网络拟合度R²，保证所有R²为正，方便看图
# type="n"：先不画点，只画坐标轴
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");

# this line corresponds to using an R^2 cut-off of h，分割无标度网络拟合度R²在0.9以上的
abline(h=0.90,col="red")

# Mean connectivity as a function of the soft-thresholding power，再绘制平均连通性
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

# 结合两者进行软阈值的选择，保证无标度网络拟合度R²和平均连通性

#=====================================================================================
#
#  Code chunk 3
#
#=====================================================================================

# 此处选取软阈值为6
# blockwiseModules根据你选定的软阈值 power 计算邻接矩阵，转换为拓扑重叠矩阵（TOM），进行层次聚类，识别共表达模块。
# TOMType："unsigned"：不区分正负相关，"signed"：保留符号，适合有方向调控的研究
# minModuleSize：每个模块最少包含的基因数量
# reassignThreshold：基因重新分配到其他模块的阈值，0 表示不重新分配
# mergeCutHeight：合并相似模块的树状图高度阈值，越小合并越严格
# numericLabels：是否用数字（1,2,3...）代替颜色作为模块标签
# pamRespectsDendro：是否严格遵守层次聚类的树状图结构进行模块划分
# saveTOMs：是否保存拓扑重叠矩阵（TOM）到文件
# saveTOMFileBase：保存 TOM 文件的文件名前缀
# verbose：输出详细程度，数字越大信息越详细
net = blockwiseModules(datExpr, power = 6,
                       TOMType = "unsigned", minModuleSize = 30,
                       reassignThreshold = 0, mergeCutHeight = 0.25,
                       numericLabels = TRUE, pamRespectsDendro = FALSE,
                       saveTOMs = TRUE,
                       saveTOMFileBase = "femaleMouseTOM", 
                       verbose = 3)


#=====================================================================================
#
#  Code chunk 4
#
#=====================================================================================


# open a graphics window
sizeGrWindow(12, 9)
# Convert labels to colors for plotting
# net$colors是输出的模块标签，labels2colors 把这些标签转换成 WGCNA 专用的模块颜色（如 turquoise、blue、brown 等），方便在图上区分不同模块。
mergedColors = labels2colors(net$colors)
# Plot the dendrogram and the module colors underneath
# net$dendrograms[[1]]：第一个数据块的层次聚类树状图，代表基因间的共表达相似性
# mergedColors[net$blockGenes[[1]]]：对应树状图上每个基因的模块颜色条，颜色相同的基因属于同一个共表达模块。
# dendroLabels = FALSE：不显示每个基因的名称，避免树状图过于拥挤。
# hang = 0.03：控制树状图末端的悬挂长度，让图更美观。
# addGuide = TRUE：添加一条引导线，连接树状图和颜色条，方便对应。
# guideHang = 0.05：引导线的悬挂长度。
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
                    "Module colors",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)

# 基因按共表达相似性聚类，分支越近的基因共表达关系越强。每个颜色代表一个共表达模块，同一颜色的基因在同一分支上聚集。

#=====================================================================================
#
#  Code chunk 5
#
#=====================================================================================


moduleLabels = net$colors
moduleColors = labels2colors(net$colors)
# MEs是聚类模块
MEs = net$MEs;
geneTree = net$dendrograms[[1]];
save(MEs, moduleLabels, moduleColors, geneTree, 
     file = "FemaleLiver-02-networkConstruction-auto.RData")

FemaleLiver-02-networkConstr-man

分步网络构建和模块检测

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213


#=====================================================================================
#
#  Code chunk 1
#
#=====================================================================================


# Display the current working directory
getwd();
# If necessary, change the path below to the directory where the data files are stored. 
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = ".";
setwd(workingDir); 
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Allow multi-threading within WGCNA. At present this call is necessary.
# Any error here may be ignored but you may want to update WGCNA if you see one.
# Caution: skip this line if you run RStudio or other third-party R environments.
# See note above.
enableWGCNAThreads()
# Load the data saved in the first part
lnames = load(file = "FemaleLiver-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames


#=====================================================================================
#
#  Code chunk 2
#
#=====================================================================================


# Choose a set of soft-thresholding powers
powers = c(c(1:10), seq(from = 12, to=20, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")


#=====================================================================================
#
#  Code chunk 3
#
#=====================================================================================
# 以上软阈值的选取和自动法是一样的步骤

softPower = 6;
# adjacency()：用于计算基因之间的邻接关系，得到的是一个基因数×基因数的矩阵
# 这个函数就完成了一个基因全部表达量与另一基因全部表达量之间的相关系数计算，再把所有相关系数根据软阈值处理得到邻接矩阵
adjacency = adjacency(datExpr, power = softPower);


#=====================================================================================
#
#  Code chunk 4
#
#=====================================================================================


# Turn adjacency into topological overlap
# TOMsimilarity()：WGCNA 的核心函数，用于将邻接矩阵转换为拓扑重叠矩阵（TOM），更精准地衡量基因间的 “真实共表达关系”。
TOM = TOMsimilarity(adjacency);
# TOM值越高代表基因之间的 “相似度” 或 “连接紧密度”越高。转换为1-TOM把 “相似度” 转成 “不相似度”“距离”，符合了距离矩阵的直觉：距离越小，基因越相似，方便后续做层次聚类、构建共表达网络等操作
dissTOM = 1-TOM


#=====================================================================================
#
#  Code chunk 5
#
#=====================================================================================


# Call the hierarchical clustering function，as.dist(dissTOM)转换为函数可识别的距离对象格式
geneTree = hclust(as.dist(dissTOM), method = "average");
# Plot the resulting clustering tree (dendrogram)
sizeGrWindow(12,9)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
     labels = FALSE, hang = 0.04);


#=====================================================================================
#
#  Code chunk 6
#
#=====================================================================================

# 筛选模块中至少要保留的基因数
# We like large modules, so we set the minimum module size relatively high:
minModuleSize = 30;
# Module identification using dynamic tree cut:
# pamRespectsDendro = FALSE：设置为 FALSE 时，在重新分配基因时可以不严格遵守原始聚类树的结构，这有助于获得更紧凑的模块。
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
                deepSplit = 2, pamRespectsDendro = FALSE,
                minClusterSize = minModuleSize);
# table(dynamicMods)统计并输出每个模块中包含的基因数量。这是一个快速检查结果的方式，可以直观地看到识别出了多少个模块，以及每个模块的大小。
table(dynamicMods)


#=====================================================================================
#
#  Code chunk 7
#
#=====================================================================================


# Convert numeric lables into colors
# 将模块转换为颜色，再作图
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
# Plot the dendrogram and colors underneath
sizeGrWindow(8,6)
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05,
                    main = "Gene dendrogram and module colors")


#=====================================================================================
#
#  Code chunk 8
#
#=====================================================================================


# Calculate eigengenes
# moduleEigengenes：WGCNA 的核心函数，用于计算每个共表达模块的模块特征基因。模块特征基因是一个代表整个模块表达模式的综合指标，通常是该模块内所有基因表达谱的第一主成分（PC1）
MEList = moduleEigengenes(datExpr, colors = dynamicColors)
# 从返回的列表中提取出模块特征基因矩阵 MEs，每一列代表一个模块的特征基因表达谱。
MEs = MEList$eigengenes
# Calculate dissimilarity of module eigengenes
# 计算模块特征基因之间的不相似度矩阵。
MEDiss = 1-cor(MEs);
# Cluster module eigengenes
# 这一步的目的是查看哪些模块的表达模式彼此相似，可能可以合并。
METree = hclust(as.dist(MEDiss), method = "average");
# Plot the result
sizeGrWindow(7, 6)
plot(METree, main = "Clustering of module eigengenes",
     xlab = "", sub = "")


#=====================================================================================
#
#  Code chunk 9
#
#=====================================================================================

# 合并0.25以下的，即相关性大于等于0.75的模块会被认为是高度相似的，将被自动合并
MEDissThres = 0.25
# Plot the cut line into the dendrogram
abline(h=MEDissThres, col = "red")
# Call an automatic merging function
# mergeCloseModules：WGCNA 中用于自动合并相似模块的函数。
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
# The merged module colors，提取合并后新模块颜色标签
mergedColors = merge$colors;
# Eigengenes of the new merged modules，提取合并后新模块特征基因表达谱
mergedMEs = merge$newMEs;


#=====================================================================================
#
#  Code chunk 10
#
#=====================================================================================


sizeGrWindow(12, 9)
#pdf(file = "Plots/geneDendro-3.pdf", wi = 9, he = 6)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
                    c("Dynamic Tree Cut", "Merged dynamic"),
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
#dev.off()


#=====================================================================================
#
#  Code chunk 11
#
#=====================================================================================


# Rename to moduleColors
moduleColors = mergedColors
# Construct numerical labels corresponding to the colors，定义一个颜色顺序向量，"grey" 代表未分配到任何模块的 “灰色” 基因。
colorOrder = c("grey", standardColors(50));
# 每个基因的模块颜色（moduleColors）映射到 colorOrder 向量中的位置，得到对应的数字标签。减去 1 是为了让标签从 0 开始计数，其中 0 对应灰色基因，1、2、3... 依次对应其他模块。
moduleLabels = match(moduleColors, colorOrder)-1;
# 将合并后的模块特征基因 mergedMEs 重命名为 MEs，同样是为了符合后续分析的标准命名。
MEs = mergedMEs;
# Save module colors and labels for use in subsequent parts
save(MEs, moduleLabels, moduleColors, geneTree, file = "FemaleLiver-02-networkConstruction-stepByStep.RData")

FemaleLiver-02-networkConstr-blockwise

基因较多时使用

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156


#=====================================================================================
#
#  Code chunk 1
#
#=====================================================================================


# Display the current working directory
getwd();
# If necessary, change the path below to the directory where the data files are stored. 
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = ".";
setwd(workingDir); 
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Allow multi-threading within WGCNA. This helps speed up certain calculations.
# At present this call is necessary.
# Any error here may be ignored but you may want to update WGCNA if you see one.
# Caution: skip this line if you run RStudio or other third-party R environments.
# See note above.
enableWGCNAThreads()
# Load the data saved in the first part
lnames = load(file = "FemaleLiver-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames


#=====================================================================================
#
#  Code chunk 2
#
#=====================================================================================


# Choose a set of soft-thresholding powers
powers = c(c(1:10), seq(from = 12, to=20, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")


#=====================================================================================
#
#  Code chunk 3
#
#=====================================================================================

# blockwiseModules：WGCNA 中用于大规模数据的一站式函数，它会自动将基因分成多个 “块（blocks）”，在每个块内独立进行网络构建和模块检测，最后整合结果。
# maxBlockSize = 2000设置每个块中包含的最大基因数为 2000。当基因总数超过这个值时，函数会自动将基因分成多个块进行处理，这可以有效降低内存消耗和计算时间。
bwnet = blockwiseModules(datExpr, maxBlockSize = 2000,
                       power = 6, TOMType = "unsigned", minModuleSize = 30,
                       reassignThreshold = 0, mergeCutHeight = 0.25,
                       numericLabels = TRUE,
                       saveTOMs = TRUE,
                       saveTOMFileBase = "femaleMouseTOM-blockwise",
                       verbose = 3)


#=====================================================================================
#
#  Code chunk 4
#
#=====================================================================================


# Load the results of single-block analysis，导入对比自动构建和检测的结果
load(file = "FemaleLiver-02-networkConstruction-auto.RData");
# Relabel blockwise modules
# 将分块分析通过比较模块内的基因组成，找到最匹配的对应关系，让可以和第一种自动分析进行直接比较
bwLabels = matchLabels(bwnet$colors, moduleLabels);
# Convert labels to colors for plotting
bwModuleColors = labels2colors(bwLabels)


#=====================================================================================
#
#  Code chunk 5
#
#=====================================================================================


# open a graphics window
sizeGrWindow(6,6)
# Plot the dendrogram and the module colors underneath for block 1
# 由于最大聚类是2000，这里结果一是2000的聚类
plotDendroAndColors(bwnet$dendrograms[[1]], bwModuleColors[bwnet$blockGenes[[1]]],
                    "Module colors", main = "Gene dendrogram and module colors in block 1", 
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
# Plot the dendrogram and the module colors underneath for block 2
# 这里是剩下1600的聚类
plotDendroAndColors(bwnet$dendrograms[[2]], bwModuleColors[bwnet$blockGenes[[2]]],
                    "Module colors", main = "Gene dendrogram and module colors in block 2", 
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)


#=====================================================================================
#
#  Code chunk 6
#
#=====================================================================================


sizeGrWindow(12,9)
plotDendroAndColors(geneTree,
                    cbind(moduleColors, bwModuleColors),
                    c("Single block", "2 blocks"),
                    main = "Single block gene dendrogram and module colors",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)


#=====================================================================================
#
#  Code chunk 7
#
#=====================================================================================

# 计算单块分析得到的模块（moduleColors）的特征基因 singleBlockMEs
singleBlockMEs = moduleEigengenes(datExpr, moduleColors)$eigengenes;
# 计算分块分析得到的模块（bwModuleColors）的特征基因 blockwiseMEs
blockwiseMEs = moduleEigengenes(datExpr, bwModuleColors)$eigengenes;


#=====================================================================================
#
#  Code chunk 8
#
#=====================================================================================

# 根据模块名称（如 MEblue, MEturquoise），建立从单块分析模块到分块分析模块的索引映射
single2blockwise = match(names(singleBlockMEs), names(blockwiseMEs))
# cor(...)：计算 blockwiseMEs（按映射顺序重排后）与 singleBlockMEs 之间的相关系数矩阵
# diag(...)：提取相关系数矩阵的对角线元素，即对应模块之间的相关系数
# signif(..., 3)：将相关系数保留 3 位有效数字，方便阅读
signif(diag(cor(blockwiseMEs[, single2blockwise], singleBlockMEs)), 3)
# 这些对角线值越接近 1，说明两种分析方法得到的对应模块的表达模式越一致，分块分析的结果就越可靠。

FemaleLiver-03-relateModsToExt

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214


#=====================================================================================
#
#  Code chunk 1
#
#=====================================================================================


# Display the current working directory
getwd();
# If necessary, change the path below to the directory where the data files are stored. 
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = ".";
setwd(workingDir); 
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Load the expression and trait data saved in the first part
lnames = load(file = "FemaleLiver-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "FemaleLiver-02-networkConstruction-auto.RData");
lnames


#=====================================================================================
#
#  Code chunk 2
#
#=====================================================================================


# Define numbers of genes and samples
nGenes = ncol(datExpr);
nSamples = nrow(datExpr);
# Recalculate MEs with color labels
# 重新计算所有共表达模块的模块特征基因（MEs），模块特征基因代表了一个模块内所有基因的整体表达模式，是模块与性状关联的核心变量。
# 明确基于最终确定的 moduleColors（所有调整后的最终模块标签），从源头重新生成 MEs，彻底避免 “用旧模块的 MEs 匹配新性状” 的错误。
MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes
# 对模块特征基因进行重新排序，通常是根据它们与性状的关联强度或模块间的相似性进行排列
MEs = orderMEs(MEs0)
# 计算模块特征基因（MEs）与表型性状（datTraits）之间的皮尔逊相关系数
# use = "p"：表示使用成对删除（pairwise deletion）来处理缺失值。
moduleTraitCor = cor(MEs, datTraits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);


#=====================================================================================
#
#  Code chunk 3
#
#=====================================================================================

# 作图
sizeGrWindow(10,6)
# Will display correlations and their p-values
# 创建一个文本矩阵 textMatrix，用于在热图的每个单元格中同时显示相关系数和P 值
textMatrix =  paste(signif(moduleTraitCor, 2), "\n(",                           signif(moduleTraitPvalue, 1), ")", sep = "");
# 将文本矩阵的维度设置为与相关系数矩阵 moduleTraitCor 完全一致，确保每个文本对应正确的热图单元格
dim(textMatrix) = dim(moduleTraitCor)
# 调整绘图边距，特别是增大了左侧边距（8.5），为较长的模块名称（y 轴标签）留出足够空间，避免文字重叠
par(mar = c(6, 8.5, 3, 3));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(datTraits),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors = greenWhiteRed(50),
               textMatrix = textMatrix,
               setStdMargins = FALSE,
               cex.text = 0.5,
               zlim = c(-1,1),
               main = paste("Module-trait relationships"))


#=====================================================================================
#
#  Code chunk 4
#
#=====================================================================================


# Define variable weight containing the weight column of datTrait
# 选取表型进行分析
weight = as.data.frame(datTraits$weight_g);
names(weight) = "weight"
# names (colors) of the modules
# 截去模块特征基因的前两位ME，得到模块颜色名称
modNames = substring(names(MEs), 3)

# 计算每个基因的表达量与所有模块特征基因之间的皮尔逊相关系数
# 这个相关系数就是基因模块成员度（MM），代表该基因在模块内的核心程度。MM 值越高，说明基因与模块的表达模式越一致，是模块内的核心基因
geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
# p值计算
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));

# 为颜色列名添加前缀 MM和 p.MM
names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");

# 计算每个基因的表达量与关注性状（体重）之间的皮尔逊相关系数。这个相关系数就是基因性状显著性（GS），代表基因与性状的关联强度。GS 值越高，说明基因与性状的关系越密切。
geneTraitSignificance = as.data.frame(cor(datExpr, weight, use = "p"));
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));

# 为列名添加前缀 GS.，例如 GS.weight
names(geneTraitSignificance) = paste("GS.", names(weight), sep="");
names(GSPvalue) = paste("p.GS.", names(weight), sep="");


#=====================================================================================
#
#  Code chunk 5
#
#=====================================================================================

# 选取想看的weight临床数据中先前作图最相关的模块颜色
module = "brown"
column = match(module, modNames);
# 所有基因中进行匹配，基因在模块里面对应颜色则是匹配的，返回T
moduleGenes = moduleColors==module;

sizeGrWindow(7, 7);
par(mfrow = c(1,1));

# verboseScatterplot(...)：WGCNA 提供的增强版散点图函数，能自动添加拟合线和相关系数
# x轴是 “brown” 模块内基因的模块成员度（MM）的绝对值，代表基因在模块内的核心程度;y轴是“brown” 模块内基因与体重性状的显著性（GS）的绝对值，代表基因与性状的关联强度
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
                   abs(geneTraitSignificance[moduleGenes, 1]),
                   xlab = paste("Module Membership in", module, "module"),
                   ylab = "Gene significance for body weight",
                   main = paste("Module membership vs. gene significance\n"),
                   cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)


#=====================================================================================
#
#  Code chunk 6
#
#=====================================================================================


names(datExpr)


#=====================================================================================
#
#  Code chunk 7
#
#=====================================================================================

# 提取brown模块里面的所有基因
names(datExpr)[moduleColors=="brown"]


#=====================================================================================
#
#  Code chunk 8
#
#=====================================================================================

# 读取基因注释文件
annot = read.csv(file = "GeneAnnotation.csv");
dim(annot)
names(annot)
probes = names(datExpr)
probes2annot = match(probes, annot$substanceBXH)
# The following is the number or probes without annotation:
sum(is.na(probes2annot))
# Should return 0.


#=====================================================================================
#
#  Code chunk 9
#
#=====================================================================================


# Create the starting data frame
geneInfo0 = data.frame(substanceBXH = probes,
                      geneSymbol = annot$gene_symbol[probes2annot],
                      LocusLinkID = annot$LocusLinkID[probes2annot],
                      moduleColor = moduleColors,
                      geneTraitSignificance,
                      GSPvalue)

# Order modules by their significance for weight，计算每个模块特征基因（ME）与体重的相关性，然后按相关性绝对值从大到小排序
modOrder = order(-abs(cor(MEs, weight, use = "p")));

# Add module membership information in the chosen order，添加每个基因在所有模块中的模块成员度（MM）及其 P 值（p.MM）
for (mod in 1:ncol(geneModuleMembership))
{
  oldNames = names(geneInfo0)
  geneInfo0 = data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]], 
                         MMPvalue[, modOrder[mod]]);
  names(geneInfo0) = c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""),
                       paste("p.MM.", modNames[modOrder[mod]], sep=""))
}

# Order the genes in the geneInfo variable first by module color, then by geneTraitSignificance
geneOrder = order(geneInfo0$moduleColor, -abs(geneInfo0$GS.weight));
geneInfo = geneInfo0[geneOrder, ]


#=====================================================================================
#
#  Code chunk 10
#
#=====================================================================================


write.csv(geneInfo, file = "geneInfo.csv")

FemaleLiver-04-Interfacing

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120


#=====================================================================================
#
#  Code chunk 1
#
#=====================================================================================


# Display the current working directory
getwd();
# If necessary, change the path below to the directory where the data files are stored. 
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = ".";
setwd(workingDir); 
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Load the expression and trait data saved in the first part
lnames = load(file = "FemaleLiver-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "FemaleLiver-02-networkConstruction-auto.RData");
lnames


#=====================================================================================
#
#  Code chunk 2
#
#=====================================================================================


# Read in the probe annotation
annot = read.csv(file = "GeneAnnotation.csv");
# Match probes in the data set to the probe IDs in the annotation file 
probes = names(datExpr)
probes2annot = match(probes, annot$substanceBXH)
# Get the corresponding Locuis Link IDs
allLLIDs = annot$LocusLinkID[probes2annot];
# $ Choose interesting modules
intModules = c("brown", "red", "salmon")
for (module in intModules)
{
  # Select module probes
  modGenes = (moduleColors==module)
  # Get their entrez ID codes
  modLLIDs = allLLIDs[modGenes];
  # Write them into a file
  fileName = paste("LocusLinkIDs-", module, ".txt", sep="");
  write.table(as.data.frame(modLLIDs), file = fileName,
              row.names = FALSE, col.names = FALSE)
}
# As background in the enrichment analysis, we will use all probes in the analysis.
fileName = paste("LocusLinkIDs-all.txt", sep="");
write.table(as.data.frame(allLLIDs), file = fileName,
            row.names = FALSE, col.names = FALSE)


#=====================================================================================
#
#  Code chunk 3
#
#=====================================================================================


GOenr = GOenrichmentAnalysis(moduleColors, allLLIDs, organism = "mouse", nBestP = 10);


#=====================================================================================
#
#  Code chunk 4
#
#=====================================================================================


tab = GOenr$bestPTerms[[4]]$enrichment


#=====================================================================================
#
#  Code chunk 5
#
#=====================================================================================


names(tab)


#=====================================================================================
#
#  Code chunk 6
#
#=====================================================================================


write.table(tab, file = "GOEnrichmentTable.csv", sep = ",", quote = TRUE, row.names = FALSE)


#=====================================================================================
#
#  Code chunk 7
#
#=====================================================================================


keepCols = c(1, 2, 5, 6, 7, 12, 13);
screenTab = tab[, keepCols];
# Round the numeric columns to 2 decimal places:
numCols = c(3, 4);
screenTab[, numCols] = signif(apply(screenTab[, numCols], 2, as.numeric), 2)
# Truncate the the term name to at most 40 characters
screenTab[, 7] = substring(screenTab[, 7], 1, 40)
# Shorten the column names:
colnames(screenTab) = c("module", "size", "p-val", "Bonf", "nInTerm", "ont", "term name");
rownames(screenTab) = NULL;
# Set the width of R's output. The reader should play with this number to obtain satisfactory output.
options(width=95)
# Finally, display the enrichment table:
screenTab

FemaleLiver-05-Visualization

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108


#=====================================================================================
#
#  Code chunk 1
#
#=====================================================================================


# Display the current working directory
getwd();
# If necessary, change the path below to the directory where the data files are stored. 
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = ".";
setwd(workingDir); 
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Load the expression and trait data saved in the first part
lnames = load(file = "FemaleLiver-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "FemaleLiver-02-networkConstruction-auto.RData");
lnames
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)


#=====================================================================================
#
#  Code chunk 2
#
#=====================================================================================


# Calculate topological overlap anew: this could be done more efficiently by saving the TOM
# calculated during module detection, but let us do it again here.
dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 6);
# Transform dissTOM with a power to make moderately strong connections more visible in the heatmap
plotTOM = dissTOM^7;
# Set diagonal to NA for a nicer plot
diag(plotTOM) = NA;
# Call the plot function
sizeGrWindow(9,9)
TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")


#=====================================================================================
#
#  Code chunk 3
#
#=====================================================================================


nSelect = 400
# For reproducibility, we set the random seed
set.seed(10);
select = sample(nGenes, size = nSelect);
selectTOM = dissTOM[select, select];
# There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = moduleColors[select];
# Open a graphical window
sizeGrWindow(9,9)
# Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing 
# the color palette; setting the diagonal to NA also improves the clarity of the plot
plotDiss = selectTOM^7;
diag(plotDiss) = NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")


#=====================================================================================
#
#  Code chunk 4
#
#=====================================================================================


# Recalculate module eigengenes
MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes
# Isolate weight from the clinical traits
weight = as.data.frame(datTraits$weight_g);
names(weight) = "weight"
# Add the weight to existing module eigengenes
MET = orderMEs(cbind(MEs, weight))
# Plot the relationships among the eigengenes and the trait
sizeGrWindow(5,7.5);
par(cex = 0.9)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8, xLabelsAngle
= 90)


#=====================================================================================
#
#  Code chunk 5
#
#=====================================================================================


# Plot the dendrogram
sizeGrWindow(6,6);
par(cex = 1.0)
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),
                      plotHeatmaps = FALSE)
# Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)
par(cex = 1.0)
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2),
                      plotDendrograms = FALSE, xLabelsAngle = 90)

FemaleLiver-06-ExportNetwork

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101


#=====================================================================================
#
#  Code chunk 1
#
#=====================================================================================


# Display the current working directory
getwd();
# If necessary, change the path below to the directory where the data files are stored. 
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = ".";
setwd(workingDir); 
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Load the expression and trait data saved in the first part
lnames = load(file = "FemaleLiver-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "FemaleLiver-02-networkConstruction-auto.RData");
lnames


#=====================================================================================
#
#  Code chunk 2
#
#=====================================================================================


# Recalculate topological overlap
TOM = TOMsimilarityFromExpr(datExpr, power = 6);
# Read in the annotation file
annot = read.csv(file = "GeneAnnotation.csv");
# Select module
module = "brown";
# Select module probes
probes = names(datExpr)
inModule = (moduleColors==module);
modProbes = probes[inModule];
# Select the corresponding Topological Overlap
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
# Export the network into an edge list file VisANT can read
vis = exportNetworkToVisANT(modTOM,
  file = paste("VisANTInput-", module, ".txt", sep=""),
  weighted = TRUE,
  threshold = 0,
  probeToGene = data.frame(annot$substanceBXH, annot$gene_symbol) )


#=====================================================================================
#
#  Code chunk 3
#
#=====================================================================================


nTop = 30;
IMConn = softConnectivity(datExpr[, modProbes]);
top = (rank(-IMConn) <= nTop)
vis = exportNetworkToVisANT(modTOM[top, top],
  file = paste("VisANTInput-", module, "-top30.txt", sep=""),
  weighted = TRUE,
  threshold = 0,
  probeToGene = data.frame(annot$substanceBXH, annot$gene_symbol) )


#=====================================================================================
#
#  Code chunk 4
#
#=====================================================================================


# Recalculate topological overlap if needed
TOM = TOMsimilarityFromExpr(datExpr, power = 6);
# Read in the annotation file
annot = read.csv(file = "GeneAnnotation.csv");
# Select modules
modules = c("brown", "red");
# Select module probes
probes = names(datExpr)
inModule = is.finite(match(moduleColors, modules));
modProbes = probes[inModule];
modGenes = annot$gene_symbol[match(modProbes, annot$substanceBXH)];
# Select the corresponding Topological Overlap
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
# Export the network into edge and node list files Cytoscape can read
cyt = exportNetworkToCytoscape(modTOM,
  edgeFile = paste("CytoscapeInput-edges-", paste(modules, collapse="-"), ".txt", sep=""),
  nodeFile = paste("CytoscapeInput-nodes-", paste(modules, collapse="-"), ".txt", sep=""),
  weighted = TRUE,
  threshold = 0.02,
  nodeNames = modProbes,
  altNodeNames = modGenes,
  nodeAttr = moduleColors[inModule]);

实际分析

此处选取处理GSE84844，提前下载并完成到前期的数据处理，得到GSE84844.rda文件

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305


rm(list = ls())
options(stringsAsFactors = FALSE)
library(tidyverse)
#install.packages("WGCNA")
library(WGCNA)
setwd('C:/Users/30521/Desktop/R自学/GEO-analysis/WGCNA/GSE84844')#设置工作路径
load('GSE84844.rda')

# 1.数据准备####
# 转置，注意！！！：exp类型一定要提前转为数字
# exp <- exp %>% mutate(across(everything(), as.numeric))
exp <- as.data.frame(t(exp))
#输入的基因:可以是全部基因、或者高变基因、也可以是差异基因，建议至少至少2000以上
#样本：15对15以上

# 1.1 选取高变基因，var方差前百分之
# apply用于处理每列基因的方差，'2'指按列处理，'1'指按行处理
vars_res <- apply(exp,2,var)
# 计算百分位数截止值
per_res <- quantile(vars_res, probs = seq(0, 1, 0.25)) # quantile生成分位数
# 选取方差位于前25%的基因
per_gene <- exp[, which(vars_res > per_res[4])]
datExpr0 <- data.matrix(per_gene)

# 1.2 选取高变基因，绝对偏差中位数排名前5000的基因
# mad_res <- apply(exp, 2, mad, na.rm = TRUE)
# mad_gene <- exp[, order(mad_res, decreasing = TRUE)[1:5000], drop = FALSE]
# datExpr0 <- data.matrix(mad_gene)

# 2.数据检查####
#检查样本和基因是否ok
gsg = goodSamplesGenes(datExpr0, verbose = 3)
gsg$allOK

#如果没有达标就需要筛选，[1] TRUE 也能运行，不影响结果
if (!gsg$allOK){
  # Optionally, print the gene and sample names that were removed:
  if (sum(!gsg$goodGenes)>0)
    printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));
  if (sum(!gsg$goodSamples)>0)
    printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
  # Remove the offending genes and samples from the data:
  datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}

#进行样本聚类
sampleTree = hclust(dist(datExpr0), method = "average")
par(cex = 0.6)
par(mar = c(0,4,2,0))
plot(sampleTree)
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2)
#剪切线
abline(h = 70, col = "red")

#切除未聚类的样本，看的是样本的合并高度
clust = cutreeStatic(sampleTree, cutHeight = 70, minSize = 10)#注意：cutHeight = 这里才是真正切割
table(clust)
keepSamples = (clust==1)
#注意查看切割数量0有多少 1有多少
datExpr0 = datExpr0[keepSamples, ]


#表型数据分组
table(rt)
# 根据具体临床数据，control列的control对应为1，pSS对应为0；同样地运用于建立的pSS列
traitData <- data.frame(control=c(rep(1,30),rep(0,30)),
                        pSS=c(rep(0,30),rep(1,30)))
row.names(traitData) <- rownames(exp)

sameSample <- intersect(rownames(datExpr0), rownames(traitData))
datExpr <- datExpr0[sameSample,]
datTraits <- traitData[sameSample,]

#去除后再聚类
sampleTree2 = hclust(dist(datExpr), method = "average")
plot(sampleTree2)
#标记颜色
traitColors = numbers2colors(datTraits, signed = FALSE)
# 3.第一部分样本可视化作图####
# 完成样本与表型的聚类
plotDendroAndColors(sampleTree2, traitColors,
                    groupLabels = names(datTraits),
                    main = "Sample dendrogram and trait heatmap")
save(datExpr, datTraits, file = "WGCNA_data1.rda")

# 4.网络构建####
rm(list = ls())
load(file = "WGCNA_data1.rda")
#power值散点图
allowWGCNAThreads()   #多线程工作
powers = c(c(1:10), seq(from = 12, to=20, by=2)) #幂指数范围1到20
powers
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
sft

#作图
sizeGrWindow(9, 5)
par(mfrow = c(1,2))
cex1 = 0.9
#拟合指数与power值散点图，确保散点在0.8以上，最好0.9以上
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
abline(h=0.90,col="red") #根据自己选择修改，如0.9则不变
###平均连通性与power值散点图
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

#自动计算softPower，自动评估软阈值
softPower <- sft$powerEstimate
softPower
#查看最佳softPower值
#选择power
softPower = 6
adjacency = adjacency(datExpr, power = softPower)

#TOM矩阵
TOM = TOMsimilarity(adjacency);
dissTOM = 1-TOM
#基因聚类
geneTree = hclust(as.dist(dissTOM), method = "average")
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
     labels = FALSE, hang = 0.04)
##动态剪切模块识别
minModuleSize = 30  #最小模块基因数
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
                            deepSplit = 2, pamRespectsDendro = FALSE,
                            minClusterSize = minModuleSize);
table(dynamicMods)
#标记颜色
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
#5.第二部分基因聚类作图####
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05,
                    main = "Gene dendrogram and module colors")

#6.相似模块聚类####
MEList = moduleEigengenes(datExpr, colors = dynamicColors)
MEs = MEList$eigengenes
MEDiss = 1-cor(MEs);
METree = hclust(as.dist(MEDiss), method = "average")
sizeGrWindow(7, 6)
plot(METree, main = "Clustering of module eigengenes",
     xlab = "", sub = "")

#剪切高度，0.25标识相异性（代表75%以上相似性模块合并）及MEDissThres越小代表要求相似性越高，最终模块越多
MEDissThres = 0.25
abline(h=MEDissThres, col = "red")
#相似模块合并（也可不做，根据结果调整）
merge = mergeCloseModules(datExpr, dynamicColors,cutHeight = MEDissThres, verbose = 3)
mergedColors = merge$colors
mergedMEs = merge$newMEs
plotDendroAndColors(geneTree, mergedColors,"Dynamic Tree Cut",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05,
                    main = "Gene dendrogram and module colors")
moduleColors = mergedColors
table(moduleColors)
colorOrder = c("grey", standardColors(50))
moduleLabels = match(moduleColors, colorOrder)-1
MEs = mergedMEs
save(MEs, moduleLabels, moduleColors, geneTree, file = "WGCNA_data2.rda")

#7.第三部分模块与特征热图####
#模块与表型数据热图
rm(list=ls())
load('WGCNA_data1.rda')
load('WGCNA_data2.rda')
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
moduleTraitCor = cor(MEs, datTraits, use = "p")
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)
#热图展示
textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
                   signif(moduleTraitPvalue, 1), ")", sep = "")
dim(textMatrix) = dim(moduleTraitCor)
sizeGrWindow(10,6)
par(mar = c(6, 8.5, 3, 3))
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(datTraits),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors = blueWhiteRed(50), #greenWhiteRed
               textMatrix = textMatrix,
               cex.lab.y = 0.5,
               cex.lab.x = 0.5,
               setStdMargins = FALSE,
               cex.text = 0.5,
               zlim = c(-1,1),
               main = paste("Module-trait relationships"))

#模块选择相关性0.4以上，P<0.05
#结果不好，调节高变基因，如25%至50%

# 8.模块提取####
modNames = substring(names(MEs), 3)

geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"))
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples))

names(geneModuleMembership) = paste("MM", modNames, sep="")
names(MMPvalue) = paste("p.MM", modNames, sep="")
#完成基因模热图

#进行模块基因输出
# 基因与表型的相关性
traitNames=names(datTraits)
geneTraitSignificance = as.data.frame(cor(datExpr, datTraits, use = "p"))
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples))
names(geneTraitSignificance) = paste("GS.", traitNames, sep="")
names(GSPvalue) = paste("p.GS.", traitNames, sep="")

#9.批量输出散点图####
for (trait in traitNames){
  traitColumn=match(trait,traitNames)  
  for (module in modNames){
    column = match(module, modNames)
    moduleGenes = moduleColors==module
    if (nrow(geneModuleMembership[moduleGenes,]) > 1){
      outPdf=paste(trait, "_", module,".pdf",sep="")
      pdf(file=outPdf,width=7,height=7)
      par(mfrow = c(1,1))
      verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
                         abs(geneTraitSignificance[moduleGenes, traitColumn]),
                         xlab = paste("Module Membership in", module, "module"),
                         ylab = paste("Gene significance for ",trait),
                         main = paste("Module membership vs. gene significance\n"),
                         cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
      abline(v=0.8,h=0.5,col="red")
      dev.off()
    }
  }
}

#10.批量输出模块基因####
for (mod in 1:nrow(table(moduleColors)))
{  
  modules = names(table(moduleColors))[mod]
  probes = colnames(datExpr)
  inModule = (moduleColors == modules)
  modGenes = probes[inModule]
  write.table(modGenes, file =paste0(modules,".txt"),sep="\t",row.names=F,col.names=F,quote=F)
}

#11.其他补充图####
rm(list=ls())
load('WGCNA_data1.rda')
load('WGCNA_data2.rda')
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
#11.1 TOM聚类图####
dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 6);#power与之前保持一致
#增加强度
plotTOM = dissTOM^7;
#对角线设置为NA
diag(plotTOM) = NA;
#绘图（十分缓慢，建议进行nSelect挑选）
sizeGrWindow(9,9)
TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")

#挑选基因，快速作图
nSelect = 400
set.seed(10);
select = sample(nGenes, size = nSelect);
selectTOM = dissTOM[select, select];
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = moduleColors[select];
sizeGrWindow(9,9)
plotDiss = selectTOM^7;
diag(plotDiss) = NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")

#11.2 特征模块热图####
rm(list=ls())
load('WGCNA_data1.rda')
load('WGCNA_data2.rda')
#重新计算模块特征基因
MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes
#提取特征
Traits = as.data.frame(datTraits$pSS); #选择特征为pSS
names(Traits) = "Traits"
#将特征添加到现有模块特征基因中
MET = orderMEs(cbind(MEs, Traits))
#绘制特征基因与性状之间的关系图
par(cex = 1)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8, xLabelsAngle
                      = 90)

#绘制树状图
sizeGrWindow(6,6);
par(cex = 1.0)
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),
                      plotHeatmaps = FALSE)
#绘制热图矩阵
par(cex = 1.0)
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2),
                      plotDendrograms = FALSE, xLabelsAngle = 90)
#WGCNA结束