背景
Enrichment for Molecular Concept Map, to identify significant association within lists of TF genes。
通过转录因子靶基因的集合,用Fisher's exact test分析转录因子间的coupling(耦合、共调控),在形式上模仿paper里的这种图来展示结果:

出自PMID: 27288520文章
应用场景
分析任意集合之间的关系,例如转录因子之间(示例数据)、差异表达基因跟通路之间、染色质开放跟基因转录之间。
场景一:找转录因子之间的耦合关系(共同调控靶基因),例如示例数据(Interactions.RData)是201个转录因子,以及每个转录因子对应的多个有相互作用的基因。
场景二:差异表达基因在通路里的富集,就把差异表达基因跟通路里的基因都放到Interactions.RData里。例如:
$up_regulated_genes
[1] "ZBTB38" "SRA1" "IPO13" "ALX4" "EP300" "CREBBP" "IPO13"
$MAPK_pathway
[1] "ESR2" "ZNF688" "BANP" "ISYNA1" "TBC1D7"
场景三:分析染色质开放程度跟基因转录调控之间的关系,就把ATAC-seq peak附近的基因跟差异表达基因都放到Interactions.RData里。例如:
$up_regulated_genes
[1] "ZBTB38" "SRA1" "IPO13" "ALX4" "EP300" "CREBBP" "IPO13"
$ATAC-seq_peak
[1] "ESR2" "ZNF688" "BANP" "ISYNA1" "TBC1D7"
环境设置
使用国内镜像安装包
options("repos"= c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
install.packages("rentrez")
加载包
library(rentrez)
library(RColorBrewer)
library(corrplot)
library(ggplot2)
library(ggthemes)
source("my_function.R")
Sys.setenv(LANGUAGE = "en") #显示英文报错信息
options(stringsAsFactors = FALSE) #禁止chr转成factor
输入文件
如果你有自己的相互作用的基因list,保存在Interactions.RData里,就可以跳过这步,直接进入“Fisher's Exact Test”。
easy_input_TF.csv,此处用201个转录因子的gene symbol,可以替换成你感兴趣的基因的gene symbol,例如差异表达基因、差异表达lncRNA等。
easy_input_bg.csv,背景gene sybol,可以替换成与你感兴趣的基因相对应的背景基因,例如基因组上的全部基因、全部lncRNA等。
TFlist <- read.csv("easy_input_TF.csv", header = T)
head(TFlist)
dim(TFlist)
bglist <- read.table("easy_input_bg.csv", header = T)
head(bglist)
dim(bglist)
从NCBI获取基因interaction
用rentrez包获取NCBI gene数据库里的Interactions,即基因与它有相互作用的基因的list,保存到Interactions.RData。The general interactions in this section are provided, without review by Gene staff, by the external sources listed in https://ftp.ncbi.nlm.nih.gov/gene/GeneRIF/interaction_sources。详情看这里:https://www.ncbi.nlm.nih.gov/books/NBK3841/#EntrezGene.Interactions
#Download TF related interactions genes from NCBI Gene
#来源:https://github.com/ropensci/rentrez/wiki/Find-genes-known-to-interact-with-a-given-gene
res <- c()
for (i in 1:nrow(TFlist)) {
output <- print(as.character(TFlist$GeneSym[i]),quote=FALSE)
gene_search <- entrez_search(db="gene",term=paste0("(",output,"[GENE]) AND (Homo sapiens[ORGN])"))
# if you just want the interacting genes you can use this function, it's huge Xpath query
interactions_from_gene <- function(gene_id){
xmlrec <- entrez_fetch(db="gene", id=gene_id, rettype="xml", parsed=TRUE)
XML::xpathSApply(xmlrec,
"//Gene-commentary[Gene-commentary_heading[./text()='Interactions']]//Other-source[Other-source_src/Dbtag/Dbtag_db[./text()='GeneID']]//Other-source_anchor",
XML::xmlValue)
}
res1=interactions_from_gene(gene_search$ids)
res1=toupper(res1)
res1=list(res1)
names(res1)=output
res<- c(res,res1)
}
save(res, file = "Interactions.RData")
Fisher's Exact Test
Produce a geneset (lists) for hypergeometric test p-value
(load("Interactions.RData"))
head(res)
# Match and filter geneset by background genelist list size range (number of genes) from 150 - 2000
# 也可以跳过这行,不做筛选
res1 <- GeneSetsFilterByBackground(res,bglist$GeneSym,100,2000)
names(res1)
length(names(res1))
##########################################################
################# hypergeometric test ####################
##########################################################
# hyper and hyperTest function to caculate enrichment score and its pvalue by using hypergeometric test for each given genelist
# usage ## input 1. the list of GeneSet 2. length of background genes
# usage HyperGeoTest <- hyperTest(GeneSet, length(background.genes))
hyper <- function(X,Y,N,alpha=1){
K<-length(X)
M<-length(Y)
if(K==0 || M ==0){
return(1)
}
both<-length(intersect(X,Y));
XOnly<-length(setdiff(X,Y));
YOnly<-length(setdiff(Y,X));
Fisher.table<-N-both-XOnly-YOnly;
tab<-matrix(c(Fisher.table, YOnly, XOnly, both),2,2);
dimnames(tab)<-list(0:1,0:1);
f<-fisher.test(tab,alternative="greater");
pVal<-f$p.value;
estimate<-f$estimate;
return(c(pVal,estimate));
}
## s A list of gene vectors.
## N as a number to provide by using length (total background genes).
hyperTest<-function(s, N, alpha=1){
n<-length(s);
pVal<-matrix(0,n,n);
estimate<-matrix(0,n,n);
for(i in 1:n){
print(paste("Processing", i));
for(j in 1:n){
h<-hyper(s[[i]],s[[j]],N,alpha);
pVal[i,j]<-h[1];
estimate[i,j]<-h[2];
}
}
dimnames(pVal)<-list(names(s),names(s));
dimnames(estimate)<-list(names(s),names(s));
return(list(pVal=pVal,estimate=estimate));
}
# hypergeometric test p-value
HyperGeoTest <- hyperTest(res1, length(bglist$GeneSym))
rm(res1)
# Extract estimate value
est <- HyperGeoTest$estimate
# Remove self comparing results
est[est==Inf] = NA
head(est)
# Extract pvalue
p <- HyperGeoTest$pVal
借用corrplot做clustring,为后面画图做准备
# set color
cols=rev(colorRampPalette(brewer.pal(9,"RdBu"))(199))
#cols=brewer.pal(n=9, name="RdBu")
p1 <- corrplot(as.matrix(log2(est+1)),
type = "upper", #只显示上三角
method="color",
order="hclust",
hclust.method="ward.D2",
col=cols,
tl.col ="black", #文字颜色
tl.cex = 0.5, #文字大小
tl.srt =45, #Text label color and rotation
is.corr = FALSE, #非相关系数矩阵。对于一般的矩阵,必须使用is.corr = FALSE
diag = F, #不展示相关系数
p.mat = as.matrix(p), #p值的矩阵
sig.level = c(1e-25,1e-35,1e-50), #显著水平
insig = c("label_sig"), #显著水平以***、**、*表示
pch.cex = 0.5, #标注的星号*的大小
font = 3) #斜体
