bioinfoer

找回密码
立即注册
搜索
热搜: 活动 交友 discuz
发新帖

0

收听

12

听众

403

主题
发表于 昨天 20:03 | 查看: 8| 回复: 2

背景

Enrichment for Molecular Concept Map, to identify significant association within lists of TF genes。

通过转录因子靶基因的集合,用Fisher's exact test分析转录因子间的coupling(耦合、共调控),在形式上模仿paper里的这种图来展示结果:

出自PMID: 27288520文章

应用场景

分析任意集合之间的关系,例如转录因子之间(示例数据)、差异表达基因跟通路之间、染色质开放跟基因转录之间。

场景一:找转录因子之间的耦合关系(共同调控靶基因),例如示例数据(Interactions.RData)是201个转录因子,以及每个转录因子对应的多个有相互作用的基因。

场景二:差异表达基因在通路里的富集,就把差异表达基因跟通路里的基因都放到Interactions.RData里。例如:

$up_regulated_genes
[1] "ZBTB38" "SRA1" "IPO13" "ALX4" "EP300" "CREBBP" "IPO13"

$MAPK_pathway
[1] "ESR2" "ZNF688" "BANP" "ISYNA1" "TBC1D7"

场景三:分析染色质开放程度跟基因转录调控之间的关系,就把ATAC-seq peak附近的基因跟差异表达基因都放到Interactions.RData里。例如:

$up_regulated_genes
[1] "ZBTB38" "SRA1" "IPO13" "ALX4" "EP300" "CREBBP" "IPO13"

$ATAC-seq_peak
[1] "ESR2" "ZNF688" "BANP" "ISYNA1" "TBC1D7"

环境设置

使用国内镜像安装包

options("repos"= c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
install.packages("rentrez")

加载包

library(rentrez)
library(RColorBrewer)
library(corrplot)
library(ggplot2)
library(ggthemes)
source("my_function.R")

Sys.setenv(LANGUAGE = "en") #显示英文报错信息
options(stringsAsFactors = FALSE) #禁止chr转成factor

输入文件

如果你有自己的相互作用的基因list,保存在Interactions.RData里,就可以跳过这步,直接进入“Fisher's Exact Test”。

easy_input_TF.csv,此处用201个转录因子的gene symbol,可以替换成你感兴趣的基因的gene symbol,例如差异表达基因、差异表达lncRNA等。

easy_input_bg.csv,背景gene sybol,可以替换成与你感兴趣的基因相对应的背景基因,例如基因组上的全部基因、全部lncRNA等。

TFlist <- read.csv("easy_input_TF.csv", header = T)
head(TFlist)
dim(TFlist)

bglist <- read.table("easy_input_bg.csv", header = T)
head(bglist)
dim(bglist)

从NCBI获取基因interaction

用rentrez包获取NCBI gene数据库里的Interactions,即基因与它有相互作用的基因的list,保存到Interactions.RData。The general interactions in this section are provided, without review by Gene staff, by the external sources listed in https://ftp.ncbi.nlm.nih.gov/gene/GeneRIF/interaction_sources。详情看这里:https://www.ncbi.nlm.nih.gov/books/NBK3841/#EntrezGene.Interactions

#Download TF related interactions genes from NCBI Gene
#来源:https://github.com/ropensci/rentrez/wiki/Find-genes-known-to-interact-with-a-given-gene
res <- c()
for (i in 1:nrow(TFlist)) {
 output <- print(as.character(TFlist$GeneSym[i]),quote=FALSE)
 gene_search <- entrez_search(db="gene",term=paste0("(",output,"[GENE]) AND (Homo sapiens[ORGN])")) 
 # if you just want the interacting genes you can use this function, it's huge Xpath query
 interactions_from_gene <- function(gene_id){
   xmlrec <- entrez_fetch(db="gene", id=gene_id, rettype="xml", parsed=TRUE)
   XML::xpathSApply(xmlrec,
                    "//Gene-commentary[Gene-commentary_heading[./text()='Interactions']]//Other-source[Other-source_src/Dbtag/Dbtag_db[./text()='GeneID']]//Other-source_anchor",
                    XML::xmlValue)
   }
 res1=interactions_from_gene(gene_search$ids)
 res1=toupper(res1)
 res1=list(res1)
 names(res1)=output
 res<- c(res,res1)
}

save(res, file = "Interactions.RData")

Fisher's Exact Test

Produce a geneset (lists) for hypergeometric test p-value

(load("Interactions.RData"))
head(res)

# Match and filter geneset by background genelist list size range (number of genes) from 150 - 2000 
# 也可以跳过这行,不做筛选
res1 <- GeneSetsFilterByBackground(res,bglist$GeneSym,100,2000)
names(res1)
length(names(res1))

##########################################################
################# hypergeometric test ####################
##########################################################
# hyper and hyperTest function to caculate enrichment score and its pvalue by using hypergeometric test for each given genelist
# usage ## input 1. the list of GeneSet 2. length of background genes
# usage HyperGeoTest <- hyperTest(GeneSet, length(background.genes))

hyper <- function(X,Y,N,alpha=1){
  K<-length(X)
  M<-length(Y)
  
  if(K==0 || M ==0){
    return(1)
  }
  
  both<-length(intersect(X,Y));
  
  XOnly<-length(setdiff(X,Y));
  YOnly<-length(setdiff(Y,X));
  Fisher.table<-N-both-XOnly-YOnly;
  tab<-matrix(c(Fisher.table,  YOnly, XOnly, both),2,2);
  dimnames(tab)<-list(0:1,0:1);
  f<-fisher.test(tab,alternative="greater");
  pVal<-f$p.value;
  estimate<-f$estimate;
  return(c(pVal,estimate));
}

## s A list of gene vectors.
## N as a number to provide by using length (total background genes).
hyperTest<-function(s, N, alpha=1){
  n<-length(s);
  pVal<-matrix(0,n,n);
  estimate<-matrix(0,n,n);
  for(i in 1:n){
    print(paste("Processing", i));
    for(j in 1:n){
      h<-hyper(s[[i]],s[[j]],N,alpha);
      pVal[i,j]<-h[1];
      estimate[i,j]<-h[2];
    }
  }
  dimnames(pVal)<-list(names(s),names(s));
  dimnames(estimate)<-list(names(s),names(s));
  return(list(pVal=pVal,estimate=estimate));
}

# hypergeometric test p-value
HyperGeoTest <- hyperTest(res1, length(bglist$GeneSym))
rm(res1)

# Extract estimate value
est <- HyperGeoTest$estimate
# Remove self comparing results
est[est==Inf] = NA
head(est)

# Extract pvalue
p <- HyperGeoTest$pVal

借用corrplot做clustring,为后面画图做准备

# set color 
cols=rev(colorRampPalette(brewer.pal(9,"RdBu"))(199))
#cols=brewer.pal(n=9, name="RdBu")

p1 <- corrplot(as.matrix(log2(est+1)), 
         type = "upper", #只显示上三角
         method="color",
         order="hclust",
         hclust.method="ward.D2",
         col=cols,
         tl.col ="black", #文字颜色
         tl.cex = 0.5, #文字大小
         tl.srt =45, #Text label color and rotation
         is.corr = FALSE, #非相关系数矩阵。对于一般的矩阵,必须使用is.corr = FALSE
         diag = F,  #不展示相关系数
         p.mat = as.matrix(p), #p值的矩阵
         sig.level = c(1e-25,1e-35,1e-50), #显著水平
         insig = c("label_sig"), #显著水平以***、**、*表示
         pch.cex = 0.5, #标注的星号*的大小
         font = 3) #斜体

收藏回复 显示全部楼层 道具 举报

发表于 昨天 20:03

开始画图

作者提供了三种画法展示以上结果,这里展示画法一,画法二三见压缩包里的2_3_ggplot2.R文件

准备工作

前面用corrplot做了聚类,提取顺序;只需要画一半,也就是三角形,需要把另一半变成NA。

# Remove self comparing p vaule
# p[p == 0 ] =NA
p[p < 1.0e-150]=NA #or p[p == 0 ] = NA
p[lower.tri(p)] <- NA

# cluster ordering info by corrplot function
o <- rownames(p1$corr)

# p value reordered by ward.D2 hclust
p <- reorder(p, o)

# get -log10 fdr value
q <- -log10(p.adjust(p, method="fdr"))
# get -log10 p value
p <- -log10(p)
head(p)

# estimate value reordered by ward.D2 hclust
o <- rownames(p1$corr)
r <- reorder(est,o)
r[lower.tri(r)] <- NA

画图

pdf("Enrichment_baseplot.pdf", 11, 10)
par(bty="n", 
    mar=c(4,4,4,8)+.1, #四周留空
    las=2,# the style of axis labels
    tcl=-.33) #The length of tick marks as a fraction of the height of a line of text

m <- nrow(est)
n <- ncol(est)

# we should check range(est[!est==Inf]) for col and breaks setting
# we should check data distribution hist(est[!est==Inf]) for col and breaks setting

max_value<- range(r[!is.na(r)])[2]
brks<- c(0,seq(1,20,l=10),seq(21,max_value,l=4))
cols2<- rev(colorRampPalette(brewer.pal(9,"RdBu"))(length(brks)-1))

# baseplot for heatmap
image(x=1:n, y=1:m, r, col=cols2, breaks=brks, xaxt="n", yaxt="n", xlab="",ylab="", xlim=c(0, n+4), ylim=c(0, n+1))

# add gene name
mtext(side=2, at=1:n, o, font=3, col="black") #left
mtext(side=3, at=1:n, o, font=3, col="black") #top

# add white border
abline(h=0:n+.5, col="white", lwd=.5)
abline(v=0:n+.5, col="white", lwd=.5)

# add title
#text(x=n/2, y=m+1, "Enrichment based on Fisher's Exact Test (greater)", pos=3)

q_range<- range(q[!is.na(q)])[2]
# significant labels q>50
w = arrayInd(which(q > q_range/2), rep(m,2))
points(w, pch="*", col="black", cex=1)
# significant labels q>35
w = arrayInd(setdiff(which(q > q_range/3),which(q > q_range/2)),rep(m,2))
points(w, pch=3, col="black", cex=1) # "+""
# significant labels q>25
w = arrayInd(setdiff(which(q > q_range/4),which(q > q_range/3)), rep(m,2))
points(w, pch="-", col="black", cex=1)

# set up legend color bar
image(y = 1:16 +6, x=rep(n,2)+c(2.5,3)+1, z=matrix(c(1:16), nrow=1), col=cols2, add=TRUE)

# add legend color bar scale value
brks2 <- round(c(0,seq(1,20,l=10), seq(21,max_value,l=4)))

axis(side = 4, at = seq(1,15) + 6.5,  tcl=-.15, label=brks2, las=1, lwd=.5)

points(x=rep(n,3)+3.5, y=1:3, pch=c("-","+","*"))
text(x=n+2, y=15, "enrichment estimate value",pos=1,srt=90)
mtext(side=4, at=c(1,2,3,4), c("-log10(FDR) > 25","-log10(FDR) > 35","-log10(FDR) > 50","ns"), line=0.2)
dev.off()

回复 显示全部楼层 道具 举报

发表于 昨天 20:04

文中所用数据可以关注公众号“生信喵实验柴”

发送关键词“20250510”获取

回复 显示全部楼层 道具 举报

您需要登录后才可以回帖 登录 | 立即注册

QQ|Archiver|手机版|小黑屋|bioinfoer ( 萌ICP备20244422号 )

GMT+8, 2025-12-1 06:33 , Processed in 0.067472 second(s), 35 queries .

Powered by Discuz! X3.5

© 2001-2025 Discuz! Team.

快速回复 返回顶部 返回列表