bioinfoer

找回密码
立即注册
搜索
热搜: 活动 交友 discuz
发新帖

0

收听

12

听众

390

主题
发表于 2025-9-8 07:43:52 | 查看: 83| 回复: 2

背景

TCGA里的正常组织太少,把GTEx的正常组织加进去,用分裂小提琴图展示某个基因在TCGA肿瘤和正常组织中的表达差异。


出自文章PMID: 30344046, Supplemental material, Figure 3b

应用场景

适用于:

  1. 对比某一基因在TCGA肿瘤组织和正常组织(TCGA的normal和GTEx)基因表达量TPM值
  2. 用分裂小提琴图展示表达量数据

环境设置

需要system调用shell命令。

安装需要用到的包

#使用国内镜像安装包
options("repos"= c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("TCGAbiolinks", version = "3.8")

install.packages("stringr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("RColorBrewer")

加载需要用到的包

library(stringr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)

Sys.setenv(LANGUAGE = "en") #显示英文报错信息
options(stringsAsFactors = FALSE) #禁止chr转成factor

输入文件下载

如果你的数据已经整理成easy_input.csv的格式,就可以跳过这步,直接进入“输入数据预处理”。

文件地址:

从UCSC xenahttps://xenabrowser.net/datapages/下载经过TOIL流程统一处理的TCGA和GTEx的TPM,用这套数据画出来的图跟xena的violin一致:https://xenabrowser.net/transcripts/:

  • TCGA Pan-Cancer (PANCAN) (41 datasets),点击gene expression RNAseq里的TOIL RSEM tpm (n=10,535) UCSC Toil RNAseq Recompute,下载TPM:https://toil.xenahubs.net/download/tcga_RSEM_gene_tpm.gz。点击phenotype里的sample type and primary disease (n=12,804) Pan-Cancer Atlas Hub,下载phenotype:https://pancanatlas.xenahubs.net/download/TCGA_phenotype_denseDataOnlyDownload.tsv.gz
  • GTEX (11 datasets),点击gene expression RNAseq里的TOIL RSEM tpm (n=7,862) UCSC Toil RNAseq Recompute,下载TPM:https://toil.xenahubs.net/download/gtex_RSEM_gene_tpm.gz。点击phenotype里的GTEX phenotype (n=9,783) UCSC Toil RNAseq Recompute,下载phenotype:https://toil.xenahubs.net/download/GTEX_phenotype.gz
  • ID和Gene symbol对应列表下载:https://toil.xenahubs.net/download/probeMap/gencode.v23.annotation.gene.probemap
  • TCGA跟GTEx的组织对应关系,参照GEPIA help的Differential analysis:http://gepia.cancer-pku.cn/help.html,整理成samplepair.txt文件

参考资料:The goal of the Toil recompute was to process ~20,000 RNA-seq samples to create a consistent meta-analysis of four datasets free of computational batch effects. https://xenabrowser.net/datapages/?hub=https://toil.xenahubs.net:443
文章是PMID: 28398314

题外话:

  • 在https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443)这个页面,phenotype还有基于TCGA数据加工过的iCluster、Molecular subtype等等你挖掘。
  • 在https://xenabrowser.net/datapages/目录下面,有TCGA的各种测序数据,下载链接有规律可循,可批量下载。

解压缩

*.gz文件是压缩包,解压缩到当前文件夹

检查一下:你的文件夹里应该有这些输入文件:

tcga_RSEM_gene_tpm

TCGA_phenotype_denseDataOnlyDownload.tsv

gtex_RSEM_gene_tpm

GTEX_phenotype.tsv

gencode.v23.annotation.gene.probemap

samplepair.txt

提取目的基因在各种肿瘤和正常组织里的TPM

为GTEx的sample标注组织

samplepair <- read.delim("samplepair.txt",as.is = T)
gtexpair <- samplepair[,c(1,2,5)]
gtexpair

#以TCGA的缩写为GTEx的组织命名
gtexpair$type <- paste0(gtexpair$TCGA,"_normal_GTEx")
gtexpair$type2 <-"normal"
gtextcga <- gtexpair[,c(1,3:5)] #筛掉Detail列
colnames(gtextcga)[1:2] <- c("tissue","X_primary_site")
head(gtextcga)

#为GTEx的sample标出组织
gtexcase <- read.delim(file="GTEX_phenotype.tsv",header=T,as.is = T)
colnames(gtexcase)[1] <- "sample"
gtexcase2tcga <- merge(gtextcga,gtexcase,by="X_primary_site")
gtextable <- gtexcase2tcga[,c(5,2:4)]
head(gtextable)

为TCGA的sample标注组织

tissue <- gtexpair$TCGA
names(tissue) <- gtexpair$Detail

tcgacase <- read.delim(file="TCGA_phenotype_denseDataOnlyDownload.tsv",header=T,as.is = T)

tcgacase$tissue <- tissue[tcgacase$X_primary_disease]
tcgacase$type <- ifelse(tcgacase$sample_type=='Solid Tissue Normal',paste(tcgacase$tissue,"normal_TCGA",sep="_"),paste(tcgacase$tissue,"tumor_TCGA",sep="_"))
tcgacase$type2 <- ifelse(tcgacase$sample_type=='Solid Tissue Normal',"normal","tumor")
tcgatable<-tcgacase[,c(1,5:7)]
head(tcgatable)

删除没有TPM的sample

system('head -1 gtex_RSEM_gene_tpm >headgtex') #前面下载的gtex_RSEM_gene_tpm文件,位于当前文件夹
system('head -1 tcga_RSEM_gene_tpm >headtcga') #前面下载的tcga_RSEM_gene_tpm文件,位于当前文件夹

headgtex <- read.delim("headgtex",as.is=T)
gtexsample <- str_replace_all(colnames(headgtex),"[.]","-")
gtextable2 <- gtextable[gtextable$sample %in% gtexsample,]

headtcga <- read.delim("headtcga",as.is=T)
tcgasample <- str_replace_all(colnames(headtcga),"[.]","-")
tcgatable2 <- tcgatable[tcgatable$sample %in% tcgasample,]

tcga_gtex<-rbind(tcgatable2,gtextable2)

收藏回复 显示全部楼层 道具 举报

发表于 2025-9-8 07:44:24

比较现有数据跟GEPIA收录的sample数量

statable<-data.frame()
for (tissuer in sort(unique(tcga_gtex$tissue))){
  
  tcga_gtex_tissuer<-tcga_gtex[tcga_gtex$tissue==tissuer,]

  tissuer_tumor_TCGA <- paste0(tissuer,"_tumor_TCGA")
  tissuer_normal_TCGA <- paste0(tissuer,"_normal_TCGA")
  tissuer_normal_GTEx <- paste0(tissuer,"_normal_GTEx")

  tissuerdf<-data.frame(TCGA=tissuer,
                     TCGA_new_tumor=sum(tcga_gtex_tissuer$type==tissuer_tumor_TCGA),
                     TCGA_new_normal=sum(tcga_gtex_tissuer$type==tissuer_normal_TCGA),
                     GTEx_new_num=sum(tcga_gtex_tissuer$type==tissuer_normal_GTEx),
                     stringsAsFactors=F)
  #str(tissuerdf)
  statable<-rbind(statable,tissuerdf)
}
cbind(statable, samplepair[,c(1,3:6)])

前4列是现有数据,后面是GEPIA数据

找出每个基因在TPM文件中的行号

idmap <- read.delim("gencode.v23.annotation.gene.probemap",as.is=T)
head(idmap)

system('cut -f 1 gtex_RSEM_gene_tpm >leftgtex') #需要大概1min
tpmid2row <- read.delim("leftgtex",as.is = T)
colnames(tpmid2row) <- "id"
tpmid2row$rownum <- seq(1,nrow(tpmid2row))
head(tpmid2row)

gene2id2row <- merge(idmap,tpmid2row,by="id")
gene2id2row <- gene2id2row[order(gene2id2row$rownum),]
head(gene2id2row)

提取目的基因在肿瘤和正常组织的TPM

此处以TP53基因为例,提取它的TPM

generownum <- gene2id2row[gene2id2row$gene=="TP53",]$rownum + 1

shellcmd<-paste0("cut -f 2-",ncol(headgtex)," gtex_RSEM_gene_tpm >gtex_tpm")
system(shellcmd) #wait for 3min
system('paste tcga_RSEM_gene_tpm gtex_tpm >tcga_gtex_tpm') #wait for 15min
system('sed \'1q;d\' tcga_gtex_tpm >tpm_header')

gettpmcmd <- paste0("sed \'",generownum,"q;d\' tcga_gtex_tpm >tpm_gene")
system(gettpmcmd)
system('cat tpm_header tpm_gene >genetpm')
tpm <- read.delim("genetpm",as.is = T)
colnames(tpm) <- str_replace_all(colnames(tpm),"[.]","-")
tpm[,1:6]
tpmvec <- as.vector(t(tpm[,2:ncol(tpm)]))
names(tpmvec) <- colnames(tpm)[2:ncol(tpm)]
tcga_gtex$tpm <- tpmvec[tcga_gtex$sample]

#按组织排序
tcga_gtex <- arrange(tcga_gtex,tissue)
write.csv(tcga_gtex[,c(2,4,5)],"easy_input.csv", quote = F)

输入数据预处理

easy_input.csv:第1列tissue是组织(分组),第2列type2是肿瘤/正常组织(左右分组),第3列tpm是TPM值。

有些组织没有normal,需要识别出来单独画。

tcga_gtex <- read.csv("easy_input.csv", row.names = 1, header = T, as.is = F)
head(tcga_gtex)

tumorlist <- unique(tcga_gtex[tcga_gtex$type2=="tumor",]$tissue)
normallist <- unique(tcga_gtex[tcga_gtex$type2=="normal",]$tissue)
withoutNormal <- setdiff(tumorlist, normallist)

MESO和UVM没有normal

tcga_gtex$type2 <- factor(tcga_gtex$type2,levels=c("tumor","normal"))
tcga_gtex_withNormal <- tcga_gtex[!(tcga_gtex$tissue %in% withoutNormal),]
tcga_gtex_MESO <- tcga_gtex[tcga_gtex$tissue=="MESO",]
tcga_gtex_UVM <- tcga_gtex[tcga_gtex$tissue=="UVM",]

开始画图

要用到SplitViolin的图层和函数,保存在GeomSplitViolin.R文件中,位于当前文件夹。

source("GeomSplitViolin.R") #位于当前文件夹

p <- ggplot(tcga_gtex_withNormal, aes(x = tissue, y = tpm, fill = type2)) + #x对应肿瘤的类型,y对应表达量,fill填充对应组织的类型
  geom_split_violin(draw_quantiles = c(0.25, 0.5, 0.75), #画4分位线
                    trim = F, #是否修剪小提琴图的密度曲线
                    linetype = "solid", #周围线的轮廓
                    color = "black", #周围线颜色
                    size = 0.2,
                    na.rm = T,
                    position ="identity")+ #周围线粗细
  ylab("TP53 expression (TPM)") + xlab("") +
  ylim(-4,9) +
  scale_fill_manual(values = c("#DF2020", "#DDDF21"))+
  theme_set(theme_set(theme_classic(base_size=20)))+
  theme(axis.text.x = element_text(angle = 45, hjust = .5, vjust = .5)) + #x轴label倾斜45度
  
  guides(fill = guide_legend(title = NULL)) + 
  theme(legend.background = element_blank(), #移除整体边框
        #图例的左下角置于绘图区域的左下角
        legend.position=c(0,0),legend.justification = c(0,0))

p + geom_split_violin(data = tcga_gtex_MESO,
                       mapping = aes(x = tissue, y = tpm, fill = type2),
                       draw_quantiles = c(0.25, 0.5, 0.75), #画4分位线
                    trim = F, #是否修剪小提琴图的密度曲线
                    linetype = "solid", #周围线的轮廓
                    color = "black", #周围线颜色
                    size = 0.2,
                    na.rm = T,
                    position ="identity") +
  geom_split_violin(data = tcga_gtex_UVM,
                       mapping = aes(x = tissue, y = tpm, fill = type2),
                       draw_quantiles = c(0.25, 0.5, 0.75), #画4分位线
                    trim = F, #是否修剪小提琴图的密度曲线
                    linetype = "solid", #周围线的轮廓
                    color = "black", #周围线颜色
                    size = 0.2,
                    na.rm = T,
                    position ="identity") +
  scale_x_discrete(limits = levels(tcga_gtex$tissue))

ggsave("TP53expression.pdf",width = 20, height = 6)

回复 显示全部楼层 道具 举报

发表于 2025-9-8 07:46:59

文中所用数据可以关注公众号“生信喵实验柴”

发送关键词“20250425”获取

回复 显示全部楼层 道具 举报

您需要登录后才可以回帖 登录 | 立即注册

QQ|Archiver|手机版|小黑屋|bioinfoer ( 萌ICP备20244422号 )

GMT+8, 2025-9-18 07:50 , Processed in 0.126912 second(s), 35 queries .

Powered by Discuz! X3.5

© 2001-2025 Discuz! Team.

快速回复 返回顶部 返回列表