0 收听	12 听众	390 主题

画出最佳分组的生存曲线

发表于 2024-4-16 19:23:49 | 查看: 1576| 回复: 0

背景

做生存分析，Best separation和Median separation，后者很简单，很想学前者，这次带来的是最佳分组的曲线代码。

一、使用场景

在展示基因表达水平（连续变量）对生存期的影响时找到最佳分组

二、准备文件

包含基因表达水平、生存时间、追踪情况等三列的文件，测试用文件为20230904.txt

rm(list = ls())  ####  魔幻操作，一键清空~
getwd()
setwd('D:/rtmp/bestsubgroup/')
#install.packages(c("survival","survminer","ggplot2"))
library(survival)
library("survminer")
library(ggplot2)

#读入测试文件dsg1.txt
svdata<-read.table("20230904.txt",header=T,as.is=T)
head(svdata)

  expression futime fustat
1     6.6980   1100      0
2     6.5767   1835      0
3     8.2342   2393      1
4     6.6820   1212      0
5     6.6826    987      1
6     9.2435   1939      1

#对行按照基因表达水平排序，默认从低到高
sortsv<-svdata[order(svdata$expression),]

三、输出

中位值分组的生存曲线、最佳分组生存曲线、遍历所有分组情况下的P值和Hazard Ratio的分布情况

3.1 Median separation

#先根据表达水平的中位值分组，画生存曲线，保存
ssdf<-cbind(sortsv,data.frame(gp=ifelse(sortsv$expression>median(sortsv$expression),"high","low")))
fit<-survfit(Surv(futime,fustat)~gp,data=ssdf)
pdf(file="mSeparation.pdf")
sc_median<-ggsurvplot(fit, linetype = "strata", conf.int = F, pval = TRUE,palette = c("#D95F02","#1B9E77"),legend.title="",legend=c(0.7,0.9),legend.labs=c("High-expression","low-expression"))
sc_median
dev.off()

遍历所有分组情况，计算P值和Hazard Ratio，p值用于判断分组之间差异是否显著，而Hazard Ratio用于衡量分组之间的差异程度

pvals<-c()
hrs<-c()
for(n in 1:(nrow(sortsv)-1)){
  ssdf<-cbind(sortsv,data.frame(gp=rep(c("low","high"),c(n,nrow(sortsv)-n))))
  diff<-survdiff(Surv(futime,fustat)~gp,data=ssdf,rho = 0)
  pv<-pchisq(diff$chisq,length(diff$n)-1,lower.tail=FALSE)
  pvals<-c(pvals,pv)
  hr<-diff$obs[1]*diff$exp[2]/(diff$obs[2]*diff$exp[1])
  hrs<-c(hrs,hr)
}

展示所有分组情况下的P值和Hazard Ratio的分布情况，水平虚线标记位置的P值为0.05，两条竖直虚线标记的HR为0.5和2

fd<-data.frame(Tag=1:(nrow(sortsv)-1),HR=hrs,Pvalue=pvals)
head(fd)

  Tag        HR     Pvalue
1   1 0.2912112 0.19105049
2   2 0.6424631 0.65680193
3   3 0.3579073 0.13259234
4   4 0.3352756 0.04992592
5   5 0.4226479 0.12917044
6   6 0.3314383 0.02266261

ggplot(fd,aes(x=log2(HR),y=-log10(Pvalue)))+geom_point(shape=21,aes(fill=Tag))+scale_fill_gradientn(colours=c("#4477AA","#77AADD","#FFFFFF","#EE9988","#BB4444"),guide="legend")+geom_hline(yintercept=(-log10(0.05)),linetype="dashed")+geom_vline(xintercept=c(-1,1),linetype="dashed")+annotate("text",y=-log10(pvals[which.min(pvals)]),x=log2(hrs[which.min(pvals)]),label="min-Pvalue")

ggsave(file="Pvalue_Hazard-Ratio.pdf")

3.2 Best separation

虚线左上角区域的点p值最小，是最佳的分组方式，分组情况如下

ggplot(fd,aes(x=log2(HR),y=-log10(Pvalue)))+geom_point(shape=21,aes(fill=Tag))+scale_fill_gradientn(colours=c("#4477AA","#77AADD","#FFFFFF","#EE9988","#BB4444"),guide="legend")+geom_hline(yintercept=(-log10(0.05)),linetype="dashed")+geom_vline(xintercept=c(-1,1),linetype="dashed")+geom_text(aes(label=paste(Tag,nrow(sortsv)-Tag,sep=":")),vjust=-1)

画出最佳分组的生存曲线

ssdf<-cbind(sortsv,data.frame(gp=rep(c("low","high"),c(which.min(pvals),nrow(sortsv)-which.min(pvals)))))
fit<-survfit(Surv(futime,fustat)~gp,data=ssdf)
pdf(file="bestSeparation.pdf")
sc_minp<-ggsurvplot(fit, linetype = "strata", conf.int = F, pval = TRUE,palette = c("#D95F02","#1B9E77"),legend.title="",legend=c(0.7,0.9),legend.labs=c("High-expression","low-expression"))
sc_minp
dev.off()

四、思考

尽管最佳分组在绘制生存曲线时优化了P值，但是我们还是需要综合多个方面考虑使用的必要性。比如这次的案例270人，本来中位值分布后两组均是135人，而采用最佳分组后则是high 264人和low 6人。这样的分组结果很难说服别人。

收藏0 回复显示全部楼层道具举报

返回列表

		自动登录	找回密码
密码			立即注册