画出最佳分组的生存曲线

生信喵 发表于 2024-4-16 19:23:49

![](https://files.mdnice.com/user/47194/8d13c60d-023e-44fb-8d96-26da2fcf53c3.gif)

# 背景

做生存分析，Best separation和Median separation，后者很简单，很想学前者，这次带来的是最佳分组的曲线代码。

# 一、使用场景

在展示基因表达水平（连续变量）对生存期的影响时找到最佳分组

# 二、准备文件

包含基因表达水平、生存时间、追踪情况等三列的文件，测试用文件为20230904.txt

```R
rm(list = ls())####魔幻操作，一键清空~
getwd()
setwd('D:/rtmp/bestsubgroup/')
#install.packages(c("survival","survminer","ggplot2"))
library(survival)
library("survminer")
library(ggplot2)

#读入测试文件dsg1.txt
svdata<-read.table("20230904.txt",header=T,as.is=T)
head(svdata)
```

```R
expression futime fustat
1 6.6980 1100    0
2 6.5767 1835    0
3 8.2342 2393    1
4 6.6820 1212    0
5 6.6826 987    1
6 9.2435 1939    1
```

```R
#对行按照基因表达水平排序，默认从低到高
sortsv<-svdata
```

# 三、输出

中位值分组的生存曲线、最佳分组生存曲线、遍历所有分组情况下的P值和Hazard Ratio的分布情况

## 3.1 Median separation

```R
#先根据表达水平的中位值分组，画生存曲线，保存
ssdf<-cbind(sortsv,data.frame(gp=ifelse(sortsv$expression>median(sortsv$expression),"high","low")))
fit<-survfit(Surv(futime,fustat)~gp,data=ssdf)
pdf(file="mSeparation.pdf")
sc_median<-ggsurvplot(fit, linetype = "strata", conf.int = F, pval = TRUE,palette = c("#D95F02","#1B9E77"),legend.title="",legend=c(0.7,0.9),legend.labs=c("High-expression","low-expression"))
sc_median
dev.off()
```

![](https://files.mdnice.com/user/47194/bceecc05-44c2-42e1-ac18-c8a5cf22d5fb.png)
遍历所有分组情况，计算P值和Hazard Ratio，p值用于判断分组之间差异是否显著，而Hazard Ratio用于衡量分组之间的差异程度

```R
pvals<-c()
hrs<-c()
for(n in 1:(nrow(sortsv)-1)){
ssdf<-cbind(sortsv,data.frame(gp=rep(c("low","high"),c(n,nrow(sortsv)-n))))
diff<-survdiff(Surv(futime,fustat)~gp,data=ssdf,rho = 0)
pv<-pchisq(diff$chisq,length(diff$n)-1,lower.tail=FALSE)
pvals<-c(pvals,pv)
hr<-diff$obs*diff$exp/(diff$obs*diff$exp)
hrs<-c(hrs,hr)
}
```

展示所有分组情况下的P值和Hazard Ratio的分布情况，水平虚线标记位置的P值为0.05，两条竖直虚线标记的HR为0.5和2

```R
fd<-data.frame(Tag=1:(nrow(sortsv)-1),HR=hrs,Pvalue=pvals)
head(fd)
```

```
Tag    HR Pvalue
1 1 0.2912112 0.19105049
2 2 0.6424631 0.65680193
3 3 0.3579073 0.13259234
4 4 0.3352756 0.04992592
5 5 0.4226479 0.12917044
6 6 0.3314383 0.02266261
```

```R
ggplot(fd,aes(x=log2(HR),y=-log10(Pvalue)))+geom_point(shape=21,aes(fill=Tag))+scale_fill_gradientn(colours=c("#4477AA","#77AADD","#FFFFFF","#EE9988","#BB4444"),guide="legend")+geom_hline(yintercept=(-log10(0.05)),linetype="dashed")+geom_vline(xintercept=c(-1,1),linetype="dashed")+annotate("text",y=-log10(pvals),x=log2(hrs),label="min-Pvalue")
```

![](https://files.mdnice.com/user/47194/a9bf40f2-2864-4af5-8a1d-baacc33b182d.png)

```
ggsave(file="Pvalue_Hazard-Ratio.pdf")
```

## 3.2 Best separation

虚线左上角区域的点p值最小，是最佳的分组方式，分组情况如下

```R
ggplot(fd,aes(x=log2(HR),y=-log10(Pvalue)))+geom_point(shape=21,aes(fill=Tag))+scale_fill_gradientn(colours=c("#4477AA","#77AADD","#FFFFFF","#EE9988","#BB4444"),guide="legend")+geom_hline(yintercept=(-log10(0.05)),linetype="dashed")+geom_vline(xintercept=c(-1,1),linetype="dashed")+geom_text(aes(label=paste(Tag,nrow(sortsv)-Tag,sep=":")),vjust=-1)
```

![](https://files.mdnice.com/user/47194/9060851b-360a-4ee7-bb0d-ea1e337dc2a2.png)

画出最佳分组的生存曲线

```R
ssdf<-cbind(sortsv,data.frame(gp=rep(c("low","high"),c(which.min(pvals),nrow(sortsv)-which.min(pvals)))))
fit<-survfit(Surv(futime,fustat)~gp,data=ssdf)
pdf(file="bestSeparation.pdf")
sc_minp<-ggsurvplot(fit, linetype = "strata", conf.int = F, pval = TRUE,palette = c("#D95F02","#1B9E77"),legend.title="",legend=c(0.7,0.9),legend.labs=c("High-expression","low-expression"))
sc_minp
dev.off()
```

![](https://files.mdnice.com/user/47194/62011f40-92d3-4ba8-bd6f-d94c981c24b4.png)

# 四、思考

尽管最佳分组在绘制生存曲线时优化了P值，但是我们还是需要综合多个方面考虑使用的必要性。
比如这次的案例270人，本来中位值分布后两组均是135人，而采用最佳分组后则是high 264人和low 6人。这样的分组结果很难说服别人。

页: [1]

生信人's Archiver

画出最佳分组的生存曲线