背景
用R画出文章PMID: 22363530里的双因子生存分析图,计算p value。

应用场景
对比两个因素对生存的影响。
例如,基因A和B都高表达,跟基因A和B都低表达比起来,生存期是否更长?
输入文件
至少包含5列:依次是sample名字、Time、Death、第一个基因的表达量(基因名作为列名)、第二个基因的表达量(基因名作为列名)
顺序不要变
此处提取R语言多基因最佳分组生存曲线的基因AKAP12和MAP2K6作为输入文件,以基因AKAP12和MAP2K6作为双因素。
df<-read.csv("easy_input.csv",row.names = 1)
head(df)

基因表达量高低的划分
这里提供两种方法划分基因表达水平的low和high:
- 方法1. 用R语言多基因最佳分组生存曲线找出最佳分组,获得划分low和high的分界表达量;基因AKAP12和MAP2K6的low跟high的分界表达量分别是766.37和880.23。
df$bestA <- cut(df[,3],breaks=c(-Inf,766.37, Inf),#其中766.37要改成你的基因A的高低分界值
labels=c("0","1"))
df$bestB <- cut(df[,4],breaks=c(-Inf,880.23, Inf),#其中880.23要改成你的基因B的高低分界值
labels=c("0","1"))
df$bestGroup <- paste0(df$bestA,df$bestB)
#按照两个基因表达量高低的排列组合分成4组
df$bestGroup <- ifelse(df$bestGroup=='10','1',ifelse(df$bestGroup=='00','2',ifelse(df$bestGroup=="01","3","4")))
head(df)

df$medA <- cut(df[,3],breaks=c(-Inf,median(df[,3]), Inf),
labels=c("0","1"))
df$medB <- cut(df[,4],breaks=c(-Inf,median(df[,4]), Inf),
labels=c("0","1"))
df$medGroup <- paste0(df$medA,df$medB)
#按照两个基因表达量高低的排列组合分成4组
df$medGroup <- ifelse(df$medGroup=='10','1',ifelse(df$medGroup=='00','2',ifelse(df$medGroup=="01","3","4")))
head(df)
#如果用方法1,就运行下面这行
df$group<-df$bestGroup
#如果用方法2,就运行下面这行
#df$group<-df$medGroup
统计运算
library(survival)
library(survminer)
#定义结局变量,生存时间和结局
y <- Surv(df$futime, df$fustat==1)
#logrank检验两两比较的结果
comp <- pairwise_survdiff(Surv(futime,fustat) ~ group, data = df)
pvalue<-as.vector(unlist(comp$p.value))
#将p value和两两比较的组别生成两列的数据集
name <- array(dim=c(3,3))
for(i in 1:3){
for(j in 2:4){
name[i,j-1] = print(paste(i,j,sep = " vs "))
}
}
pvalue_name <- as.vector(t(name))
logrank <- data.frame(pvalue_name,pvalue)
logrank
#挑选p value小于0.05的记录
logrank_sig <- subset(logrank, pvalue<0.05)
#如果p值太小,就写“<0.0001”,否则保留小数点后4位
logrank_sig$pvalue <- lapply(logrank_sig$pvalue,function(i)
ifelse (i<0.0001,"<0.0001",round(i, 4)))
logrank_sig
#进行COX回归,导出每组对应的HR值,并将1赋值给对照组
coxph.fit <- coxph(y ~ as.factor(group), data = df)
hr <- round(coef(summary(coxph.fit))[,2],3)
HR <- c(1,as.vector(unlist(hr)))
开始画图
右上角的图例有时会跟曲线重叠,因此,我们把图例画到图右侧。
#KM曲线的绘制
kmfit <- survfit(y~df$group,)
#写legend
A = c("Low; ", "High;", "Low; ", "High;")
B = c("Low; ", "Low; ", "High;", "High;")
Group = c(1,2,3,4)
text.legend1 <- paste0(colnames(df)[3], " = ", A, colnames(df)[4], " = ", B, " Group = ", Group, ", HR = ", HR)
text.legend2 <- paste0(logrank_sig$pvalue_name," : ",logrank_sig$pvalue)
#自定义足够你用的颜色
mycol <- c("#223D6C","#D20A13","#FFD121","#088247","#11AA4D","#58CDD9","#7A142C","#5D90BA","#431A3D","#91612D","#6E568C","#E0367A","#D8D155","#64495D","#7CC767")
#画图,并保存到pdf文件
pdf("nSurv.pdf",width = 7,height = 6)
par(xpd = T,
mar = par()$mar + c(0,0,0,5)#,#在右侧给图例留出空间
#cex.axis=0.8, cex.lab=1, cex.main=1, cex.sub=1 #修改字体大小
)
plot(kmfit,
col= mycol,
lwd = 1.4,#线的粗细
#pch = "o", #如果你也想让观测点的形状是o,就运行这行
xlab="Months from diagnosis",
ylab="Survival Distribution Function",
mark.time=TRUE)
legend("bottomleft", lty=c("solid","solid","solid","solid"),
col=mycol,
legend=text.legend1,
bty="n",
lwd = 2,
cex=0.8)
legend("topright",
inset=c(-0.3,0), #图例画到图外面
legend=c("Pairwise comparison",text.legend2),
bty="n",
cex=0.8)
dev.off()

对比两种方法的效果
用最佳分组的分界表达量划分low与high

用中位数划分low与high
