背景
画出文章PMID: 24239208中的lasso回归图,给出图中线所对应的基因顺序。
将纳入signature的变量拟合成一个变量,作为nomogram的输入。

应用场景
筛选差异基因之后,进一步筛选有预后意义的基因,组成基因表达signature。
LASSO对于高维度、强相关、小样本的生存资料数据较为适用。
输入数据
输入文件有两个,一个是生存信息 easy_input_suv.csv
,一个是基因表达矩阵 easy_input_exp.csv
。
easy_input_suv.csv
,生存信息,至少包含三列,分别为patient IDs、follow up time、life status
easy_input_exp.csv
,基因表达矩阵,每行一个基因,每列一个patient,跟 suv.csv
里的 patient ID
一致。
myexpr <- read.csv("easy_input_exp.csv",header = T,row.names = 1)
myexpr[1:3,1:4]
mysurv <- read.csv("easy_input_suv.csv",header = T,row.names = 1)
head(mysurv)
#检查样品信息跟表达量矩阵的样品是否一致
if (all(colnames(myexpr) %in% rownames(mysurv))){
warning("两个文件的patient ID是一致的")
} else{
warning("两个文件的patient ID不一致")
}
myexpr前几行如下:

mysurv前几行如下:

开始画图
采用R包 glmnet
,它是目前最好用的拟合广义线性模型的R包,由 LASSO
回归的发明人,斯坦福统计学家 Trevor Hastie 领衔开发。
#install.packages("glmnet")
#install.packages("survival")
library("glmnet")
library("survival")
算出lambda值
cvfit = cv.glmnet(t(myexpr), Surv(mysurv$months,mysurv$status),
#10倍交叉验证,非必须限定条件,这篇文献有,其他文献大多没提
#nfold=10,
family = "cox"
)
plot(cvfit)
#两个lambda值均可采用,具体lambda选值要根据自己实验设计而定。
#此处使用`lambda min`
cvfit$lambda.min #最佳lambda值
cvfit$lambda.1se #一倍SE内的更简洁的模型
fit <- glmnet(t(myexpr), Surv(mysurv$months,mysurv$status),
family = "cox")
#用包自带的函数画图
plot(fit, label = TRUE)
cvfit:

fit:

修改画图函数
原函数打印序号,需要修改画图函数,改为打印基因名
如果基因太多,打印基因名也看不清。在右侧列出图例,这样看起来更清晰。
#自定义颜色
mycol <- rep(c("#223D6C","#D20A13","#FFD121","#088247","#11AA4D","#58CDD9","#7A142C","#5D90BA","#431A3D","#91612D","#6E568C","#E0367A","#D8D155","#64495D","#7CC767"),2)
#设置x轴最大值
xmax <- 3.6
# library(glmnet)
plotCoef_plus <- function (beta, norm, lambda, df, dev, label = FALSE, legend = FALSE, xvar = c("norm",
"lambda", "dev"), xlab = iname, ylab = "Coefficients", ...)
{
which = glmnet:::nonzeroCoef(beta) # 内部函数,三个:调用
nwhich = length(which)
switch(nwhich + 1, `0` = {
warning("No plot produced since all coefficients zero")
return()
}, `1` = warning("1 or less nonzero coefficients; glmnet plot is not meaningful"))
beta = as.matrix(beta[which, , drop = FALSE])
xvar = match.arg(xvar)
switch(xvar, norm = {
index = if (missing(norm)) apply(abs(beta), 2, sum) else norm
iname = "L1 Norm"
approx.f = 1
}, lambda = {
index = log(lambda)
iname = "Log Lambda"
approx.f = 0
}, dev = {
index = dev
iname = "Fraction Deviance Explained"
approx.f = 1
})
dotlist = list(...)
type = dotlist$type
if (legend){
#在右侧留出画图例的地方
par(xpd = T, mar = par()$mar + c(0,0,0,6))
}
#修改bty,换个更好看的边框,还可以改成,o / n / 7 / l / c / u / ]
if (is.null(type))
matplot(index, t(beta), lty = 1, lwd = 2,
xlab = xlab, ylab = ylab,
xlim = c(0, xmax), #设置x轴最大值
col = mycol,#线的颜色
type = "l", cex.lab=1.2, cex.axis=1,
bty="n", ...)#不画右边框
else matplot(index, t(beta), lty = 1, lwd = 2,
xlab = xlab, ylab = ylab,
xlim = c(0, xmax),
col = mycol,
type = "l", cex.lab=1.2, cex.axis=1,
bty="n", ...)
atdf = pretty(index)
prettydf = approx(x = index, y = df, xout = atdf, rule = 2,
method = "constant", f = approx.f)$y
axis(3, at = atdf, labels = prettydf, tcl = NA)
if (label) {
nnz = length(which)
xpos = max(index)
pos = 4
if (xvar == "lambda") {
xpos = min(index)
pos = 2
}
xpos = rep(xpos, nnz)
ypos = beta[, ncol(beta)]
#原函数打印序号,修改为打印基因名
text(xpos, ypos, paste(rownames(myexpr)[which]),
cex = 0.8, #基因名字体大小
#基因名的颜色跟线一样
col = mycol,
#如果你不想要彩色的字,就用下面这行
#col = "black",
pos = pos)
}
if (legend) {
#画图例
legend("topright",
inset=c(-0.12,0),#图例画到图外面
legend = rownames(myexpr), #图例文字
col = mycol, #图例线的颜色,与文字对应
lwd = 3, #图例中线的粗细
cex = 1, #图例字体大小
bty = "n") #不显示图例边框
}
par(xpd=FALSE)
}
plot.glmnet_plus <- function (x, xvar = c("norm", "lambda", "dev"), label = FALSE, legend = FALSE,
...)
{
xvar = match.arg(xvar)
plotCoef_plus(x$beta, lambda = x$lambda, df = x$df, dev = x$dev.ratio,
label = label, legend = legend, xvar = xvar, ...)
}
用修改后的函数画图
plot.glmnet_plus(fit, label = TRUE, #打印基因名
legend = FALSE) #不显示图例
#在图上画虚线
#你想用哪个cutoff,就在“v = ”写上相应的数字
#此处以lambda.min作为cutoff
abline(v = cvfit$lambda.min, lty = 3, #线的类型,可以改成0, 1, 2, 3, 4, 5, 6
lwd = 2, #线的粗细
col = "black") #线的颜色
保存到pdf文件
pdf("lasso.pdf",width = 10,height = 8)
plot.glmnet_plus(fit, label = FALSE, #不打印基因名
legend = TRUE) #显示图例
abline(v = cvfit$lambda.min, lty=3,
lwd = 2,
col = "black")
dev.off()

输出选中的基因
输出选中的基因及其coefficient,这个coefficient跟lasso图中的纵坐标是一致的
#在参数“s =”后面写cutoff
#此处选用lambda.min
coef.min = coef(cvfit, s = "lambda.min")
coef.min
#提取选中的基因名
active.min = which(coef.min != 0)
geneids <- rownames(myexpr)[active.min]
geneids
#提取选中的基因对应的coefficient
index.min = coef.min[active.min]
index.min
#输出到文件
combine <- cbind(geneids, index.min)
write.csv(combine,"gene_index.csv")
combine前几行如下:

输出用于nomogram作图的文件
将纳入signature的变量拟合成一个变量,作为nomogram的输入
signature <- as.matrix(t(myexpr[geneids,])) %*% as.matrix(index.min)
summary(signature)
colnames(signature)[1] <- "lasso"
row.names = row.names(myexpr)
write.table(signature,"lasso_output.txt",row.names = T, quote = F)
signature前几行如下:
