下载参考序列
背景RNAseq 分析需要的数据主要包括参考序列与 GTF 文件,参考序列可以从 NCBI,ENSEMBL,UCSC 等网站下载,GTF 文件可以从 ENSEMBL 与 UCSC 网站下载。
一、ENSEMBL 网站下载
EMBL:https://asia.ensembl.org/index.html
wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget http://ftp.ensembl.org/pub/release-107/gtf/homo_sapiens/Homo_sapiens.GRCh38.107.gtf.gz
二、UCSC 下载
#UCSC hg19
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
#UCSC hg38
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
#GTF下载
http://www.genome.ucsc.edu/cgi-bin/hgTables
三、GFF3 转换为 GTF
如果参考序列没有现成的 GTF 文件,可以通过 GFF 文件进行转换,通过gtfread 工具进行操作。
#GFF转换为GTF
wget http://ftp.ensembl.org/pub/release-107/gff3/homo_sapiens/Homo_sapiens.GRCh38.107.gff3.gz
#gffread处理GTF与GFF
#gff2gtf
gffread Homo_sapiens.GRCh38.107.gff3 -T -o genome.gtf
#gtf2gff
#gffread Homo_sapiens.GRCh38.107.gtf -o genome.gff
#获取CDS序列
gffread Homo_sapiens.GRCh38.107.gff3 -g Homo_sapiens.GRCh38.dna.primary_assembly.fa -x cds.fa
#获取蛋白序列
gffread Homo_sapiens.GRCh38.107.gff3 -g Homo_sapiens.GRCh38.dna.primary_assembly.fa -y protein.fa
#获取转录本序列
gffread Homo_sapiens.GRCh38.107.gff3 -g Homo_sapiens.GRCh38.dna.primary_assembly.fa -w transcripts.fa
从人基因组中剥离出21号染色体信息。
#提取21号染色体信息
seqkit grep-p "21" Homo_sapiens.GRCh38.dna.primary_assembly.fa >chr21.fa
grep "^21" Homo_sapiens.GRCh38.107.gtf >chr21.gtf
页:
[1]