RNAseq pepline

必读文献：Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis

数据下载与处理

将数据从sra格式转化为fastq.gz格式

对于双端测序，用fastq-dump命令的--split-3选项

fastq-dump --split-3 -A xxx.sra \
--gzip \
-O outpath

检测数据质量QC

fastqc -o outpath --noextarct xxx.fastq.gz

自动化处理数据

可同时处理双端数据

fastp -i xxx.1.fastq.gz -o xxx.1.out.fastq.gz -I xxx.2.fastq.gz -O xxx.2.out.fastq.gz

序列比对

收集并了解不同的比对软件

比较不同的比对软件的优势以及劣势。常用的比对软件包括：STAR, TopHat和HISAT2

STAR具有最高比例的在基因组上有唯一比对位置的reads，尤其是对读长为300 nt的MCF7样品也有最高的比对率。与TopHat和HISAT2不同，STAR只保留双端reads都比对到基因组的序列，但对低质量的比对 (允许更多的错配碱基和soft-clip事件) 容忍度高。这一点在长reads >(MCF7-300)样品中的体现更为明显。TopHat则不允许soft-clip事件。
在比对速度方面，HISAT2比STAR快2.5倍，比TopHat快大约100倍。
soft-clip事件: 即reads末端存在低质量碱基或接头导致比对不上的, STAR会自动尝试截去未比对部分，只保留比对上的部分。

接下来，我们对这三种软件都安装并且进行尝试：STAR安装，TopHat2安装，HISAT2安装

在了解这三种软件的创始人以及使用目的时，我发现TopHat2和HISAT2都是约翰霍普金斯大学计算生物学中心发表的软件，而且再TopHat2的首页上也已经提出了：

Please note that TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 which provides the same core functionality (i.e. spliced alignment of RNA-Seq reads), in a more accurate and much more efficient way.

因此，我们下面只尝试STAR和HISAT2这两种方法

STAR

使用手册

生信菜鸟团的使用说明

参考阅读１
参考阅读２
参考阅读３

1 生成genome Index

STAR \
 --runThreadN 30    \
 --runMode genomeGenerate    \
 --genomeDir /home/xx/reference/genomeDir \
 --genomeFastaFiles /home/xx/reference/hg19.fa    \
 --sjdbGTFfile /home/xx/reference/hg19.gtf    \
 --sjdbOverhang 99

2 进行序列比对

STAR \
--runThreadN 30 \
--genomeDir genomeDir \
--readFilesCommand zcat \
--readFilesIn fastq1 fastq2 \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix ${mapping_dir}/${sample_id}.

输出的文件：

Aligned.sortedByCoord.out.bam: 比对结果
SJ.out.tab：包含剪接信息

column 1: chromosome
column 2: first base of the intron (1-based)
column 3: last base of the intron (1-based)
column 4: strand (0: undefined, 1: +, 2: -)
column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT
column 6: 0: unannotated, 1: annotated (only if splice junctions database is used)
column 7: number of uniquely mapping reads crossing the junction
column 8: number of multi-mapping reads crossing the junction
column 9: maximum spliced alignment overhang

3 STAR软件可以进行二次比对

为了发现更加灵敏的new junction，STAR建议使用2-pass mode，其能增加检测到的new junction数目，使得更多的splices reads能mapping到new junction。因此STAR先用一般参数做一遍mapping，收集检测到的junction信息，然后利用这已经annotated junction来做第二次mapping．

3.1 重新建立索引

STAR \
--runThreadN 10 \
--runMode genomeDir \
--genomeDir new.path.to.genomeDir \
--genomeFastaFiles /home/xx/reference/hg19.fa \
--sjdbGTFfile /home/xx/reference/hg19.gtf \
--sjdbOverhang 99 \
--sjdbFileChrStartEnd all.SJ.out.tab.list \ #相比与第一次建立索引，只增加了一个命令选项，就是把SJ.out.tab文件加入到建立索引中

3.2 重新比对

STAR \
 --runThreadN 10 \
 --genomeDir ~/project/technique/RNAseq.data/genomeDir/genomeDir \
 --readFilesCommand zcat \
 --readFilesIn ~/project/technique/RNAseq.data/SRR771548.sra_1_clean.fastq.gz ~/project/technique/RNAseq.data/SRR771548.sra_2_clean.fastq.gz \
 --outSAMtype BAM SortedByCoordinate \
 --outFileNamePrefix ~/project/technique/RNAseq.data/two.pass.mapping/

4 计算counts

4.1 featureCounts

featureCounts -T 10 -p -t exon -g gene_id \
  -a /home/xx/reference/homo_annotation_hg19/gencode.v19.annotation.gtf \
  -o ~/project/technique/featurecounts/featureCounts.txt ~/project/technique/RNAseq.data/two.pass.mapping/Aligned.sortedByCoord.out.bam

输出文件：gene_assigned　计数结果

cut -f 1,7 gene_assigned | grep -v '^#' > feature.counts.txt

HISAT2

这个软件我们的服务器竟然没装，好在装起来简单，官网下载source code，解压后make，然后把路径添加到.bashrc里面去：
export PATH=$PATH:/home/liuke/software/hisat2-2.1.0，在服务器上，则是添加到了.bash_profile中：PATH=$PATH:$HOME/software/hisat2-2.1.0

官网

参考阅读

1 下载index

下载完成后解压并且运行文件夹中的shell脚本，这步运行了好久,生成了下列的文件：
genome.1.ht2　genome.2.ht2　genome.3.ht2　genome.4.ht2　genome.5.ht2　genome.6.ht2　genome.7.ht2　genome.8.ht2　make_hg19.sh

2 比对

最基础的命令

hisat2 -x ~/reference/hisat2-hg19/genome -1 ~/project/technique/RNAseq.data/SRR771548.sra_1_clean.fastq.gz \
    -2 ~/project/technique/RNAseq.data/SRR771548.sra_2_clean.fastq.gz \
    -S ~/project/technique/RNAseq.data/hisat2.mapping/SRR771548.sam

最需要注意的一点是在用-x指定index时，不能只指定目录，比如上述index的目录是~/reference/hisat2-hg19/,但是如果我只输入这个目录，就会显示Could not locate a HISAT2 index corresponding to basename "/home/liuke/reference/hisat2-hg19/"，我们还要输入这个目录中，每个文件相同的开头，比如这个index中，就需要输入~/reference/hisat2-hg19/genome

输出的文件就是上述的SRR771548.sam文件

输出的信息如下所示:

62789772 reads; of these:
  62789772 (100.00%) were paired; of these:
    2946595 (4.69%) aligned concordantly 0 times
    56129524 (89.39%) aligned concordantly exactly 1 time
    3713653 (5.91%) aligned concordantly >1 times
    ----
    2946595 pairs aligned concordantly 0 times; of these:
      124043 (4.21%) aligned discordantly 1 time
    ----
    2822552 pairs aligned 0 times concordantly or discordantly; of these:
      5645104 mates make up the pairs; of these:
        3552647 (62.93%) aligned 0 times
        1911761 (33.87%) aligned exactly 1 time
        180696 (3.20%) aligned >1 times

97.17% overall alignment rate

3 排序

按照名字进行排序

samtools sort -n file.sam -o file.sortbyname.sam

4 计算counts

用htseq-count计算counts，刚开始直接用下面的代码写：

htseq-count sam.file gtf > counts.txt

但是，我们用的sam文件要排序，我们上面按照名字排序，所以下面的选项r中，我们就写name．但是这个代码默认是比对exon，而且默认的feature是gene_id，而且由于用的gtf里面的染色体的写法是站＇１＇，而不是只＇chr1＇，导致一个都没有比对上，在西仔细的研究了htseq-count的命令选项后，重新写了代码：

htseq-count -s no -t gene -i gene_name -r name sam.file gtf > gene.counts.txt

而且，在发现了这个问题之后，又重新回去仔细看了featureCounts的命令选项，发现t表示同样的作用，而g则与htseq-count里的i作用相同．

搜索此博客

吃瓜少女

RNAseq学习与总结

RNAseq pepline

数据下载与处理

将数据从sra格式转化为fastq.gz格式

检测数据质量QC

自动化处理数据

序列比对

收集并了解不同的比对软件

评论

发表评论

此博客中的热门博文

WGCNA全流程

10X Genomics单细胞转录组测序数据的处理