# 目录
1.Module 1 - Introduction to RNA sequencing
2.Module 2 - RNA-seq Alignment and Visualization
3.Module 3 - Expression and Differential Expression
4.Module 4 - Isoform Discovery and Alternative Expression
- Reference Guided Transcript Assembly
- de novo Transcript Assembly
- Transcript Assembly Merge
- Differential Splicing
- Splicing Visualization
5.Module 5 - De novo transcript reconstruction
6.Module 6 - Functional Annotation of Transcripts
# 2.1 Adapter Trim (可选步骤)
使用 Flexbar 从读取的 FASTQ 文件中修剪 reads。这个步骤的输出将为每个数据集裁剪 FASTQ 文件。
参考 Flexbar 帮助文档获得更详细的解释:
- https://github.com/seqan/flexbar
- https://github.com/seqan/flexbar/wiki
Flexbar 基本用法:
flexbar -r reads [-t target] [-b barcodes] [-a adapters] [options] |
额外选项如下:
- '--adapter-min-overlap 7' requires a minimum of 7 bases to match the adapter
- '--adapter-trim-end RIGHT' uses a trimming strategy to remove the adapter from the 3 prime or RIGHT end of the read
- '--max-uncalled 300' allows as many as 300 uncalled or N bases (MiSeq read lengths can be 300bp)
- '--min-read-length' the minimum read length allowed after trimming is 25bp.
- '--threads 8' use 8 threads
- '--zip-output GZ' the input FASTQ files are gzipped so we will output gzipped FASTQ to save space
- '--adapters' define the path to the adapter FASTA file to trim
- '--reads' define the path to the read 1 FASTQ file of reads
- '--reads2' define the path to the read 2 FASTQ file of reads
- '--target' a base path for the output files. The value will _1.fastq.gz and _2.fastq.gz for read 1 and read 2 respectively
- '--pre-trim-left' trim a fixed number of bases at left read end. For example, to trim 5 bases at the left side of reads: --pre-trim-left 5
- '--pre-trim-right' trim a fixed number of bases at right read end. For example, to trim 5 bases at the right side of reads: --pre-trim-right 5
- '--pre-trim-phred' trim based on phred quality value to deal with higher error rates towards the end of reads. For example, to trim the 3 prime end until quality offset value 30 or higher is reached, specify: --pre-trim-phred 30
# Flexbar trim
首先,为输出设置一些目录
mkdir trim |
下载必要的 Illumina 接头序列文件。
wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa |
使用 flexbar 删除 illumina 接头序列 (如果有的话),并修剪每个读取的前 13 个碱基。
../flexbar-3.4.0-linux/flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz --target trim/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22 |
可选练习:比较裁剪前后 FastQC 文件的质控报告。所有 fastqc 报告都可以在命令行上生成。
fastqc *.fastq.gz |
# 练习 5
作业:使用上面的方法,修剪你在之前的实践练习中下载的正常样本和肿瘤样本 reads 文件。注意:尝试去掉上面使用的硬左修剪选项 (”--pre-trim-left”)。一旦你削减了读取,使用 FastQC 工具比较修剪前和修剪后的 FastQ 文件。
mkdir trimmed | |
wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa | |
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_normal_rep1_r1.fastq.gz --reads2 hcc1395_normal_rep1_r2.fastq.gz --target trimmed/hcc1395_normal_rep1 | |
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_normal_rep2_r1.fastq.gz --reads2 hcc1395_normal_rep2_r2.fastq.gz --target trimmed/hcc1395_normal_rep2 | |
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_normal_rep3_r1.fastq.gz --reads2 hcc1395_normal_rep3_r2.fastq.gz --target trimmed/hcc1395_normal_rep3 | |
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_tumor_rep1_r1.fastq.gz --reads2 hcc1395_tumor_rep1_r2.fastq.gz --target trimmed/hcc1395_tumor_rep1 | |
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_tumor_rep2_r1.fastq.gz --reads2 hcc1395_tumor_rep2_r2.fastq.gz --target trimmed/hcc1395_tumor_rep2 | |
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_tumor_rep3_r1.fastq.gz --reads2 hcc1395_tumor_rep3_r2.fastq.gz --target trimmed/hcc1395_tumor_rep3 |
修剪后,hcc1395 正常样本 1 号重复,reads1 的读长范围是多少?25-151
FastQC 报告中哪些部分最适合观察修剪的效果?'Basic Statistics', 'Sequence Length Distribution' 以及 'Adapter Content'
在 “Per base sequence content section” 部分,你看到了什么模式?什么可以解释这种模式呢?
前 9 个碱基位置显示出一个尖状的模式,表明每个碱基在我们的读取 / 片段的开头有偏倚的表示。一种可能的解释是,cDNA 合成的随机六聚体引物在文库准备过程中以非随机的方式产生。因此碎片的生成 (以及最终的 reads) 在开始时有一个非随机模式。