RNAseq 教程 (3.1) - 生信学习 | Zhou Xiaozhao = 小钊の笔记 = 前天是小兔子，昨天是小鹿，今天是你

# 目录

1.Module 1 - Introduction to RNA sequencing

2.Module 2 - RNA-seq Alignment and Visualization

3.Module 3 - Expression and Differential Expression

4.Module 4 - Isoform Discovery and Alternative Expression

5.Module 5 - De novo transcript reconstruction

De novo RNA-Seq Assembly and Analysis Using Trinity

6.Module 6 - Functional Annotation of Transcripts

Functional Annotation of Assembled Transcripts Using Trinotate

# 3.1 Expression

# Stringtie

使用 Stringtie 从上一个模块中 HISAT2 生成的 SAM/BAM 文件中生成表达量统计

注意使用 Stringtie 从头发现转录本和差异表达

在本模块中，我们将以有参模式模式运行 Stringtie。为了简化和减少运行时间，只使用已知的转录本。但是，Stringtie 可以预测每个文库中存在的可能的转录本 (通过删除 Stringtie 命令中的 '-G' 选项)。然后，Stringtie 将为每个由数据组装的转录本分配任意的转录本 id，并估计这些转录本的表达。

Stringtie 提供了一个 merge 命令合并不同文库的 gtf 文件
旦您有了一个合并的 GTF 文件，您就可以使用这个文件再次运行 Stringtie，而不是我们上面使用的已知的 transcripts GTF 文件
Stringtie 还提供了 “gffcompare” 来比较预测的转录本和已知的转录本
参考 Stringtie 手册获得更详细的解释:
https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

String 的基础使用

stringtie <aligned_reads.bam> [options]*

'-p 8' tells Stringtie to use eight CPUs
'-G <known transcripts file>' reference annotation to use for guiding the assembly process (GTF/GFF3)
'-e' only estimate the abundance of given reference transcripts (requires -G)
'-B' enable output of Ballgown table files which will be created in the same directory as the output GTF (requires -G, -o recommended)
'-o' output path/file name for the assembled transcripts GTF (default: stdout)
'-A' output path/file name for gene abundance estimates

	stringtie -p 8 -G ../chr22_with_ERCC92.gtf -e -B -o HBR_Rep1/transcripts.gtf -A HBR_Rep1/gene_abundances.tsv HBR_Rep1.bam
	stringtie -p 8 -G ../chr22_with_ERCC92.gtf -e -B -o HBR_Rep2/transcripts.gtf -A HBR_Rep2/gene_abundances.tsv HBR_Rep2.bam
	stringtie -p 8 -G ../chr22_with_ERCC92.gtf -e -B -o HBR_Rep3/transcripts.gtf -A HBR_Rep3/gene_abundances.tsv HBR_Rep3.bam

	stringtie -p 8 -G ../chr22_with_ERCC92.gtf -e -B -o UHR_Rep1/transcripts.gtf -A UHR_Rep1/gene_abundances.tsv UHR_Rep1.bam
	stringtie -p 8 -G ../chr22_with_ERCC92.gtf -e -B -o UHR_Rep2/transcripts.gtf -A UHR_Rep2/gene_abundances.tsv UHR_Rep2.bam
	stringtie -p 8 -G ../chr22_with_ERCC92.gtf -e -B -o UHR_Rep3/transcripts.gtf -A UHR_Rep3/gene_abundances.tsv UHR_Rep3.bam

Stringtie 的原始输出是什么样子的？有关 Stringtie 输出文件的详细信息，请参阅 Stringtie 手册 (输出部分)

less -S UHR_Rep1/transcripts.gtf

查看表达量

	awk '{if ($3=="transcript") print}' UHR_Rep1/transcripts.gtf \| cut -f 1,4,9 \| less

	22 15690026 gene_id "ENSG00000198062"; transcript_id "ENST00000343518"; ref_gene_name "POTEH"; cov "0.106554"; FPKM "2.778050"; TPM "3.623655";
	22 15690078 gene_id "ENSG00000198062"; transcript_id "ENST00000621704"; ref_gene_name "POTEH"; cov "0.519912"; FPKM "13.555018"; TPM "17.681002";

基因和转录本水平表达值也可以在这两个文件中查看:

	less -S UHR_Rep1/t_data.ctab

	less -S UHR_Rep1/gene_abundances.tsv

创建一个整洁的表达式矩阵文件。这将在基因和转录水平上进行，还将产生各种不同的表达量表示方法：覆盖率、FPKM 和 TPM。

	wget https://github.com/griffithlab/rnaseq_tutorial/blob/master/scripts/stringtie_expression_matrix.pl
	chmod +x stringtie_expression_matrix.pl
	./stringtie_expression_matrix.pl --expression_metric=TPM --result_dirs='HBR_Rep1,HBR_Rep2,HBR_Rep3,UHR_Rep1,UHR_Rep2,UHR_Rep3' --transcript_matrix_file=transcript_tpm_all_samples.tsv --gene_matrix_file=gene_tpm_all_samples.tsv

	./stringtie_expression_matrix.pl --expression_metric=FPKM --result_dirs='HBR_Rep1,HBR_Rep2,HBR_Rep3,UHR_Rep1,UHR_Rep2,UHR_Rep3' --transcript_matrix_file=transcript_fpkm_all_samples.tsv --gene_matrix_file=gene_fpkm_all_samples.tsv

	./stringtie_expression_matrix.pl --expression_metric=Coverage --result_dirs='HBR_Rep1,HBR_Rep2,HBR_Rep3,UHR_Rep1,UHR_Rep2,UHR_Rep3' --transcript_matrix_file=transcript_coverage_all_samples.tsv --gene_matrix_file=gene_coverage_all_samples.tsv

	head transcript_tpm_all_samples.tsv gene_tpm_all_samples.tsv

# HTSEQ-COUNT (粗略介绍)

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

htseq-count [options] <sam_file> <gff_file>

'--format' specify the input file format one of BAM or SAM. Since we have BAM format files, select 'bam' for this option.
'--order' provide the expected sort order of the input file. Previously we generated position sorted BAM files so use 'pos'.
'--mode' determines how to deal with reads that overlap more than one feature. We believe the 'intersection-strict' mode is best.
'--stranded' specifies whether data is stranded or not. The TruSeq strand-specific RNA libraries suggest the 'reverse' option for this parameter.
'--minaqual' will skip all reads with alignment quality lower than the given minimum value
'--type' specifies the feature type (3rd column in GFF file) to be used. (default, suitable for RNA-Seq and Ensembl GTF files: exon)
'--idattr' The feature ID used to identity the counts in the output table. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.

	mkdir -p htseq_counts
	cd htseq_counts

	htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id ../UHR_Rep1.bam ../../chr22_with_ERCC92.gtf > UHR_Rep1_gene.tsv
	htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id ../UHR_Rep2.bam ../../chr22_with_ERCC92.gtf > UHR_Rep2_gene.tsv
	htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id ../UHR_Rep3.bam ../../chr22_with_ERCC92.gtf > UHR_Rep3_gene.tsv

	htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id ../HBR_Rep1.bam ../../chr22_with_ERCC92.gtf > HBR_Rep1_gene.tsv
	htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id ../HBR_Rep2.bam ../../chr22_with_ERCC92.gtf > HBR_Rep2_gene.tsv
	htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id ../HBR_Rep3.bam ../../chr22_with_ERCC92.gtf > HBR_Rep3_gene.tsv

	cd htseq_counts/
	join UHR_Rep1_gene.tsv UHR_Rep2_gene.tsv \| join - UHR_Rep3_gene.tsv \| join - HBR_Rep1_gene.tsv \| join - HBR_Rep2_gene.tsv \| join - HBR_Rep3_gene.tsv > gene_read_counts_table_all.tsv
	echo "GeneID UHR_Rep1 UHR_Rep2 UHR_Rep3 HBR_Rep1 HBR_Rep2 HBR_Rep3" > header.txt
	cat header.txt gene_read_counts_table_all.tsv \| grep -v "__" \| perl -ne 'chomp $_; $_ =~ s/\s+/\t/g; print "$_\n"' > gene_read_counts_table_all_final.tsv
	rm -f gene_read_counts_table_all.tsv header.txt
	head gene_read_counts_table_all_final.tsv

# 目录

# 3.1 Expression

# Stringtie

# HTSEQ-COUNT (粗略介绍)

RNAseq教程(2.5)

RNAseq教程(3.2)