# 目录
1.Module 1 - Introduction to RNA sequencing
2.Module 2 - RNA-seq Alignment and Visualization
3.Module 3 - Expression and Differential Expression
4.Module 4 - Isoform Discovery and Alternative Expression
- Reference Guided Transcript Assembly
- de novo Transcript Assembly
- Transcript Assembly Merge
- Differential Splicing
- Splicing Visualization
5.Module 5 - De novo transcript reconstruction
6.Module 6 - Functional Annotation of Transcripts
# 1.2 Reference Genomes
从 Ensembl、iGenomes、NCBI 或 UCSC 获得参考基因组。在本例分析中,我们将使用人 GRCh38 版本的 Ensembl 基因组。此外,我们实际上将只使用单个染色体 (chr22) 和 ERCC spikein 来执行分析,以使它运行得更快……
创建必要的工作目录
mkdir RNA_ref |
这些 s 数据可以在 ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/ 找到。你可以使用 wget 下载 homo_sapien . grch38 .dna_sm.primary_assembly.fa.gz 文件,然后解压缩 / 解压。
cd RNA_ref | |
wget http://genomedata.org/rnaseq-tutorial/fasta/GRCh38/chr22_with_ERCC92.fa | |
ls |
查看该文件的前 10 行。为什么会是这个样子
head chr22_with_ERCC92.fa |
这个文件中有多少行和字符?这条染色体有多长 (碱基和 Mbp)
wc chr22_with_ERCC92.fa | |
848761 848764 51751056 chr22_with_ERCC92.fa |
查看大约从该文件中间开始的 10 行。大小写字符的意义是什么?
head -n 425000 chr22_with_ERCC92.fa | tail | |
ggaggctgaggcaggagaatcgcttgaacatgggaggtggaagttgcagtgagccgaaac | |
tgcgccattgcactatagcctgggcaacaagagtgaaagtctgtcttgaaaaaaaaaaaT | |
CAGATGTTCTATGTAAAAATGCTATCTAtgattgaagtataaaactttacctccctttat | |
gttcctttgccctccccactatttattattgtcttgattatatcttctatatgcattgag | |
aggtgttataacttttgtatcaatcaccaaatttaatttagaaaatataagaggagaaga | |
aaagtctattacatttactcatatttttgcttactgtgttctttcttccttcttgatgtt | |
ccagaatttcttttattgcttcttttctgcttagaaaactttatctttttctttcatctt | |
tcttttttcctcctcctcctcctcctcctttttttttttttttttttttttttttttaat | |
aaagagacagggtctcactctatcacccagactggagttcagtgatgcaatcatagctca | |
ttgcaaccttgaactcctgggctcaagtgatcctcccacctcagcctcctgagtagctgg |
在整个参考基因组文件中每个碱基的计数是多少 (跳过每个序列的标题行)?
cat chr22_with_ERCC92.fa | grep -v ">" | perl -ne 'chomp $_; $bases{$_}++ for split //; if (eof){print "$_ $bases{$_}\n" for sort keys %bases}' | |
A 4455938 | |
C 4406493 | |
G 4411768 | |
N 10710000 | |
T 4445994 | |
Y 1 | |
a 5950524 | |
c 4772185 | |
g 4853055 | |
n 948691 | |
t 5946575 |
请记住引用序列 (染色体) 的名称必须与注释 gtf 文件 (在下一节中描述) 中匹配。
# 练习 2
22 号染色体上有多少个碱基对应于重复的元素?整个长度占的百分比是多少
cat chr22_with_ERCC92.fa | perl -ne 'if ($_ =~ /\>22/){$chr22=1}; if ($_ =~ /\>ERCC/){$chr22=0}; if ($chr22){print "$_";}' > chr22_only.fa | |
cat chr22_only.fa | grep -v ">" | perl -ne 'chomp $_; $r+= $_ =~ tr/a/A/; $r += $_ =~ tr/c/C/; $r += $_ =~ tr/g/G/; $r += $_ =~ tr/t/T/; $l += length($_); if (eof){$p = sprintf("%.2f", ($r/$l)*100); print "\nrepeat bases = $r\ntotal bases = $l\npercent repeat bases = $p%\n\n"}' | |
repeat bases = 21522339 | |
total bases = 50818468 | |
percent repeat bases = 42.35% |
22 号染色体序列中出现多少个 EcoRI 限制位点?EcoRI 限制性内切酶识别序列为 5'- GAATTC -'3。
cat chr22_only.fa | grep -v ">" | perl -ne 'chomp $_; $s = uc($_); print $_;' | perl -ne '$c += $_ =~ s/GAATTC/XXXXXX/g; if (eof){print "\nEcoRI site (GAATTC) count = $c\n\n";}' | |
EcoRI site (GAATTC) count = 3935 |