最近做的项目是关于地衣的转录组和基因组,拿到的结果是真菌部位的基因组以及不同时期和部位的转录组,地下部位的藻类部由于太复杂而且杂质过多不好分离就没有测基因组,目前想知道地下部位的藻类到底是什么?根据之前的观察,推测是 Coccomyxa subellipsoidea,想要具体确定的话还需要进一步证明。
可以看到该地衣的地上部位是一种担子菌,地下部位不仅有藻类还有其他的物种。
# 转录组比对到真菌参考基因组
目前有真菌的测序基因组,第一步就是想把转录组数据比对到真菌的参考基因组上
比对工具我选择 STAR,使用地上部位(真菌部_Lh3-1)和地下部位(藻类部 _Lh1-1)的转录组比对到基因组上
建立索引
S$ATR --runMode genomeGenerate --genomeDir /datadisk02/mpoly/hun_db/ --genomeFastaFiles /datadisk02/mpoly/hudsoiana.fa |
比对
$STAR --runThreadN 20 --genomeDir /datadisk02/mpoly/hun_db/ --readFilesCommand zcat --readFilesIn _Lh2-1_S110_L007_R1_001.fastq.gz _Lh2-1_S110_L007_R2_001.fastq.gz --outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 5 --outFileNamePrefix S2-1 | |
$STAR --runThreadN 20 --genomeDir /datadisk02/mpoly/hun_db/ --readFilesCommand zcat --readFilesIn _Lh3-1_S108_L007_R1_001.fastq.gz _Lh3-1_S108_L007_R2_001.fastq.gz --outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 5 --outFileNamePrefix S1-1 |
比对结束后我最关心的是比对率,按照预期,地下部位的比对率比地上部位低。
$less S1-1Log.final.out
Started job on | Aug 20 05:48:00
Started mapping on | Aug 20 05:48:01
Finished on | Aug 20 06:42:35
Mapping speed, Million of reads per hour | 26.61
Number of input reads | 24198466
Average input read length | 300
UNIQUE READS:
Uniquely mapped reads number | 15397710
Uniquely mapped reads % | 60.89%
$less S3-1Log.final.out
Started job on | Aug 20 08:38:29
Started mapping on | Aug 20 08:38:30
Finished on | Aug 20 08:49:15
Mapping speed, Million of reads per hour | 134.06
Number of input reads | 24018884
Average input read length | 300
UNIQUE READS:
Uniquely mapped reads number | 19151198
Uniquely mapped reads % | 79.73%
可以看到结果符合预期,地下部位的 reads 比对率为 64%,地上部位为 80%。
# 地下部位转录组分析
接下来用地下部位的转录组分析确定共生藻,首先的想法是利用转录组数据比对到初步确定的共生藻 Coccomyxa subellipsoidea 上,使用 Coccomyxa subellipsoidea 的基因组建库进行比对
S$ATR --runMode genomeGenerate --genomeDir /datadisk02/mpoly/cocco/ --genomeFastaFiles /datadisk02/mpoly/Coccomyxa.fa | |
$STAR --runThreadN 20 --genomeDir /datadisk02/mpoly/cocco/ --readFilesCommand zcat --readFilesIn _Lh1-1_S110_L007_R1_001.fastq.gz _Lh1-1_S110_L007_R2_001.fastq.gz --outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 5 --outFileNamePrefix S1-1C |
结果有点出乎意料,只有零点几的比对率,
$less S1-1CLog.final.out | |
Started job on | Aug 19 14:25:11 | |
Started mapping on | Aug 19 14:25:16 | |
Finished on | Aug 19 20:51:41 | |
Mapping speed, Million of reads per hour | 3.96 | |
Number of input reads | 25517519 | |
Average input read length | 300 | |
UNIQUE READS: | |
Uniquely mapped reads number | 92457 | |
Uniquely mapped reads % | 0.36% |
由于地下部位比对到真菌部也有 64% 的比对率,所以我重新跑了一遍 STAR,利用 unmapped reads 进行比对。
$STAR --runThreadN 32 --genomeDir hun_db/ --readFilesIn _Lh1-1_S110_L007_R1_001.fastq.gz _Lh1-1_S110_L007_R2_001.fastq.gz --readFilesCommand zcat --outSAMtype SAM --outFileNamePrefix S1-1 --outReadsUnmapped Fastx | |
$awk -F " " '{print $1}' S1-1Unmapped.out.mate1| awk '{if (NR%4==1||NR%4=3) print}' > S1-1.left.fa | |
$sed -i "s/@/>/g" S1-1.left.fa | |
$awk -F " " '{print $1}' S1-1Unmapped.out.mate2| awk '{if (NR%4==1||NR%4=3) print}' > S1-1.right.fa | |
$sed -i "s/@/>/g" S1-1.right.fa | |
$STAR --runThreadN 20 --genomeDir cocco/ --readFilesIn S1-1.left.fa S1-1.right.fa --outSAMtype SAM --outFileNamePrefix S1-1UC --outReadsUnmapped Fastx | |
$less S1-1UCLog.final.out | |
Started job on | Aug 26 10:59:26 | |
Started mapping on | Aug 26 10:59:39 | |
Finished on | Aug 26 15:32:14 | |
Mapping speed, Million of reads per hour | 1.58 | |
Number of input reads | 7180320 | |
Average input read length | 300 | |
UNIQUE READS: | |
Uniquely mapped reads number | 96710 | |
Uniquely mapped reads % | 1.35% |
比对率也只有可怜的 1.35%,一开始我怀疑是不是藻类物种鉴定错了,后来我用了 ncbi 中所有藻类的基因组,比对率都不到 1%。
后来去网上搜索比对率低的原因,在 github 上发现相应解答,加入 --outFilterScoreMinOverLread 0.3 和 --outFilterMatchNminOverLread 0.3 参数
$STAR --runThreadN 20 --genomeDir hun_db/ --readFilesCommand zcat --readFilesIn _Lh1-1_S110_L007_R1_001.fastq.gz _Lh1-1_S110_L007_R2_001.fastq.gz --outFileNamePrefix S1-1fli --outReadsUnmapped Fastx --outSAMtype SAM --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 | |
$less S1-1fliLog.finall.out | |
Started job on | Aug 27 09:52:31 | |
Started mapping on | Aug 27 09:52:37 | |
Finished on | Aug 27 13:02:11 | |
Mapping speed, Million of reads per hour | 8.08 | |
Number of input reads | 25517519 | |
Average input read length | 300 | |
UNIQUE READS: | |
Uniquely mapped reads number | 16371292 | |
Uniquely mapped reads % | 64.16% | |
$STAR --runThreadN 20 --genomeDir cocco/ --readFilesCommand zcat --readFilesIn _Lh1-1_S110_L007_R1_001.fastq.gz _Lh1-1_S110_L007_R2_001.fastq.gz --outFileNamePrefix S1-1Cfli --outReadsUnmapped Fastx --outSAMtype SAM --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 | |
$less S1-1CfliLog.final.out | |
Started job on | Aug 27 09:52:51 | |
Started mapping on | Aug 27 09:52:53 | |
Finished on | Aug 28 04:02:41 | |
Mapping speed, Million of reads per hour | 1.40 | |
Number of input reads | 25517519 | |
Average input read length | 300 | |
UNIQUE READS: | |
Uniquely mapped reads number | 1475579 | |
Uniquely mapped reads % | 5.78% |
对率有 5.78%,提高了一点。
接下来,我使用真菌的 unmapped reads 进行比对,看看没有比到真菌上的 reads 有多少是藻类的。
$awk -F " " '{print $1}' S1-1fliUnmapped.out.mate1| awk '{if (NR%4==1||NR%4=3) print}' > S1-1fli.left.fa | |
$sed -i "s/@/>/g" S1-1.left.fa | |
$awk -F " " '{print $1}' S1-1fliUnmapped.out.mate2| awk '{if (NR%4==1||NR%4=3) print}' > S1-1fli.right.fa | |
$sed -i "s/@/>/g" S1-1.right.fa | |
$STAR --runThreadN 20 --genomeDir cocco/ --readFilesIn S1-1fli.left.fa S1-1fli.right.fa --outSAMtype SAM --outFileNamePrefix S1-1Ufli --outReadsUnmapped Fastx --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 | |
$less S1-1UfliLog.final.out | |
Started job on | Aug 27 14:28:39 | |
Started mapping on | Aug 27 14:28:44 | |
Finished on | Aug 27 21:38:00 | |
Mapping speed, Million of reads per hour | 0.86 | |
Number of input reads | 6123885 | |
Average input read length | 300 | |
UNIQUE READS: | |
Uniquely mapped reads number | 1441904 | |
Uniquely mapped reads % | 23.55% |
也就是说,在剩余的 36% 未比对的 reads 中有 23% 属于 Coccomyxa,总比对率有 70% 左右,地上部位按照相同的步骤进行比对有 83% 的比对率,这个现象也正常,因为地下部位的组织特别复杂,在取样时肯定包含了很多杂质,比对率肯定比地上部位低。
# 总结
目前已经可以确定地下部位的藻类就是 Coccomyxa 属的,是不是 Coccomyxa subellipsoidea 目前还不确定,接下来的工作是确定地下部位未比对上 reads 的所属。