identification of tRFs from sRNA deep-sequencing data

methods

参考 http://genome.bioch.virginia.edu/trfdb/method.php,详细方法描述如下:

tRNA 基因信息从 Genomic tRNA database 或者 NCBI 下载,根据注释基因组版本,tRNA 基因链关系,在成熟 rRNA 基因上下游分别延申 100 bp 提取序列

Information about the tRNA genes for species R. sphaeroides (Bacteria; ATCC_17025), S. pombe (schiPomb1), Droshophila (dm3), C. elegans (ce6), Xenopus (xenTro3), zebra fish (Zv9), mouse (mm9) and human (hg19) was either downloaded from the “Genomic tRNA database“ (http://gtrnadb.ucsc.edu/) or “ NCBI“ (http://www.ncbi.nlm.nih.gov/). We extracted sequences that spanned 100bp upstream and 100bp downstream to the mature tRNA genes from the same genome assembly on which the tRNA gene coordinates were built. The genomic sequences were extracted based on the strand information of tRNA gene transcription.

物种特异的 blastdb 被构建,每一个 sRNA 通过 blastn 比对到对应的物种数据库。通常只有 100% 比对到上 db 数据的 sRNA read 进入后续分析步骤。

A species-specific tRNAdb blast database was built to query the small RNA sequences. To find the tRNA-related RNA sequences in each library, the small RNAs were mapped to the species-specific tRNAdb, using BLASTn (Altschul, Madden et al. 1997). In general we considered only those alignments where the query sequence (small RNA) was mapped to the database sequence (tRNA) along 100% of its length.

blast 输出结果通过 in-house 脚本得到 sRNA 比对的 tRNA 基因上的位置。Blastdb 数据库考虑了基因链信息,因此只有 blast 比对到相同链的 read 被考虑

Blast output file was parsed using an in-house developed script to get the information of mapped positions of small RNA on tRNA genes. Since custom tRNA database was built considering the strand information of tRNA gene therefore only those blast alignments were considered where queries mapped on to the positive strand of the subject sequence.

根据比对从 tRNA 第一个剪辑到最后一个碱基提取 sRNA 信息,最多允许1个错配。在 tRNA 成熟过程中 tRNA 核苷酸转移酶在 tRNA 3’末端添加了“CCA”,因此 tRNA 3‘ 末端的 sRNA 特异的允许不超过3个碱基的错配

We extracted the information for small RNAs aligned from its first to the last base with tRNA sequences, allowing either one or no mismatch. Since “CCA” is added at the 3’ end of tRNA by tRNA nucleotidyltransferase during maturation of tRNA (Xiong and Steitz 2006) therefore a special exception for the small RNA mapping to the 3’ ends of tRNAs in the tRNAdb was devised allowing a terminal mismatch of <=3 bases.

为了消除假阳性,比对到 tRNA blastdb 的 sRNA read,再次使用 blast 进行全基因组搜索,排除不在 tRNA 位点的 read。最后,只有那些只定位于 tRNA 位点的 read 被认定为可能的 tRFs

To eliminate any false positives, the small RNAs that mapped on to the “tRNAdb” were again searched against the whole genome database using blast search excluding the tRNA loci. Finally only those small RNAs were qualified as probable tRFs that mapped exclusively on tRNA loci.

映射在tRNA基因上的sRNA的末端用于获得tRNA上任何映射的sRNA的显着富集

The ends of the small RNA mapped on tRNA genes was used to access the significant enrichment of any mapped small RNA on tRNA.

如果 tRF 是 tRNA 转录物随机降解的产物,则tRF的末端将沿着tRNA基因的长度以相当的频率均等分布。然而,大多数 sRNA(在单个 tRNA 的总比对读数的90%以上)映射在三个特定区域:成熟 tRNA 的5’末端(tRF-5),3’末端(tRF-3)和 3’ tRNA基因前体的区域(tRF-1)。因此,仅考虑映射到这些特定位置的tRF用于进一步分析。

If the tRFs were a result of the random degradation of tRNA transcript, the ends of the tRFs would be equally distributed along the lengths of the tRNA genes with comparable frequency. However, the small RNA mostly (>90% of total mapped reads on individual tRNA) mapped on three specific regions: extreme 5’ end (tRF-5), extreme 3’ end (tRF-3) of mature tRNA and 3’ trailer region (tRF-1) of primary tRNA genes. Therefore tRFs mapped only to these specific locations were considered for further analysis.

如下图所示,对于每个 tRF 都存在一种丰富特别高的 sRNA read(例如tRF-5“GCATTGGTGGTTCAGTGGTAGA”以8258读数/百万份进行测序),占到映射到该位点的读数的80%以上。这将主 tRF 与 其他低丰度的核酸酶降解产物区分开,或者可能来自 tRNA 和 tRF 的随机降解。最后,数据库中仅包含高度丰富的tRF(至少一个库中每百万> 20个读数)。

As shown in Figure for each tRF there is one most abundant RNA sequenced (example the tRF-5 “GCATTGGTGGTTCAGTGGTAGA” was sequenced at 8258 reads per million) accounting for more than 80% of the reads mapping to that site. This distinguishes the main tRF from other low abundance products created by nucleases digesting the main tRF, or possibly from random degradation of tRNAs and tRFs. Finally, only highly abundant tRFs (>20 reads per million in at least one library) were included in the database.

---------本文结束,感谢您的阅读---------