KNN (k-nearest neighbor) 算法学习

Posted on 2020-01-05 | In learning

KNN 是一种分类和回归算法，利用训练集数据计算特征向量空间距离，根据距离最近的 k 个训练实例，根据分类决策规则对输入实例进行分类，算法不具备显示的学习过程。knn 算法三个基本要素为距离算法、k值选择、分类决策规则

距离算法

空间中两个实例点之间的距离，可反应两个实例点之间的相似程度。通常计算为闵氏距离(Minkowski distance)，如下：

$$L_{p}(x_i, x_y) = {(\sum_{l=1}^n |x_i^{(l)} - x_J^{(l)}|^p)}^\frac{1}{p}$$

p==1时，即为曼哈顿距离(Manhattan Distance)：

$$L_1(x_i, x_j) = \sum_{l=1}^n |x_i^{(l)} - x_J^{(l)}|$$
p==2 时，即为欧式距离(Euclidean Distance)：

$$L_2(x_i, x_j) = {(\sum_{l=1}^n |x_i^{(l)} - x_J^{(l)}|^2)}^\frac{1}{2}$$
p = $\infty$ 时，即为切比雪夫距离(Chebyshev distance)

$$L_{\infty}(x_i, x_y) = \lim_{p \to \infty} {(\sum_{l=1}^n |x_i^{(l)} - x_J^{(l)}|^p)}^\frac{1}{p} = max_l(|x_i^{(l)} - x_J^{(l)}|)$$

K值选择

K 值小，近似误差减少，估计误差增大，模型复杂。预测结果对邻近的实例敏感，容易受极端值影响
K 值大，估计误差减小，近似误差增大，模型简单。假设 K 等于训练集 N，预测结果始终属于训练集中最多的类。
一般选取较小的 K 值，交叉验证选取最优 k 值

决策规则

knn 的分类规则一般时多数表决，即由实例的 k 个相邻训练实例中的多数决定

Next-generation diagnostics and disease-gene discovery with the Exomiser

Posted on 2019-12-31 | In read

Exomiser summary

Limitation

仅能分析基因 coding region 和 exon-intron 剪接点附近变异

prioritization

PHIVE, PhenoDigm algorithm

支持数据

小鼠 phenotype data from MGD, IMPC
HPO

####算法

Ontologies
Pairwise alignment of ontology concepts with OWLSim
- $$ sim_j(p,q) = |\alpha^p \cap \alpha^q| / |\alpha^p \cup \alpha^q| $$
  
  Jaccard Index
- $$ IC_{(concept)} = -log2(item_{concept}/item) $$
  
  IC is calculated for the Least Common Subsuming (LCS) phenotype of the pair of concepts
Determining phenotype similarity score estimation
- $$ maxScore(Q, D) = max(score(i, j)) $$
  
  i = 1 … m for Query, j = 1 … n for Disease
- $$ avgScore(Q, D) = \frac {\displaystyle \sum_{i=1}^m max(score(i, j), j=1…n) + \displaystyle \sum_{j=1}^n max(score(i, j), i=1…m)}{m + n} $$
Dvaluated the performance of recalling known disease–gene or model–disease associations use the combinedPercentageScore
- $$ maxPercentageScore(a, b) = \frac {maxScore(a, b)} {maxScore(a, optimal match of a)} $$
- $$ avgPercentageScore(a, b) = \frac {avgScore(a, b)} {avgScore(a, optimal match of a)} $$
- $$ combinedPercentageScore = avg(maxPercentageScore(a, b), avgPercentageScore(a, b)) $$

####限制

突变基因须存在对应的小鼠模型
需要整合模式动物 MP 和 HPO，生成包含 MP 和 HPO 对应信息的 Ontologies，该步骤用到词义匹配
http://purl.obolibrary.org/obo/uberon/basic.obo

PhenIX, Phenomizer algorithm

支持数据

HPO 数据库中与 OMIM 数据库关联数据
预计算每个 HPO 项对应的 IC(Information Content)

计算每个 HPO term 在数据库中关联的 4813 中遗传病中出现的频率，然后去负对数得到每个 term 的 IC

算法

Information content: IC

the negative natural logarithm of the frequency
similarity: MICA

The similarity between two terms can be calculated as the IC of their most informative common ancestor (MICA)
HPO input as Query

$$ sim(Q \rightarrow D) = avg[\displaystyle \sum_{q1 \in Q}\ max_{d1 \in D}(IC(MICA(q1, d1)))] $$
symmetric similarity

$$ sim_{symmetric}(Q,D) = (sim(Q\rightarrow D) + sim(D\rightarrow Q)/2) $$
p Value Calculation

The p values are estimated by Monte Carlo random sampling and corrected for multiple testing by the method of Benjamini and Hochberg

Query term 超过 10 个时候，为了减少计算量，全部 downsample 到 10 个 query 计算 pvalue；

基于以上逻辑，该部分工作为完全重复性工作，预先计算完成。

限制

仅能分析已知的孟德尔疾病

ExomeWalker

支持数据

需要输入临床疾病对应的已知致病基因
protein-protein interaction data from STRING(score >= 0.7)

算法

PPI， protein-protein interaction network

使用 STRING 数据库中 score >= 0.7 的互作关系
Disease-gene families，导致相似临床表型的一系列基因，分析中由用户输入
Variant score

根据基因频率添加一个 0-1 得分，分别对应基因频率于 2%-0%，MAF>2% 对应 0。1% 以上的 variant 被丢弃
预测的致病性结果，计算一个 致病性得分 被计算
- 位于非编码区域和非剪接点的脱靶变异得分为0，同时被丢弃；
- 非同义突变得分为 SIFT 或 MutationTaster 预测得分；
- gene 和疾病的关系从 OMIM 数据库中提取；
- 上述三种方式都未提及的 variant，致病性得分被指定为 0.6.
最终变异优先级评分由 频率得分 * 致病性得分 得到

Gene score, Random walk analysis

弃疗 ……
ExomeWalker score

combination of the random walk score and the best scoring variant in that gene

AR inheritance, variant score of gene = avg(top 2 of variant score)
Logistic regression on a training set of 20000 disease variants and 20 000 benign variants
10-fold cross validation was used to train and test the model
This final ExomeWalker score gives a measure from 0 to 1

限制

需输入疾病已知致病基因
通常针对异质性较高的疾病，用于发现新的致病基因位点

hiPHIVE

使用 PhenoDigm 算法分析斑马鱼突变体表型数据，与上述三种方法结果进行整合，最总结果用于 vairant rank

Information content

IC( information content ) 原始定义文献参考 10.1613/jair.514，首先构建如图 Ontologies ，根据出现频率负对数的定义计算每一项对应的 IC，其中 DOCTOR1 和 NURSE1 分别对应的 IC 是 9.093, 12.94，两者的相似性定义为最近共同祖先的 IC，即为 8.844。

IC 算法应用到人类疾病中，对临床表型和已知疾病表型进行相似性计算，详细逻辑原理体现如图：

文献

Smedley, D., Jacobsen, J. O. B., Jäger, M., Köhler, S., Holtgrewe, M., Schubach, M., … Robinson, P. N. (2015). Next-generation diagnostics and disease-gene discovery with the Exomiser. Nature Protocols, 10(12), 2004–2015. https://doi.org/10.1038/nprot.2015.124
Smedley, D., Oellrich, A., Köhler, S., Ruef, B., Westerfield, M., Robinson, P., … Mungall, C. (2013). PhenoDigm: Analyzing curated annotations to associate animal models with human diseases. Database. https://doi.org/10.1093/database/bat025
Köhler, S., Schulz, M. H., Krawitz, P., Bauer, S., Dölken, S., Ott, C. E., … Robinson, P. N. (2009). Clinical Diagnostics in Human Genetics with Semantic Similarity Searches in Ontologies. American Journal of Human Genetics, 85(4), 457–464. https://doi.org/10.1016/j.ajhg.2009.09.003
Resnik, P. (1999). Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11, 95–130. https://doi.org/10.1613/jair.514
Smedley, D., Köhler, S., Czeschik, J. C., Amberger, J., Bocchini, C., Hamosh, A., … Robinson, P. N. (2014). Walking the interactome for candidate prioritization in exome sequencing studies of Mendelian diseases. Bioinformatics. https://doi.org/10.1093/bioinformatics/btu508

ggplot2 控制 x, y 轴排序

Posted on 2018-10-20 | Edited on 2019-04-21 | In R

GO Scatter plot

# 数据读入
df <- read.table('b63c49b49d7ba1e7.xls',header=TRUE,sep="\t",fill=TRUE)
# 提取富集结果前 30 项目进行绘图展示
df <- head(df, 30)
head(df)

default order

# 根据 label 顺序进行排序
gplot(df, aes(rich_factor, Description)) +
  geom_point(aes(colour=-log10(qvalue), size=gene_number)) +
  scale_colour_gradientn(colours=c('#436EEE','#FF0000'), guide = "colourbar") +
  ggtitle("Statistics of Pathway Enrichment") +
  xlab("Rich factor") +ylab("")

order labels

# 自定义顺序，根据 rich_factor 从大到小顺序排列
df$Description <- factor(df$Description, levels=df$Description[order(df$rich_factor, decreasing=FALSE)])
gplot(df, aes(rich_factor, Description)) +
  geom_point(aes(colour=-log10(qvalue), size=gene_number)) +
  scale_colour_gradientn(colours=c('#436EEE','#FF0000'), guide = "colourbar") +
  ggtitle("Statistics of Pathway Enrichment") +
  xlab("Rich factor") +ylab("")

Literature MSI

Posted on 2018-10-07 | Edited on 2019-04-21 | In cancer

MSI

A change that occurs in the DNA of certain cells (such as tumor cells) in which the number of repeats of microsatellites (short, repeated sequences of DNA) is different than the number of repeats that was in the DNA when it was inherited. The cause of microsatellite instability may be a defect in the ability to repair mistakes made when DNA is copied in the cell. Also called MSI.[^1]
MSI is defined as a difference in length of microsatellites (DNA nucleotide repeats) when comparing tumor DNA with normal DNA from the same individual.
Microsatellite sequences (MS) are repeating stretches of DNA located throughout the entire genome. MS are one to six base pairs long and are repeated many thousand times. [^2]

MSI detection

MSI-PCR[^2]

patient tumor and normal DNA for five mononucleotide microsatellite loci with PCR

MSI high (MSI-H) is present if they differ in at least two of the markers and MSI low (MSI-L) is present if they differ in one of the markers.

MS are identical in each cell of an individual, normal and malignant, condition referred to as microsatellite stability (MSS).

MMR-IHC[^3]

These sites are prone to DNA replication errors as a result of DNA polymerase slippage, which is effectively corrected through the mismatch repair (MMR) system.
detecting MMR protein expression status has been IHC for MLH1, MSH2, PMS2, and MSH6 expression.

Tumors were MMR proficient if all four proteins were expressed (retained) by IHC and MMR deficient (MMR-D) if any of the four proteins was not expressed (lost) by IHC.

NGS

mSINGs[^4]
MSIsensor[^5]
MANTIS[^6]
MSIseq[^7]
Count altered MS loci per Mb[^8]

Landscape of MSI

Fig 1. Prevalence of microsatellite instability (MSI) across 39 human cancer types.[^9]

MSI and TMB

Both studies[^3][^9] looked at mutational burden and found that MSI tumors have a significantly
higher mutational burden compared with microsatellite stable (MSS) tumors.[^10]
The vast majority (85%) of MSI-High cases also had high TMB, but the converse was not true. TMB and MSI status in cancer genomes are correlated and that the majority of MSI-High cases also had high overall TMB.[^11]
Thirty percent of MSI-H cases were TMB-low, and only 26% of MSI-H cases were PD-L1 positive. The overlap between TMB, MSI, and PD-L1 differed among cancer types.[^8]

ref

[^1]: url: https://www.cancer.gov/publications/dictionaries/cancer-terms/def/microsatellite-instability

python 高效处理 bam 等生物格式 —— pysam

Posted on 2018-09-15 | In python

Pysam 打包 htslib C 代码，处理 bam 等生物格式文件的 python 工具包，文档参考：https://pysam.readthedocs.io/en/latest/api.html

install

pip search pysam
# conda search pysam
pip install pysam
# conda install pysam

# pysam 打包了 htslib 提供了一个 python 接口用来处理 bam, vcf 等常用生物格式

解析sam

import pysam
fp = pysam.AlignmentFile('bam', 'rb')

fout = pysam.AlignmentFile("allpaired.bam", "wb", template=fp)
for read in fp.fetch():
    if read.is_paired:
        fout.write(read)
    # dir(read)
    # 查看 read 对象的全部方法或属性
                         
# 特定位置 reads 读取，bam文件需要index
for pileupcolumn in fp.fetch('chr1', 100, 120):
    for read in pileupcolumn.pileups:
        print(read)
        # ...

fout.close()
fp.close()

解析fasta

fp = pysam.FastaFile('fasta')
# 提取序列, 0-based，fasta文件需要index
seq = fp.fetch(reference=chro, start=start, end=end)
fp.close()

# 也可以使用with ... as ... 上下文管理
with pysam.FastaFile('fasta') as fp:
    fp.fetch(reference=chro, start=start, end=end)

解析VCF

fp = pysam.VariantFile('vcf')

# 获得vcf包含的所有sample lst
samples = list((fp.header.samples))

for rec in bcf_in.fetch():
# fp.fetch('chr1', 100000, 200000) 获取特定位置区间内的vairant，index后的vcf文件可用
    print(rec)
	# dir(rec) 查看 rec 的所有方法属性
fp.colse()

identification of tRFs from sRNA deep-sequencing data

Posted on 2018-09-02 | Edited on 2020-01-07

methods

参考 http://genome.bioch.virginia.edu/trfdb/method.php，详细方法描述如下：

tRNA 基因信息从 Genomic tRNA database 或者 NCBI 下载，根据注释基因组版本，tRNA 基因链关系，在成熟 rRNA 基因上下游分别延申 100 bp 提取序列

Information about the tRNA genes for species R. sphaeroides (Bacteria; ATCC_17025), S. pombe (schiPomb1), Droshophila (dm3), C. elegans (ce6), Xenopus (xenTro3), zebra fish (Zv9), mouse (mm9) and human (hg19) was either downloaded from the “Genomic tRNA database“ (http://gtrnadb.ucsc.edu/) or “ NCBI“ (http://www.ncbi.nlm.nih.gov/). We extracted sequences that spanned 100bp upstream and 100bp downstream to the mature tRNA genes from the same genome assembly on which the tRNA gene coordinates were built. The genomic sequences were extracted based on the strand information of tRNA gene transcription.

物种特异的 blastdb 被构建，每一个 sRNA 通过 blastn 比对到对应的物种数据库。通常只有 100% 比对到上 db 数据的 sRNA read 进入后续分析步骤。

A species-specific tRNAdb blast database was built to query the small RNA sequences. To find the tRNA-related RNA sequences in each library, the small RNAs were mapped to the species-specific tRNAdb, using BLASTn (Altschul, Madden et al. 1997). In general we considered only those alignments where the query sequence (small RNA) was mapped to the database sequence (tRNA) along 100% of its length.

blast 输出结果通过 in-house 脚本得到 sRNA 比对的 tRNA 基因上的位置。Blastdb 数据库考虑了基因链信息，因此只有 blast 比对到相同链的 read 被考虑

Blast output file was parsed using an in-house developed script to get the information of mapped positions of small RNA on tRNA genes. Since custom tRNA database was built considering the strand information of tRNA gene therefore only those blast alignments were considered where queries mapped on to the positive strand of the subject sequence.

根据比对从 tRNA 第一个剪辑到最后一个碱基提取 sRNA 信息，最多允许1个错配。在 tRNA 成熟过程中 tRNA 核苷酸转移酶在 tRNA 3’末端添加了“CCA”，因此 tRNA 3‘ 末端的 sRNA 特异的允许不超过3个碱基的错配

We extracted the information for small RNAs aligned from its first to the last base with tRNA sequences, allowing either one or no mismatch. Since “CCA” is added at the 3’ end of tRNA by tRNA nucleotidyltransferase during maturation of tRNA (Xiong and Steitz 2006) therefore a special exception for the small RNA mapping to the 3’ ends of tRNAs in the tRNAdb was devised allowing a terminal mismatch of <=3 bases.

为了消除假阳性，比对到 tRNA blastdb 的 sRNA read，再次使用 blast 进行全基因组搜索，排除不在 tRNA 位点的 read。最后，只有那些只定位于 tRNA 位点的 read 被认定为可能的 tRFs

To eliminate any false positives, the small RNAs that mapped on to the “tRNAdb” were again searched against the whole genome database using blast search excluding the tRNA loci. Finally only those small RNAs were qualified as probable tRFs that mapped exclusively on tRNA loci.

映射在tRNA基因上的sRNA的末端用于获得tRNA上任何映射的sRNA的显着富集

The ends of the small RNA mapped on tRNA genes was used to access the significant enrichment of any mapped small RNA on tRNA.

如果 tRF 是 tRNA 转录物随机降解的产物，则tRF的末端将沿着tRNA基因的长度以相当的频率均等分布。然而，大多数 sRNA（在单个 tRNA 的总比对读数的90％以上）映射在三个特定区域：成熟 tRNA 的5’末端（tRF-5），3’末端（tRF-3）和 3’ tRNA基因前体的区域（tRF-1）。因此，仅考虑映射到这些特定位置的tRF用于进一步分析。

If the tRFs were a result of the random degradation of tRNA transcript, the ends of the tRFs would be equally distributed along the lengths of the tRNA genes with comparable frequency. However, the small RNA mostly (>90% of total mapped reads on individual tRNA) mapped on three specific regions: extreme 5’ end (tRF-5), extreme 3’ end (tRF-3) of mature tRNA and 3’ trailer region (tRF-1) of primary tRNA genes. Therefore tRFs mapped only to these specific locations were considered for further analysis.

如下图所示，对于每个 tRF 都存在一种丰富特别高的 sRNA read（例如tRF-5“GCATTGGTGGTTCAGTGGTAGA”以8258读数/百万份进行测序），占到映射到该位点的读数的80％以上。这将主 tRF 与其他低丰度的核酸酶降解产物区分开，或者可能来自 tRNA 和 tRF 的随机降解。最后，数据库中仅包含高度丰富的tRF（至少一个库中每百万> 20个读数）。

As shown in Figure for each tRF there is one most abundant RNA sequenced (example the tRF-5 “GCATTGGTGGTTCAGTGGTAGA” was sequenced at 8258 reads per million) accounting for more than 80% of the reads mapping to that site. This distinguishes the main tRF from other low abundance products created by nucleases digesting the main tRF, or possibly from random degradation of tRNAs and tRFs. Finally, only highly abundant tRFs (>20 reads per million in at least one library) were included in the database.

parse blast XML use python

Posted on 2018-09-02 | Edited on 2020-01-07 | In bioinfor

2017.05

blast 输出 xml 格式结果，包括最完整的比对信息，经常需要自己从 blast.xml 结果中提取想要的信息，最开始都自己循环 xml 一行行匹配，一点都不 pythonic，陆续发现了一些新的 xml 解析实现，整理如下：

re.search

def parse_BLAST_XML(infile):
    get = lambda x: re.search('>.*?<', x).group(0)[1:-1]

    r_query_id = re.compile("<Iteration_query-ID>")
    r_query_def = re.compile("<Iteration_query-def>")
    r_query_len = re.compile("<Iteration_query-len>")
    r_hit_num = re.compile("<Hit_num>")
    r_hit_id = re.compile("<Hit_id>")
    r_hit_def = re.compile("<Hit_def>")
    r_hit_accession = re.compile("<Hit_accession>")
    r_hit_len = re.compile("<Hit_len>")
    r_hsp_num = re.compile("<Hsp_num>")
    r_hsp_bit_score = re.compile("<Hsp_bit-score>")
    r_hsp_score = re.compile("<Hsp_score>")
    r_hsp_evalue = re.compile("<Hsp_evalue>")
    r_hsp_query_from = re.compile("<Hsp_query-from>")
    r_hsp_query_to = re.compile("<Hsp_query-to>")
    r_hsp_hit_from = re.compile("<Hsp_hit-from>")
    r_hsp_hit_to = re.compile("<Hsp_hit-to>")
    r_hsp_query_frame = re.compile("<Hsp_query-frame>")
    r_hsp_hit_frame = re.compile("<Hsp_hit-frame>")
    r_hsp_identity = re.compile("<Hsp_identity>")
    r_hsp_positive = re.compile("<Hsp_positive>")
    r_hsp_gaps = re.compile("<Hsp_gaps>")
    r_hsp_align_len = re.compile("<Hsp_align-len>")
    r_hsp_qseq = re.compile("<Hsp_qseq>")
    r_hsp_hseq = re.compile("<Hsp_hseq>")
    r_hsp_midline = re.compile("<Hsp_midline>")

    for line in open(infile,'r'):
        if r_query_id.search(line) is not None:
            query_id =  get(line)
        elif r_query_def.search(line) is not None:
            query_def = get(line)
        elif r_query_len.search(line) is not None:
            query_len = get(line)
        elif r_hit_num.search(line) is not None:
            hit_num = get(line)
        elif r_hit_id.search(line) is not None:
            hit_id = get(line)
        elif r_hit_def.search(line) is not None:
            hit_def = get(line)

        elif r_hit_accession.search(line) is not None:
            hit_accession = get(line)
        elif r_hit_len.search(line) is not None:
            hit_len = get(line)
        elif r_hsp_num.search(line) is not None:
            hsp_num = get(line)
        elif r_hsp_bit_score.search(line) is not None:
            hsp_bit_score = get(line)
        elif r_hsp_score.search(line) is not None:
            hsp_score = get(line)
        elif r_hsp_evalue.search(line) is not None:
            hsp_evalue = get(line)
        elif r_hsp_query_from.search(line) is not None:
            hsp_query_from = get(line)
        elif r_hsp_query_to.search(line) is not None:
            hsp_query_to = get(line)
        elif r_hsp_hit_from.search(line) is not None:
            hsp_hit_from = get(line)
        elif r_hsp_hit_to.search(line) is not None:
            hsp_hit_to = get(line)
        elif r_hsp_query_frame.search(line) is not None:
            hsp_query_frame = get(line)
        elif r_hsp_hit_frame.search(line) is not None:
            hsp_hit_frame = get(line)
        elif r_hsp_identity.search(line) is not None:
            hsp_identity = get(line)
        elif r_hsp_positive.search(line) is not None:
            hsp_positive = get(line)
        elif r_hsp_gaps.search(line) is not None:
            hsp_gaps = get(line)
        elif r_hsp_align_len.search(line) is not None:
            hsp_align_len = get(line)
        elif r_hsp_qseq.search(line) is not None:
            hsp_qseq = get(line)
        elif r_hsp_hseq.search(line) is not None:
            hsp_hseq = get(line)
        elif r_hsp_midline.search(line) is not None:
            hsp_midline = get(line)

            percent_ident = "%5.2f" %  (((float(hsp_identity)/float(hsp_align_len))*100))
            mismatch = str(int(hsp_align_len) - int(hsp_identity))
            # ['query id', 'query name', 'subject id', 'identity(%)', 'alignment length', 'mismatches', 'gap opens', 'q. start', 'q. end', 's. start', 's. end', 'evalue', 'bit score', 'subject description']
            yield [query_id, query_def, hit_id, percent_ident, hsp_align_len, mismatch, hsp_gaps, hsp_query_from, hsp_query_to, hsp_hit_from, hsp_hit_to, hsp_evalue, hsp_bit_score, hit_def]

Bio.Blast.NCBIXML.parse

from __future__ import division
from Bio.Blast import NCBIXML
# biopython

def parse_blast_xml(result, threshold=1e-5):
	'''
		result: opened file handle of xml format
	'''
    for blastrecords in NCBIXML.parse(result):
        query_id, _, query_descri = blastrecords.query.partition(' ')
        query_len = blastrecords.query_length
        for align in blastrecords.alignments:
            sbjct_title = align.title
            sbjct_accession = align.accession
            sbjct_descri = align.hit_def
            sbjct_len = align.length
            for hsp in align.hsps:
                hsp_evalue = hsp.expect
                if hsp_evalue > threshold:
                    continue
                hsp_score = hsp.score
                hsp_gap = hsp.gaps
                hsp_len = hsp.align_length
                hsp_match = hsp.identities
                hsp_frame = hsp.frame    # nucl blast to protein database
                hsp_strand = hsp.strand    # nucl2prot
                query_seq = hsp.query
                query_start = hsp.query_start
                query_end = hsp.query_end
                sbjct_seq = hsp.sbjct
                sbjct_start = hsp.sbjct_start
                sbjct_end = hsp.sbjct_end
                tmp = [
                    query_id,
                    query_descri,
                    query_seq,
                    query_len,
                    # hsp_frame,
                    # hsp_strand,
                    sbjct_accession,
                    # sbjct_title,
                    sbjct_descri,
                    sbjct_seq,
                    sbjct_len,
                    hsp_match / hsp_len,
                    hsp_len,
                    hsp_len - hsp_match,
                    hsp_gap,
                    query_start,
                    query_end,
                    sbjct_start,
                    sbjct_end,
                    hsp_evalue,
                    hsp_score,
                ]
                # qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore
                yield tmp

other python modules

1
2
3

# http://pythonguidecn.readthedocs.io/zh/latest/scenarios/xml.html
# http://www.runoob.com/python/python-xml.html
# 懒得再去看各种文档，挖坑...

Whole Genome Amplification Technologies:MALBAC

Posted on 2018-08-04 | Edited on 2019-04-21

Technical principles

A single cell is picked and lysed. First, genomic DNA of the single cell is melted into single-stranded DNA molecules at 94°C.

Picograms of DNA fragments (~10 to 100 kb)
MALBAC primers then anneal randomly to single-stranded DNA molecules at 0°C and are extended by a polymerase with displacement activity at elevated temperatures, creating semiamplicons.

random primers, each having a common 27-nucleotide sequence and 8 variable nucleotides
In the following five temperature cycles, after the step of looping the full amplicons, singlestranded amplicons and the genomic DNA are used as a template to produce full amplicons and additional semiamplicons, respectively.
For full amplicons, the 3′ end is complementary to the sequence on the 5′ end. The two ends hybridize to form looped DNA, which can efficiently prevent the full amplicon from being used as a template, therefore warranting a close-to-linear amplification.
After the five cycles of linear preamplification, only the full amplicons can be exponentially amplified in the following PCR using the common 27- nucleotide sequence as the primer. PCR reaction will generate microgram level of DNA material for sequencing experiments.

Advantages

quasi-linear amplification
accuracy in CNV detection, especially after signal normalization with a reference cell

###Disadvantages

MALBAC is not free from sequence-dependent bias
MALBAC is that it has a high false positive rate for SNV detection
underamplified regions of the genome are sometimes lost during amplification

ref

单细胞全基因组扩增技术: MALBAC文章全文翻译
Zong, Chenghang, et al. “Genome-Wide Detection of Single-Nucleotide and Copy-Number Variations of a Single Human Cell.” Science, vol. 338, no. 6114, 2012, pp. 1622–1626.
L, Huang, et al. “Single-Cell Whole-Genome Amplification and Sequencing: Methodology and Applications.” Annual Review of Genomics and Human Genetics, vol. 16, no. 1, 2015, pp. 79–102.

Single-cell isoform RNA sequencing (ScISOr-Seq) across thousands of cells reveals isoforms of cerebellar cell types

Posted on 2018-07-31 | Edited on 2019-04-21 | In read

三代平台喜获单细胞测序技能

Methods

we employ microfluidics to produce amplified full-length cDNAs barcoded for their cell of origin
one pool for 3’ sequencing to measure gene expression(Pacific, Nanopore)
another pool for long-read sequencing and isoform expression(10xGenomics)

Results

a.expression

We sequence a mean of 17,885 reads per cell
After filtering cells and considering only reads confidently mapped to genes, we have 3,875 unique molecular identifiers (UMIs) and 1,448 genes per cell during 3’end sequencing
6,627 cells were clustered into 17 groups
To assess stability of clusters, we tripled Illumina sequencing depth for rep2

b.Full-length cDNAs

generate ~5.2 million PacBio circular consensus reads
cellular barcodes are located close to the polyA-tail, we first searched for polyA-tails
chimeras, which were removed from further analysis
for 58.0% (compared to 74.0% for 10x-3’seq) of the polyA-tail-containing CCS, we identified a perfect-match 16mer cellular barcode

c.ScISOr-Seq using Nanopore sequencing

Using 1µg of barcoded cDNA
~31.4% (1D) and ~35.2% (passed 1D2) of Nanopore reads have a 30bp window with >=25 Ts
nanopore 需要比 pacbio 更高的样品起始量，nanopore 数据错误率更高

Question

三代数据仍然只是用于转录本组装、注释，需要二代数据定量。分析逻辑约等于三代注释，cell特异转录本，外加二代scRNAseq
single cell full-length cDNAs 带 barcode 扩增原理？？？

从 p.HGVS 到基因组变异位置

Posted on 2018-07-19 | Edited on 2019-04-21 | In bioinfor

transcvar web 版

transcvar 本地化

# https://github.com/zwdzwd/transvar
pip search transvar
pip install transvar --user

transvar config --download_ref --refversion hg19
transvar config --download_anno --refversion hg19

head HGVS_cds
# MLH1:c.100G>A
# MLH1:c.790+1G>A
# VHL:c.1_17del17
# VHL:c.136G>T

transvar canno --refversion hg19 --refseq -l HGVS_cds --oneline >HGVS_cds.transvar.txt

head HGVS_cds.transvar.txt
# input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info
# MLH1:c.100G>A	NM_001167618 (protein_coding)	MLH1	+	chr3:g.37059029G>A/c.100G>A/p.A34T	inside_[cds_in_exon_10]	CSQN=Missense;reference_codon=GCC;alternative_codon=ACC;dbxref=GeneID:4292,HGNC:7127,HPRD:00390,MIM:120436;aliases=NP_001161090;source=RefSeq
# MLH1:c.790+1G>A	NM_001258271 (protein_coding)	MLH1	+	chr3:g.37056036G>A/c.790+1G>A/.	inside_[intron_between_exon_9_and_10]	CSQN=IntronicSNV;dbxref=GeneID:4292,HGNC:7127,HPRD:00390,MIM:120436;aliases=NP_001245200;source=RefSeq
# VHL:c.1_17del17	NM_198156 (protein_coding)	VHL	+	chr3:g.10183532_10183548del17/c.1_17del17/.	inside_[cds_in_exon_1]	CSQN=CdsStartDeletion;left_align_gDNA=g.10183529_10183545del17;unaligned_gDNA=g.10183532_10183548del17;left_align_cDNA=c.1-3_14del17;unalign_cDNA=c.1_17del17;cds_start_at_chr3:10183532_lost;dbxref=GeneID:7428,HGNC:12687,HPRD:01905,MIM:608537;aliases=NP_937799;source=RefSeq

类似的也可以从 c.HGVS、g.HGVS 开始获得完整注释

本地化数据库配置

软件提供数据库路径为：http://transvar.info/transvar_user/annotations/，本地数据库配置详细如下：