Next-generation diagnostics and disease-gene discovery with the Exomiser

Exomiser summary

Limitation

  • 仅能分析基因 coding region 和 exon-intron 剪接点附近变异

prioritization

PHIVE, PhenoDigm algorithm

支持数据

  • 小鼠 phenotype data from MGD, IMPC
  • HPO

####算法

  1. Ontologies

  2. Pairwise alignment of ontology concepts with OWLSim

    • $$ sim_j(p,q) = |\alpha^p \cap \alpha^q| / |\alpha^p \cup \alpha^q| $$

      Jaccard Index

    • $$ IC_{(concept)} = -log2(item_{concept}/item) $$

      IC is calculated for the Least Common Subsuming (LCS) phenotype of the pair of concepts

  3. Determining phenotype similarity score estimation

    • $$ maxScore(Q, D) = max(score(i, j)) $$

      i = 1 … m for Query, j = 1 … n for Disease

    • $$ avgScore(Q, D) = \frac {\displaystyle \sum_{i=1}^m max(score(i, j), j=1…n) + \displaystyle \sum_{j=1}^n max(score(i, j), i=1…m)}{m + n} $$

  4. Dvaluated the performance of recalling known disease–gene or model–disease associations use the combinedPercentageScore

    • $$ maxPercentageScore(a, b) = \frac {maxScore(a, b)} {maxScore(a, optimal match of a)} $$
    • $$ avgPercentageScore(a, b) = \frac {avgScore(a, b)} {avgScore(a, optimal match of a)} $$
    • $$ combinedPercentageScore = avg(maxPercentageScore(a, b), avgPercentageScore(a, b)) $$

####限制

  • 突变基因须存在对应的小鼠模型
  • 需要整合模式动物 MP 和 HPO,生成包含 MP 和 HPO 对应信息的 Ontologies,该步骤用到词义匹配
  • http://purl.obolibrary.org/obo/uberon/basic.obo

PhenIX, Phenomizer algorithm

支持数据

  • HPO 数据库中与 OMIM 数据库关联数据

  • 预计算每个 HPO 项对应的 IC(Information Content)

    计算每个 HPO term 在数据库中关联的 4813 中遗传病中出现的频率,然后去负对数得到每个 term 的 IC

算法

  • Information content: IC

    the negative natural logarithm of the frequency

  • similarity: MICA

    The similarity between two terms can be calculated as the IC of their most informative common ancestor (MICA)

  • HPO input as Query

    $$ sim(Q \rightarrow D) = avg[\displaystyle \sum_{q1 \in Q}\ max_{d1 \in D}(IC(MICA(q1, d1)))] $$

  • symmetric similarity

    $$ sim_{symmetric}(Q,D) = (sim(Q\rightarrow D) + sim(D\rightarrow Q)/2) $$

  • p Value Calculation

    The p values are estimated by Monte Carlo random sampling and corrected for multiple testing by the method of Benjamini and Hochberg

    Query term 超过 10 个时候,为了减少计算量,全部 downsample 到 10 个 query 计算 pvalue;

    基于以上逻辑,该部分工作为完全重复性工作,预先计算完成。

限制

  • 仅能分析已知的孟德尔疾病

ExomeWalker

支持数据

  • 需要输入临床疾病对应的已知致病基因
  • protein-protein interaction data from STRING(score >= 0.7)

算法

  • PPI, protein-protein interaction network

    使用 STRING 数据库中 score >= 0.7 的互作关系

  • Disease-gene families,导致相似临床表型的一系列基因,分析中由用户输入

  • Variant score

  1. 根据基因频率添加一个 0-1 得分,分别对应基因频率于 2%-0%,MAF>2% 对应 0。1% 以上的 variant 被丢弃
  2. 预测的致病性结果,计算一个 致病性得分 被计算
    • 位于非编码区域和非剪接点的 脱靶变异 得分为0,同时被丢弃;
    • 非同义突变得分为 SIFT 或 MutationTaster 预测得分;
    • gene 和疾病的关系从 OMIM 数据库中提取;
    • 上述三种方式都未提及的 variant,致病性得分被指定为 0.6.
  3. 最终变异优先级评分由 频率得分 * 致病性得分 得到
  • Gene score, Random walk analysis

    弃疗 ……

  • ExomeWalker score

  1. combination of the random walk score and the best scoring variant in that gene

    AR inheritance, variant score of gene = avg(top 2 of variant score)

  2. Logistic regression on a training set of 20000 disease variants and 20 000 benign variants

  3. 10-fold cross validation was used to train and test the model

  4. This final ExomeWalker score gives a measure from 0 to 1

限制

  • 需输入疾病已知致病基因
  • 通常针对异质性较高的疾病,用于发现新的致病基因位点

hiPHIVE

使用 PhenoDigm 算法分析斑马鱼突变体表型数据,与上述三种方法结果进行整合,最总结果用于 vairant rank

Information content

IC( information content ) 原始定义文献参考 10.1613/jair.514,首先构建如图 Ontologies ,根据出现频率负对数的定义计算每一项对应的 IC,其中 DOCTOR1 和 NURSE1 分别对应的 IC 是 9.093, 12.94,两者的相似性定义为最近共同祖先的 IC,即为 8.844。

IC 算法应用到人类疾病中,对临床表型和已知疾病表型进行相似性计算,详细逻辑原理体现如图:

文献

  • Smedley, D., Jacobsen, J. O. B., Jäger, M., Köhler, S., Holtgrewe, M., Schubach, M., … Robinson, P. N. (2015). Next-generation diagnostics and disease-gene discovery with the Exomiser. Nature Protocols, 10(12), 2004–2015. https://doi.org/10.1038/nprot.2015.124
  • Smedley, D., Oellrich, A., Köhler, S., Ruef, B., Westerfield, M., Robinson, P., … Mungall, C. (2013). PhenoDigm: Analyzing curated annotations to associate animal models with human diseases. Database. https://doi.org/10.1093/database/bat025
  • Köhler, S., Schulz, M. H., Krawitz, P., Bauer, S., Dölken, S., Ott, C. E., … Robinson, P. N. (2009). Clinical Diagnostics in Human Genetics with Semantic Similarity Searches in Ontologies. American Journal of Human Genetics, 85(4), 457–464. https://doi.org/10.1016/j.ajhg.2009.09.003
  • Resnik, P. (1999). Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11, 95–130. https://doi.org/10.1613/jair.514
  • Smedley, D., Köhler, S., Czeschik, J. C., Amberger, J., Bocchini, C., Hamosh, A., … Robinson, P. N. (2014). Walking the interactome for candidate prioritization in exome sequencing studies of Mendelian diseases. Bioinformatics. https://doi.org/10.1093/bioinformatics/btu508
---------本文结束,感谢您的阅读---------