...
首页> 外文期刊>International Journal of Modern Physics, C. Physics and Computers >(HSD)-D-3, a dataset of Homo Sapiens Splice regions, and its extraction procedure from a major public database
【24h】

(HSD)-D-3, a dataset of Homo Sapiens Splice regions, and its extraction procedure from a major public database

机译:(HSD)-D-3,智人剪接区域的数据集,及其从主要公共数据库中的提取过程

获取原文
获取原文并翻译 | 示例
           

摘要

The aim of this work is to describe a cleaning procedure of GenBank data, producing material to train and to assess the prediction accuracy of computational approaches for gene characterization. A procedure (GenBank2HS(3)D) has been defined, producing a dataset ((HSD)-D-3 - Homo Sapiens Splice Sites Dataset) of Homo Sapiens Splice regions extracted from GenBank (Rel.123 at this time). It selects, from the complete GenBank Primate Division, entries of Human Nuclear DNA according with several assessed criteria; then it extracts exons and introns from these entries (actually 4523 + 3802). Donor and acceptor sites are then extracted as windows of 140 nucleotides around each splice site (3799 + 3799). After discarding windows not including canonical GT-AG junctions (65 + 74), including insufficient data (not enough material for a 140 nucleotide window) (686 + 589), including not ACCT bases (29 + 30), and redundant (218 + 226), the remaining windows (2796 + 2880) are reported in the dataset. Finally, windows of false splice sites axe selected by searching canonical GT-AG pairs in not splicing positions (271937 + 332 296). The false sites in a range +/- 60 from a true splice site axe marked as proximal. HS3D, release 1.2 at this time, is available at the Web server of the University of Sannio: http://www.sci.unisannio.it/docenti/rampone/. [References: 29]
机译:这项工作的目的是描述GenBank数据的清洗程序,生产用于训练的材料,并评估用于基因表征的计算方法的预测准确性。已经定义了一个程序(GenBank2HS(3)D),从而生成从GenBank中提取的智人剪接区域的数据集((HSD)-D-3-智人剪接位点数据集)(此时,Rel.123)。它从完整的GenBank灵长类部门中,根据几种评估的标准选择人类核DNA的条目。然后从这些条目中提取外显子和内含子(实际上是4523 + 3802)。然后将供体和受体位点提取为每个剪接位点周围的140个核苷酸的窗口(3799 + 3799)。丢弃不包含规范的GT-AG连接(65 + 74)的窗口(包括不足的数据(没有足够的材料用于140个核苷酸的窗口))(686 + 589),包括不包含ACCT碱基(29 + 30)和冗余(218 + 226),其余窗口(2796 + 2880)在数据集中报告。最后,通过在非拼接位置搜索规范的GT-AG对来选择错误的拼接位点窗口(271937 + 332 296)。假位距标记为近端的真剪接位点+/- 60。目前,HS3D版本1.2已在Sannio大学的Web服务器上提供:http://www.sci.unisannio.it/docenti/rampone/。 [参考:29]

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号