首页> 外文学位 >Efficient case-based reasoning through feature weighting, and its application in protein crystallography.
【24h】

Efficient case-based reasoning through feature weighting, and its application in protein crystallography.

机译:通过特征加权的基于案例的有效推理及其在蛋白质晶体学中的应用。

获取原文
获取原文并翻译 | 示例

摘要

Data preprocessing is critical for machine learning, data mining, and pattern recognition. In particular, selecting relevant and non-redundant features in high-dimensional data is important to efficiently construct models that accurately describe the data. In this work, I present SLIDER, an algorithm that weights features to reflect relevance in determining similarity between instances. Accurate weighting of features improves the similarity measure, which is useful in learning algorithms like nearest neighbor and case-based reasoning. SLIDER performs a greedy search for optimum weights in an exponentially large space of weight vectors. Exhaustive search being intractable, the algorithm reduces the search space by focusing on pivotal weights at which representative instances are equidistant to truly similar and different instances in Euclidean space. SLIDER then evaluates those weights heuristically, based on effectiveness in properly ranking pre-determined matches of a set of cases, relative to mismatches.; I analytically show that by choosing feature weights that minimize the mean rank of matches relative to mismatches, the separation between the distributions of Euclidean distances for matches and mismatches is increased. This leads to a better distance metric, and consequently increases the probability of retrieving true matches from a database. I also discuss how SLIDER is used to improve the efficiency and effectiveness of case retrieval in a case-based reasoning system that automatically interprets electron density maps to determine the three-dimensional structures of proteins. Electron density patterns for regions in a protein are represented by numerical features, which are used in a distance metric to efficiently retrieve matching patterns by searching a large database. These pre-selected cases are then evaluated by more expensive methods to identify truly good matches---this strategy speeds up the retrieval of matching density regions, thereby enabling fast and accurate protein model-building. This two-phase case retrieval approach is potentially useful in many case-based reasoning systems, especially those with computationally expensive case matching and large case libraries.
机译:数据预处理对于机器学习,数据挖掘和模式识别至关重要。特别是,在高维数据中选择相关的和非冗余的特征对于有效构建可准确描述数据的模型很重要。在这项工作中,我介绍了SLIDER,这是一种对特征进行加权以反映确定实例之间相似性的相关性的算法。准确的特征权重可改善相似性度量,这在学习算法(如最近邻居和基于案例的推理)中很有用。 SLIDER在权重向量的指数空间中对最优权重进行贪婪搜索。穷举搜索难以解决,该算法通过关注代表实例与欧几里得空间中真正相似和不同实例等距的枢轴权重来减少搜索空间。然后,SLIDER会根据正确排序一组案例中相对于错配的预定匹配的有效性,试探性地评估这些权重。我的分析表明,通过选择使权重相对于失配的平均排名最小的特征权重,可以增加匹配和失配的欧氏距离分布之间的距离。这导致更好的距离度量,并因此增加了从数据库检索真实匹配项的可能性。我还将讨论在基于案例的推理系统中如何使用SLIDER来提高案例检索的效率和有效性,该系统可自动解释电子密度图以确定蛋白质的三维结构。蛋白质区域中的电子密度模式由数字特征表示,这些特征在距离度量中用于通过搜索大型数据库来有效地检索匹配模式。然后,通过更昂贵的方法对这些预选的病例进行评估,以鉴定出真正良好的匹配-这种策略可加快匹配密度区域的检索速度,从而实现快速,准确的蛋白质模型构建。这种两阶段的案例检索方法在许多基于案例的推理系统中可能很有用,尤其是那些具有计算上昂贵的案例匹配和大型案例库的系统。

著录项

  • 作者

    Gopal, Kreshna.;

  • 作者单位

    Texas A&M University.;

  • 授予单位 Texas A&M University.;
  • 学科 Biology Bioinformatics.; Artificial Intelligence.; Computer Science.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 137 p.
  • 总页数 137
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 人工智能理论;自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号