...
首页> 外文期刊>ACM transactions on Asian language information processing >A Link Prediction Approach for Accurately Mapping a Large-scale Arabic Lexical Resource to English WordNet
【24h】

A Link Prediction Approach for Accurately Mapping a Large-scale Arabic Lexical Resource to English WordNet

机译:一种链路预测方法,用于准确映射到英语Wordnet的大规模阿拉伯语词汇资源

获取原文
获取原文并翻译 | 示例
           

摘要

Success of Natural Language Processing (NLP) models, just like all advanced machine learning models, rely heavily on large -scale lexical resources. For English, English WordNet (EWN) is a leading example of a large-scale resource that has enabled advances in Natural Language Understanding (NLU) tasks such as word sense disambiguation, question answering, sentiment analysis, and emotion recognition. EWN includes sets of cognitive synonyms called synsets, which are interlinked by means of conceptual-semantic and lexical relations and where each synset expresses a distinct concept. However, other languages are still lagging behind in having large-scale and rich lexical resources similar to EWN. In this article, we focus on enabling the development of such resources for Arabic. While there have been efforts in developing an Arabic WordNet (AWN), the current version of AWN has its limitations in size and in lacking transliteration standards, which are important for compatibility with Arabic NLP tools. Previous efforts for extending AWN resulted in a lexicon, called ArSenL, that overcame the size and the transliteration standard limitation but was limited in accuracy due to the heuristic approach that only considered surface matching between the English definitions from the Standard Arabic Morphological Analyzer (SAMA) and EWN synset terms, and that resulted in inaccurate mapping of Arabic lemmas to EWN's synsets. Furthermore, there has been limited exploration of other expansion methods due to expensive manual validation needed. To address these limitations of simultaneously having large-scale size with high accuracy and standard representations, the mapping problem is formulated as a link prediction problem between a large-scale Arabic lexicon and EWN, where a word in one lexicon is linked to a word in another lexicon if the two words are semantically related. We use a semi-supervised approach to create a training dataset by finding common terms in the large-scale Arabic resource and AWN. This set of data becomes implicitly linked to EWN and can be used for training and evaluating prediction models. We propose the use of a two-step Boosting method, where the first step aims at linking English translations of SAMA's terms to EWN's synsets. The second step uses surface similarity between SAMA's glosses and EWN's synsets. The method results in a new large-scale Arabic lexicon that we call ArSenL 2.0 as a sequel to the previously developed sentiment lexicon ArSenL. A comprehensive study covering both intrinsic and extrinsic evaluations shows the superiority of the method compared to several baseline and state-of-the-art link prediction methods. Compared to previously developed ArSenL, ArSenL 2.0 included a larger set of sentimentally charged adjectives and verbs. It also showed higher linking accuracy on the ground truth data compared to previous ArSenL. For extrinsic evaluation, ArSenL 2.0 was used for sentiment analysis and showed, here, too, higher accuracy compared to previous ArSenL.
机译:自然语言处理(NLP)模型的成功,就像所有先进的机器学习模型一样,严重依赖于大型的词汇资源。对于英语,英文WordNet(EWN)是一个大规模资源的领先示例,它已经启用了自然语言理解(NLU)任务,例如词感歧义,问题应答,情感分析和情感认可。 EWN包括一种名为synpset的认知同义词,它通过概念语义和词法关系互连,并且每个Synset表示截然不同的概念。然而,其他语言仍然落后于具有与EWN类似的大规模和丰富的词汇资源。在本文中,我们专注于实现阿拉伯语的这些资源。虽然在开发阿拉伯语Wordnet(AWN)时,AWN的当前版本的含量尺寸和缺乏音译标准有所含义,这对于与阿拉伯语NLP工具的兼容性很重要。以前的延伸AWN的努力导致称为arsenl,克服arsenl,克服尺寸和音译标准限制,而是由于启发式方法而被限制,这是仅考虑从标准阿拉伯语形态分析仪(SAMA)之间的英语定义之间的表面匹配。和EWN SYNSET术语,它导致阿拉伯LEMMAS的映射不准确到EWN的Synpsets。此外,由于所需的昂贵的手动验证,对其他扩展方法的有限探索。为了解决具有高精度和标准表示的同时具有大规模大小的这些限制,将映射问题作为一个大规模阿拉伯语词典和EWN之间的链路预测问题,其中一个词典中的单词链接到一个单词另一个词典如果两个词是语义相关的。我们使用半监控方法通过在大规模阿拉伯资源和宏论中查找常见术语来创建培训数据集。这组数据被隐式链接到EWN,并且可用于训练和评估预测模型。我们建议使用双步提升方法,其中第一步旨在将Sama的术语的英语翻译联系到EWN的Synpsets。第二步使用SAMA的光泽与EWN的舞蹈协会之间的表面相似性。该方法导致新的大型阿拉伯语词典,我们将Arsenl 2.0称为先前发达的情绪arsenl的续集。与多种基线和最先进的链路预测方法相比,覆盖内在和外在评估的综合研究表明了该方法的优越性。与以前开发的Arsenl相比,Arsenl 2.0包括更大的一系列情绪指控形容词和动词。与以前的Arsenl相比,它还显示了地面真实数据的链接更高的准确性。对于外在评估,Arsenl 2.0用于情绪分析并显示出与之前的仲柱相比的更高的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号