A Link Prediction Approach for Accurately Mapping a Large-scale Arabic Lexical Resource to English WordNet

Badaro Gilbert; Hajj Hazem; Habash Nizar

首页> 外文期刊>ACM transactions on Asian language information processing >A Link Prediction Approach for Accurately Mapping a Large-scale Arabic Lexical Resource to English WordNet

【24h】

A Link Prediction Approach for Accurately Mapping a Large-scale Arabic Lexical Resource to English WordNet

机译：一种链路预测方法，用于准确映射到英语Wordnet的大规模阿拉伯语词汇资源

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Success of Natural Language Processing (NLP) models, just like all advanced machine learning models, rely heavily on large -scale lexical resources. For English, English WordNet (EWN) is a leading example of a large-scale resource that has enabled advances in Natural Language Understanding (NLU) tasks such as word sense disambiguation, question answering, sentiment analysis, and emotion recognition. EWN includes sets of cognitive synonyms called synsets, which are interlinked by means of conceptual-semantic and lexical relations and where each synset expresses a distinct concept. However, other languages are still lagging behind in having large-scale and rich lexical resources similar to EWN. In this article, we focus on enabling the development of such resources for Arabic. While there have been efforts in developing an Arabic WordNet (AWN), the current version of AWN has its limitations in size and in lacking transliteration standards, which are important for compatibility with Arabic NLP tools. Previous efforts for extending AWN resulted in a lexicon, called ArSenL, that overcame the size and the transliteration standard limitation but was limited in accuracy due to the heuristic approach that only considered surface matching between the English definitions from the Standard Arabic Morphological Analyzer (SAMA) and EWN synset terms, and that resulted in inaccurate mapping of Arabic lemmas to EWN's synsets. Furthermore, there has been limited exploration of other expansion methods due to expensive manual validation needed. To address these limitations of simultaneously having large-scale size with high accuracy and standard representations, the mapping problem is formulated as a link prediction problem between a large-scale Arabic lexicon and EWN, where a word in one lexicon is linked to a word in another lexicon if the two words are semantically related. We use a semi-supervised approach to create a training dataset by finding common terms in the large-scale Arabic resource and AWN. This set of data becomes implicitly linked to EWN and can be used for training and evaluating prediction models. We propose the use of a two-step Boosting method, where the first step aims at linking English translations of SAMA's terms to EWN's synsets. The second step uses surface similarity between SAMA's glosses and EWN's synsets. The method results in a new large-scale Arabic lexicon that we call ArSenL 2.0 as a sequel to the previously developed sentiment lexicon ArSenL. A comprehensive study covering both intrinsic and extrinsic evaluations shows the superiority of the method compared to several baseline and state-of-the-art link prediction methods. Compared to previously developed ArSenL, ArSenL 2.0 included a larger set of sentimentally charged adjectives and verbs. It also showed higher linking accuracy on the ground truth data compared to previous ArSenL. For extrinsic evaluation, ArSenL 2.0 was used for sentiment analysis and showed, here, too, higher accuracy compared to previous ArSenL.

机译：自然语言处理（NLP）模型的成功，就像所有先进的机器学习模型一样，严重依赖于大型的词汇资源。对于英语，英文WordNet（EWN）是一个大规模资源的领先示例，它已经启用了自然语言理解（NLU）任务，例如词感歧义，问题应答，情感分析和情感认可。 EWN包括一种名为synpset的认知同义词，它通过概念语义和词法关系互连，并且每个Synset表示截然不同的概念。然而，其他语言仍然落后于具有与EWN类似的大规模和丰富的词汇资源。在本文中，我们专注于实现阿拉伯语的这些资源。虽然在开发阿拉伯语Wordnet（AWN）时，AWN的当前版本的含量尺寸和缺乏音译标准有所含义，这对于与阿拉伯语NLP工具的兼容性很重要。以前的延伸AWN的努力导致称为arsenl，克服arsenl，克服尺寸和音译标准限制，而是由于启发式方法而被限制，这是仅考虑从标准阿拉伯语形态分析仪（SAMA）之间的英语定义之间的表面匹配。和EWN SYNSET术语，它导致阿拉伯LEMMAS的映射不准确到EWN的Synpsets。此外，由于所需的昂贵的手动验证，对其他扩展方法的有限探索。为了解决具有高精度和标准表示的同时具有大规模大小的这些限制，将映射问题作为一个大规模阿拉伯语词典和EWN之间的链路预测问题，其中一个词典中的单词链接到一个单词另一个词典如果两个词是语义相关的。我们使用半监控方法通过在大规模阿拉伯资源和宏论中查找常见术语来创建培训数据集。这组数据被隐式链接到EWN，并且可用于训练和评估预测模型。我们建议使用双步提升方法，其中第一步旨在将Sama的术语的英语翻译联系到EWN的Synpsets。第二步使用SAMA的光泽与EWN的舞蹈协会之间的表面相似性。该方法导致新的大型阿拉伯语词典，我们将Arsenl 2.0称为先前发达的情绪arsenl的续集。与多种基线和最先进的链路预测方法相比，覆盖内在和外在评估的综合研究表明了该方法的优越性。与以前开发的Arsenl相比，Arsenl 2.0包括更大的一系列情绪指控形容词和动词。与以前的Arsenl相比，它还显示了地面真实数据的链接更高的准确性。对于外在评估，Arsenl 2.0用于情绪分析并显示出与之前的仲柱相比的更高的准确性。

著录项

来源
《ACM transactions on Asian language information processing》 |2020年第6期|80.1-80.38|共38页
作者
Badaro Gilbert; Hajj Hazem; Habash Nizar;
展开▼
作者单位

Amer Univ Beirut Elect & Comp Engn Dept POB 11-0236 Riad E Solh Beirut 11072020 Lebanon;

Amer Univ Beirut Elect & Comp Engn Dept POB 11-0236 Riad E Solh Beirut 11072020 Lebanon;

New York Univ Abu Dhabi Comp Sci Dept POB 129188 Abu Dhabi U Arab Emirates;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Link prediction; arabic wordnet expansion; arabic natural language processing; lexical resources; arabic sentiment lexicon; wordnet;

机译：链接预测;阿拉伯语Wordnet扩展;阿拉伯语自然语言处理;词汇资源;阿拉伯语情绪词典;Wordnet;

相似文献

外文文献
中文文献
专利

1. A Semi-Automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs [J] . Noureddine Doumi, Ahmed Lehireche, Denis Maurel, International Journal of Information Technology and Computer Science . 2016,第2期

机译：一种半自动，低成本的方法，为阿拉伯动词建立可扩展的基于引理的词法资源
2. Improving Information Retrieval in Arabic through a Multi-agent Approach and a Rich Lexical Resource [J] . Mouna Anizi, Joseph Dichy Knowledge Organization . 2011,第5期

机译：通过多主体方法和丰富的词汇资源改善阿拉伯语的信息检索
3. WordNet: A Lexical Database for English [J] . George A.Miller Communications of the ACM . 1995,第11期

机译：WordNet：英语词汇数据库
4. NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet [C] . Anais Tack, Thomas Francois, Piet Desmet, Thirteenth workshop on innovative use of NLP for building educational applications 2018 . 2018

机译：NT2Lex：CEFR分级的词汇资源，用于将荷兰语作为外语链接到开放式荷兰词网
5. Shame in English, Arabic, and Javanese: A comparative lexical study. [D] . Al Jallad, Nader T. 2002

机译：英语，阿拉伯语和爪哇语的耻辱：比较词汇研究。
6. English to Arabic Translation of the Composite Abuse Scale (CAS): A Multi-Method Approach [O] . Samia Alhabib, Gene Feder, Jeremy Horwood -1

机译：综合虐待量表（CAS）的英语到阿拉伯语翻译：一种多方法方法
7. Non-Lexicalized Concepts in Wordnets: A Case Study of English and Hungarian [O] . Vincze Veronika, Almási Attila 2014

机译：词网中的非词汇化概念：以英语和匈牙利语为例
8. Mapping Lexical Entries in a Verbs Database to WordNet Senses [R] . Green, R. , Pearl, L. , Dorr, B. J. 2001

机译：将动词数据库中的词条表映射到WordNet意义

A Link Prediction Approach for Accurately Mapping a Large-scale Arabic Lexical Resource to English WordNet

摘要

著录项

相似文献

相关主题

期刊订阅