Unsupervised Taxonomy of Large Document Corpora Utilizing Idiomatic Character of Natural Languages

机译：利用自然语言的惯用特性的大文档语料库的无监督分类法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents a novel approach to unsupervised taxonomy. It is based on the idiomatic character of natural languages, rather then on statistical calculations. The term "idiomatic" is hereby utilized in two complementary senses: as an intersubjective agreement among the members of a speech community regarding the meaning of a phrase, or as an objective agreement that a particular message is customarily expressed by some particular phrase. The idiomatic character of natural languages makes it extremely likely that across entire document corpora similar ideas will be consistently expressed by some particular phrases. This allows main ideas of a corpus to be faithfully represented by a handful of idiomatic phrases, which can serve as a directory that significantly improves the navigation through the underlying corpus.

机译：本文提出了一种无监督分类法的新颖方法。它基于自然语言的惯用特性，而不是基于统计计算。因此，术语“惯用的”在两种互补的意义上被使用：作为语音社区的成员之间关于短语的含义的主体间协议，或者作为客观的协议，即特定消息通常由某个特定短语表达。自然语言的惯用性使得极有可能在整个文档库中，类似的想法将由某些特定的短语一致地表达。这使语料库的主要思想可以由少数惯用语来忠实地表示，这些惯用语可以用作目录，从而显着改善基础语料库的导航。

著录项

来源
《International Conference on Artificial Intelligence IC-AI'2001 Vol.2, Jun 25-28, 2001, Las Vegas, Nevada, USA》|2001年|p.961-965|共5页
会议地点 Las Vegas NV(US);Las Vegas NV(US)
作者
Nahum Korda; Naphtali Abudarham; Shmulik Regev;
展开▼
作者单位

Virtual Self Ltd. 28 Bezalcl St. Ramat-Gan 52521 Israel;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. 一种基于模糊模型相似测量的字符无监督分类法 [J] . 卢达, 钱忆平, 谢铭培, 东南大学学报（英文版） . 2002,第004期
2. 一种基于模糊模型相似测量的字符无监督分类法(英文) [J] . 卢达, 钱忆平, 谢铭培, 东南大学学报：英文版 . 2002,第004期
3. Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets [J] . Dario Borrelli, Gabriela Gongora Svartzman, Carlo Lipizzi PLoS One . 2020,第6期

机译：无监督的象征自然语言惯用单位的收购：基于N-GRAM频率的新闻文章和推文的方法
4. Finding the Minimum Document Length for Reliable Clustering of Multi-Document Natural Language Corpora [J] . Hermann Moisla* Journal of Quantitative Linguistics . 2011,第1期

机译：寻找最小文档长度以可靠地聚类多文档自然语言语料库
5. Language model adaptation in Tamil language using cross-lingual latent semantic analysis with document aligned corpora [J] . Selvam M., Natarajan A. M. Current Science: A Fortnightly Journal of Research . 2010,第7期

机译：使用跨语言潜在语义分析和文档对齐语料库对泰米尔语语言模型进行适应
6. Unsupervised Taxonomy of Large Document Corpora Utilizing Idiomatic Character of Natural Languages [C] . Nahum Korda, Naphtali Abudarham, Shmulik Regev International conference on artificial intelligence . 2001

机译：利用自然语言惯用性的大型文档语料库无监督分类
7. Data-driven Natural Language Generation: Making Machines Talk Like Humans Using Natural Corpora. [D] . Langner, Brian. 2010

机译：数据驱动的自然语言生成：使用自然语料库使机器像人一样说话。
8. Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets [O] . Dario Borrelli, Gabriela Gongora Svartzman, Carlo Lipizzi 2020

机译：无监督的象征自然语言惯用单位的收购：新闻文章和推文的分组的基于n克频率的方法
9. Correction: Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets [O] . Dario Borrelli, Gabriela Gongora Svartzman, Carlo Lipizzi 2021

机译：更正：无监督的象征自然语言惯用单位的收购：基于N-GRAM频率的新闻文章和推文的方法

Unsupervised Taxonomy of Large Document Corpora Utilizing Idiomatic Character of Natural Languages

摘要

著录项

相似文献

相关主题

期刊订阅