Twitter corpus creation: The case of a Malay Chat-style-text Corpus (MCC)

Saloot Mohammad Arshi; Idris Norisma; Aw AiTi; Thorleuchter Dirk

首页> 外文期刊>Literary & linguistic computing >Twitter corpus creation: The case of a Malay Chat-style-text Corpus (MCC)

【24h】

Twitter corpus creation: The case of a Malay Chat-style-text Corpus (MCC)

机译：Twitter语料库的创建：马来聊天风格文本语料库（MCC）的情况

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In recent years, social networks, microblogs, and short message service have deeply penetrated peoples lives, and thus, chat-style text is a common phenomenon. This chat-style text has many unknown features for linguists, which can be discovered by analyzing a chat-style corpus. The process of constructing a corpus conforms to specific corpus criteria, such as representativeness, sampling, variety, and chronology. Up to now, literature does not provide specific corpus criteria for creating a chat-style-text corpus. In contrast to related work, corpus criteria for creating a chat-style corpus are provided. An exhaustive and reliable Malay chat-style text corpus is still lacking. Thus, the provided criteria are used to demonstrate the process of constructing a Twitter corpus known as the Malay Chat-style Corpus (MCC). The MCC, which has 1 million twitter messages, consists of 14,484,384 word instances, 646,807 terms and metadata, such as posting time, used twitter client application, and type of Twitter message (simple Tweet, Retweet, Reply). Furthermore, the results of the analysis of the MCC reveal characteristics of the corpus including the most frequent terms and collocations, Zipf law diagram, Twitter peak hours, and percentages of message types. Finally, representativeness of the corpus is evaluated by employing cartography and automatic language identification methods. This corpus and the process of corpus creating are valuable for researchers working in linguistics, natural language processing, and data mining.

机译：近年来，社交网络，微博和短消息服务已深入渗透到人们的生活中，因此，聊天式文本是一种普遍现象。这种聊天风格的文本具有许多语言学家未知的功能，可以通过分析聊天风格的语料库来发现它们。语料库的构建过程符合特定的语料库标准，例如代表性，抽样，品种和时间顺序。到目前为止，文献还没有提供用于创建聊天风格文本语料库的特定语料库标准。与相关工作相反，提供了用于创建聊天式语料库的语料库标准。仍然缺乏详尽而可靠的马来人聊天风格的文本语料库。因此，所提供的标准用于说明构建称为马来聊天风格语料库（MCC）的Twitter语料库的过程。 MCC拥有一百万条Twitter消息，包含14,484,384个单词实例，646,807个术语和元数据，例如发布时间，使用的Twitter客户端应用程序和Twitter消息的类型（简单的Tweet，Retweet，Reply）。此外，MCC的分析结果还显示了语料库的特征，包括最常用的术语和搭配，Zipf法则图，Twitter高峰时间和消息类型的百分比。最后，通过采用制图和自动语言识别方法来评估语料库的代表性。该语料库和语料库创建过程对于从事语言学，自然语言处理和数据挖掘的研究人员非常有价值。

著录项

来源
《Literary & linguistic computing》 |2016年第2期|227-243|共17页
作者
Saloot Mohammad Arshi; Idris Norisma; Aw AiTi; Thorleuchter Dirk;
展开▼
作者单位

Univ Malaya, Kuala Lumpur, Malaysia|ASTAR, Inst Infocomm Res I2R, Singapore, Singapore;

Univ Malaya, Kuala Lumpur, Malaysia;

ASTAR, Inst Infocomm Res I2R, Singapore, Singapore;

Fraunhofer INT, Appelsgarten 2, Euskirchen, Germany;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Corpus Design for Malay Corpus-based Speech Synthesis System | Science Publications [J] . Sh-Hussain, Tian-Swee Tan American journal of applied sciences . 2009,第4期

机译：基于马来语语料库的语音合成系统的语料库设计科学出版物
2. House building tips (HBT) corpus dataset as a resource to discover Malay architectural ingenuity and identity 1 [J] . Muhamad Fadzllah Zaini, Anida Sarudin, Mazura Mastura Muhammad, Data in Brief . 2021,第a期

机译：房屋建筑提示（HBT）语料库数据集作为发现马来架构的资源和身份 1
3. Building a Malay-English Code-Switching Subjectivity Corpus for Sentiment Analysis [J] . Emaliana Kasmuri, Halizah Basiron International Journal of Advances in Soft Computing and Its Applications . 2019,第1期

机译：建立马来语-英语代码转换主观语料库以进行情感分析
4. Creation of Corpus and analysis in Code-Mixed Kannada-English Twitter data for Emotion Prediction [C] . Appidi Abhinav Reddy, Vamshi Krishna Srirangam, Suhas Darsi, International Conference on Computational Linguistics . 2020

机译：编码混合Kannada-English Twitter数据中的语料库和分析的创建，用于情感预测
5. Saudis in the Eyes of the Other: A Corpus-Driven Critical Discourse Study of the Representation of Saudis on Twitter [D] . Alanazi, Faizah Mohammed. 2020

机译：在另一个眼中的沙特人：一个语料库驱动的批判性话语研究，对Twitter上的沙特人表示
6. An Analysis of a Twitter Corpus for Training a Medication Intake Classifier [O] . Ari Z. Klein, Abeed Sarker, Karen O’Connor, 2019

机译：用于训练药物摄入分类器的Twitter语料库的分析
7. Corpus design for Malay corpus-based speech synthesis system [O] . Tan, Tian-Swee, Sh-Hussain, Sh-Hussain 2009

机译：马来语料库语音合成系统的语料库设计

Twitter corpus creation: The case of a Malay Chat-style-text Corpus (MCC)

摘要

著录项

相似文献

相关主题

期刊订阅