...
首页> 外文期刊>Literary & linguistic computing >Twitter corpus creation: The case of a Malay Chat-style-text Corpus (MCC)
【24h】

Twitter corpus creation: The case of a Malay Chat-style-text Corpus (MCC)

机译:Twitter语料库的创建:马来聊天风格文本语料库(MCC)的情况

获取原文
获取原文并翻译 | 示例
           

摘要

In recent years, social networks, microblogs, and short message service have deeply penetrated peoples lives, and thus, chat-style text is a common phenomenon. This chat-style text has many unknown features for linguists, which can be discovered by analyzing a chat-style corpus. The process of constructing a corpus conforms to specific corpus criteria, such as representativeness, sampling, variety, and chronology. Up to now, literature does not provide specific corpus criteria for creating a chat-style-text corpus. In contrast to related work, corpus criteria for creating a chat-style corpus are provided. An exhaustive and reliable Malay chat-style text corpus is still lacking. Thus, the provided criteria are used to demonstrate the process of constructing a Twitter corpus known as the Malay Chat-style Corpus (MCC). The MCC, which has 1 million twitter messages, consists of 14,484,384 word instances, 646,807 terms and metadata, such as posting time, used twitter client application, and type of Twitter message (simple Tweet, Retweet, Reply). Furthermore, the results of the analysis of the MCC reveal characteristics of the corpus including the most frequent terms and collocations, Zipf law diagram, Twitter peak hours, and percentages of message types. Finally, representativeness of the corpus is evaluated by employing cartography and automatic language identification methods. This corpus and the process of corpus creating are valuable for researchers working in linguistics, natural language processing, and data mining.
机译:近年来,社交网络,微博和短消息服务已深入渗透到人们的生活中,因此,聊天式文本是一种普遍现象。这种聊天风格的文本具有许多语言学家未知的功能,可以通过分析聊天风格的语料库来发现它们。语料库的构建过程符合特定的语料库标准,例如代表性,抽样,品种和时间顺序。到目前为止,文献还没有提供用于创建聊天风格文本语料库的特定语料库标准。与相关工作相反,提供了用于创建聊天式语料库的语料库标准。仍然缺乏详尽而可靠的马来人聊天风格的文本语料库。因此,所提供的标准用于说明构建称为马来聊天风格语料库(MCC)的Twitter语料库的过程。 MCC拥有一百万条Twitter消息,包含14,484,384个单词实例,646,807个术语和元数据,例如发布时间,使用的Twitter客户端应用程序和Twitter消息的类型(简单的Tweet,Retweet,Reply)。此外,MCC的分析结果还显示了语料库的特征,包括最常用的术语和搭配,Zipf法则图,Twitter高峰时间和消息类型的百分比。最后,通过采用制图和自动语言识别方法来评估语料库的代表性。该语料库和语料库创建过程对于从事语言学,自然语言处理和数据挖掘的研究人员非常有价值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号