首页> 外文会议>Conference on Computational Linguistics and Speech Processing >Observing Features of PTT Neologisms: A Corpus-driven Study with N-gram Model
【24h】

Observing Features of PTT Neologisms: A Corpus-driven Study with N-gram Model

机译:观察PTT新词的特征:基于N-gram模型的语料库驱动研究

获取原文

摘要

PTT (批踢踢)is one of the largest web forums in Taiwan. In the last few years, its importance has been growing rapidly because it has been widely mentioned by most of the mainstream media. It is observed that its influence reflects not only on the society but also on the language novel use in Taiwan. In this research, a pipeline processing system in Python was developed to collect the data from PTT, and the n-gram model with proposed linguistic filter are adopted with the attempt to capture two-character neologisms emerged in PTT. Evaluation task with 25 subjects was conducted against the system's performance with the calculation of Fleiss' kappa measure. Linguistic discussion as well as the comparison with time series analysis of frequency data are provided. It is hoped that the detection of neologisms in PTT can be improved by observing the features, which may even facilitate the prediction of the neologisms in the future.
机译:PTT(批踢踢)是台湾最大的网络论坛之一。在过去的几年中,它的重要性一直在迅速增长,因为它已被大多数主流媒体广泛提及。可以看出,它的影响不仅反映在社会上,而且反映在台湾语言小说的使用上。在这项研究中,开发了Python的管道处理系统来收集PTT中的数据,并采用带有建议的语言过滤器的n-gram模型来尝试捕获PTT中出现的两个字符的新词。通过计算Fleiss的kappa量度,对25名受试者的评估任务针对该系统的性能进行了评估。提供语言讨论以及与频率数据的时间序列分析的比较。希望通过观察特征可以改善PTT中新词的检测,这甚至可以在将来促进新词的预测。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号