Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms

Samia F. Abd-hood; Nazlia Omar

首页> 外文期刊>Journal of Advances in Information Technology >Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms

【24h】

Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms

机译：Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相关主题

摘要

Microblogging and social networking sites such as Twitter, Facebook, and Instagram, are becoming increasingly popular, registering more than 500 million posts each day. Twitter uses hashtags that are dynamic, user-generated text, preceded by a pound (#) symbol, to retrieve similar posts or topics, to mark events or to tag channels. Following segmentation, hashtags can be used for many Natural Language Processing (NLP) applications. These include sentiment analysis, text classification, named entity recognition, and sarcasm detection. This study delves into a comparison of three algorithms, namely the Viterbi, Triangular matrix and Word breaker algorithms, to determine the best among the three, for the segmentation of hashtags. These algorithms utilize different resources, to calculate the probability of the segmented parts, in order to rank the possible generated segmentations. For example, while the Viterbi and Triangular Matrix algorithms use two statistical corpora of unigram and bigram, the Word Breaker algorithm uses the n-gram language model. According to conducted experiment, the Viterbi algorithm is better for hashtag segmentation than the Triangular Matrix algorithm. This can be attributed to the manner in which the Viterbi algorithm conducts the backtracking. On the other hand, the Word Breaker algorithm, which can ascertain the meaningful tokens in the form of words, before proceeding with the segmentation of the remaining characters, is considered superior to both the Viterbi and Triangular Matrix algorithms, particularly when it comes to the detection of unknown words. Used together with the Good-Turing smoothing algorithm, the Word Breaker algorithm achieved 86.64 f1-score on a large language model.

著录项

来源
《Journal of Advances in Information Technology》 |2021年第4期|311-318|共8页
作者
Samia F. Abd-hood; Nazlia Omar;
展开▼
作者单位

Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia;

展开▼
收录信息
原文格式 PDF
正文语种英语
中图分类计量学;
关键词

Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms

摘要

著录项

相关主题

期刊订阅