首页> 外文会议>International Conference on Intelligent Communication Technologies and Virtual Mobile Networks >Machine Learning based Dataset for Finding Suicidal Ideation on Twitter
【24h】

Machine Learning based Dataset for Finding Suicidal Ideation on Twitter

机译:基于机器学习的数据集用于在Twitter上查找自杀意图

获取原文

摘要

Suicidal ideation is a major health issue nowadays. This may lead to death of various people. Suicide is also one of the major reason of death in many of the countries [9] [14]. Automatically finding people having suicidal ideation on social media is a major concern and a lot of people are working in this direction [10] [11]. There are many risk factor associated with suicidal ideation such as anxiety, depression, mental disorder etc. [13] [15]. A number of methods have been made to prevent deaths because of suicide. With the advent of social networking site, people have started expressing their feelings more on social media rather than someone in personal [6] [12]. Text classification has proven to be a successful method to prevent suicides [8]. This article describes a dataset of people having suicidal ideation on twitter. The data was extracted from an Application Programming Interface provided by Twitter. Various features/keywords related to suicidal ideation shown in table 2 were used to identify persons having such ideation. These keywords have been gathered from various web forums and previous year papers [7]. Initially the dataset have been taken from Twitter public application programming interface using its access key and access token. The raw data comprises of various fields such as: user__id, user__name, created_at, text, user__screen_name, user__friends_count, user__listed_count, user__favourites_count, user__followers_count, user__statuses_count, user__created_at, user_location with around 14202 tweets a part of which is shown in table 3. After that a sample of 1897 tweets were extracted depending upon the keywords selected and merely the text and class fields are set aside as needed to be given as input to any of the algorithms as shown in table 4. The class consists of binary values having either value 0 (non-suicidal) or 1 (suicidal) based on whether the tweet is related to suicidal ideation or not. This is done by a manual annotation by a human annotator and a psychiatric expert as shown in table 4. In the final step the preprocessing of the tweets are done based on the semantics of the keywords recognized and then based on the text fileld new colums are added to the table which contains all the keywords and the table is altered into the probabilistic values i.e. either 0 or 1. Based on the occurrence/non-occurrence of the keyword, a value 0 or 1 is assigned to each keyword and tweet in the particular record. We have given a value 1 if the specific keyword exists in that particular tweet and we have given a value 0 if a keyword doesn’t exist in the particular tweet and hence the resultant dataset consists of only binary (0 or 1) values as given in table 5 [1]. The resultant dataset consists of 1897 tweets and 34 features. A number of machine learning algorithms like Multinomial Naïve Bayes, Bernoulli Naïve Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Random Forest, Voting Ensemble and AdaBoost are then used on this dataset for testing the dataset and finding the accuracy, recall and precision.
机译:自杀意识形动是如今的主要健康问题。这可能导致各种人死亡。自杀也是许多国家死亡的主要原因之一[9] [14]。自动寻找对社交媒体的自杀意念的人是一个主要问题,很多人在这个方向上工作[10] [11]。有许多危险因素与自杀素相似,如焦虑,抑郁,精神障碍等[13] [15]。已经进行了许多方法以防止死亡因自杀。随着社交网站的出现,人们已经开始在社交媒体上表达自己的感受,而不是个人[6] [12]。文本分类已被证明是防止自杀的成功方法[8]。本文介绍了在Twitter上具有自杀意图的人的数据集。从Twitter提供的应用程序编程接口中提取数据。与表2中所示的自杀式大象相关的各种特征/关键词用于识别具有此类观点的人。这些关键字已收集来自各种网络论坛和上一年的论文[7]。最初,数据集已从Twitter公共应用程序编程接口中获取,使用其访问密钥和访问令牌。原始数据包括各种字段,例如:user__id,user__name,created_at,text,user_count,user__listed_count,user__favourites_count,user__followers_count user__followers_count,user__statuses_count,user__created_at,user_location ysual_location,user_location,user_location,其中包含在表3中所示的一部分。之后根据所选择的关键字提取1897个推文的样本,并且仅根据需要将文本和类字段置于任何算法中的文本和类字段,如表4所示。该类由具有值0的二进制值组成(基于推文是否与自杀意图有关的非自杀)或1(自杀)。这是由人类注释器和精神病专家的手动注释来完成的,如表4所示。在最后一步中,推文的预处理是根据识别的关键字的语义来完成的,然后基于文本菲尔德新核素添加到包含所有关键字的表中,表格被更改为0或1.基于关键字的发生/不发生,将值0或1分配给每个关键字和推文具体记录。如果特定关键字存在于该特定推文中,我们已经给出了值1,并且如果在特定推文中不存在关键字,则给出了值0,因此结果数据集仅由给定的二进制(0或1)值组成在表5 [1]中。结果数据集由1897个推文和34个功能组成。然后,在此数据集上使用伯努利天真贝叶斯,伯努利天真贝叶斯,伯努利天真贝叶斯,伯努利天真贝叶斯,博尔努利天真湾,支持向量机,随机森林,投票集合和adaboost,用于测试数据集并找到准确性,召回和精度。 。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号