面向社会化媒体用户评论行为的属性推断

刘云; 孙宇清; 李明珠

摘要

The user attribute inference problem occupies an important role in practical applications such as personalized recommendation,marketing and promotion on quality of web service.The current works mainly aim at the identity related user online behaviors,such as a user query history,user relationships etc.,which are not applicable for the case on social media since users are often anonymous.Additionally,user reviews are not only fragmented and noisy,but also imbalanced on both the quantity and distribution.In this paper,we propose a series of methods to solve the above challenging problems.We take into account the item information user commented and the context as the supplements for solving the imbalanced problem on quantity distribution,which reveals a user's preference and behavior trajectory.In addition,we introduce an ontology database to enrich inner semantic features of user comments,which summarizes and generalizes the relevant knowledge of words and organizes it into a hierarchical structure.User comments are partitioned into words and mapped to the nodes in the ontology that represent conceptions of the same meaning words.The hierarchical features reveal semantic relationship existed in words and effectively reduce the negative influence of fragmented data and imbalanced quantity problem.The feature dimension is high after modeling and the fragmented information has low value.To solve this problem,we adopt information gain to measure the importance of features.It can be used to measure the influence of the variety of features on user attributes inference result.It reflects the amount of information that a feature contains.In the information theory,the entropy is used to measure the uncertainty of a random variable.For user attributes inference,the uncertainty change of user attributes after adding a feature is called information gain,which indicates the amount of information brought about by this feature.The larger the difference,the more the ability of the feature to distinguish users who have different attributes.In order to reduce the influence by high dimension problem,based on information gain,we improve the two representative methods of probabilistic feature selection:Probability Wrapped Features Selection algorithm and Heuristic Probability Feature Selection algorithm.Both methods adopt feature importance as the probability in feature selection either in pre-classification or iterative learning process.These two methods reduce the search space and improve the convergence rate of feature selection.By taking into account the correlations between features and classifiers on the small scale type data,we proposed the Unbalanced Data Enhancement Learning algorithm to integrate multiple featurerelated classifiers.It retains the important features while selects trivial features with low probability.It is more advantageous in the problem of unbalanced attributes inference.Several real datasets are adopted to validate our methods on attribute inference from several aspects,including behavior models,feature selection methods,parameters influence and the degree of imbalanced data on user attributes.The experimental results show that the proposed approach not only relieves the negative influence of fragmented and noisy data,but also effectively solve the difficulty of attribute classification under imbalanced user attribute distribution.The results also show that our methods outperform the related algorithms.%针对用户网络行为进行属性推断,在个性化推荐、市场营销和提升平台服务质量等方面具有重要应用价值.现有工作主要针对浏览行为、社交行为等可追踪用户身份的网络行为进行属性推断,而评论性网站用户多为匿名身份,其网络评论行为数据具有碎片化、信息价值含量低和不平衡的特点,且用户群体的属性分布严重不均衡,这些问题给用户属性推断带来挑战.文中引入客体信息、环境信息和语义知识库,辅助用户特征建模,增加了用户评论行为的语义内涵,缓解了用户行为数据量不平衡性和稀疏性问题;基于信息增益度量特征,提出了面向概率性特征选择的两种代表性算法的改进策略:概率包裹式特征选择和启发式概率特征搜索,在解决特征空间高维问题,提高效率的同时,降低了数据噪音影响;提出了面向小比例类型数据的差异性特征选择和迭代式增强学习算法,集成多个特征相关的分类器,既保留了重要特征信息,也给低价值特征提供了小概率选择机会.分别使用真实的中文和英文数据集验证该文方法,包括不同的行为建模方式和特征筛选方法,以及不同参数和用户属性分布不平衡问题对属性推断的影响,并和其他方法进行了对比,实验结果表明该文方法更为有效.

面向社会化媒体用户评论行为的属性推断

摘要

著录项

相似文献

相关主题

期刊订阅