李燕,卫志华,徐凯.基于Lasso算法的中文情感混合特征选择方法研究[J].计算机科学,2018,45(1):39-46
基于Lasso算法的中文情感混合特征选择方法研究
Hybrid Feature Selection Method of Chinese Emotional Characteristics Based on Lasso Algorithm
投稿时间:2017-03-03  修订日期:2017-07-11
DOI:10.11896/j.issn.1002-137X.2018.01.006
中文关键词:  中文情感分析,特征选择,Lasso,情感分类,机器学习
英文关键词:Chinese sentiment analysis,Feature selection,Lasso,Sentiment classification,Machine learning
基金项目:本文受国家自然科学基金项目(61573259),上海市进一步加快中医药事业发展三年行动计划(2014-2016年)(ZY3-CCCX-3-6002),中央高校基本科研专项资金(0800219302,0800219315)资助
作者单位E-mail
李燕 同济大学电子与信息工程学院计算机科学与技术系 上海201804  
卫志华 同济大学电子与信息工程学院计算机科学与技术系 上海201804 zhihua_wei@tongji.edu.cn 
徐凯 上海海事大学上海国际航运研究中心港航大数据实验室 上海200082 kaixu@shmtu.edu.cn 
摘要点击次数: 670
全文下载次数: 559
中文摘要:
      中文情感分析中的一个重要问题就是情感倾向分类,情感特征选择是基于机器学习的情感倾向分类的前提和基础,其作用在于通过剔除无关或冗余的特征来降低特征集的维数。提出一种将Lasso算法与过滤式特征选择方法相结合的情感混合特征选择方法:先利用Lasso惩罚回归算法对原始特征集合进行筛选,得出冗余度较低的情感分类特征子集;再对特征子集引入CHI,MI,IG等过滤方法来评价候选特征词与文本类别的依赖性权重,并据此剔除候选特征词中相关性较低的特征词;最终,在使用高斯核函数的SVM分类器上对比所提方法与DF,MI,IG和CHI在不同特征词数量下的分类效果。在微博短文本语料库上进行了实验,结果表明所提算法具有有效性和高效性;并且在特征子集维数小于样本数量时,提出的混合方法相比DF,MI,IG和CHI的特征选择效果都有一定程度的改善;通过对比识别率和查全率可以发现,Lasso-MI方法相比MI以及其他过滤方法更为有效。
英文摘要:
      An important issue in Chinese sentiment analysis is the emotional tendency classification.The sentiment feature selection is the premise and foundation of the emotional tendency classification based on the machine learning,with the effect of rejecting irrelevant and redundant features to reduce the dimension of the feature set.The hybrid sentiment feature selection method was proposed in this paper combining the Lasso algorithm and filtering feature selection me-thod.At first,Lasso type penalized methods are used to filtrate original feature set to generate emotional classification feature subset with lower redundancy.Secondly,such filtering algorithms as CHI,MI and IG are introduced to evaluate the dependency weight between the candidate feature word and the text category.And some candidate words with lower correlation can be rejected according to the evaluation result.Finally,the proposed algorithm and those such as DF,MI,IG and CHI are compared about various numbers of feature words by SVM classifier which uses gaussian kernel function.It turns out that the proposed algorithm is more effective and efficient when it is used in blog short text corpus.Otherwise,it can improve the effects of feature selection used in DF,MI,IG and CHI to some extent when feature subset dimension is smaller than sample size.With the comparison of recognition rate and recall ratio,it is obvious that Lasso-MI is better than MI as well as other filtering methods.
查看全文  查看/发表评论  下载PDF阅读器