面向文本知识管理的自适应中文分词算法

doi:10.11835/j.issn.1000-582X.2010.10.019

首页 > 过刊浏览>2010年第33卷第10期 >110-117. DOI:10.11835/j.issn.1000-582X.2010.10.019

面向文本知识管理的自适应中文分词算法
DOI:
                        10.11835/j.issn.1000-582X.2010.10.019
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:重庆市自然科学基金资助项目（2008BB2183）；中央高校基本科研资助项目（DJIR10180006）；“211工程”三期建设资助项目（S-10218）；中国博士后科学基金资助项目（20080440699）；国家科技支撑计划资助项目（2008BAH37B04）；国家社会科学基金“十一五”规划教育学重点课题（ACA07004-08）.〖ZK）〗

Text knowledge management oriented adaptive Chinese word segmentation algorithms

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

针对传统字典匹配分词法在识别新词和特殊词处理方面的不足,结合2元统计模型提出了面向文本知识管理的自适应中文分词算法——SACWSA.SACWSA在预处理阶段结合应用有限状态机理论、基于连词的分隔方法和分治策略对输入文本进行子句划分,从而有效降低了分词算法的复杂度;在分词阶段应用2元统计模型,结合局部概率和全局概率,完成子句的切分,从而有效地提升了新词的识别率并消除了歧义;在后处理阶段,通过建立词性搭配规则来进一步消除2元分词结果的歧义.SACWSA主要的特色在于利用“分而治之”的思想来处理长句和长词,用局部概率与全局概率相结合来识别生词和消歧.通过在不同领域语料库的实验表明,SACWSA能准确、高效地自动适应不同行业领域的文本知识管理要求.

Abstract:

To overcome the shortcomings of new word recognition and special word processing for the traditional dictionary-based matching algorithm in,text knowledge management oriented adaptive Chinese word segmentation algorithm (SACWSA) based on 2-gram statistical model is presented..At the preprocessing stage,SACWSA applies finite state machine theory,conjunction-based partition method and divide conquer strategy to partition long sentences in input text into sub-sentences,which reduces the algorithm complexity effectively.At the word segmentation stage,2-gram statistical model is employed and combined with partial probability and overall probability to partition the sub-sentences into words,which improved the recognition rate of new words and eliminated ambiguity.At the post-processing stage,the matching rules of part-of-speech are established to eliminate ambiguity of 2-gram word segmentation results further.The innovations of SACWSA include dealing with the long sentences and long terms with the idea of ’Divide and Conquer’; while combining the partial probability and overall probability to identify new words and eliminate ambiguity.Experimental results on text corpus of different fields show that SACWSA can adapt to different text knowledge management requirements in different fields accurately,efficiently and automatically.

参考文献

相似文献

引证文献

引用本文

冯永,贺迅,唐黎,陈显勇,陈贞.面向文本知识管理的自适应中文分词算法[J].重庆大学学报,2010,33(10):110-117.

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2009-05-10
最后修改日期:
录用日期:
在线发布日期:
出版日期:

期刊社主页

编辑部首页

期刊介绍

编委会

数据库收录

过刊浏览

联系我们

引用本文

分享

相关视频

文章指标

历史

文章二维码