分类编号:
密
级:
单位代码:10065
学
号:05209029
研究生学位论文
论文题目:基于贝叶斯网络的数据挖掘研究
学 生 姓 名: 徐 计 申请学位级别:工学硕士
申请专业名称:计算机应用技术
研 究 方 向: 数据挖掘
指导教师姓名: 张桂芸 专业技术职称: 副教授
提交论文日期:
独 创 性 声 明
本人声明所呈交的论文是我个人在导师指导下进行的研究工作及取得的研究成果。尽我所
知,除了文中特别加以标注和致谢的地方外,论文中不包含其他人已经发表或撰写过的研究成
果,也不包含为获得天津师范大学或其它教育机构的学位或证书而使用过的材料。与我一
同工作的同志对本研究所做的任何贡献均已在论文中作了明确的说明并表示了谢意。
签 名:
日 期:
学位论文版权使用授权书
本人完全了解天津师范大学有关保留、使用学位论文的规定,即:学校有权将学位论文的
全部或部分内容编入有关数据库进行检索,并采用影印、缩印或扫描等复制手段保存、汇编以
供查阅和借阅。同意学校向国家有关部门或机构送交论文的复印件和磁盘。
(保密的论文在解密后应遵守此规定)
签 名:
导师签名:
日 期:
摘 要
贝叶斯网络是贝叶斯方法与图形理论的有机结合。由于理论上的严格性和一致
性,以及有效的局部计算机制和直观的图形化知识表达,贝叶斯网络已经成为人工
智能领域的研究热点。本文将贝叶斯网络应用于农业科学领域,对某农场的牛奶产
量进行学习与预测,完成了用贝叶斯网络方法进行数据挖掘的整个流程,并将获得
的结果与多元线性回归方法得到的结果进行了比较。本文的主要工作和创新之处如
下:
(1) 简要介绍了数据挖掘的概念和相关技术,阐述了贝叶斯网络的基本原理和
方法。
(2) 在数据预处理阶段,采用了作者于 2007 年 3 月提出的 Chi2 变形算法。该
算法在保持数据忠实性的同时,将各预测变量的取值离散化,以便于贝叶斯网络方
法的应用。
(3) 在网络结构搜索阶段,采用了带启发规则和随机重启机制的贪心算法。此
方法充分利用了领域知识,再结合变量本身的含义制定了五条启发规则,大大减小
了搜索空间。带随机重启机制的贪心算法,既保留了贪心算法的简洁特性又克服了
其可能陷入局部最优的缺点,获得了可与众多智能搜索算法相媲美的结果。
(4) 在贝叶斯网络推理获得离散值结果后,为了提高预测的精度,进一步考虑
了如何将离散取值还原为连续值得问题,而不是简单的采用相应区间上的中位数。
另外,在将两种方法的结果进行比较时,先把原始数据排序得到新的显示序列,
避免了散点图的杂乱,使得贝叶斯网络结果的优越性更加显而易见。
关键词:数据挖掘,贝叶斯网络,Chi2 变形,贪心算法,线性回归
I
Research on Data Mining Using Bayesian Network
Abstract
Bayesian Network is the combination of Bayesian theory and the graph theory.
Because it is strict and consistent in theory, and also due to its effective local computation
mechanism and visualized knowledge representation, Bayesian Network has attracted
most attention of researchers from the AI field. In this paper, Bayesian Network was
applied to deal with data from agricultural domain, more specifically, to predict the milk
output of a certain farm after having studied its history data. The whole procedure of data
mining using Bayesian Network is completely done, and subsequently we made a
comparison between the results generated by Bayesian Network and multinomial linear
regression. The main work and innovations of this paper are as follows:
1.The concepts and related techniques of data mining are briefly introduced. The
basic principles and methods of Bayesian Network are described with details.
2.In the data pre-processing stage, a variation of Chi2 algorithm put forward by the
author in March, 2007 is employed. The algorithm discretizes the predicting variables
without sacrificing the fidelity of the training data, hence makes it convenient to use
Bayesian Network method.
3.In the stage of structure searching, applied is the greedy algorithm with heuristic
rules and random restart mechanism. This method takes full advantage of the domain
knowledge and the semantics of the variables to work out five heuristic rules. In this way,
the search space is dramatically reduced. The greedy algorithm with random restart
mechanism remains the merit of simplicity and overcomes the shortcoming of probably
being trapped in the local optimality, therefore gains as good results as most of the
intelligent searching algorithms.
4.After we getting the discrete-valued result through Bayesian Network inference,
the question of how to transform the discrete values into continuous ones is also
considered in order to improve the prediction accuracy, instead of straightly using the
median of corresponding interval.
In addition, when the results generated by the two methods are brought to make a
comparison in a visualized way, a new order produced by sorting the original data is
adopted to avoid the chaos in the scatter figure. So, the better performance of Bayesian
Network with respect to this problem is much clearer.
Key words: Data Mining, Bayesian Network, A variation of Chi2, Greedy Algorithm,
Linear Regression
II
目 录
摘 要.................................................................................................................I
Abstract............................................................................................................II
第一章 绪论................................................................................................1
1.1 知识发现的相关概念............................................................................................... 1
1.1.1 数据、信息与知识 ................................................................................................................. 1
1.1.2 知识发现.................................................................................................................................1
1.2 知识发现的步骤.......................................................................................................2
1.3 知识发现的功能.......................................................................................................3
1.3.1 数据总结.................................................................................................................................3
1.3.2 分类.........................................................................................................................................3
1.3.3 聚类.........................................................................................................................................3
1.3.4 相关性分析.............................................................................................................................4
1.4 知识发现的方法和技术...........................................................................................4
1.4.1 统计方法.................................................................................................................................4
1.4.2 机器学习的方法.....................................................................................................................6
1.4.3 神经计算.................................................................................................................................7
1.4.4 混合算法.................................................................................................................................7
1.5 论文组织................................................................................................................... 7
第二章 贝叶斯网络................................................................................... 9
2.1 贝叶斯学习理论.......................................................................................................9
2.2 贝叶斯网络的产生、发展和研究现状...................................................................9
2.3 贝叶斯网络的定义.................................................................................................10
2.4 贝叶斯网络中的独立关系和因果关系.................................................................11
2.4.1 独立关系...............................................................................................................................11
2.4.2 因果关系...............................................................................................................................13
i
目录
2.5 贝叶斯网络的结构模型学习................................................................................. 13
2.5.1 评分函数...............................................................................................................................14
2.5.2 搜索策略...............................................................................................................................14
2.6 贝叶斯网络的局部概率学习................................................................................. 15
2.6.1 先验分布的选取 ....................................................................................................................15
2.6.2 局部概率学习的步骤.......................................................................................................... 16
2.7 贝叶斯网络推理.....................................................................................................17
2.8 小结.........................................................................................................................18
第三章 算法及实验过程......................................................................... 19
3.1 实验背景及领域知识.............................................................................................19
3.1.1 实验背景...............................................................................................................................19
3.1.2 关于牛奶产量的领域知识.................................................................................................. 19
3.2 实验中的若干关键算法.........................................................................................21
3.2.1 数据预处理阶段的算法 ....................................................................................................... 21
3.2.2 网络结构学习阶段的算法.................................................................................................. 23
3.2.3 局部概率学习阶段的算法.................................................................................................. 34
3.2.4 运用贝叶斯网络进行推理的算法...................................................................................... 35
3.3 实验程序的模块及其功能介绍.............................................................................36
3.3.1 程序模块及其关系.............................................................................................................. 36
3.3.2 实验环境介绍.......................................................................................................................37
3.3.3 程序的实现...........................................................................................................................38
3.4 小结.........................................................................................................................40
第四章 结果比较与分析......................................................................... 41
4.1 贝叶斯网络学习的结果.........................................................................................41
4.2 多元线性回归分析的结果..................................................................................... 42
4.2.1 回归分析简介 ........................................................................................................................42
4.2.2 多元线性回归及其标准输出.............................................................................................. 42
4.2.3 多元线性回归分析处理本例的结果.................................................................................. 43
ii
天津师范大学硕士学位论文——基于贝叶斯网络的数据挖掘研究
4.3 两种方法的比较与分析.........................................................................................45
4.4 小结.........................................................................................................................46
第五章 结束语..........................................................................................47
5.1 总结.........................................................................................................................47
5.2 展望.........................................................................................................................48
参考文献........................................................................................................49
致 谢..............................................................................................................52
攻读硕士学位期间发表学术论文情况.......................................................53
iii