硕士学位论文
基于深度学习的汉语语音关键词检测方法
研究
RESEARCH ON CHINESE SPOKEN TERM
DETECTION BASED ON DEEP LEARNING
王朝松
哈尔滨工业大学
2015 年 6 月
国内图书分类号:TP391.42
国际图书分类号:681.3
学校代码:10213
密级:公开
工学硕士学位论文
基于深度学习的汉语语音关键词检测方法
研究
硕 士 研 究 生 : 王朝松
导
师 : 韩纪庆教授
申 请 学 位 : 工学硕士
学
科 : 计算机科学与技术
所 在 单 位 : 计算机科学与技术学院
答 辩 日 期 : 2015 年 6 月
授予学位单位 : 哈尔滨工业大学
Classified Index: TP391.42
U.D.C: 681.3
Dissertation for the Master Degree in Engineering
RESEARCH ON CHINESE SPOKEN TERM
DETECTION BASED ON DEEP LEARNING
Wang Zhaosong
Prof. Han Jiqing
Computer Science and Technology
School of Computer Science and
Technology
June, 2015
Candidate:
Supervisor:
Academic Degree Applied for: Master of Engineering
Speciality:
Affiliation:
Date of Defence:
Degree-Conferring-Institution: Harbin Institute of Technology
哈尔滨工业大学工学硕士学位论文
摘 要
语音关键词检测是一种从连续的语音流中检测预定义的 一组关键词的技
术 , 它 的一 种 主 流 方 法 是基 于 大 词 汇 量 连 续 语 音 识 别 器( Large Vocabulary
Continuous Speech Recognition, LVCSR)的。基于语音识别器的关键词检测系
统主要有两个阶段—— 解码阶段和检测阶段,语音识别器的性能对关键词检
测的性能有很大影响。
传统的关键词检测是用 GMM(Gaussian Mixture Model)和 HMM(Hidden
Markov Model)结合的 GMM-HMM 模型作为 LVCSR 的声学模型,其识别率
不高。 近年来深度学习技术对语音识别产生了巨大影响,人们对 DNN(Deep
Neural Network)替代 GMM 组成 DNN-HMM 声学模型进行了深入研究。本文
研究在关键词检测中用 DNN-HMM 声学模型替代 GMM-HMM 声学模型,并
在 DNN-HMM 声学模型的基础上建立关键词检测系统。实验结果表明,基于
DNN-HMM 模型的语音识别器相比基于 GMM-HMM 模型的语音识别器识别
率更高,关键词检测系统的 性能也更好。
针对基于 LVCSR 的关键词检测两阶段间缺乏紧密联系的问题,本文在
DNN-HMM 声学模型的基础上,研究了在声学模型的训练阶段,对关键词赋
予较大的权重以提高模型对关键词的建模能力。 因此,本文考虑在 区分性训
练中,利用侧重关键词的非均匀准则进行 训练。本文首先 研究了对关键词进
行侧重的非均匀 MCE(Minimum Classification Erro)准则,然后用非均匀 MCE
准则对声学模型参数进行优化。非均匀 MCE 准则中关键词的权重系数对识
别结果有一定影响,固定权重系数的缺点是较大的权重系数可能导致过训练。
因此本文研究利用 AdaBoost(Adaptive Boosting)算法来动态调整非均匀 MCE
训练过程中的权重系数,AdaBoost 算法可以避免非均匀 MCE 准则中的过训
练问题,提高模型的泛化能力。实验结果表明,基于 AdaBoost 算法的非均匀
MCE 准则的关键词检测性能更好。此外,本文还研究了非均匀 sMBR(state-
level Minimum Bayes Risk)准则,实验结果表明,基于非均匀 sMBR 方法的系
统性能要好于基线系统, 本文最后对这两种非均匀准则进行了总结和对比。
关键词:语音识别;关键词检测; 深度学习; 区分性训练;最小分类错误
-I-
哈尔滨工业大学工学硕士学位论文
Abstract
Spoken term detection (STD) is a task to automatically detect a set of
keywords in continuous speech. One mainstream method of STD is based on
LVCSR (Large Vocabulary Continuous Speech Recognition). LVCSR-based STD
usually uses a two-stage model e.g. recognition stage and detection stage. The
performance of speech recognition has a great influence on STD.
Traditional spoken term detection system always use GMM-HMM model as
the acoustic model of LVCSR, which consists of GMM (Gaussian Mixture Model)
and HMM (Hidden Markov Model). However, as the great impact of deep learning
on speech recognition, people begin to replace GMM with DNN (Deep Neural
Network) as the acoustic model and results show that DNN-HMM greatly
improves speech recognition accuracy, compared
to GMM-HMM model.
Therefore, we use DNN-HMM model as the acoustic model of LVCSR in STD and
use it to establish our STD system. Our experimental results show that, compared
to GMM-HMM acoustic model, our DNN-HMM acoustic model not only has
better recognition accuracy, but also the performance of STD has greatly improved.
For the two stages of LVCSR-based STD lack close contact, we study giving
keywords greater weight during the training of acoustic model to improve the
capabilities of modeling keywords. Discriminative training of speech recognition
considers the model training and speech recognition results together, then
establishes the objection function and optimizes the parameters according to the
objection function. Therefore, we consider using discriminative training methods
to establish contact between the acoustic model and spoken term detection. The
basic idea is that we can put more weights on keywords during discriminative
training, which is called discriminative training based on non-uniform criteria. We
consider non-uniform MCE (Minimum Classification Error) criteria, which puts
more weights on keywords during traditional MCE training. After defining non-
uniform MCE objective function, we use it to optimize the parameters of acoustic
model. Experimental shows that non-uniform MCE method can improve the
performance of our STD system. During non-uniform MCE training, if we impose
-II-
哈尔滨工业大学工学硕士学位论文
the same weight during the different optimization iterations, it could lead to severe
over-training when we use fairly large weights. Therefore, we use the adaptive
boosting (AdaBoost) technique to adjust the weights dynamically during training.
It can solve the over-training problem during non-uniform MCE training.
Experimental results show that non-uniform MCE training based on AdaBoost can
give better performance. In addition, we also study non-uniform sMBR (state-
level Minimum Bayes Risk) criteria. Non-uniform sMBR training can also
promote
two non-uniform
discriminative training methods and compare them.
the performance. Finally, we summarize
the
Keywords: speech
discriminative training, minimum classification error
recognition, spoken
term detection, deep
learning,
-III-
哈尔滨工业大学工学硕士学位论文
目 录
摘 要 ......................................................................................................... I
Abstract ...................................................................................................... II
目 录 ...................................................................................................... IV
第 1 章 绪 论 .......................................................................................... 1
1.1 课题背景及研究的目的和意义 .................................................... 1
1.2 国内外研究现状与分析 ............................................................... 3
1.2.1 基于模板匹配的关键词检测 ............................................ 4
1.2.2 基于 Keyword/Filler 模型的关键词检测 .......................... 5
1.2.3 基于 LVCSR 的关键词检测 .............................................. 5
1.2.4 低资源环境下的关键词检测 ............................................ 8
1.2.5 总结与分析 ....................................................................... 8
1.3 主要研究内容 .............................................................................. 9
第 2 章 关键词检测系统的基本组成 .......................................................11
2.1 引言 ............................................................................................11
2.2 关键词检测中的语音识别器 ..................................................... 12
2.2.1 前端处理 ......................................................................... 12
2.2.2 声学模型 ......................................................................... 13
2.2.3 语言模型 ......................................................................... 16
2.2.4 基于 WFST 的语音识别 .................................................. 18
2.3 建立索引和搜索 ........................................................................ 19
2.4 关键词检测系统的评价指标 ..................................................... 21
2.5 基线系统的实验结果 ................................................................. 22
2.6 本章小结 .................................................................................... 22
第 3 章 基于 DNN-HMM 声学模型构建的关键词检测 .......................... 23
3.1 引言 ........................................................................................... 23
3.2 DNN-HMM 声学模型 ................................................................. 24
3.2.1 DNN-HMM 声学模型的结构 ........................................... 24
3.2.2 用 DNN-HMM 声学模型解码 ......................................... 26
3.3 DNN-HMM 模型的主要训练过程 .............................................. 27
-IV-
哈尔滨工业大学工学硕士学位论文
3.3.1 DNN-HMM 模型的预训练 ............................................... 28
3.3.2 DNN-HMM 模型的参数调优 ........................................... 31
3.4 DNN-HMM 声学模型中的非线性单元 ...................................... 32
3.4.1 sigmoid 激活单元 ............................................................. 32
3.4.2 ReLU 单元 ....................................................................... 33
3.4.3 p-norm 单元 ..................................................................... 34
3.5 实验结果与分析 ........................................................................ 34
3.6 本章小结 .................................................................................... 35
第 4 章 关键词声学模型的区分性训练 .................................................. 37
4.1 引言 ........................................................................................... 37
4.2 基于非均匀 MCE 准则的关键词声学模型 ................................ 38
4.2.1 基于 MCE 准则的区分性训练 ........................................ 38
4.2.2 关键词检测的非均匀 MCE 准则 .................................... 39
4.2.3 基于 AdaBoost 算法的非均匀 MCE 准则 ....................... 41
4.3 基于非均匀 sMBR 准则的关键词声学模型 ............................... 43
4.4 模型训练需考虑的实际因素 ..................................................... 45
4.4.1 Lattice 生成 ...................................................................... 45
4.4.2 学习率的调整 ................................................................. 47
4.5 实验结果与分析 ........................................................................ 47
4.5.1 非均匀 MCE 准则的实验结果与分析 ............................. 47
4.5.2 非均匀 sMBR 准则的实验结果与分析 ........................... 49
4.6 本章小结 .................................................................................... 50
结 论 ...................................................................................................... 52
参考文献 .................................................................................................. 54
攻读硕士学位期间发表的论文 ................................................................ 59
哈尔滨工业大学学位论文原创性声明和使用权限 ................................. 60
致 谢 ...................................................................................................... 61
-V-