基于深度学习的汉语语音关键词检测方法研究

caruchi2008-11441296-16359647307578504577.pdf-第1页.png

第1页 / 共70页

caruchi2008-11441296-16359647307578504577.pdf-第2页.png

第2页 / 共70页

caruchi2008-11441296-16359647307578504577.pdf-第3页.png

第3页 / 共70页

caruchi2008-11441296-16359647307578504577.pdf-第4页.png

第4页 / 共70页

caruchi2008-11441296-16359647307578504577.pdf-第5页.png

第5页 / 共70页

caruchi2008-11441296-16359647307578504577.pdf-第6页.png

第6页 / 共70页

caruchi2008-11441296-16359647307578504577.pdf-第7页.png

第7页 / 共70页

caruchi2008-11441296-16359647307578504577.pdf-第8页.png

第8页 / 共70页

硕士学位论文基于深度学习的汉语语音关键词检测方法研究 RESEARCH ON CHINESE SPOKEN TERM DETECTION BASED ON DEEP LEARNING 王朝松哈尔滨工业大学 2015 年 6 月

国内图书分类号：TP391.42 国际图书分类号：681.3 学校代码：10213 密级：公开工学硕士学位论文基于深度学习的汉语语音关键词检测方法研究硕士研究生：王朝松导师：韩纪庆教授申请学位：工学硕士学科：计算机科学与技术所在单位：计算机科学与技术学院答辩日期： 2015 年 6 月授予学位单位：哈尔滨工业大学

Classified Index: TP391.42 U.D.C: 681.3 Dissertation for the Master Degree in Engineering RESEARCH ON CHINESE SPOKEN TERM DETECTION BASED ON DEEP LEARNING Wang Zhaosong Prof. Han Jiqing Computer Science and Technology School of Computer Science and Technology June, 2015 Candidate： Supervisor： Academic Degree Applied for： Master of Engineering Speciality： Affiliation： Date of Defence： Degree-Conferring-Institution： Harbin Institute of Technology

哈尔滨工业大学工学硕士学位论文摘要语音关键词检测是一种从连续的语音流中检测预定义的一组关键词的技术，它的一种主流方法是基于大词汇量连续语音识别器( Large Vocabulary Continuous Speech Recognition, LVCSR)的。基于语音识别器的关键词检测系统主要有两个阶段—— 解码阶段和检测阶段，语音识别器的性能对关键词检测的性能有很大影响。传统的关键词检测是用 GMM(Gaussian Mixture Model)和 HMM(Hidden Markov Model)结合的 GMM-HMM 模型作为 LVCSR 的声学模型，其识别率不高。近年来深度学习技术对语音识别产生了巨大影响，人们对 DNN(Deep Neural Network)替代 GMM 组成 DNN-HMM 声学模型进行了深入研究。本文研究在关键词检测中用 DNN-HMM 声学模型替代 GMM-HMM 声学模型，并在 DNN-HMM 声学模型的基础上建立关键词检测系统。实验结果表明，基于 DNN-HMM 模型的语音识别器相比基于 GMM-HMM 模型的语音识别器识别率更高，关键词检测系统的性能也更好。针对基于 LVCSR 的关键词检测两阶段间缺乏紧密联系的问题，本文在 DNN-HMM 声学模型的基础上，研究了在声学模型的训练阶段，对关键词赋予较大的权重以提高模型对关键词的建模能力。因此，本文考虑在区分性训练中，利用侧重关键词的非均匀准则进行训练。本文首先研究了对关键词进行侧重的非均匀 MCE(Minimum Classification Erro)准则，然后用非均匀 MCE 准则对声学模型参数进行优化。非均匀 MCE 准则中关键词的权重系数对识别结果有一定影响，固定权重系数的缺点是较大的权重系数可能导致过训练。因此本文研究利用 AdaBoost(Adaptive Boosting)算法来动态调整非均匀 MCE 训练过程中的权重系数，AdaBoost 算法可以避免非均匀 MCE 准则中的过训练问题，提高模型的泛化能力。实验结果表明，基于 AdaBoost 算法的非均匀 MCE 准则的关键词检测性能更好。此外，本文还研究了非均匀 sMBR(state- level Minimum Bayes Risk)准则，实验结果表明，基于非均匀 sMBR 方法的系统性能要好于基线系统，本文最后对这两种非均匀准则进行了总结和对比。关键词：语音识别；关键词检测；深度学习；区分性训练；最小分类错误 -I-

哈尔滨工业大学工学硕士学位论文 Abstract Spoken term detection (STD) is a task to automatically detect a set of keywords in continuous speech. One mainstream method of STD is based on LVCSR (Large Vocabulary Continuous Speech Recognition). LVCSR-based STD usually uses a two-stage model e.g. recognition stage and detection stage. The performance of speech recognition has a great influence on STD. Traditional spoken term detection system always use GMM-HMM model as the acoustic model of LVCSR, which consists of GMM (Gaussian Mixture Model) and HMM (Hidden Markov Model). However, as the great impact of deep learning on speech recognition, people begin to replace GMM with DNN (Deep Neural Network) as the acoustic model and results show that DNN-HMM greatly improves speech recognition accuracy, compared to GMM-HMM model. Therefore, we use DNN-HMM model as the acoustic model of LVCSR in STD and use it to establish our STD system. Our experimental results show that, compared to GMM-HMM acoustic model, our DNN-HMM acoustic model not only has better recognition accuracy, but also the performance of STD has greatly improved. For the two stages of LVCSR-based STD lack close contact, we study giving keywords greater weight during the training of acoustic model to improve the capabilities of modeling keywords. Discriminative training of speech recognition considers the model training and speech recognition results together, then establishes the objection function and optimizes the parameters according to the objection function. Therefore, we consider using discriminative training methods to establish contact between the acoustic model and spoken term detection. The basic idea is that we can put more weights on keywords during discriminative training, which is called discriminative training based on non-uniform criteria. We consider non-uniform MCE (Minimum Classification Error) criteria, which puts more weights on keywords during traditional MCE training. After defining non- uniform MCE objective function, we use it to optimize the parameters of acoustic model. Experimental shows that non-uniform MCE method can improve the performance of our STD system. During non-uniform MCE training, if we impose -II-

哈尔滨工业大学工学硕士学位论文 the same weight during the different optimization iterations, it could lead to severe over-training when we use fairly large weights. Therefore, we use the adaptive boosting (AdaBoost) technique to adjust the weights dynamically during training. It can solve the over-training problem during non-uniform MCE training. Experimental results show that non-uniform MCE training based on AdaBoost can give better performance. In addition, we also study non-uniform sMBR (state- level Minimum Bayes Risk) criteria. Non-uniform sMBR training can also promote two non-uniform discriminative training methods and compare them. the performance. Finally, we summarize the Keywords: speech discriminative training, minimum classification error recognition, spoken term detection, deep learning, -III-

哈尔滨工业大学工学硕士学位论文目录摘要 ......................................................................................................... I Abstract ...................................................................................................... II 目录 ...................................................................................................... IV 第 1 章绪论 .......................................................................................... 1 1.1 课题背景及研究的目的和意义 .................................................... 1 1.2 国内外研究现状与分析 ............................................................... 3 1.2.1 基于模板匹配的关键词检测 ............................................ 4 1.2.2 基于 Keyword/Filler 模型的关键词检测 .......................... 5 1.2.3 基于 LVCSR 的关键词检测 .............................................. 5 1.2.4 低资源环境下的关键词检测 ............................................ 8 1.2.5 总结与分析 ....................................................................... 8 1.3 主要研究内容 .............................................................................. 9 第 2 章关键词检测系统的基本组成 .......................................................11 2.1 引言 ............................................................................................11 2.2 关键词检测中的语音识别器 ..................................................... 12 2.2.1 前端处理 ......................................................................... 12 2.2.2 声学模型 ......................................................................... 13 2.2.3 语言模型 ......................................................................... 16 2.2.4 基于 WFST 的语音识别 .................................................. 18 2.3 建立索引和搜索 ........................................................................ 19 2.4 关键词检测系统的评价指标 ..................................................... 21 2.5 基线系统的实验结果 ................................................................. 22 2.6 本章小结 .................................................................................... 22 第 3 章基于 DNN-HMM 声学模型构建的关键词检测 .......................... 23 3.1 引言 ........................................................................................... 23 3.2 DNN-HMM 声学模型 ................................................................. 24 3.2.1 DNN-HMM 声学模型的结构 ........................................... 24 3.2.2 用 DNN-HMM 声学模型解码 ......................................... 26 3.3 DNN-HMM 模型的主要训练过程 .............................................. 27 -IV-

哈尔滨工业大学工学硕士学位论文 3.3.1 DNN-HMM 模型的预训练 ............................................... 28 3.3.2 DNN-HMM 模型的参数调优 ........................................... 31 3.4 DNN-HMM 声学模型中的非线性单元 ...................................... 32 3.4.1 sigmoid 激活单元 ............................................................. 32 3.4.2 ReLU 单元 ....................................................................... 33 3.4.3 p-norm 单元 ..................................................................... 34 3.5 实验结果与分析 ........................................................................ 34 3.6 本章小结 .................................................................................... 35 第 4 章关键词声学模型的区分性训练 .................................................. 37 4.1 引言 ........................................................................................... 37 4.2 基于非均匀 MCE 准则的关键词声学模型 ................................ 38 4.2.1 基于 MCE 准则的区分性训练 ........................................ 38 4.2.2 关键词检测的非均匀 MCE 准则 .................................... 39 4.2.3 基于 AdaBoost 算法的非均匀 MCE 准则 ....................... 41 4.3 基于非均匀 sMBR 准则的关键词声学模型 ............................... 43 4.4 模型训练需考虑的实际因素 ..................................................... 45 4.4.1 Lattice 生成 ...................................................................... 45 4.4.2 学习率的调整 ................................................................. 47 4.5 实验结果与分析 ........................................................................ 47 4.5.1 非均匀 MCE 准则的实验结果与分析 ............................. 47 4.5.2 非均匀 sMBR 准则的实验结果与分析 ........................... 49 4.6 本章小结 .................................................................................... 50 结论 ...................................................................................................... 52 参考文献 .................................................................................................. 54 攻读硕士学位期间发表的论文 ................................................................ 59 哈尔滨工业大学学位论文原创性声明和使用权限 ................................. 60 致谢 ...................................................................................................... 61 -V-

资料库

基于深度学习的汉语语音关键词检测方法研究_hit.pdf

相关推荐

人工智能

热门标签

最新资料