Introduction to
AUDIO ANALYSIS
Introduction to
AUDIO ANALYSIS:
A MATLAB Approach
THEODOROS GIANNAKOPOULOS
AGGELOS PIKRAKIS
Amsterdam • Boston • Heidelberg • London
New York • Oxford • Paris • San Diego
San Francisco • Singapore • Sydney • Tokyo
Academic Press is an imprint of Elsevier
Academic Press is an imprint of Elsevier
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
225 Wyman Street, Waltham, MA 02451, USA
525 B Street, Suite 1800, San Diego, CA 92101-4495, USA
First edition 2014
Copyright © 2014 Elsevier Ltd. All rights reserved.
MATLAB® is a registered trademarks of The MathWorks, Inc.
For MATLAB and Simulink product information, please contact:
The MathWorks, Inc.
3 Apple Hill Drive
Natick, MA, 01760-2098 USA
Tel: 508-647-7000
Fax: 508-647-7001
E-mail: info@mathworks.com
Web: mathworks.com
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any
form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior
written permission of the publisher.
Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in
Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email permissions@elsevier.
com. Alternatively you can submit your request online by visiting the Elsevier website at http://
elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material.
Notice
No responsibility is assumed by the publisher for any injury and/or damage to persons or property as
a matter of products liability, negligence or otherwise, or from any use or operation of any methods,
products, instructions or ideas contained in the material herein. Because of rapid advances in the
medical sciences, in particular, independent verification of diagnoses and drug dosages should be
made.
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
ISBN: 978-0-08-099388-1
For information on all Academic Press publications
visit our web site at books.elsevier.com
Printed and bound in United States of America
14 15 16 17 18 10 9 8 7 6 5 4 3 2 1
PREFACE
This book attempts to provide a gentle introduction to the field of audio
analysis using the MATLAB programming environment as the vehicle
of presentation. Audio analysis is a multidisciplinary field, which requires
the reader to be familiar with concepts from diverse research disciplines,
including digital signal processing and machine learning. As a result, it is a
great challenge to write a book that can provide sufficient coverage of the
important concepts in the field of audio analysis and, at the same time, be
accessible to readers who do not necessarily possess the required scientific
background.
Our main goal has been to provide a standalone introduction, involv-
ing a balanced presentation of theoretical descriptions and reproducible
MATLAB examples. Our philosophy is that readers with diverse scientific
backgrounds can gain an understanding of the field of audio analysis, if
they are provided with basic theory, in conjunction with reproducible
experiments that can help them deal with the theory from a more practical
perspective. In addition, this type of approach allows the reader to acquire
certain technical skills that are useful in the context of developing real-
world audio analysis applications. To this end, we also provide an accompa-
nying software library which can be downloaded from the companion site
and includes the MATLAB functions and related data files that have been
used throughout the text.
We believe that this book is suitable for students, researchers, and
professionals alike, who need to develop practical skills, along with a basic
understanding of the field. The book does not assume previous knowledge
of digital signal processing and machine learning concepts, as it provides
introductory material for the necessary topics for both disciplines. We
expect that, after reading this book, the reader will feel comfortable with
various key processing stages of the audio analysis chain, including audio
content creation, representation, feature extraction, classification, segmenta-
tion, sequence alignment and temporal modeling. Furthermore, we believe
that the study of the presented case studies will provide further insight into
the development of real-world applications.
This book is the product of several years of teaching and research and
reflects our teaching philosophy, which has been shaped via our interaction
with our students and colleagues, and to whom we are both grateful. We
vii
viii
Preface
hope that the will prove useful to all readers who are making their first steps
in the field of audio analysis. Although we have made an effort to eliminate
errors during the writing stage, we encourage the reader to contact us with
any comments and suggestions for improvement, in either the text or the
accompanying software library.
Theodoros Giannakopoulos and Aggelos Pikrakis
Athens, 2013
For access to the software library and other supporting materials, please visit
the companion website at: htpp://booksite.elsevier.com/9780080993881
ACKNOWLEDGMENTS
This book has improved thanks to the support of a number of colleagues,
students, and friends, who have provided generous feedback and construc-
tive comments, during the writing process. Above all, T. Giannakopoulos
would like to thank his wife, Maria, and his daughter, Eleni, for always
being cheerful and supportive. A. Pikrakis would like to thank his family
for their patience and generous support and dedicates this book to all the
teachers who have shaped his life.
ix
LIST OF TABLES
Table 1.1
Table 2.1
Table 2.2
Table 4.1
Table 5.1
Table 5.2
Table 5.3
Table 5.4
Table 5.5
Table A.1
Table A.2
Table B.1
Table B.2
Table B.3
Table B.4
Table C.1
Difficulty Levels of the Exercises
Execution Times for Different Loading Techniques
Sound Recording Using the Data Acquisition Toolbox
Class Descriptions for the Multi-Class Task of Movie
Segments
Classification Tasks and Files
Row-Wise Normalized Confusion Matrix for the 8-Class
Audio Segment Classification Task
Row-Wise Normalized Confusion Matrix for the Speech
vs Music Binary Classification Task
Row-Wise Normalized Confusion Matrix for the 3-Class
Musical Genre Classification Task
Row-Wise Normalized Confusion Matrix for the Speech
vs Non-Speech Classification Task
List of All Functions Included in the MATLAB Audio
Analysis Library Provided with the Book
List of Data Files that are Available in the Library that
Accompanies the Book
MATLAB Libraries—Audio and Speech
MATLAB Libraries—Pattern Recognition and Machine
Learning
A List of Python Packages and Libraries that can be Used
for Audio Analysis and Pattern Recognition Applications
Representative Audio Analysis and Pattern Recognition
Libraries and Packages Written in C++
A Short List of Available Datasets for Selected Audio
Analysis Tasks
7
20
20
69
131
144
144
146
147
233
239
242
242
244
245
247
xi
LIST OF FIGURES
Figure 2.1
Figure 2.2
Figure 2.3
Figure 3.1
Figure 3.2
Figure 3.3
Figure 3.4
Figure 3.5
Figure 3.6
Figure 3.7
Figure 3.8
Figure 3.9
Figure 4.1
Figure 4.2
Figure 4.3
Figure 4.4
Figure 4.5
A synthetic audio signal.
A STEREO audio signal.
Short-term processing of an audio signal.
Plots of the magnitude of the spectrum of a signal
consisting of three frequencies at 200, 500, and 1200 Hz.
A synthetic signal consisting of three frequencies is
corrupted by additive noise.
The spectrogram of a speech signal.
Spectrograms of a synthetic, frequency-modulated
signal for three short-term frame lengths.
Spectrum representations of (a) an analog signal,
(b) a sampled version when the sampling frequency
exceeds the Nyquist rate, and (c) a sampled version
with insufficient sampling frequency. In the last case,
the shifted versions of the analog spectrum are
overlapping, hence the aliasing effect.
Spectral representations of the same three-tone
(200, 500 and 3000 HZ) signal for two different
sampling frequencies (8 kHz and 4 kHz).
Frequency response of a pre-emphasis filter for
a = −0.95.
An example of the application of a lowpass filter on a
synthetic signal consisting of three tones.
Example of a simple speech denoising technique applied
on a segment of the diarizationExample.wav
file, found in the data folder of the library of the book.
Mid-term feature extraction: each mid-term segment
is short-term processed and statistics are computed
based on the extracted feature sequence.
Plotting the results of featureExtractionFile(),
using plotFeaturesFile(), for the six feature
statistics drawn from the 6th adopted audio feature.
Histograms of the standard deviation by mean ratio
( σ 2
μ ) of the short-term energy for two classes: music
and speech.
Example of a speech segment and the respective sequence
of ZCR values.
Histograms of the standard deviation of the ZCR for
music and speech classes.
12
14
26
38
40
41
42
43
44
51
53
55
63
68
72
74
75
xiii