Multicore DSP
Multicore DSP
From Algorithms to Real-time Implementation
on the TMS320C66x SoC
Naim Dahnoun
University of Bristol
UK
This edition first published 2018
© 2018 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any
form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice
on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Naim Dahnoun to be identified as the author of this work has been asserted in accordance with law.
Registered Office(s)
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at
www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in
standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or
warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all
warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No
warranty may be created or extended by sales representatives, written sales materials or promotional statements for this
work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of
further information does not mean that the publisher and authors endorse the information or services the organization,
website, or product may provide or recommendations it may make. This work is sold with the understanding that the
publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be
suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that
websites listed in this work may have changed or disappeared between when this work was written and when it is read.
Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not
limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication data applied for
ISBN: 9781119003823
Cover design by Wiley
Cover image: © matejmo/Gettyimages
Set in 10/12pt Warnock by SPi Global, Pondicherry, India
10 9 8 7 6 5 4 3 2 1
I dedicate this book to my children
Zahra, Yasmin and Riyad
and in memory of my parents
vii
3
6
9
Contents
Preface xviii
Acknowledgements xxi
Foreword xxii
About the Companion Website xxiii
1
1.1
1.2
1.2.1
1.2.2
1.3
1.4
1.5
1.6
1.7
3
Introduction to DSP 1
Introduction 1
Multicore processors
Can any algorithm benefit from a multicore processor?
How many cores do I need for my application?
Key applications of high-performance multicore devices
8
FPGAs, Multicore DSPs, GPUs and Multicore CPUs
Challenges faced for programming a multicore processor
Texas Instruments DSP roadmap 10
Conclusion 11
12
References
5
Functional units
21
Register file A and file B 20
The TMS320C66x architecture overview 14
Overview 14
The CPU 15
Cross paths
2
2.1
2.2
16
2.2.1
2.2.1.1 Data cross paths
2.2.1.2 Address cross paths
2.2.2
2.2.2.1 Operands 20
2.2.3
2.2.3.1 Condition registers 21
2.2.3.2 .L units
2.2.3.3 .M units
2.2.3.4 .S units
2.2.3.5 .D units
2.3
2.3.1
2.4
2.4.1
2.4.2 Memory protection and extension 29
2.4.3 Memory throughput
Single instruction, multiple data (SIMD) instructions
Control registers 24
24
The KeyStone memory
Using the internal memory
27
22
22
23
23
17
18
29
24
viii
Contents
2.5
2.5.1
2.5.2
2.5.3
2.5.4
2.5.5
2.6
30
32
Peripherals
Navigator
Enhanced Direct Memory Access (EDMA) Controller
Universal Asynchronous Receiver/Transmitter (UART)
General purpose input–output (GPIO)
Internal timers
Conclusion 33
33
References
32
32
32
32
37
38
39
Software development tools and the TMS320C6678 EVM 35
Introduction 35
Software development tools
Compiler
Assembler
40
Linker
3
3.1
3.2
3.2.1
3.2.2
3.2.3
3.2.3.1 Linker command file
3.2.4
3.2.5
3.2.5.1 Platform update using the XDCtools 42
3.2.6
3.3
3.3.1
3.4
KeyStone Multicore Software Development Kit
Hardware development tools
EVM features
Laboratory experiments based on the C6678 EVM: introduction
51
to Code Composer Studio (CCS)
Software and hardware requirements
Compile, assemble and link
Using the Real-Time Software Components (RTSC) tools
40
42
42
47
47
47
51
53
52
Laboratory experiments with the CCS6
3.4.1
3.4.1.1 Key features
3.4.1.2 Download sites
3.4.2
3.4.2.1 Introduction to CCS 55
3.4.2.2 Implementation of a DOTP algorithm 63
3.4.3
3.4.4
3.5
3.6
Profiling using the clock
Considerations when measuring time
Loading different applications to different cores
Conclusion 72
72
References
53
65
67
67
76
75
Numerical issues 74
Introduction 74
Fixed- and floating-point representations
Fixed-point arithmetic
76
4
4.1
4.2
4.2.1
4.2.1.1 Unsigned integer
77
4.2.1.2 Signed integer
4.2.1.3 Fractional numbers
4.2.2
4.2.2.1 Special numbers for the 32-bit and 64-bit floating-point formats
4.3
4.4
4.5
77
Floating-point arithmetic
Dynamic range and accuracy
Laboratory exercise
Conclusion 85
85
References
78
82
83
81
Contents
ix
88
91
91
90
92
88
98
99
Software optimisation 86
5
Introduction 86
5.1
Hindrance to software scalability for a multicore processor
5.2
Single-core code optimisation procedure
5.3
The C compiler options
5.3.1
Interfacing C with intrinsics, linear assembly and assembly
5.4
Intrinsics
5.4.1
Interfacing C and assembly
5.4.2
Assembly optimisation 97
5.5
Parallel instructions
5.5.1
Removing the NOPs
5.5.2
99
Loop unrolling
5.5.3
Double-Word Access 100
5.5.4
Optimisation summary
5.5.5
Software pipelining 101
5.6
105
Software-pipelining procedure
5.6.1
105
5.6.1.1 Writing linear assembly code
5.6.1.2 Creating a dependency graph 105
5.6.1.3 Resource allocation 108
5.6.1.4 Scheduling table
5.6.1.5 Generating assembly code
5.7
5.7.1
5.8
5.9
5.10
5.11
Linear assembly
Hand optimisation of the dotp function using linear assembly
Avoiding memory banks
Optimisation using the tools
123
Laboratory experiments
Conclusion 126
126
References
100
108
109
111
118
118
112
6
6.1
6.1.1
6.2
6.3
6.3.1
6.3.2
6.4
129
The TMS320C66x interrupts 127
Introduction 127
Chip-level interrupt controller
135
The interrupt controller
140
Laboratory experiment
Experiment 1: Using the GIPIOs to trigger some functions
Experiment 2: Using the console to trigger an interrupt 140
Conclusion 143
144
References
140
Real-time operating system: TI-RTOS 145
Introduction 146
TI-RTOS 146
148
Real-time scheduling
Hardware interrupts (Hwis)
148
7
7.1
7.2
7.3
7.3.1
149
7.3.1.1 Setting an Hwi
7.3.1.2 Hwi hook functions
7.3.2
7.3.3
7.3.3.1 Task hook functions
155
157
149
Software interrupts (Swis), including clock, periodic or single-shot functions
Tasks
155
x
Contents
158
159
163
163
159
159
Idle functions
158
Clock functions
158
Timer functions
Synchronisation 158
Events
Summary
Dynamic memory management
Stack allocation 165
Heap allocation 165
Heap implementation 165
7.3.4
7.3.5
7.3.6
7.3.7
7.3.7.1 Semaphores
7.3.7.2 Semaphore_pend 159
159
7.3.7.3 Semaphore_post
7.3.7.4 How to configure the semaphores
7.3.8
7.3.9
7.4
7.4.1
7.4.2
7.4.3
7.4.3.1 HeapMin implementation 165
7.4.3.2 HeapMem implementation 165
7.4.3.3 HeapBuf implementation 167
7.4.3.4 HeapMultiBuf implementation 171
7.5
7.5.1
7.5.2
7.5.3
7.5.4
7.5.5
7.6
Laboratory experiments 172
Lab 1: Manual setup of the clock (part 1)
Lab 2: Manual setup of the clock (part 2)
Lab 3: Using Hwis, Swis, tasks and clocks
Lab 4: Using events
Lab 5: Using the heaps
Conclusion 190
191
References
References (further reading)
187
189
191
172
172
174
Enhanced Direct Memory Access (EDMA3) controller 192
Introduction 192
193
Type of DMAs available
EDMA controllers architecture
194
The EDMA3 Channel Controller (EDMA3CC)
The EDMA3 transfer controller (EDMA3TC) 201
EDMA prioritisation 201
194
202
8
8.1
8.2
8.3
8.3.1
8.3.2
8.3.3
8.3.3.1 Trigger source priority
203
8.3.3.2 Channel priority
203
8.3.3.3 Dequeue priority
8.3.3.4 System (transfer controller) priority
8.4
8.4.1
8.5
8.5.1
8.5.2
8.6
8.7
8.8
8.9
Parameter RAM (PaRAM)
Channel options parameter (OPT)
Transfer synchronisation dimensions
A – Synchronisation 204
AB – Synchronisation 204
Simple EDMA transfer 204
Chaining EDMA transfers
Linked EDMAs
Laboratory experiments 210
203
208
208
203
203
203