FD.io – How to Push
Extreme Limits of
Performance and Scale
with Vector Packet
Processing Technology
Keith Burns
DEVNET-1221
Cisco Spark
Questions?
Use Cisco Spark to chat with the
speaker after the session
How
1. Find this session in the Cisco Live Mobile App
2. Click “Join the Discussion”
3.
Install Spark or go directly to the space
4. Enter messages/questions in the space
Cisco Spark spaces will be
available until July 3, 2017.
cs.co/ciscolivebot#DEVNET-1221
© 2017 Cisco and/or its affiliates. All rights reserved. Cisco Public
Agenda
•
Introduction to VPP
• Packet Processing on Commodity
Hardware
• Scalar and Vector Packet Processing
• Graph Scheduler
• Exploiting Multiple Cores
• Binary APIs
• Performance Data
What is Vector Packet Processing?
• High performance packet-processing stack for commodity CPUs
• x86_64, i686, ppc-64-BE, aarch64-LE
• Endian clean, 32 / 64-bit clean
• Linux user-mode process
• Leverage DPDK, widely-available kernel modules
• (uio, igb_uio, uio_pci_generic)
• Linux user-space
• Same image works in a VM, over a host kernel, in an LXC
• Physical NICs via PCI direct-map
• Active development since 2002
• Ships as part of Cisco embedded and server products, in volume
DEVNET-1221
© 2017 Cisco and/or its affiliates. All rights reserved. Cisco Public
5
Packet Processing (PP) on Commodity Hardware (CH)
• Packet-processing: load/store-intensive, big tables, N-tuple problems
• PP on CH: significantly different than PP on NPUs
• NPU: e.g. 2048 outstanding prefetches, SRAM
• Commodity HW: 8 → 16 outstanding prefetches, DDRn
• NPU: thousands of PPEs processing single packets
• Commodity HW: tens of general-purpose cores
• NPU: work distributor, TCAM, specialized counter support, QoS / queueing support
• VPP solves these problems—or a useful subset—on commodity hardware
• Structure the computation for CH’s convenience
DEVNET-1221
© 2017 Cisco and/or its affiliates. All rights reserved. Cisco Public
6
Scalar Packet Processing
• A fancy name for processing one packet at a time
• Traditional, straightforward implementation scheme
Interrupt, a calls b calls c … return return return RFI
•
• Considerable stack depth
• Issue #1: thrashing the I-cache
• When code path length exceeds the primary I-cache size, each packet incurs an
identical set of I-cache misses
• Only workaround: bigger caches
DEVNET-1221
© 2017 Cisco and/or its affiliates. All rights reserved. Cisco Public
7
Scalar Packet Processing, cont’d
• Dependent read latency on big forwarding tables
• Example: 4 x 8 mtrie walk. Assume tables do not fit in cache.
• Lookup 5.6.7.8: read root_ply[5], then ply_2[6], the ply_3[7], the ply_4[8]
• Big tables: reads stall for ~170 clocks
• Few opportunities to mitigate (“prefetch around”) read latency stalls
DEVNET-1221
© 2017 Cisco and/or its affiliates. All rights reserved. Cisco Public
8