logo资料库

RoCEv2和rdma标准文档.pdf

第1页 / 共23页
第2页 / 共23页
第3页 / 共23页
第4页 / 共23页
第5页 / 共23页
第6页 / 共23页
第7页 / 共23页
第8页 / 共23页
资料共23页,剩余部分请下载后查看
Supplement to InfiniBandTM Architecture Specification Volume 1 Release 1.2.1
Annex A17: RoCEv2
Annex A17: RoCEv2 (IP Routable RoCE)
A17.1 Introduction
A17.2 Overview
A17.2.1 The InfiniBand Architecture
A17.2.2 RDMA over Converged Ethernet (RoCE)
A17.2.3 The Need for (IP) Routable RDMA
A17.2.4 RoCEv2 (IP Routable RoCE)
A17.3 RoCEv2 Packet Format
A17.3.1 Ethertypes and IP Header Fields
A17.3.1.1 RoCEv2 with IPv4
A17.3.1.1.1 Internet Header Length (IHL)
A17.3.1.1.2 Differentiated Services Codepoint (DSCP)
A17.3.1.1.3 Explicit Congestion Notification (ECN)
A17.3.1.1.4 Total Length
A17.3.1.1.5 Flags
A17.3.1.1.6 Fragment Offset
A17.3.1.1.7 Time to Live
A17.3.1.1.8 Protocol
A17.3.1.1.9 Source and Destination IP Addresses
A17.3.1.2 RoCEv2 with IPv6
A17.3.1.2.1 Differentiated Services Codepoint (DSCP)
A17.3.1.2.2 Explicit Congestion Notification (ECN)
A17.3.1.2.3 Payload Length
A17.3.1.2.4 Next Header
A17.3.1.2.5 Hop Limit
A17.3.1.2.6 Source and Destination IP Addresses
A17.3.2 UDP Header Fields
A17.3.2.1 Source Port
A17.3.2.2 Destination Port
A17.3.2.3 Length
A17.3.2.4 Checksum
A17.3.3 ICRC for RoCEv2 Packets
A17.3.4 RoCEv2 Inbound Packet Validation
A17.4 InfiniBand Transport Protocol Spec Considerations
A17.4.1 RoCEv2 Addressing
A17.4.1.1 L3 Addresses
A17.4.1.2 L2 Addresses
A17.4.2 Address Vector
A17.4.3 Port GID Table
A17.4.4 GRH Checks
A17.4.4.1 IP Version
A17.4.4.2 Address Validation Rules
A17.4.5 Unreliable Datagram (UD)
A17.4.5.1 UD Completion Queue Entries (CQEs)
A17.4.5.2 Scattering of the L3 Header in UD
A17.4.6 IB Raw Datagrams
A17.4.7 InfiniBand Partitioning
A17.4.8 InfiniBand Congestion Control
A17.4.9 InfiniBand QOS
A17.5 InfiniBand Verbs Considerations
A17.5.1 QUERY HCA
A17.5.2 MODIFY HCA
A17.5.3 CREATE/MODIFY/QUERY ADDRESS HANDLE
A17.5.4 MODIFY/QUERY QUEUE PAIR / MODIFY/QUERY XRC TARGET QP
A17.5.5 MODIFY/QUERY EE CONTEXT
A17.5.6 ATTACH/DETACH QP TO/FROM MULTICAST GROUP
A17.5.7 POLL FOR COMPLETION
A17.5.8 GET SPECIAL QP
A17.5.9 POST SEND REQUEST
A17.5.10 UNAFFILIATED ASYNCHRONOUS EVENTS
A17.6 InfiniBand Management Considerations
A17.6.1 Communication Management
A17.6.1.1 REQ Message
A17.6.1.2 REJ Message
A17.6.1.3 LAP Message
A17.6.1.4 APR Message
A17.6.1.5 SAP Message
A17.7 Channel Adapters
A17.7.1 Loading The P_KEY Table
A17.7.2 Locally Routed Packets
A17.7.3 Backpressure and Deadlock Prevention
A17.7.4 Inbound Packet Checking
A17.7.5 Support for QP0
A17.8 Interoperability with RoCE Endnodes
A17.9 RoCEv2 Network Considerations
A17.9.1 Lossless Network
A17.9.2 RoCEv2 QoS
A17.9.3 RoCEv2 Congestion Management
A17.9.4 ECMP for RoCEv2
Supplement to InfiniBandTM Architecture Specification Volume 1 Release 1.2.1 Annex A17: RoCEv2 September 2, 2014 Copyright © 2010 by InfiniBandTM Trade Association. All rights reserved. All trademarks and brands are the property of their respective owners. This document contains information proprietary to the InfiniBandTM Trade Association. Use or disclosure without written permission by an officer of the InfiniBandTM Trade Association is prohibited.
InfiniBandTM Architecture Release 1.2.1 Volume 1 - General Specifications RoCEv2 (IP Routable RoCE) September 2, 2014 Table 0 Revision History Revision Date 1.0 Sept. 2, 2014 General Release LEGAL DISCLAIMER This specification provided “AS IS” and without any warranty of any kind, including, without limitation, any express or implied warranty of non-infringement, merchantability or fitness for a particular purpose. In no event shall IBTA or any member of IBTA be liable for any direct, indirect, special, exemplary, punitive, or consequential damages, including, without limita- tion, lost profits, even if advised of the possibility of such damages. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 InfiniBandSM Trade Association
InfiniBandTM Architecture VOLUME 1 - GENERAL SPECIFICATIONS ANNEX A17: ROCEV2 (IP ROUTABLE ROCE) RoCEv2 (IP Routable RoCE) September 2, 2014 A17.1 INTRODUCTION This document is an annex to Volume 1 release 1.2.1 of the InfiniBand Ar- chitecture, herein referred to as the base specification. This annex is Op- tional Normative, meaning that implementation of the feature described by this annex is Optional, but if present, the implementation must comply with the compliance statements contained within this annex. This specification follows the spirit of the RoCE Annex (Annex A16 to the base specification) in defining a new InfiniBand protocol variant that uses an IP network layer (with an IP header instead of InfiniBand‘s GRH) thus allowing IP routing of its packets. A17.2 OVERVIEW A17.2.1 THE INFINIBAND ARCHITECTURE The InfiniBand Architecture offers a rich set of I/O services based on an RDMA access method and message passing semantics. Included are a variety of transport services, reliable and unreliable, connected and un- connected, support for atomic operations, multicast and others. InfiniBand defines a layered architecture that specifies the first four layers of the OSI reference stack including the physical, link, network and trans- port layers as well as an accompanying management framework. In addi- tion, the IB specification defines a software interface and its accompanying verbs which are designed to allow smooth access to the services provided by the InfiniBand Architecture. A17.2.2 RDMA OVER CONVERGED ETHERNET (ROCE) RDMA over Converged Ethernet (RoCE) is an InfiniBand Trade Associa- tion Standard designed to provide InfiniBand Transport Services on Ethernet Networks4. RoCE preserves the InfiniBand Verbs Semantics to- gether with its Transport and Network Protocols and replaces the Infini- Band Link and Physical Layers with those of Ethernet. The network management infrastructure for RoCE is also that of Ethernet. 4. http://www.infinibandta.org/content/pages.php?pg=about_us_RoCE InfiniBandSM Trade Association Page 1 Proprietary and Confidential 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
InfiniBandTM Architecture VOLUME 1 - GENERAL SPECIFICATIONS RoCEv2 (IP Routable RoCE) September 2, 2014 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 A17.2.3 THE NEED FOR (IP) ROUTABLE RDMA Figure 1 InfiniBand and RoCE Protocol Stacks RoCE packets are regular Ethernet frames5 that carry an Ethertype value6 allocated by IEEE which indicates that the next header is a RoCE GRH. Figure 2 RoCE Packet Format Since RoCE traffic doesn't carry an IP header, it can't be routed across the boundaries of Ethernet L2 Subnets using regular IP routers. Under this scheme, RoCE provides RDMA services for communication within an Ethernet L2 domain. 5. Including VLANs and all other Ethernet header variations as defined by IEEE 802 6. 0x8915 InfiniBandSM Trade Association Page 2 Proprietary and Confidential
InfiniBandTM Architecture VOLUME 1 - GENERAL SPECIFICATIONS A17.2.4 ROCEV2 (IP ROUTABLE ROCE) RoCEv2 (IP Routable RoCE) September 2, 2014 RoCEv2 is a straightforward extension of the RoCE protocol that involves a simple modification of the RoCE packet format. Instead of the GRH, RoCEv2 packets carry an IP header which allows traversal of IP L3 Routers and a UDP header that serves as a stateless encapsulation layer for the RDMA Transport Protocol Packets over IP. Figure 3 RoCEv2 and RoCE Frame Format Differences RoCEv2 packets use a well-known UDP Destination Port (dport) value that unambiguously distinguishes them in a stateless manner. As an additional benefit, following common practices in UDP encapsu- lated protocols, the UDP Source Port (sport) field of RoCEv2 packets serves as an opaque flow identifier that can be used by the networking in- frastructure for packet forwarding optimizations - see Section 17.9.4, “ECMP for RoCEv2,” on page 21. Since this approach exclusively affects the packet format on the wire, and due to the fact that with RDMA semantics packets are generated and con- sumed below the API, applications can operate over any form of RDMA service (including RoCEv2) in a completely transparent way7 (see Figure 4). 7.  Widespread RDMA APIs are IP based for all existing RDMA technologies  InfiniBandSM Trade Association Page 3 Proprietary and Confidential 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
InfiniBandTM Architecture VOLUME 1 - GENERAL SPECIFICATIONS RoCEv2 (IP Routable RoCE) September 2, 2014 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 A17.3 ROCEV2 PACKET FORMAT The RoCEv2 Packet format is shown in Figure 5. Figure 4 RoCEv2 Protocol Stack Figure 5 RoCEv2 Packet Format A17.3.1 ETHERTYPES AND IP HEADER FIELDS RoCEv2 supports both IPv4 and IPv6. The corresponding Ethertype values as well as IPv4 and IPv6 header fields for RoCEv2 packets are de- scribed in Section 17.3.1.1, “RoCEv2 with IPv4,” on page 5 and Section 17.3.1.2, “RoCEv2 with IPv6,” on page 6 respectively. InfiniBandSM Trade Association Page 4 Proprietary and Confidential
InfiniBandTM Architecture VOLUME 1 - GENERAL SPECIFICATIONS RoCEv2 (IP Routable RoCE) September 2, 2014 A17.3.1.1 ROCEV2 WITH IPV4 CA17-1: RoCEv2 Ports shall support both RoCEv2 with IPv4 and RoCEv2 with IPv6 packet formats. CA17-2: RoCEv2 Packets shall conform to the format depicted in Figure 5 with individual fields set as mandated by either Section 17.3.1.1, “RoCEv2 with IPv4,” on page 5 or Section 17.3.1.2, “RoCEv2 with IPv6,” on page 6. The Ethertype value for IPv4 as assigned by IEEE is 0x0800. The format of the IPv4 header and its fields are specified by the IETF in RFC791, RFC2474 and RFC3168. The sub-sections below define the values for relevant fields in the IPv4 header of RoCEv2 packets. A17.3.1.1.1 INTERNET HEADER LENGTH (IHL) CA17-3: For RoCEv2 packets with IPv4, the IHL field shall be set to 5. A17.3.1.1.2 DIFFERENTIATED SERVICES CODEPOINT (DSCP) CA17-4: For RoCEv2 packets with IPv4, the DSCP field shall be set to the value in the Traffic Class component of the RDMA Address Vector asso- ciated with the packet. A17.3.1.1.3 EXPLICIT CONGESTION NOTIFICATION (ECN) RoCEv2 makes use of the ECN field in the IPv4 header for signaling of congestion as defined by the IETF in RFC3168. See Section 17.9.3, “RoCEv2 Congestion Management,” on page 20. For HCAs that support RoCEv2 Congestion Management, the ECN field in the IPv4 header of a RoCEv2 packet may be set to ‘01’ or ‘10’ to indi- cate that the packet is subject to marking in the network to indicate con- gestion. CA17-5: For HCAs that don’t support RoCEv2 Congestion Management, the ECN field in the IPv4 header of a RoCEv2 packet shall be set to ‘00’. CA17-6: For RoCEv2 packets with IPv4, the Total Length field shall be set to the length of the IPv4 packet in bytes including the IPv4 header and up to and including the ICRC. CA17-7: For RoCEv2 packets with IPv4 the Flags field shall be set to ‘010’ (don’t fragment bit is set). A17.3.1.1.4 TOTAL LENGTH A17.3.1.1.5 FLAGS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 InfiniBandSM Trade Association Page 5 Proprietary and Confidential
InfiniBandTM Architecture VOLUME 1 - GENERAL SPECIFICATIONS A17.3.1.1.6 FRAGMENT OFFSET A17.3.1.1.7 TIME TO LIVE A17.3.1.1.8 PROTOCOL RoCEv2 (IP Routable RoCE) September 2, 2014 CA17-8: For RoCEv2 packets with IPv4 the Fragment Offset field shall be set to 0. CA17-9: For RoCEv2 packets with IPv4 the Time to Live field shall be set to the value in the Hop Limit component of the RDMA Address Vector as- sociated with the packet. CA17-10: For RoCEv2 packets with IPv4 the Protocol field shall be set to 0x11 (UDP). A17.3.1.1.9 SOURCE AND DESTINATION IP ADDRESSES A17.3.1.2 ROCEV2 WITH IPV6 CA17-11: The Source IP Address of RoCEv2 packets with IPv4 shall be set to the IPv4 address encoded in the Port GID entry referenced by the “port” and “SGID index” components of the Address Vector associated with the packet. CA17-12: The Destination IP Address of RoCEv2 packets with IPv4 shall be set to the IPv4 address encoded in the DGID component of the Ad- dress Vector associated with the packet. The Ethertype value for IPv6 as assigned by IEEE is 0x86DD. The format of the IPv6 header and its fields are specified by the IETF in RFC2460, RFC2474 and RFC3168. The sub-sections below define the values for relevant fields in the IPv6 header of RoCEv2 packets. A17.3.1.2.1 DIFFERENTIATED SERVICES CODEPOINT (DSCP) CA17-13: For RoCEv2 packets with IPv6, the DSCP field shall be set to the value in the Traffic Class component of the Address Vector associated with the packet. A17.3.1.2.2 EXPLICIT CONGESTION NOTIFICATION (ECN) RoCEv2 makes use of the ECN field in the IPv6 header for signaling of congestion as defined by the IETF in RFC3168. See Section 17.9.3, “RoCEv2 Congestion Management,” on page 20. For HCAs that support RoCEv2 Congestion Management, the ECN field in the IPv6 header of a RoCEv2 packet may be set to ‘01’ or ‘10’ to indi- cate that the packet is subject to marking in the network to indicate con- gestion. InfiniBandSM Trade Association Page 6 Proprietary and Confidential 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
分享到:
收藏