Revisiting Network Support for RDMA

重新审视RDMA的网络支持

本文为SIGCOMM 2018会议论文。

笔者翻译了该论文。由于时间仓促，且笔者英文能力有限，错误之处在所难免；欢迎读者批评指正。

本文及翻译版本仅用于学习使用。如果有任何不当，请联系笔者删除。

Abstract （摘要）

The advent of RoCE (RDMA over Converged Ethernet) has led to a signifcant increase in the use of RDMA in datacenter networks. To achieve good performance, RoCE requires a lossless network which is in turn achieved by enabling Priority Flow Control (PFC) within the network. However, PFC brings with it a host of problems such as head-of-the-line blocking, congestion spreading, and occasional deadlocks. Rather than seek to fix these issues, we instead ask: is PFC fundamentally required to support RDMA over Ethernet?

RoCE(基于融合以太网的RDMA)的出现导致RDMA在数据中心网络中的使用量显着增加。为了获得良好的性能，RoCE需要一个不丢包网络，这反过来通过在网络中启用优先级流量控制(PFC)来实现。然而，PFC带来了许多问题，例如线头阻塞、拥塞扩散和偶尔的死锁。我们不是寻求解决这些问题，而是要询问：为了需要基于以太网上的RDMA，RFC是否是必须的？

We show that the need for PFC is an artifact of current RoCE NIC designs rather than a fundamental requirement. We propose an improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packet losses. We show that IRN (without PFC) outperforms RoCE (with PFC) by 6-83% for typical network scenarios. Thus not only does IRN eliminate the need for PFC, it improves performance in the process! We further show that the changes that IRN introduces can be implemented with modest overheads of about 3-10% to NIC resources. Based on our results, we argue that research and industry should rethink the current trajectory of network support for RDMA.

我们表明，对PFC的需求是当前RoCE NIC设计的一种人为因素，而不是基本要求。我们提出了一种改进的RoCE NIC (IRN)设计，通过对RoCE NIC进行一些简单的更改，以便更好地处理数据包丢失。我们表明，对于典型的网络场景，IRN(没有PFC)优于RoCE(使用PFC) 6-83％。因此，IRN不仅消除了对PFC的需求，而且还提高了处理过程中的性能!我们进一步表明，IRN引入的更改可以通过大约3-10％的适度NIC资源开销来实现。根据我们的结果，我们认为研究界和工业界应重新考虑当前RDMA的网络支持。

1 Introduction (引言）

Datacenter networks oﬀer higher bandwidth and lower latency than traditional wide-area networks. However, traditional endhost networking stacks, with their high latencies and substantial CPU overhead, have limited the extent to which applications can make use of these characteristics. As a result, several large datacenters have recently adopted RDMA, which bypasses the traditional networking stacks in favor of direct memory accesses.

与传统的广域网相比，数据中心网络具有更高的带宽和更低的延迟。但是，传统的终端主机网络栈具有高延迟和大量CPU开销，限制了应用程序利用这些特性的程度。因此，几个大型数据中心最近采用了RDMA，它绕过了传统的网络栈，转而采用直接内存访问。

RDMA over Converged Ethernet (RoCE) has emerged as the canonical method for deploying RDMA in Ethernet-based datacenters [23, 38]. The centerpiece of RoCE is a NIC that (i) provides mechanisms for accessing host memory without CPU involvement and (ii) supports very basic network transport functionality. Early experience revealed that RoCE NICs only achieve good end-to-end performance when run over a lossless network, so operators turned to Ethernet’s Priority Flow Control (PFC) mechanism to achieve minimal packet loss. The combination of RoCE and PFC has enabled a wave of datacenter RDMA deployments.

融合以太网上的RDMA（RoCE）已成为在基于以太网的数据中心中部署RDMA的规范方法[23,38]。RoCE的核心是一个NIC，它（i）提供了在没有CPU参与的情况下访问主机内存的机制，以及（ii）支持非常基本的网络传输功能。早期的经验表明，RoCE NIC在不丢包网络上运行时只能取得良好的端到端性能，因此运营商转向以太网的优先级流量控制（PFC）机制，以实现最小的数据包丢失。 RoCE和PFC的组合已经实现了数据中心RDMA部署的浪潮。

However, the current solution is not without problems. In particular, PFC adds management complexity and can lead to signifcant performance problems such as head-of-the-line blocking, congestion spreading, and occasional deadlocks [23, 24, 35, 37, 38]. Rather than continue down the current path and address the various problems with PFC, in this paper we take a step back and ask whether it was needed in the first place. To be clear, current RoCE NICs require a lossless fabric for good performance. However, the question we raise is: can the RoCE NIC design be altered so that we no longer need a lossless network fabric?

但是，目前的解决方案并非没有问题。特别是，PFC增加了管理的复杂性，并可能导致严重的性能问题，如线头阻塞、拥塞传播和偶尔的死锁[23,24,35,37,38]。不是继续沿着当前路径继续解决PFC的各种问题，在本文中我们退后一步，询问是否首先需要它。需要明确的是，目前的RoCE网卡需要不丢包结构才能获得良好的性能。但是，我们提出的问题是：RoCE网卡设计是否可以改变，以便我们不再需要不丢包网络结构？

We answer this question in the affirmative, proposing a new design called IRN (for Improved RoCE NIC) that makes two incremental changes to current RoCE NICs (i) more efficient loss recovery, and (ii) basic end-to-end ﬂow control to bound the number of in-ﬂight packets (§3). We show, via extensive simulations on a RoCE simulator obtained from a commercial NIC vendor, that IRN performs better than current RoCE NICs, and that IRN does not require PFC to achieve high performance; in fact, IRN often performs better without PFC (§4). We detail the extensions to the RDMA protocol that IRN requires (§5) and use comparative analysis and FPGA synthesis to evaluate the overhead that IRN introduces in terms of NIC hardware resources (§6). Our results suggest that adding IRN functionality to current RoCE NICs would add as little as 3-10% overhead in resource consumption, with no deterioration in message rates.

我们肯定地回答了这个问题，提出了一个名为IRN（改进的RoCE NIC）的新设计，它对当前的RoCE NIC进行了两个增量更改（i）更有效的丢包恢复，以及（ii）基本的端到端流控以限制飞行中（in-flight）数据包的数量（§3）。我们在商用NIC供应商处获得的RoCE仿真器上进行的大量仿真，结果表明IRN的性能优于当前的RoCE NIC，并且IRN不需要PFC来取得高性能。实际上，IRN在没有PFC的情况下通常表现更好（§4）。我们详细介绍了IRN要求的RDMA协议的扩展（§5），并使用比较分析和FPGA综合来评估IRN在NIC硬件资源方面引入的开销（§6）。我们的结果表明，在当前的RoCE网卡中添加IRN功能会增加3-10％的资源开销，而不会降低消息速率。

A natural question that arises is how IRN compares to iWARP? iWARP [33] long ago proposed a similar philosophy as IRN: handling packet losses efficiently in the NIC rather than making the network lossless. What we show is that iWARP’s failing was in its design choices. The diﬀerences between iWARP and IRN designs stem from their starting points: iWARP aimed for full generality which led them to put the full TCP/IP stack on the NIC, requiring multiple layers of translation between RDMA abstractions and traditional TCP bytestream abstractions. As a result, iWARP NICs are typically far more complex than RoCE ones, with higher cost and lower performance (§2). In contrast, IRN starts with the much simpler design of RoCE and asks what minimal features can be added to eliminate the need for PFC.

一个自然的问题是与iWARP进行比较，IRN如何？ iWARP [33]很久以前提出了与IRN类似的哲学：在NIC中有效地处理数据包丢失而不是使网络不丢包。我们表明iWARP的失败之处在于它的设计选择。iWARP和IRN设计之间的差异源于他们的出发点：iWARP旨在实现全面的通用性，这使得他们将完整的TCP/IP栈放在NIC上，需要在RDMA抽象和传统TCP字节流抽象之间进行多层转换。因此，iWARP NIC通常比RoCE更复杂、成本更高，且性能更低（§2）。相比之下，IRN从更简单的RoCE设计开始，并询问可以添加哪些最小功能以消除对PFC的需求。

More generally: while the merits of iWARP vs. RoCE has been a long-running debate in industry, there is no conclusive or rigorous evaluation that compares the two architectures. Instead, RoCE has emerged as the de-facto winner in the marketplace, and brought with it the implicit (and still lingering) assumption that a lossless fabric is necessary to achieve RoCE’s high performance. Our results are the first to rigorously show that, counter to what market adoption might suggest, iWARP in fact had the right architectural philosophy, although a needlessly complex design approach.

更一般地说：虽然iWARP与RoCE的优点一直是业界长期争论的问题，但没有比较两种架构的结论性或严格的评估。相反，RoCE已成为市场上事实上的赢家，并带来了隐含（并且仍然挥之不去）的假设，即不丢包结构是实现RoCE高性能所必需的。我们的结果是第一个严格表明，与市场采用建议相反，iWARP实际上具有正确的架构理念，尽管是一种不必要的复杂设计方法。

Hence, one might view IRN and our results in one of two ways: (i) a new design for RoCE NICs which, at the cost of a few incremental modifcations, eliminates the need for PFC and leads to better performance, or, (ii) a new incarnation of the iWARP philosophy which is simpler in implementation and faster in performance.

因此，可以通过以下两种方式之一来审视IRN和我们的结果：（i）RoCE NIC的新设计，以少量增量修改为代价，消除了对PFC的需求并导致更好的性能，或者，（ii）iWARP理念的新体现，其实现更简单，性能更快。

2 Background （背景）

We begin with reviewing some relevant background.

我们以回顾一些相关背景开始。

2.1 Infiniband RDMA and RoCE （Infiniband RDMA和ROCE）

RDMA has long been used by the HPC community in special-purpose Infniband clusters that use credit-based ﬂow control to make the network lossless [4]. Because packet drops are rare in such clusters, the RDMA Infiniband transport (as implemented on the NIC) was not designed to efficiently recover from packet losses. When the receiver receives an out-of-order packet, it simply discards it and sends a negative acknowledgement (NACK) to the sender. When the sender sees a NACK, it retransmits all packets that were sent after the last acknowledged packet (i.e., it performs a go-back-N retransmission).

长期以来，HPC社区一直在特殊用途的Infniband集群中使用RDMA，这些集群使用基于信用的流控制来使网络不丢包[4]。由于数据包丢失在此类群集中很少见，因此RDMA Infniband传输层（在NIC上实现）并非旨在从数据包丢失中有效恢复。当接收方收到无序数据包时，它只是丢弃该数据包并向发送方发送否定确认（NACK）。当发送方看到NACK时，它重新发送在最后一个确认的数据包之后发送的所有数据包（即，它执行返回N重传）。

To take advantage of the widespread use of Ethernet in datacenters, RoCE [5, 9] was introduced to enable the use of RDMA over Ethernet. RoCE adopted the same Infiniband transport design (including go-back-N loss recovery), and the network was made lossless using PFC.

为了利用以太网在数据中心中的广泛使用的优势，引入了RoCE [5,9]，以便在以太网上使用RDMA（我们对RoCE [5]及其后继者RoCEv2 [9]使用术语RoCE，RoCEv2使得RDMA不仅可以通过以太网运行，还可以运行在IP路由网络。）。RoCE采用了相同的Infiniband传输层设计（包括返回N数据包丢包恢复），并且使用PFC使网络不丢包。

2.2 Priority Flow Control (优先级流控）

Priority Flow Control (PFC) [6] is Ethernet’s ﬂow control mechanism, in which a switch sends a pause (or X-OFF) frame to the upstream entity (a switch or a NIC), when the queue exceeds a certain confgured threshold. When the queue drains below this threshold, an X-ON frame is sent to resume transmission. When confgured correctly, PFC makes the network lossless (as long as all network elements remain functioning). However, this coarse reaction to congestion is agnostic to which ﬂows are causing it and this results in various performance issues that have been documented in numerous papers in recent years [23, 24, 35, 37, 38]. These issues range from mild (e.g., unfairness and head-of-line blocking) to severe, such as “pause spreading” as highlighted in [23] and even network deadlocks [24, 35, 37]. In an attempt to mitigate these issues, congestion control mechanisms have been proposed for RoCE (e.g., DCQCN [38] and Timely [29]) which reduce the sending rate on detecting congestion, but are not enough to eradicate the need for PFC. Hence, there is now a broad agreement that PFC makes networks harder to understand and manage, and can lead to myriad performance problems that need to be dealt with.

优先级流控（PFC）[6]是以太网的流量控制机制，当队列超过某个特定的配置阈值时，交换机会向上游实体（交换机或NIC）发送暂停（或X-OFF）帧。当队列低于此阈值时，将发送X-ON帧以恢复传输。正确配置后，PFC使网络不丢包（只要所有网络元素保持正常运行）。然而，这种对拥堵的粗略反应对哪些流导致拥塞是不可知的，这导致了近年来在许多论文中记载的各种性能问题[23,24,35,37,38]。这些问题的范围从轻微（例如，不公平和线头阻塞）到严重(例如[23]中突出显示的“暂停传播”)，甚至是网络死锁[24,35,37]。为了缓解这些问题，已经为RoCE提出了拥塞控制机制（例如，DCQCN [38]和Timely [29]），其降低了检测拥塞的发送速率，但是不足以消除对PFC的需求。因此，现在普遍认为PFC使网络更难理解和管理，并且可能导致需要处理的无数性能问题。