作者

Conor Walsh is a software engineering intern with the Architecture Team of Intel’s Network Platform Group (NPG), based in Intel Shannon (Ireland).

引言

高速包处理应用程序对资源非常敏感。一种解决方案是将包处理流水线(pipeline)分离到多线程以提高程序性能。
然而，这样做可能增加缓存(Cache)和内存(Memory)访问的压力。
因此，构建高性能应用程序的关键在于尽可能缩短数据面流量(data plane traffic)相关的内存占用(memory footprint)。
本文提供了一种多线程包处理应用程序的内存优化技术，提高内存受限应用的性能。
即使应用程序没有内存受限的情况，也应该减少内存需求。

参考程序

本文基于对源自Intel的vCMT系统的研究。
vCMTS是基于DOCSIS3.1标准和DPDK包处理框架实现的MAC层数据面流水线程序。
该程序是为了测试Intel至强平台vCMTS数据平面包处理性能及功耗开发的工具。
可以从0.1org的the Access Network Dataplanes下载vCMTS。
This version of vCMTS is based on DPDK 18.08, but the theory behind the methods described in this paper can be applied to applications that use earlier versions of DPDK or other packet processing libraries such as the Cisco* Vector Packet Processing (VPP) framework. The features used in this paper were initially added to DPDK 16.07.2.

The downstream portion of this application uses a multithreaded pipeline design. The pipeline is split into two parts: the upper and lower MAC, which run on separate threads. These must be run on sibling hyper-threads – two threads running on one physical core – or the L2 caching efficiency will be lost. Refer to the vCMTS downstream packet processing pipeline in Figure 1. The upstream portion of vCMTS at the time of this paper does not use a multithreaded model; this paper focuses on the downstream portion as its reference application.

Ring Versus Stack

DPDK uses message buffers known as mbufs to store packet data. These mbufs are stored in memory pools known as mempools. By default, mempools are set up as a ring, which creates a pool with a configuration similar to a first-in, first-out (FIFO) system. This model can work well for multithreaded applications where the threads span multiple cores. However, for applications where the threads are on the same core, it can cause unnecessary memory bandwidth, and some of the hyper-threading efficiency may be lost. In applications where the threads are running on the same core, the ring mempool will end up cycling through all the mbufs. This results in CPU cache misses on almost every mbuf allocated when the number of mbufs is large, such as in the case of vCMTS.

DPDK also allows users to set up their mempools in a stack configuration, which creates a mempool that uses a last-in, first-out (LIFO) configuration.

Mempools also have a mempool cache, which allows “warm” buffers to be recycled, providing better cache efficiency where buffers are allocated and freed on the same thread. Mempool caches are always set up using a LIFO configuration to improve the performance of the mempool cache. Each thread used by a DPDK application has its own mempool cache for each mempool. As the packets are received by one thread and transmitted by another thread, mbufs will never be freed and allocated on the same thread, which renders the mempool cache system redundant in this case.

Movement of buffers in the ring mempool model (shown in Figure 2):

The application tries to populate the network interface cards (NICs) receive (RX) free list from thread 0’s mempool cache.
Mbufs are never freed on thread 0, so the mempool cache will be replenished directly from the mempool.
The application then allocates mbufs from the mempool cache.
When the mbufs are freed from the transmitting (TX) NIC, they are held in a mempool cache on thread 1.
When thread 1’s mempool cache is full, the application will start to return mbufs to the mempool, as mbufs are never allocated on thread 1.

This is a poor model for this application, as the thread 0 mempool cache always contains the “coldest” mbufs and the thread 1 mempool cache is always full. If the mempool is large, the CPU will be unable to retain the whole mempool in CPU cache, and it will be pushed to memory. In this model, the application will quickly cycle through the entire mempool, resulting in large memory bandwidth as the mbufs fall in and out of the CPU’s cache.

Movement of buffers in the stack mempool model (shown in Figure 3):

The application will allocate mbufs to the NICs RX free list from the mempool.
When the mbufs are freed from the transmitting NIC, they are freed back to the mempool.

This model is better, as it is more streamlined1. The mempool has been changed to perform as a stack, and the redundant mempool caches have been disabled. The mbufs are allocated to the NIC straight from the mempool and are then freed straight back to the mempool, not held on thread 1. Note that freeing and allocating from the stack mempool may be slightly more expensive, due to the need to do locking on the mempool because of multithreaded access. The locking penalties are minimized because the locks are confined to the one core. (If the threads were on separate cores, the locking cost would be more significant.) The overall benefits of reusing “warm” mbufs outweighs the additional locking costs.

Another method attempted was to reduce the memory footprint of the mempools while still using a ring configuration was to reduce the number of mbufs in the mempool (mempool size). Analysis showed that during normal operation about 750 mbufs were in flight in vCMTS – but at times the application could require up to 20,000 mbufs. This meant that reducing the size of the mempool and disabling the mempool cache was not an option, because at times the application may need a number of mbufs that is several orders of magnitude larger than the number needed during normal operation. If those mbufs are not available, the application may perform unpredictably.

When the application uses the stack model, the CPU should be able to keep most of the mbufs required by the application “warm” in the CPU cache, and fewer of them should be evicted from the CPU cache to memory. This is because it will be the same few mbufs reused over and over again from the mempool during normal operation, which will drastically reduce memory bandwidth.

Conclusion

It is clear that across all four tests, the drop in memory bandwidth due to the change from ring to stack was significant. The average drop across the four tests was 76%. A key benefit of this is, when high traffic rates are run, the data-plane cores are less likely to approach memory bandwidth saturation, which could degrade performance. In this case, a direct performance benefit can be achieved using stack mempool configuration for dual-threaded packet processing applications, as it reduces memory bandwidth utilization and improves the traffic rate at which memory bandwidth gets saturated. Another benefit of this change is the availability of more memory bandwidth for other applications running on the same socket. This modification should require minimal code changes to the application, so the effort would be worth the reward for this change2.

Resources

Maximizing the Performance of DOCSIS 3 0/3 1 Processing on Intel® Xeon® Processors

Endnotes

1 This model is more streamlined for the dual sibling hyper-threaded case; it was not tested in a scenario where the threads spanned multiple cores.

2 When vCMTS is changed to stack the system should not have to be populated with as many DIMMs of memory due to the reduced memory bandwidth and this would result in a power saving of roughly seven watts per DIMM. This claim was not verified as part of this paper but it could be a possible way to gain power savings from switching to a stack configuration.

优化多线程DPDK应用程序的内存使用

作者

引言

参考程序

Ring Versus Stack

Conclusion

Resources

Endnotes

猜你喜欢