logo
ResearchBunny Logo
Understanding the Benefits of Hardware-Accelerated Communication in Model-Serving Applications

Computer Science

Understanding the Benefits of Hardware-Accelerated Communication in Model-Serving Applications

W. A. Hanafy, L. Wang, et al.

This groundbreaking research by Walid A Hanafy, Limin Wang, Hyunseok Chang, Sarit Mukherjee, T V Lakshman, and Prashant Shenoy reveals how hardware-accelerated communication can significantly reduce latency in machine learning pipelines. By leveraging RDMA and GPUDirect RDMA, the study demonstrates a potential latency savings of 15-50% compared to traditional TCP methods, showcasing crucial insights into performance optimization.

00:00
00:00
Playback language: English
Introduction
Edge computing is crucial for applications exceeding on-device capabilities (e.g., cloud gaming, wearable cognitive assistance). Advances in 5G/6G networking and new chip technologies (GPU, TPU) have driven edge offloading. While research focuses on compute resource optimization, the internal network fabric within edge computing facilities is often overlooked. A common assumption is that end-to-end performance is solely determined by external network connectivity. However, with complex offloaded computations and dynamic load balancing, tasks often traverse multiple nodes and proxies within the edge facility, interconnected by a dedicated network fabric. Hardware-accelerated transport technologies like RDMA and GPUDirect RDMA (GDR) are increasingly used to build these fabrics, offering the potential for significant performance improvements by bypassing the server CPU and operating system for direct memory access. This paper aims to understand the full potential of these technologies in the context of computation offload, considering factors like GPU scheduling and computation characteristics.
Literature Review
Existing research on edge offloading optimization primarily focuses on adaptive computation and intelligent workload scheduling. These works largely ignore the impact of the internal network fabric within the edge computing infrastructure. While works exist on hardware-accelerated transports like RDMA and GPUDirect, a comprehensive understanding of their potential benefits in model-serving applications and the interplay with various factors (GPU scheduling, computation characteristics) is lacking. This paper addresses this gap by using a purpose-built model-serving framework.
Methodology
To systematically evaluate the role of hardware-accelerated network fabrics in low-latency edge computation offload, the authors built a custom model-serving application framework. This framework supports various communication mechanisms (TCP, RDMA, GDR) and allows for fine-grained visibility into pipeline stages—a feature absent in off-the-shelf model-serving systems. The model-serving pipeline comprises request handling, preprocessing, inference, and response handling. For RDMA and GDR, a connection setup procedure involves queue creation, memory buffer allocation, and metadata exchange. RDMA_WRITE is used for request and response data transfer. GDR omits host-to-device (H2D) and device-to-host (D2H) copies. ZeroMQ is used for TCP-based transport to ensure a fair comparison with RDMA. The framework provides detailed time profiling for individual pipeline stages, breaking down latency into transport and GPU components. GPU latency includes copy time (TCP/RDMA) and is measured using CUDA events. Transport delay (request and response times) is measured differently for each mechanism. Metrics such as CPU/memory usage are also collected. The experiments varied transport mechanism, connection mode (direct or proxied), and GPU configuration (concurrency, priority, sharing modes: multi-stream, multi-context, MPS). The system was implemented using NVIDIA OFED, ZeroMQ, CUDA toolkit, OpenCV, and TensorRT, deployed on three servers with an NVIDIA A2 GPU on one server. Experiments started with single-client scenarios to isolate transport delay and then scaled to multiple clients to study concurrency and resource sharing effects.
Key Findings
The study revealed several key findings: 1. **Hardware-accelerated transport benefits increase with communication overhead:** GDR and RDMA significantly outperformed TCP, especially when communication dominated the pipeline (e.g., smaller models, larger I/O). The relative performance improvement was more pronounced when preprocessing was not required. 2. **Protocol translation is beneficial:** Even using hardware-accelerated transport only on the last hop of a proxied connection (gateway to server) yielded substantial latency improvements compared to end-to-end TCP pipelines. The improvement is due to reducing the data movement time between the gateway and GPU server. 3. **Data copies are major bottlenecks:** H2D and D2H copies quickly become bottlenecks as concurrency increases. GDR's ability to eliminate these copies resulted in superior scalability compared to RDMA and TCP. 4. **GPU copy-engine limitations:** The coarse-grained interleaving of the GPU copy engine limits the effectiveness of prioritizing high-priority clients, especially with RDMA. GDR showed better prioritization due to its avoidance of copy operations. 5. **Scalability differences across mechanisms:** With multiple clients, GDR maintained significant advantages over RDMA and TCP, while RDMA's performance became similar to TCP. This is attributed to the copy-time overhead increasing with concurrency, becoming a bottleneck in RDMA and TCP. GDR's avoidance of copy operations minimized this issue. 6. **GPU management strategies:** Limiting concurrency (reducing number of streams) reduces variability in processing time, improving performance. MPS outperformed multi-context and multi-stream, but for GDR and RDMA the differences between MPS, multi-context and multi-stream were marginal when multiple clients were involved. GDR provided little advantages when compared to RDMA. However, the study contradicts the assumed independence between execution and copy engines in GPUs by showing unexpected variability differences in the processing times between the two mechanisms, implying the copy engine and the execution engine are managed using a single central unit. The study also revealed that stream priorities are more effective for execution engines (fine granularity) than copy engines (coarse granularity). Specific quantitative results include GDR saving 20.3% to 23.2% latency compared to TCP with ResNet50; GDR outperforming other methods significantly for models like MobileNetV3 and DeepLabV3, especially with high concurrency; and significant latency reduction from adopting hardware-accelerated transport in proxied connections.
Discussion
The findings directly address the research question by quantifying the performance benefits of hardware-accelerated communication in model-serving pipelines. The results demonstrate that the choice of communication mechanism significantly impacts latency, especially under concurrent workloads. The importance of minimizing data copies and optimizing GPU resource sharing is highlighted. The study's significance lies in providing detailed insights into the interplay between hardware-accelerated transport, GPU scheduling, and model-serving performance, which can guide the design of future low-latency edge computing infrastructures. The findings are relevant to the broader field of distributed systems, particularly in the context of deploying and scaling ML applications.
Conclusion
This paper contributes a comprehensive evaluation of hardware-accelerated communication (RDMA and GDR) in model-serving applications. The custom framework and detailed analysis revealed critical performance bottlenecks and trade-offs. The findings highlight the importance of communication fraction, efficient data copy handling, and optimal GPU resource management. Future work could explore more sophisticated GPU scheduling algorithms, investigate the impact of different accelerator types, and extend the framework to support more complex model-serving scenarios.
Limitations
The study acknowledges limitations including memory overhead from per-client buffer allocation, potential interoperability issues due to RDMA's raw-byte transfer, limitations with GPU pinning in GDR, and potential sub-optimality of GPUs for certain preprocessing tasks. The study focused primarily on NVIDIA GPUs and may not generalize directly to other hardware architectures.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny