logo
ResearchBunny Logo
Understanding the Benefits of Hardware-Accelerated Communication in Model-Serving Applications

Computer Science

Understanding the Benefits of Hardware-Accelerated Communication in Model-Serving Applications

W. A. Hanafy, L. Wang, et al.

This groundbreaking research by Walid A Hanafy, Limin Wang, Hyunseok Chang, Sarit Mukherjee, T V Lakshman, and Prashant Shenoy reveals how hardware-accelerated communication can significantly reduce latency in machine learning pipelines. By leveraging RDMA and GPUDirect RDMA, the study demonstrates a potential latency savings of 15-50% compared to traditional TCP methods, showcasing crucial insights into performance optimization.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper examines how hardware-accelerated transport within edge computing facilities affects end-to-end latency for model-serving applications. While prior work often assumes external access networks (e.g., 5G/6G) dominate performance, the authors argue that internal edge network fabrics and their interaction with multi-stage pipelines spanning gateways, proxies, and GPU servers can be critical. RDMA and GPUDirect RDMA (GDR) bypass the OS and CPU to directly place data into CPU or GPU memory, promising lower latency via zero-copy transfers. The study builds a model-serving framework to profile pipeline stages and quantify the net benefits of RDMA/GDR under varied workloads, connection modes, and GPU scheduling/sharing strategies. Key hypotheses and initial takeaways include: (1) The benefit of hardware-accelerated transport grows when communication constitutes a significant fraction of end-to-end latency, especially with faster GPUs and I/O-intensive applications. (2) Even with protocol translation, using hardware-accelerated transport within the cluster can substantially reduce latency compared to end-to-end TCP. (3) Host-device data copies (H2D/D2H) are major bottlenecks and interfere with GPU execution; GDR mitigates this by avoiding copy queues. (4) Prioritization effectiveness is limited by the coarse-grained interleaving of the GPU copy engine, reducing priority benefits compared to execution engines.
Literature Review
The paper situates its work within several areas: (i) Edge offloading and inference serving systems have emphasized adaptive computation and cluster scheduling to reduce latency and improve predictability/utilization, often overlooking the internal networking fabric’s role. Works like Clipper, Triton, TensorFlow-Serving focus on TCP-based protocols and do not support RDMA/GDR or fine-grained pipeline profiling. (ii) Studies of GPU scheduling reveal non-preemptive, priority-aware scheduling at block-level across streams and contexts, and mechanisms like MPS to improve utilization, but trade-offs between predictability and utilization remain underexplored. (iii) Hardware-accelerated transport literature has evaluated RDMA, GPUDirect, and GPU interconnects, and systems like Lynx (SmartNIC offload), GPU-Ether (GPU-native I/O), and FlexDriver (accelerator data-plane driver) demonstrate performance benefits of specific solutions. However, prior work does not provide in-depth, end-to-end analysis of how RDMA/GDR affect model-serving pipelines across realistic deployment scenarios, multi-hop proxying, and GPU sharing. This paper bridges that gap by providing detailed breakdowns and guidelines for when and how to adopt hardware-accelerated transport in model-serving.
Methodology
The authors develop a custom model-serving framework in C++ (~4.5k SLOC) to support multiple transports (TCP via ZeroMQ, RDMA, GPUDirect RDMA), fine-grained profiling, and flexible deployment topologies. Pipeline stages: request handling, preprocessing (e.g., resizing/formatting with OpenCV), inference (TensorRT on CUDA), and response handling. Transport implementations: - RDMA/RoCEv2: explicit queue pairs, work requests (RDMA_WRITE), work completions; data DMAed to host RAM, then cudaMemcpy H2D/D2H to/from GPU memory. - GDR: RNIC directly DMAs to/from GPU memory via GPUDirect RDMA, eliminating H2D/D2H copies and CPU staging. - TCP baseline: ZeroMQ (Router-Dealer) is used to avoid (de)serialization overheads common to HTTP/GRPC, enabling fair comparison with RDMA semantics. Each server thread reuses buffers to minimize allocation overheads. Metrics and profiling: The system injects CUDA events around GPU stages to measure preprocessing, inference, and copy-time (H2D+D2H; not applicable to GDR). Transport latencies are captured as request-time and response-time. Total end-to-end latency (total-time), CPU usage (user/kernel), and memory usage (RAM and GPU) are recorded. Each client issues 1000 closed-loop requests; experiments vary client counts and GPU sharing modes. Experimental scenarios: - Transport mechanisms: local (lower-bound, no network), RDMA, GDR, TCP. - Connection modes: direct (gateway to GPU server) and proxied (client→gateway→server) with combinations: RDMA/GDR, RDMA/RDMA, TCP/GDR, TCP/RDMA, TCP/TCP. - GPU configurations: concurrency (streams per client), stream priorities, sharing methods (multi-stream, multi-context, MPS). Implementation environment: NVIDIA OFED v5.6, ZeroMQ v2.1, CUDA 11.6.2, OpenCV 4.5.5, TensorRT 8.4; Ubuntu 20.04, kernel 5.15. Testbed: three Dell servers (S1, S2 with NVIDIA A2 16 GB GPU and dual copy engines, S3), all with ConnectX-5 25GbE RNICs. Workloads: diverse DNNs (MobileNetV3, ResNet50, EfficientNetB0, WideResNet101, YoloV4, DeepLabV3_ResNet50) covering different GFLOPs, input/output sizes, and tasks; experiments with raw and preprocessed inputs.
Key Findings
- Overall gains: GPUDirect RDMA (GDR) saves 15–50% of model-serving latency (≈70–160 ms) compared to TCP across a range of setups. - Direct connection (ResNet50): With server-side preprocessing, GDR and RDMA reduce latency by 20.3% and 11.4% vs TCP; without preprocessing, by 23.2% and 15.2%, respectively. GDR adds only ~0.27–0.53 ms over local processing, while TCP adds ~1.2–1.5 ms. - Latency breakdown: TCP incurs 0.61–0.73 ms more per direction than RDMA/GDR; GDR further saves ~0.2–0.3 ms by eliminating H2D/D2H copies. - Dependence on model size and I/O: Smaller models and larger I/O sizes increase the communication fraction and benefit more from RDMA/GDR. Example overheads vs local: MobileNetV3 adds ≥80.8% (raw) and 48.1% (preprocessed), while WideResNet101 adds ~4.5% and 2%, respectively. - Communication fraction examples: For MobileNetV3, time in data movement is 62% (TCP), 42% (RDMA), 30% (GDR). For DeepLabV3 (raw), TCP spends 60% in data movement versus 32% (RDMA) and 23% (GDR); TCP adds 68–71 ms vs RDMA/GDR. - CPU usage: TCP has highest CPU usage due to CPU-driven networking; with DeepLabV3, TCP uses ~100% more CPU than GDR; RDMA adds minimal CPU overhead for copy issuance. - Proxied connections (MobileNetV3, raw): Replacing only the last hop with hardware-accelerated transport substantially reduces latency versus TCP/TCP: TCP/RDMA saves 23%, TCP/GDR saves 57%; hardware-accelerated links also reduce performance variability. - Scalability with multiple clients (direct): GDR consistently outperforms RDMA/TCP; with 16 clients, GDR saves ~4.7 ms (MobileNetV3) and ~160 ms (DeepLabV3) vs TCP. RDMA’s advantage over TCP diminishes as clients increase due to H2D/D2H copy engine bottlenecks. Network I/O (request/response) rarely becomes the bottleneck; copy engines do. - Scalability with proxied connections: Using GDR in the last hop can be comparable to RDMA/GDR end-to-end and can outperform RDMA/RDMA; GDR-at-last-hop saves ~27% vs TCP/TCP and is within ~4% of RDMA/GDR in best cases. - Managing concurrency (ResNet50): Limiting concurrency (single stream) increases latency by ~33% vs full concurrency (one stream per client). Increasing streams reduces latency at diminishing returns, bounded by model/device limits. - Variability: For 16 clients, CoV of processing time is lower with GDR (0.11) than RDMA (0.21), indicating interference from copy engines despite nominal independence of execution/copy paths. - Priority clients (YoloV4, preprocessed): With GDR, a high-priority client maintains low latency (e.g., ~54 ms) even with more clients; with RDMA, priority benefits erode as copy engines interleave at coarse granularity, limiting prioritization efficacy. - Sharing methods (EfficientNetB0): MPS outperforms multi-context; with GDR, multi-stream ~ MPS performance; with RDMA, MPS outperforms multi-stream, suggesting different interleaving/sharing of copy engines across processes vs threads. - Design takeaways: Communication fraction critically determines gains; protocol translation to hardware-accelerated transport within the cluster is worthwhile; H2D/D2H copies are key bottlenecks that GDR avoids; copy-engine’s coarse interleaving limits prioritization; GDR at the last hop in proxied setups yields most of the benefits.
Discussion
The study demonstrates that internal edge network fabrics and GPU memory transfer paths are pivotal determinants of end-to-end latency in model-serving. When communication constitutes a large share of the pipeline (small models, large I/O, faster accelerators), hardware-accelerated transports offer substantial improvements; GDR’s avoidance of H2D/D2H copies is decisive, especially under concurrency where GPU copy engines become bottlenecks that negate RDMA’s benefits over TCP. Applying RDMA/GDR even partially (e.g., gateway-to-server hop) yields significant reductions relative to end-to-end TCP and reduces variability. GPU scheduling and sharing strategies substantially modulate transport benefits: prioritization works well for execution engines (GDR) but is limited for copy engines (RDMA/TCP), and concurrency management affects both mean and variance of latency with differing impacts between execution-only (GDR) and execution+copy (RDMA) domains. These insights guide deployment: use GDR where possible, especially on hops adjacent to GPU servers; anticipate diminishing returns from RDMA if copy engines are saturated; choose sharing modes (e.g., MPS vs multi-stream) mindful of copy-engine behavior; and consider protocol translation at gateways to unlock benefits within the cluster despite TCP at the edge. Overall, the findings directly answer the research question by quantifying when and how hardware-accelerated transports improve model-serving latency and by exposing the GPU-copy bottlenecks that govern scalability.
Conclusion
The paper introduces a model-serving framework that supports TCP, RDMA, and GPUDirect RDMA with fine-grained pipeline profiling, enabling systematic evaluation across transport mechanisms, connection modes, workloads, and GPU sharing strategies. Results show that hardware-accelerated communication is most beneficial when communication dominates the pipeline, with GDR providing the largest gains by eliminating H2D/D2H copies; RDMA provides benefits over TCP but loses advantage under high concurrency due to copy-engine bottlenecks. Using hardware-accelerated transport within the cluster, even with protocol translation, substantially reduces latency versus end-to-end TCP, and adopting GDR on the last hop captures most of the benefits. The study highlights that GPU copy operations and sharing policies significantly impact latency and variability. Potential future directions include addressing memory pinning and session scalability constraints, improving interoperability across heterogeneous data layouts and accelerators, exploring smarter scheduling/prioritization that accounts for copy-engine behavior, and investigating strategies to balance predictability and utilization under diverse workloads.
Limitations
- Memory overhead: RDMA/GDR often require per-client pinned buffers; GPU memory is limited, constraining the number of concurrent sessions (especially for GDR). - Homogeneity: RDMA transfers raw bytes and requires consistent data layouts on both ends, limiting interoperability; proxies can help with translation. - GPU pinning: GDR allocates GPU memory per client, tying sessions to specific GPUs or incurring inter-GPU copy costs. - GPU inadequacy for some preprocessing: Dedicated ASICs (e.g., decoders) may be better suited for certain preprocessing; in such cases, RDMA to host memory may outperform GDR for that stage, while GPUDirect can still move postprocessed data directly to GPUs.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny