This paper investigates the latency overhead in machine learning (ML)-based computation pipelines and analyzes the potential benefits of hardware-accelerated communication using RDMA and GPUDirect RDMA (GDR). A model-serving framework supporting various communication mechanisms was built to identify performance bottlenecks. The study shows that GDR can save 15-50% of model-serving latency (70-160ms) compared to TCP, highlighting the importance of communication fraction, protocol translation, and data copy optimization.
Publisher
This information was not provided in the paper.
Published On
Jan 01, 2023
Authors
Walid A Hanafy, Limin Wang, Hyunseok Chang, Sarit Mukherjee, T V Lakshman, Prashant Shenoy
Tags
latency
machine learning
hardware acceleration
RDMA
GPUDirect RDMA
model-serving
performance optimization
Related Publications
Explore these studies to deepen your understanding of the subject.