logo
ResearchBunny Logo
Abstract
This paper investigates the latency overhead in machine learning (ML)-based computation pipelines and analyzes the potential benefits of hardware-accelerated communication using RDMA and GPUDirect RDMA (GDR). A model-serving framework supporting various communication mechanisms was built to identify performance bottlenecks. The study shows that GDR can save 15-50% of model-serving latency (70-160ms) compared to TCP, highlighting the importance of communication fraction, protocol translation, and data copy optimization.
Publisher
This information was not provided in the paper.
Published On
Jan 01, 2023
Authors
Walid A Hanafy, Limin Wang, Hyunseok Chang, Sarit Mukherjee, T V Lakshman, Prashant Shenoy
Tags
latency
machine learning
hardware acceleration
RDMA
GPUDirect RDMA
model-serving
performance optimization
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny