[1812.00669] Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud

[1812.00669] Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud
CRANK

(Submitted on 3 Dec 2018)Abstract: Deep Learning system architects strive to design a balanced system where the computational accelerator -- FPGA, GPU, etc, is not starved for data. Feeding training data fast enough to effectively keep the accelerator utilization high is difficult when utilizing dedicated hardware like GPUs. As accelerators are getting faster, the storage media \& data buses feeding the data have not kept pace and the ever increasing size of training data further compounds the problem. We describe the design and implementation of a distributed caching system called Hoard that stripes the data across fast local disks of multiple GPU nodes using a distributed file system that efficiently feeds the data to ensure minimal degradation in GPU utilization due to I/O starvation. Hoard can cache the data from a central storage system before the start of the job or during the initial execution of the job and feeds the cached data for subsequent epochs of the same job and for di…

arxiv.org 5 years ago

Open page

https://arxiv.org/abs/1812.00669