DLion: Decentralized Distributed Deep Learning in Micro-Clouds

Rankyung Hong, Abhishek Chandra

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Scopus citations

Abstract

Deep learning (DL) is a popular technique for building models from large quantities of data such as pictures, videos, messages generated from edges devices at rapid pace all over the world. It is often infeasible to migrate large quantities of data from the edges to centralized data center(s) over WANs for training due to privacy, cost, and performance reasons. At the same time, training large DL models on edge devices is infeasible due to their limited resources. An attractive alternative for DL training distributed data is to use micro-clouds - -small-scale clouds deployed near edge devices in multiple locations. However, micro-clouds present the challenges of both computation and network resource heterogeneity as well as dynamism. In this paper, we introduce DLion, a new and generic decentralized distributed DL system designed to address the key challenges in micro-cloud environments, in order to reduce overall training time and improve model accuracy. We present three key techniques in DLion: (1) Weighted dynamic batching to maximize data parallelism for dealing with heterogeneous and dynamic compute capacity, (2) Per-link prioritized gradient exchange to reduce communication overhead for model updates based on available network capacity, and (3) Direct knowledge transfer to improve model accuracy by merging the best performing model parameters. We build a prototype of DLion on top of TensorFlow and show that DLion achieves up to 4.2X speedup in an Amazon GPU cluster, and up to 2X speed up and 26% higher model accuracy in a CPU cluster over four state-of-the-art distributed DL systems.

Original languageEnglish (US)
Title of host publicationHPDC 2021 - Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery, Inc
Pages227-238
Number of pages12
ISBN (Electronic)9781450382175
DOIs
StatePublished - Jun 21 2021
Event30th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2021 - Virtual, Online, Sweden
Duration: Jun 21 2021Jun 25 2021

Publication series

NameHPDC 2021 - Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing

Conference

Conference30th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2021
Country/TerritorySweden
CityVirtual, Online
Period6/21/216/25/21

Bibliographical note

Funding Information:
This work is supported in part by NSF grant CNS-1717834.

Publisher Copyright:
© 2020 ACM.

Keywords

  • deep learning
  • edge computing
  • micro-clouds
  • resource allocation

Fingerprint

Dive into the research topics of 'DLion: Decentralized Distributed Deep Learning in Micro-Clouds'. Together they form a unique fingerprint.

Cite this