CDBB: An nvrambased burst buffer coordination system for parallel file systems

Ziqi Fan, Fenggang Wu, Jim Diehl, David H.C. Du, Doug Voigt

Research output: Contribution to journalConference articlepeer-review

Abstract

For modern HPC systems, failures are treated as the norm instead of exceptions. To avoid rerunning applications from scratch, checkpoint/restart techniques are employed to periodically checkpoint intermediate data to parallel file systems. To increase HPC checkpointing speed, distributed burst buffers (DBB) have been proposed to use node-local NVRAM to absorb the bursty checkpoint data. However, without proper coordination, DBB is prone to suffer from low resource utilization. To solve this problem, we propose an NVRAM-based burst buffer coordination system, named collaborative distributed burst buffer (CDBB). CDBB coordinates all the available burst buffers, based on their priorities and states, to help overburdened burst buffers and maximize resource utilization. We built a proof-of-concept prototype and tested CDBB at the Minnesota Supercomputing Institute. Compared with a traditional DBB system, CDBB can speed up checkpointing by up to 8.4x under medium and heavy workloads and only introduces negligible overhead.

Original languageEnglish (US)
Pages (from-to)1-12
Number of pages12
JournalSimulation Series
Volume50
Issue number4
StatePublished - 2018
Event26th High Performance Computing Symposium, HPC 2018, Part of the 2018 Spring Simulation Multi-Conference, SpringSim 2018 - Baltimore, United States
Duration: Apr 15 2018Apr 18 2018

Bibliographical note

Funding Information:
This work is partially supported by the following NSF awards: 1305237, 1421913, 1439622 and 1525617. This work is also supported by Hewlett Packard Enterprise.

Keywords

  • Burst buffer
  • Coordination system
  • Non-volatile memory
  • Parallel file system

Cite this