CDBB: An nvrambased burst buffer coordination system for parallel file systems

Ziqi Fan; Fenggang Wu; Jim Diehl; David H.C. Du; Doug Voigt

CDBB: An nvrambased burst buffer coordination system for parallel file systems

Ziqi Fan, Fenggang Wu, Jim Diehl, David H.C. Du, Doug Voigt

Computer Science and Engineering

Research output: Contribution to journal › Conference article › peer-review

Abstract

For modern HPC systems, failures are treated as the norm instead of exceptions. To avoid rerunning applications from scratch, checkpoint/restart techniques are employed to periodically checkpoint intermediate data to parallel file systems. To increase HPC checkpointing speed, distributed burst buffers (DBB) have been proposed to use node-local NVRAM to absorb the bursty checkpoint data. However, without proper coordination, DBB is prone to suffer from low resource utilization. To solve this problem, we propose an NVRAM-based burst buffer coordination system, named collaborative distributed burst buffer (CDBB). CDBB coordinates all the available burst buffers, based on their priorities and states, to help overburdened burst buffers and maximize resource utilization. We built a proof-of-concept prototype and tested CDBB at the Minnesota Supercomputing Institute. Compared with a traditional DBB system, CDBB can speed up checkpointing by up to 8.4x under medium and heavy workloads and only introduces negligible overhead.

Original language	English (US)
Pages (from-to)	1-12
Number of pages	12
Journal	Simulation Series
Volume	50
Issue number	4
State	Published - 2018
Event	26th High Performance Computing Symposium, HPC 2018, Part of the 2018 Spring Simulation Multi-Conference, SpringSim 2018 - Baltimore, United States Duration: Apr 15 2018 → Apr 18 2018

Bibliographical note

Funding Information:
This work is partially supported by the following NSF awards: 1305237, 1421913, 1439622 and 1525617. This work is also supported by Hewlett Packard Enterprise.

Keywords

Burst buffer
Coordination system
Non-volatile memory
Parallel file system

OpenUrl availability

Full text

Cite this

@article{2f658cd12eb64a79b7fa66528ce84388,

title = "CDBB: An nvrambased burst buffer coordination system for parallel file systems",

abstract = "For modern HPC systems, failures are treated as the norm instead of exceptions. To avoid rerunning applications from scratch, checkpoint/restart techniques are employed to periodically checkpoint intermediate data to parallel file systems. To increase HPC checkpointing speed, distributed burst buffers (DBB) have been proposed to use node-local NVRAM to absorb the bursty checkpoint data. However, without proper coordination, DBB is prone to suffer from low resource utilization. To solve this problem, we propose an NVRAM-based burst buffer coordination system, named collaborative distributed burst buffer (CDBB). CDBB coordinates all the available burst buffers, based on their priorities and states, to help overburdened burst buffers and maximize resource utilization. We built a proof-of-concept prototype and tested CDBB at the Minnesota Supercomputing Institute. Compared with a traditional DBB system, CDBB can speed up checkpointing by up to 8.4x under medium and heavy workloads and only introduces negligible overhead.",

keywords = "Burst buffer, Coordination system, Non-volatile memory, Parallel file system",

author = "Ziqi Fan and Fenggang Wu and Jim Diehl and Du, {David H.C.} and Doug Voigt",

note = "Funding Information: This work is partially supported by the following NSF awards: 1305237, 1421913, 1439622 and 1525617. This work is also supported by Hewlett Packard Enterprise.; 26th High Performance Computing Symposium, HPC 2018, Part of the 2018 Spring Simulation Multi-Conference, SpringSim 2018 ; Conference date: 15-04-2018 Through 18-04-2018",

year = "2018",

language = "English (US)",

volume = "50",

pages = "1--12",

journal = "Simulation Series",

issn = "0735-9276",

number = "4",

}

TY - JOUR

T1 - CDBB

T2 - 26th High Performance Computing Symposium, HPC 2018, Part of the 2018 Spring Simulation Multi-Conference, SpringSim 2018

AU - Fan, Ziqi

AU - Wu, Fenggang

AU - Diehl, Jim

AU - Du, David H.C.

AU - Voigt, Doug

N1 - Funding Information: This work is partially supported by the following NSF awards: 1305237, 1421913, 1439622 and 1525617. This work is also supported by Hewlett Packard Enterprise.

PY - 2018

Y1 - 2018

N2 - For modern HPC systems, failures are treated as the norm instead of exceptions. To avoid rerunning applications from scratch, checkpoint/restart techniques are employed to periodically checkpoint intermediate data to parallel file systems. To increase HPC checkpointing speed, distributed burst buffers (DBB) have been proposed to use node-local NVRAM to absorb the bursty checkpoint data. However, without proper coordination, DBB is prone to suffer from low resource utilization. To solve this problem, we propose an NVRAM-based burst buffer coordination system, named collaborative distributed burst buffer (CDBB). CDBB coordinates all the available burst buffers, based on their priorities and states, to help overburdened burst buffers and maximize resource utilization. We built a proof-of-concept prototype and tested CDBB at the Minnesota Supercomputing Institute. Compared with a traditional DBB system, CDBB can speed up checkpointing by up to 8.4x under medium and heavy workloads and only introduces negligible overhead.

AB - For modern HPC systems, failures are treated as the norm instead of exceptions. To avoid rerunning applications from scratch, checkpoint/restart techniques are employed to periodically checkpoint intermediate data to parallel file systems. To increase HPC checkpointing speed, distributed burst buffers (DBB) have been proposed to use node-local NVRAM to absorb the bursty checkpoint data. However, without proper coordination, DBB is prone to suffer from low resource utilization. To solve this problem, we propose an NVRAM-based burst buffer coordination system, named collaborative distributed burst buffer (CDBB). CDBB coordinates all the available burst buffers, based on their priorities and states, to help overburdened burst buffers and maximize resource utilization. We built a proof-of-concept prototype and tested CDBB at the Minnesota Supercomputing Institute. Compared with a traditional DBB system, CDBB can speed up checkpointing by up to 8.4x under medium and heavy workloads and only introduces negligible overhead.

KW - Burst buffer

KW - Coordination system

KW - Non-volatile memory

KW - Parallel file system

UR - http://www.scopus.com/inward/record.url?scp=85055276440&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055276440&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85055276440

SN - 0735-9276

VL - 50

SP - 1

EP - 12

JO - Simulation Series

JF - Simulation Series

IS - 4

Y2 - 15 April 2018 through 18 April 2018

ER -

CDBB: An nvrambased burst buffer coordination system for parallel file systems

Abstract

Bibliographical note

Keywords

OpenUrl availability

Other files and links

Cite this