Characterizing datasets for data deduplication in backup applications

Nohhyun Park; David J. Lilja

doi:10.1109/IISWC.2010.5650369

Characterizing datasets for data deduplication in backup applications

Nohhyun Park, David J. Lilja

Electrical and Computer Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

37 Scopus citations

Abstract

The compression and throughput performance of data deduplication system is directly affected by the input dataset. We propose two sets of evaluation metrics, and the means to extract those metrics, for deduplication systems. The First set of metrics represents how the composition of segments changes within the deduplication system over five full backups. This in turn allows more insights into how the compression ratio will change as data accumulate. The second set of metrics represents index table fragmentation caused by duplicate elimination and the arrival rate at the underlying storage system. We show that, while shorter sequences of unique data may be bad for index caching, they provide a more uniform arrival rate which improves the overall throughput. Finally, we compute the metrics derived from the datasets under evaluation and show how the datasets perform with different metrics.Our evaluation shows that backup datasets typically exhibit patterns in how they change over time and that these patterns are quantifiable in terms of how they affect the deduplication process. This quantification allows us to: 1) decide whether deduplication is applicable, 2) provision resources, 3) tune the data deduplication parameters and 4) potentially decide which portion of the dataset is best suited for deduplication.

Original language	English (US)
Title of host publication	IEEE International Symposium on Workload Characterization, IISWC'10
DOIs	https://doi.org/10.1109/IISWC.2010.5650369
State	Published - 2010
Event	2010 IEEE International Symposium on Workload Characterization, IISWC'10 - Atlanta, GA, United States Duration: Dec 2 2010 → Dec 4 2010

Publication series

Name	IEEE International Symposium on Workload Characterization, IISWC'10

Other

Other	2010 IEEE International Symposium on Workload Characterization, IISWC'10
Country/Territory	United States
City	Atlanta, GA
Period	12/2/10 → 12/4/10

Access

10.1109/IISWC.2010.5650369

OpenUrl availability

Full text

Cite this

Park, N & Lilja, DJ 2010, Characterizing datasets for data deduplication in backup applications. in IEEE International Symposium on Workload Characterization, IISWC'10., 5650369, IEEE International Symposium on Workload Characterization, IISWC'10, 2010 IEEE International Symposium on Workload Characterization, IISWC'10, Atlanta, GA, United States, 12/2/10. https://doi.org/10.1109/IISWC.2010.5650369

@inproceedings{82608b9846fb49798a5d4dec53ba2105,

title = "Characterizing datasets for data deduplication in backup applications",

abstract = "The compression and throughput performance of data deduplication system is directly affected by the input dataset. We propose two sets of evaluation metrics, and the means to extract those metrics, for deduplication systems. The First set of metrics represents how the composition of segments changes within the deduplication system over five full backups. This in turn allows more insights into how the compression ratio will change as data accumulate. The second set of metrics represents index table fragmentation caused by duplicate elimination and the arrival rate at the underlying storage system. We show that, while shorter sequences of unique data may be bad for index caching, they provide a more uniform arrival rate which improves the overall throughput. Finally, we compute the metrics derived from the datasets under evaluation and show how the datasets perform with different metrics.Our evaluation shows that backup datasets typically exhibit patterns in how they change over time and that these patterns are quantifiable in terms of how they affect the deduplication process. This quantification allows us to: 1) decide whether deduplication is applicable, 2) provision resources, 3) tune the data deduplication parameters and 4) potentially decide which portion of the dataset is best suited for deduplication.",

author = "Nohhyun Park and Lilja, {David J.}",

year = "2010",

doi = "10.1109/IISWC.2010.5650369",

language = "English (US)",

isbn = "9781424492978",

series = "IEEE International Symposium on Workload Characterization, IISWC'10",

booktitle = "IEEE International Symposium on Workload Characterization, IISWC'10",

note = "2010 IEEE International Symposium on Workload Characterization, IISWC'10 ; Conference date: 02-12-2010 Through 04-12-2010",

}

TY - GEN

T1 - Characterizing datasets for data deduplication in backup applications

AU - Park, Nohhyun

AU - Lilja, David J.

PY - 2010

Y1 - 2010

N2 - The compression and throughput performance of data deduplication system is directly affected by the input dataset. We propose two sets of evaluation metrics, and the means to extract those metrics, for deduplication systems. The First set of metrics represents how the composition of segments changes within the deduplication system over five full backups. This in turn allows more insights into how the compression ratio will change as data accumulate. The second set of metrics represents index table fragmentation caused by duplicate elimination and the arrival rate at the underlying storage system. We show that, while shorter sequences of unique data may be bad for index caching, they provide a more uniform arrival rate which improves the overall throughput. Finally, we compute the metrics derived from the datasets under evaluation and show how the datasets perform with different metrics.Our evaluation shows that backup datasets typically exhibit patterns in how they change over time and that these patterns are quantifiable in terms of how they affect the deduplication process. This quantification allows us to: 1) decide whether deduplication is applicable, 2) provision resources, 3) tune the data deduplication parameters and 4) potentially decide which portion of the dataset is best suited for deduplication.

AB - The compression and throughput performance of data deduplication system is directly affected by the input dataset. We propose two sets of evaluation metrics, and the means to extract those metrics, for deduplication systems. The First set of metrics represents how the composition of segments changes within the deduplication system over five full backups. This in turn allows more insights into how the compression ratio will change as data accumulate. The second set of metrics represents index table fragmentation caused by duplicate elimination and the arrival rate at the underlying storage system. We show that, while shorter sequences of unique data may be bad for index caching, they provide a more uniform arrival rate which improves the overall throughput. Finally, we compute the metrics derived from the datasets under evaluation and show how the datasets perform with different metrics.Our evaluation shows that backup datasets typically exhibit patterns in how they change over time and that these patterns are quantifiable in terms of how they affect the deduplication process. This quantification allows us to: 1) decide whether deduplication is applicable, 2) provision resources, 3) tune the data deduplication parameters and 4) potentially decide which portion of the dataset is best suited for deduplication.

UR - http://www.scopus.com/inward/record.url?scp=78751526844&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78751526844&partnerID=8YFLogxK

U2 - 10.1109/IISWC.2010.5650369

DO - 10.1109/IISWC.2010.5650369

M3 - Conference contribution

AN - SCOPUS:78751526844

SN - 9781424492978

T3 - IEEE International Symposium on Workload Characterization, IISWC'10

BT - IEEE International Symposium on Workload Characterization, IISWC'10

T2 - 2010 IEEE International Symposium on Workload Characterization, IISWC'10

Y2 - 2 December 2010 through 4 December 2010

ER -

Characterizing datasets for data deduplication in backup applications

Abstract

Publication series

Other

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this