Efficient and retargetable dynamic binary translation on multicores

Ding Yong Hong; Jan Jan Wu; Pen Chung Yew; Wei Chung Hsu; Chun Chen Hsu; Pangfeng Liu; Chien Min Wang; Yeh Ching Chung

doi:10.1109/TPDS.2013.56

Efficient and retargetable dynamic binary translation on multicores

Ding Yong Hong, Jan Jan Wu, Pen Chung Yew, Wei Chung Hsu, Chun Chen Hsu, Pangfeng Liu, Chien Min Wang, Yeh Ching Chung

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

9 Scopus citations

Abstract

Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation, and security. However, there are several factors that often impede its performance: 1) emulation overhead before translation; 2) translation and optimization overhead; and 3) translated code quality. The issues also include its retargetability that supports guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs-an important feature to system virtualization. In this work, we take advantage of the ubiquitous multicore platforms, and use a multithreaded approach to implement DBT. By running the translator and the dynamic binary optimizer on different cores with different threads, it could off-load the overhead incurred by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as its retargetability. Using QEMU (a popular retargetable DBT for system virtualization) and Low-Level Virtual Machine (LLVM) as our building blocks, we demonstrated in a multithreaded DBT prototype, called Hybrid-QEMU (HQEMU), that it could improve QEMU performance by a factor of (2.6 ×) and (4.1 ×) on the SPEC CPU2006 integer and floating point benchmarks, respectively, for dynamic translation of x86 code to run on x86-64 platforms. For ARM codes to x86-64 platforms, HQEMU can gain a factor of (2.5 ×) speedup over QEMU for the SPEC CPU2006 integer benchmarks. We also address the performance scalability issue of multithreaded applications across ISAs. We identify two major impediments to performance scalability in QEMU: 1) coarse-grained locks used to protect shared data structures, and 2) inefficient emulation of atomic instructions across ISAs. We proposed two techniques to mitigate those problems: 1) using indirect branch translation caching (IBTC) to avoid frequent accesses to locks, and 2) using lightweight memory transactions to emulate atomic instructions across ISAs. Our experimental results show that for multithread applications, HQEMU achieves (25 ×) speedups over QEMU for the PARSEC benchmarks.

Original language	English (US)
Article number	6471968
Pages (from-to)	622-632
Number of pages	11
Journal	IEEE Transactions on Parallel and Distributed Systems
Volume	25
Issue number	3
DOIs	https://doi.org/10.1109/TPDS.2013.56
State	Published - Mar 2014

Keywords

Dynamic binary translation
feedback-directed optimization
hardware performance monitoring
multicores
traces

Access

10.1109/TPDS.2013.56

OpenUrl availability

Full text

Cite this

@article{815a65fecd4d46119085708b694e4dd7,

title = "Efficient and retargetable dynamic binary translation on multicores",

abstract = "Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation, and security. However, there are several factors that often impede its performance: 1) emulation overhead before translation; 2) translation and optimization overhead; and 3) translated code quality. The issues also include its retargetability that supports guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs-an important feature to system virtualization. In this work, we take advantage of the ubiquitous multicore platforms, and use a multithreaded approach to implement DBT. By running the translator and the dynamic binary optimizer on different cores with different threads, it could off-load the overhead incurred by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as its retargetability. Using QEMU (a popular retargetable DBT for system virtualization) and Low-Level Virtual Machine (LLVM) as our building blocks, we demonstrated in a multithreaded DBT prototype, called Hybrid-QEMU (HQEMU), that it could improve QEMU performance by a factor of (2.6 ×) and (4.1 ×) on the SPEC CPU2006 integer and floating point benchmarks, respectively, for dynamic translation of x86 code to run on x86-64 platforms. For ARM codes to x86-64 platforms, HQEMU can gain a factor of (2.5 ×) speedup over QEMU for the SPEC CPU2006 integer benchmarks. We also address the performance scalability issue of multithreaded applications across ISAs. We identify two major impediments to performance scalability in QEMU: 1) coarse-grained locks used to protect shared data structures, and 2) inefficient emulation of atomic instructions across ISAs. We proposed two techniques to mitigate those problems: 1) using indirect branch translation caching (IBTC) to avoid frequent accesses to locks, and 2) using lightweight memory transactions to emulate atomic instructions across ISAs. Our experimental results show that for multithread applications, HQEMU achieves (25 ×) speedups over QEMU for the PARSEC benchmarks.",

keywords = "Dynamic binary translation, feedback-directed optimization, hardware performance monitoring, multicores, traces",

author = "Hong, {Ding Yong} and Wu, {Jan Jan} and Yew, {Pen Chung} and Hsu, {Wei Chung} and Hsu, {Chun Chen} and Pangfeng Liu and Wang, {Chien Min} and Chung, {Yeh Ching}",

year = "2014",

month = mar,

doi = "10.1109/TPDS.2013.56",

language = "English (US)",

volume = "25",

pages = "622--632",

journal = "IEEE Transactions on Parallel and Distributed Systems",

issn = "1045-9219",

publisher = "IEEE Computer Society",

number = "3",

}

TY - JOUR

T1 - Efficient and retargetable dynamic binary translation on multicores

AU - Hong, Ding Yong

AU - Wu, Jan Jan

AU - Yew, Pen Chung

AU - Hsu, Wei Chung

AU - Hsu, Chun Chen

AU - Liu, Pangfeng

AU - Wang, Chien Min

AU - Chung, Yeh Ching

PY - 2014/3

Y1 - 2014/3

N2 - Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation, and security. However, there are several factors that often impede its performance: 1) emulation overhead before translation; 2) translation and optimization overhead; and 3) translated code quality. The issues also include its retargetability that supports guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs-an important feature to system virtualization. In this work, we take advantage of the ubiquitous multicore platforms, and use a multithreaded approach to implement DBT. By running the translator and the dynamic binary optimizer on different cores with different threads, it could off-load the overhead incurred by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as its retargetability. Using QEMU (a popular retargetable DBT for system virtualization) and Low-Level Virtual Machine (LLVM) as our building blocks, we demonstrated in a multithreaded DBT prototype, called Hybrid-QEMU (HQEMU), that it could improve QEMU performance by a factor of (2.6 ×) and (4.1 ×) on the SPEC CPU2006 integer and floating point benchmarks, respectively, for dynamic translation of x86 code to run on x86-64 platforms. For ARM codes to x86-64 platforms, HQEMU can gain a factor of (2.5 ×) speedup over QEMU for the SPEC CPU2006 integer benchmarks. We also address the performance scalability issue of multithreaded applications across ISAs. We identify two major impediments to performance scalability in QEMU: 1) coarse-grained locks used to protect shared data structures, and 2) inefficient emulation of atomic instructions across ISAs. We proposed two techniques to mitigate those problems: 1) using indirect branch translation caching (IBTC) to avoid frequent accesses to locks, and 2) using lightweight memory transactions to emulate atomic instructions across ISAs. Our experimental results show that for multithread applications, HQEMU achieves (25 ×) speedups over QEMU for the PARSEC benchmarks.

AB - Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation, and security. However, there are several factors that often impede its performance: 1) emulation overhead before translation; 2) translation and optimization overhead; and 3) translated code quality. The issues also include its retargetability that supports guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs-an important feature to system virtualization. In this work, we take advantage of the ubiquitous multicore platforms, and use a multithreaded approach to implement DBT. By running the translator and the dynamic binary optimizer on different cores with different threads, it could off-load the overhead incurred by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as its retargetability. Using QEMU (a popular retargetable DBT for system virtualization) and Low-Level Virtual Machine (LLVM) as our building blocks, we demonstrated in a multithreaded DBT prototype, called Hybrid-QEMU (HQEMU), that it could improve QEMU performance by a factor of (2.6 ×) and (4.1 ×) on the SPEC CPU2006 integer and floating point benchmarks, respectively, for dynamic translation of x86 code to run on x86-64 platforms. For ARM codes to x86-64 platforms, HQEMU can gain a factor of (2.5 ×) speedup over QEMU for the SPEC CPU2006 integer benchmarks. We also address the performance scalability issue of multithreaded applications across ISAs. We identify two major impediments to performance scalability in QEMU: 1) coarse-grained locks used to protect shared data structures, and 2) inefficient emulation of atomic instructions across ISAs. We proposed two techniques to mitigate those problems: 1) using indirect branch translation caching (IBTC) to avoid frequent accesses to locks, and 2) using lightweight memory transactions to emulate atomic instructions across ISAs. Our experimental results show that for multithread applications, HQEMU achieves (25 ×) speedups over QEMU for the PARSEC benchmarks.

KW - Dynamic binary translation

KW - feedback-directed optimization

KW - hardware performance monitoring

KW - multicores

KW - traces

UR - http://www.scopus.com/inward/record.url?scp=84894599828&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84894599828&partnerID=8YFLogxK

U2 - 10.1109/TPDS.2013.56

DO - 10.1109/TPDS.2013.56

M3 - Article

AN - SCOPUS:84894599828

SN - 1045-9219

VL - 25

SP - 622

EP - 632

JO - IEEE Transactions on Parallel and Distributed Systems

JF - IEEE Transactions on Parallel and Distributed Systems

IS - 3

M1 - 6471968

ER -

Efficient and retargetable dynamic binary translation on multicores

Abstract

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this