Options
REFU: Redundant Execution with Idle Functional Units, Fault Tolerant GPGPU architecture
Journal
Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI
ISSN
21593469
Date Issued
2022-01-01
Author(s)
Raghunandana, K. K.
Varaprasad, B. K.S.V.L.
Reorda, M. Sonza
Singh, Virendra
Abstract
The General-Purpose Graphics Processing Units (GPGPU) with energy efficient execution are increasingly used in wide range of applications due to high performance. These GPGPUs are fabricated with the cutting-edge technologies. Shrinking transistor feature size and aggressive voltage scaling has increased the susceptibility of devices to intrinsic and extrinsic noise leading to major reliability issues in the form of the transient faults. Therefore, it is essential to ensure the reliable operation of the GPGPUs in the presence of the transient faults. GPGPUs are designed for high throughput and execute the multiple threads in parallel, that brings a new challenge for the fault detection with minimum overheads across all threads. This paper proposes a new fault detection method called REFU, an architectural solution to detect the transient faults by temporal redundant re-execution of instructions using the idle functional execution units of the GPGPU. The performance of the REFU is evaluated with standard benchmarks, for fault free run across different workloads REFU shows mean performance overhead of 2%, average power overhead of 6%, and peak power overhead of 10%.
Volume
2022-July
Subjects