Scholars@Duke publication: Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators

Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators

Publication , Journal Article

Joardar, BK; Doppa, JR; Li, H; Chakrabarty, K; Pande, PP

Published in: ACM Transactions on Embedded Computing Systems

October 31, 2021

The growing popularity of convolutional neural networks (CNNs) has led to the search for efficient computational platforms to accelerate CNN training. Resistive random-access memory (ReRAM)-based manycore architectures offer a promising alternative to commonly used GPU-based platforms for training CNNs. However, due to the immature fabrication process and limited write endurance, ReRAMs suffer from different types of faults. This makes training of CNNs challenging as weights are misrepresented when they are mapped to faulty ReRAM cells. This results in unstable training, leading to unacceptably low accuracy for the trained model. Due to the distributed nature of the mapping of the individual bits of a weight to different ReRAM cells, faulty weights often lead to exploding gradients. This in turn introduces a positive feedback in the training loop, resulting in extremely large and unstable weights. In this paper, we propose a lightweight and reliable CNN training methodology using weight clipping to prevent this phenomenon and enable training even in the presence of many faults. Weight clipping prevents large weights from destabilizing CNN training and provides the backpropagation algorithm with the opportunity to compensate for the weights mapped to faulty cells. The proposed methodology achieves near-GPU accuracy without introducing significant area or performance overheads. Experimental evaluation indicates that weight clipping enables the successful training of CNNs in the presence of faults, while also reducing training time by 4 on average compared to a conventional GPU platform. Moreover, we also demonstrate that weight clipping outperforms a recently proposed error correction code (ECC)-based method when training is carried out using faulty ReRAMs.

Duke Scholars

Author Hai "Helen" Li Electrical and Computer Engineering

Published In

ACM Transactions on Embedded Computing Systems

DOI

10.1145/3476986

EISSN

1558-3465

ISSN

1539-9087

Publication Date

October 31, 2021

Volume

Issue

Start / End Page

1 / 23

Publisher

Association for Computing Machinery (ACM)

Related Subject Headings

Computer Hardware & Architecture
4606 Distributed computing and systems software
4006 Communications engineering
1006 Computer Hardware
0805 Distributed Computing
0803 Computer Software

Citation

APA

Chicago

ICMJE

MLA

NLM

Joardar, B. K., Doppa, J. R., Li, H., Chakrabarty, K., & Pande, P. P. (2021). Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators. ACM Transactions on Embedded Computing Systems, 20(5s), 1–23. https://doi.org/10.1145/3476986

Joardar, Biresh Kumar, Janardhan Rao Doppa, Hai Li, Krishnendu Chakrabarty, and Partha Pratim Pande. “Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators.” ACM Transactions on Embedded Computing Systems 20, no. 5s (October 31, 2021): 1–23. https://doi.org/10.1145/3476986.

Joardar BK, Doppa JR, Li H, Chakrabarty K, Pande PP. Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators. ACM Transactions on Embedded Computing Systems. 2021 Oct 31;20(5s):1–23.

Joardar, Biresh Kumar, et al. “Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators.” ACM Transactions on Embedded Computing Systems, vol. 20, no. 5s, Association for Computing Machinery (ACM), Oct. 2021, pp. 1–23. Crossref, doi:10.1145/3476986.

Joardar BK, Doppa JR, Li H, Chakrabarty K, Pande PP. Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators. ACM Transactions on Embedded Computing Systems. Association for Computing Machinery (ACM); 2021 Oct 31;20(5s):1–23.

Published In

ACM Transactions on Embedded Computing Systems

DOI

10.1145/3476986

EISSN

1558-3465

ISSN

1539-9087

Publication Date

October 31, 2021

Volume

Issue

Start / End Page

1 / 23

Publisher

Association for Computing Machinery (ACM)

Related Subject Headings

Computer Hardware & Architecture
4606 Distributed computing and systems software
4006 Communications engineering
1006 Computer Hardware
0805 Distributed Computing
0803 Computer Software