Skip to main content

Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators

Publication ,  Journal Article
Joardar, BK; Doppa, JR; Li, H; Chakrabarty, K; Pande, PP
Published in: ACM Transactions on Embedded Computing Systems
October 31, 2021

The growing popularity of convolutional neural networks (CNNs) has led to the search for efficient computational platforms to accelerate CNN training. Resistive random-access memory (ReRAM)-based manycore architectures offer a promising alternative to commonly used GPU-based platforms for training CNNs. However, due to the immature fabrication process and limited write endurance, ReRAMs suffer from different types of faults. This makes training of CNNs challenging as weights are misrepresented when they are mapped to faulty ReRAM cells. This results in unstable training, leading to unacceptably low accuracy for the trained model. Due to the distributed nature of the mapping of the individual bits of a weight to different ReRAM cells, faulty weights often lead to exploding gradients. This in turn introduces a positive feedback in the training loop, resulting in extremely large and unstable weights. In this paper, we propose a lightweight and reliable CNN training methodology using weight clipping to prevent this phenomenon and enable training even in the presence of many faults. Weight clipping prevents large weights from destabilizing CNN training and provides the backpropagation algorithm with the opportunity to compensate for the weights mapped to faulty cells. The proposed methodology achieves near-GPU accuracy without introducing significant area or performance overheads. Experimental evaluation indicates that weight clipping enables the successful training of CNNs in the presence of faults, while also reducing training time by 4 on average compared to a conventional GPU platform. Moreover, we also demonstrate that weight clipping outperforms a recently proposed error correction code (ECC)-based method when training is carried out using faulty ReRAMs.

Duke Scholars

Published In

ACM Transactions on Embedded Computing Systems

DOI

EISSN

1558-3465

ISSN

1539-9087

Publication Date

October 31, 2021

Volume

20

Issue

5s

Start / End Page

1 / 23

Publisher

Association for Computing Machinery (ACM)

Related Subject Headings

  • Computer Hardware & Architecture
  • 4606 Distributed computing and systems software
  • 4006 Communications engineering
  • 1006 Computer Hardware
  • 0805 Distributed Computing
  • 0803 Computer Software
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Joardar, B. K., Doppa, J. R., Li, H., Chakrabarty, K., & Pande, P. P. (2021). Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators. ACM Transactions on Embedded Computing Systems, 20(5s), 1–23. https://doi.org/10.1145/3476986
Joardar, Biresh Kumar, Janardhan Rao Doppa, Hai Li, Krishnendu Chakrabarty, and Partha Pratim Pande. “Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators.” ACM Transactions on Embedded Computing Systems 20, no. 5s (October 31, 2021): 1–23. https://doi.org/10.1145/3476986.
Joardar BK, Doppa JR, Li H, Chakrabarty K, Pande PP. Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators. ACM Transactions on Embedded Computing Systems. 2021 Oct 31;20(5s):1–23.
Joardar, Biresh Kumar, et al. “Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators.” ACM Transactions on Embedded Computing Systems, vol. 20, no. 5s, Association for Computing Machinery (ACM), Oct. 2021, pp. 1–23. Crossref, doi:10.1145/3476986.
Joardar BK, Doppa JR, Li H, Chakrabarty K, Pande PP. Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators. ACM Transactions on Embedded Computing Systems. Association for Computing Machinery (ACM); 2021 Oct 31;20(5s):1–23.

Published In

ACM Transactions on Embedded Computing Systems

DOI

EISSN

1558-3465

ISSN

1539-9087

Publication Date

October 31, 2021

Volume

20

Issue

5s

Start / End Page

1 / 23

Publisher

Association for Computing Machinery (ACM)

Related Subject Headings

  • Computer Hardware & Architecture
  • 4606 Distributed computing and systems software
  • 4006 Communications engineering
  • 1006 Computer Hardware
  • 0805 Distributed Computing
  • 0803 Computer Software