The ever-increasing compute needed to use deep neural networks (DNNs) have made hardware latency and energy efficiency a growing concern. However, conventional processor architectures (such as CPUs or GPUs) incessantly transfer data between memory and processing through the ‘von Neumann bottleneck’, inducing time and energy overheads that significantly degrade latency and energy efficiency.
Analog in-memory computing (AIMC) using non-volatile memory (NVM) elements is a promising mixed-signal approach for DNN acceleration [1,2,3]. Here, weights of a deep neural network are stored using crossbar arrays of tunable conductive elements. This enables approximate matrix-vector computation directly in-memory, by applying activation vectors (as voltages or pulse durations) to the crossbar array, and then reading out analog physical quantities (instantaneous current or accumulated charge) [4,5,6]. As a ‘non-von Neumann’ architecture, AIMC performs matrix-vector multiplications (MVMs) at the location of the stored weights, in a highly-parallel, fast, and energy-efficient manner [5] – but only approximately.
In general, this approximative computing means that when DNNs – pre-trained using conventional digital hardware – are directly programmed onto the analog crossbar arrays, the resulting accuracy will drop significantly. This situation is similar to other AI acceleration approaches, such as reduced-precision digital AI accelerators, where DNNs (and thus MVMs) are heavily quantized and thus approximated. One solution is to re-train these DNNs with these approximations in mind in software, making inference of the DNNs more robust to quantization. This approach was very successful for quantized DNNs, where the approximations are very well defined, namely the inaccuracies induced by the quantization of formerly real-valued weights and activations. Typically, the DNN accuracy can be (almost) fully recovered to the original accuracy before quantization.
In a recent paper [7], we investigated applying a similar approach for analog in-memory computing. The challenging part is that because MVMs are computed with physical quantities in AIMC, such as electric current and conductance of real material substrates, the approximations are much more difficult to characterize and will also be highly stochastic both across chip replications and computation repeats.
In our study, we developed a model that characterizes the kind of approximations of typical AIMC hardware in an abstract manner, including temporal conductance fluctuations, material and electric effects, quantization and range clipping, and other nonidealities. We then improve on earlier approaches for pre-training with AIMC hardware in mind, which we call hardware-aware training. One key aspect is to inject the different noise sources into the pre-training process that are expected to happen during deployment of the DNN in analog hardware, while also being mindful of the inherent limitation of input-output dynamical ranges and other design constraints.
While earlier approaches to hardware-aware training for AIMC exist, they were typically limited to one or a few small toy DNN examples. More critically, hardware assumptions vary widely among studies, making the effectiveness of different approaches to hardware-aware training difficult to compare.
In our study, in a great simulation effort, we expanded greatly on the investigated DNN models in scale and number with the identical underlying hardware assumptions, so that one can for the first time compare quantitatively the suitability of the HWA training approach for AIMC per se, as well as the suitability of the DNNs of different AI domains and topologies in general, such as recurrent DNNs for speech- to-text translation, convolutional networks for image classification, or transformers on natural language understanding.
Interestingly, we find that the hardware-aware training approach is also very promising for larger DNNs, but there are differences for the robustness towards AIMC nonidealities depending on the DNN topology. For instance, recurrent DNNs (such as LSTMs and RNN-T), are very easily trained to nearly the same accuracy and perform generally very well, while convolutional networks are typically the most challenging.
We further looked at how much worse certain nonidealities could get to still support high accuracy and thus measuring the impact of different noise sources. These insights give valuable feedback to the chip designers that can target on reducing the nonidealities that are the most detrimental in terms of accuracy drop for future chip design improvements.
We hope that our “standard” AIMC model – which we have released as part of the open source software toolkit, IBM AI Hardware Acceleration Kit [8] based on Torch – will make it easier for the research community to compare and benchmark future algorithmic improvements for hardware-aware training even without having access to particular hardware chip prototype, which are currently under development in only a few selected big labs and companies (such as IBM’s low-power AI chip prototypes).
References
[1] G. W. Burr, R. M. Shelby, A. Sebastian, S. Kim, S. Kim, S. Sidler, K. Virwani, M. Ishii, P. Narayanan, A. Fumarola, L. L. Sanches, I. Boybat, M. Le Gallo, K. Moon, J. Woo, H. Hwang, and Y. Leblebici, “Neuromorphic computing using non-volatile memory,” Advances in Physics: X, vol. 2, no. 1, pp. 89–124, 2017.
[2] A. Sebastian, M. Le Gallo, R. Khaddam-Aljameh, and E. Eleftheriou, “Memory devices and applications for in-memory computing,” Nature Nanotechnology, vol. 15, pp. 529–544, 2020.
[3] G. W. Burr, A. Sebastian, T. Ando, and W. Haensch, “Ohm’s law plus Kirchhoff’s current law equals better AI,” IEEE Spectrum, vol. 58, no. 12, pp. 44–49, 2021.
[4] F. Merrikh-Bayat, X. Guo, M. Klachko, M. Prezioso, K. K. Likharev, and D. B. Strukov, “High-performance mixed-signal neurocomputing with nanoscale floating-gate memory cell arrays,” IEEE Trans. Neur. Netw. Learn. Sys., 2017.
[5] H.-Y. Chang, P. Narayanan, S. C. Lewis, N. C. P. Farinha, K. Hosokawa, C. Mackin, H. Tsai, S. Ambrogio, A. Chen, and G. W. Burr, “AI hardware acceleration with analog memory: micro-architectures for low energy at high speed,” IBM Journal of Research and Development, vol. 63, no. 6, pp. 1–14, 2019.
[6] B. Murmann, “Mixed-signal computing for deep neural network inference,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 1, pp. 3–13, 2020.
[7] M. J. Rasch, C. Mackin, M. Le Gallo, A. Chen, A. Fasoli, F. Odermatt, N. Li, S. R. Nandakumar, P. Narayanan, H. Tsai, G. W. Burr, A. Sebastian, and V. Narayanan, “Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators,” Nature Communications, vol. 14, no. 1, pp. 5282, 2023.
[8] M. J. Rasch, D. Moreda, T. Gokmen, M. Le Gallo, F. Carta, C. Goldberg, K. E. Maghraoui, A. Sebastian, and V. Narayanan, “A flexible and fast pytorch toolkit for simulating training and inference on analog crossbar arrays,” IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 1–4, 2021.
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in