1. Introduction
In the last decade, medical researchers have started using CNNs to analyze images [1]-[3]. The state-of-the-art medical image analysis platforms are constructed based on Von Neumann computing architectures, which causes frequent data movements between processors and memory units, hitting a bottleneck in performance and energy efficiency[4], [5]. Many edge-side accelerators have been proposed for the issue[6]-[8]. Among the many technologies proposed in the last years for edge-side computing, resistive random-access memory (ReRAM) CIM macro has been under extensive investigation[9]-[12].
Conventionally, ReRAM CIM macros often use a one-transistors-one-resistors (1T1R) cell to store an unsigned weight bit or two 1T1R cells (2T2R) to store a signed one[13]. The MAC operations of the input variable and weight matrix in the ReRAM array have utilized weight-current summation mechanism according to Ohm’s Law and Kirchhoff’s Law[14], [15]. However, the static current which is mainly caused by read paths tends to rise significantly while computing parallelism increasing, causing huge power consumption at the array level and weakening the energy-efficient of CIM[16]. What’s more, with the ever-growing diverse range of algorithmic demands, ReRAM CIM macro rely on intricate off-chip circuits for shifting and concatenating operations to output high resolution results[17]-[19].
While increasing the resistances of the High-Resistance State (HRS) and Low-Resistance State (LRS) of ReRAM contributes to current reduction, it is linked to the constraints imposed by technological capabilities. Although a sign-weighted 1T2R array can lower the current on the source line (SL) by storing the positive and negative weights in the same column, it is still based on current accumulation and the static power consumption are almost the same[20]-[22]. Some approaches use complex reference voltage levels[23] or a series of ADCs with different resolutions[24], [25] to achieve high resolution results, which may not fully leverage the nonvolatile nature of ReRAM in the perception of computation results.
In this work, we propose a charge-domain one-transistors-one-resistor-one-capacitors (1T1R1C) ReRAM-based CIM macro, with configurable output resolution for edge medical semantic segmentation. The 1T1R1C structure can effectively reduce static current during computation and have higher parallelism. Apart from this, it can configure output resolution by encoding the resistance of RRAM, providing a potential solution for general MAC operations and reducing the complexity of off-chip circuits. The simulation results show that our proposed macro can successfully accomplish medical semantic segmentation.
2. ReRAM-based convolutional neural networks for medical semantic segmentation
Deep learning methods for automatic segmentation of organs from Computed Tomography (CT) scans and Magnetic Resonance Imaging (MRI) are providing reliable results, leading to a revolutionary change in the workflow of radiologists[26]. However, the successful training of deep networks requires thousands of annotated samples. The researchers built an encoder-decoder based architecture named U-Net to address this issue[27]. Effectively trained with a limited dataset, U-Net serves as a robust backbone in the field of segmentation, particularly tailored for medical images[28]. Medical image segmentation requires assigning each pixel in the original image to its corresponding semantic class. The binary predicted segmented volume obtained from the original image is denoted as P, and the ground truth obtained from manual annotation is denoted as G. The Sørensen-Dice coefficient (DSC) is used to evaluate the overall performance of the algorithm. The evaluation metric is defined as follows:
\[\begin{equation*} DSC(P,G)=\frac{2\cdot \left | P\cap G \right | }{|P|+|G|} \tag{1} \end{equation*}\] |
The formula yields a value between 0 and 1, where ‘0’ indicates poor segmentation performance and ‘1’ indicates perfect segmentation.
Considering the realistic characteristic of the control method of ReRAM[29], we should quantize the neural networks to adapt it to the programming ability of memristor. We employed a quantization method commonly utilized in CNN[30], mapping the weights to limited bits. Even when the memristor cannot support many states stably, the U-Net can still achieve a high performance. The weights of the networks are distributed in the convolution layer and fully connected layer. For each layer, quantization is parameterized by the number of quantization levels and clamping range, and is performed by applying point-wise the quantization function q defined as follows:
\[\begin{equation*} \begin{split} clamp(W_{i};a,b)&:=min(max(x,a),b)\\ s(a,b,n)&:=\frac{b-a}{n-1} \\ q(W_{i};a,b,n)&:=\left \lfloor \frac{clamp(W_{i};a,b)-a}{s(a,b,n)} \right \rfloor s(a,b,n)+a \end{split} \tag{2} \end{equation*}\] |
where \(W_{i}\) is the real-valued number to be quantized. The [a;b] is the quantization range, and the \(n\) is the number of quantization levels. The \(\left \lfloor \right \rfloor\) denotes rounding to the nearest integer. The \(n\) is fixed for all layers in our experiments,e.g. \(n=2^{6}=64\) for 6 bit quantization.
The weights of the trained matrix should be converted into resistance values that fall within the limited range of the memristor.
\[\begin{equation*} R_{i}=\frac{R_{max}-R_{min}}{W_{q,max}}\cdot |W_{q.i}|+R_{min} \tag{3} \end{equation*}\] |
Equation (3) demonstrates the conversion of quantized weights to resistance values, with implementation of the procedure on the software platform. The \(R_{i}\) refers to the resistance values at the convolutional layers. The \(R_{max}\) and \(R_{min}\) indicate the maximum and minimum resistances,respectively. The \(W_{q,i}\) is the quantized weights of the networks, and the \(W_{q,max}\) is the maximum of the quantized weights.
3. Proposed high resolution and configurable charge-domain ReRAM-based CIM macro
3.1 Overall structure of proposed 1T1R1C bit-cell and macro
The proposed 1T1R1C macro is targeted for CNN acceleration in edge-computing devices. Fig. 1 presents the overall architecture of the proposed CIM macro which is composed of 1T1R1C cells, input decoder, output buffers and other control logical components. The source line (SL) decoder circuits comprised of NOR gates, NOT gates and AND-OR-INVERT gates, which convert the input code into current pulse. The 1T1R1C cell comprises one selection \(T\), one ReRAM for 5-bit unsigned weight storage, and a capacitor C for capacitive coupling. The gate and source of \(T\) are connected to the word line (WL) and source line (SL), respectively. The drain of \(T\) is connected to the top electrodes of ReRAM and C at node \(N_{0}\), the bottom electrodes of which are connected to the complementary bit line (BL) and the coupled bit line (CBL).
3.2 1T1R1C cell for charge-domain computing
After the training and quantization are finished, the weight matrix also has some negative weights. Therefore, we use a differential approach to map 6-bit weight in two 1T1R1C cells[31], facilitating the representation of negative weights. Simultaneously, the mapping of higher bits of weight can be achieved by employing additional cells. Fig. 2 is a simple demo about performing a convolution convolution in a memristor cell. A two-dimensional input CT image is converted to a one-dimensional electrical signal as an input, and the weights of convolution kernel is mapped on memristor cell.
The advantages of proposed charge-domain computing enabled by 1T1R1C cell on static current are analyzed as follows. In the prior 2T2R voltage-division (VD) based architecture[13], differential cells were employed to restrain the read out current within a cell. However, a low-R path across two LRS cells in different selected rows may still occur through common SL during computing, as shown in Fig. 3(a). The low R-path could introduce substantial current, as shown below:
\[\begin{equation*} I_{total}=\frac{V_{read}}{R_{L,set}+R_{R,sel}} \tag{4} \end{equation*}\] |
where \(R_{L,sel}\) and \(R_{R,sel}\) denote the parallel resistance of the left and right resistors of cells in the selected rows. With the increasing of parallelism, the values of \(R_{L,sel}\) and \(R_{R,sel}\) can be much smaller than that of LRS resistance. Consequently, the large direct current path between the complementary bit lines (BL/BLB) will arise accordingly, exacerbating concerns related to power consumption and IR-drop. By contrast, the 1T1R1C array can reduce the currents by charge-style computing. As shown in Fig. 3(b), initially, the WL is connected to VDD, and a current pulse of a designated magnitude is injected from SL. The current engages with memristors possessing distinct resistance values, leading to the accumulation of a specific charge on the capacitor during the pre-charge time, as shown below:
\[\begin{equation*} Q_{total}=\sum CI_i(R_{pi}-R_{ni}) \tag{5} \end{equation*}\] |
where C represents the capacitance value of coupling capacitors, and \(I_i\) is the input current pulse, representing the input data of the networks. \(R_{pi}\) and \(R_{ni}\) is the memristor representing the positive and negative terminals of the differential pair in the selected column. Then we can sense the accumulated charge in the array through coupled bit line positive terminal (CBLP) and coupled bit line negative terminal (CBLN). After a short pre-charge time, WL and SL are turned off, so there is no need to continue to interact with the read voltage or current in the array, which greatly reduces the power consumption in the array during calculation. Moreover, through capacitive decoupling, it is avoided that large current directly passes through the memristor, which causes the nonlinear reading of the memristor and reduces the calculation accuracy.
3.3 Configurable capacitive mac operation details
In this paper, we propose an on-chip configurable output mode featuring adjustable precision. This approach utilizes the programmable and nonvolatile attributes of memristors, eliminating the need for the complex off-chip shifting and splicing circuitry. Fig. 4 shows the details of the circuits and waveform. The configurable output circuit consists of configurable charge units (\(Q_{config}\)), a fully differential operational amplifier (op amp), a comparator and a charge processing counting module (CPC), as shown in Fig. 4(a). The \(Q_{config}\) comprises a pair of memristors, and the signals ipt0, generated by CPC, and ipt0b, inverse phase with ipt0, control the interaction with the input signal \(I_{in}\) to produce the least significant bit charge integral (\(Q_{LSB}\)) controlling the output resolution. Simultaneously, for accelerating the speed of high-bit output, we employ the subrange method which configure two (\(Q_{config}\)) with different values of \(Q_{LSB}\). In high bits, a larger amount of charge is moved, and in low bits, a smaller amount of charge is moved, as depicted in equation (6):
\[\begin{equation*} \begin{split} Q_{LSB1}&=CI_{in1}R_{dif1}\\ Q_{LSB2}&=CI_{in2}R_{dif2}\\ \end{split} \tag{6} \end{equation*}\] |
where \(Q_{LSB}\) and \(Q_{LSB}\) denotes different charge unit in different \(Q_{config}\). \(R_{dif}\) is the resistance difference between the memristor pair in \(Q_{config}\).
Fig. 4 (a) Proposed configurable circuits, (b) the detail waveform of 14 bit resolution MAC operation. |
Fig. 4(b) shows four phases, i.e., array-charge, \(Q_{config}\) phase1, \(Q_{config}\) phase2 and \(Data_{out}\) are required for MAC operations to obtain results. In the array-charge phase, 1T1R1C array and EN are activated, accumulating charge in the array, and resetting the integrating capacitor connecting the op amp. In the \(Q_{config}\) phase1, the larger charge integral (\(Q_{lLSB1}\)) produced by \(Q_{config1}\) are transferred to the integrating capacitor. CPC counts and generates high-bit data. Similarly, in \(Q_{config}\) phase2, the least charge integral (\(Q_{LSB2}\)) produced by \(Q_{config2}\) are transferred. The value is proportional to the number of charge inte as follows:
\[\begin{equation*} \begin{split} Q_{total}&=Q_{lLSB1} \times M+Q_{LSB2}\times N\\ Data_{out}&=M+N \end{split} \tag{7} \end{equation*}\] |
In the \(data_{out}\) phase the CPC latches MAC results for the next stage to read.
4. Experiment
All experiments were conducted on an Intel core Xeon(R) Silver 4110 and an NVIDIA RTX 2080 Ti GPU. To better reflect the practical application of the algorithm, data augmentation technologies[27] were employed. The implementation of this work utilized PyTorch, and circuits simulations were performed using spectre. In the experiments, we chose the CHAOS dataset[32] for verification, we normalized the gray value of the image as the input current, that is, the input current range from 0 to 15\(\mu\)A. In this paper, we refer the data from a fabricated \(TaO_{x}\) device[29]. And our performance analysis of the proposed macro, as presented in the study, was conducted utilizing the SMIC 0.18\(\mu\)m standard CMOS technology.
4.1 Experiments on datasets
The CHAOS dataset includes MR images of 40 patients, with 20 for training and 20 for testing. As ground truth was not provided for the test set due to competition requirements, we divided the data into a training set (80% or 16 samples) and a test set (20% or 4 samples), adhering to common practices in deep learning for evaluating performance on a limited dataset.
As the original data is very limited, data augmentation is necessary, especially the random elastic deformations of training samples appear to be crucial in training a segmentation network with a limited amount of annotated images. The original configuration file specifies that the liver’s gray scale is 63 (55\(<<\)70), and based on this, the mask of the original data is extracted. To increase the size of the dataset, we performed various data augmentations such as image rotation, left-right flipping, and resizing.
4.2 Circuit-level evaluation
Due to the non-idealities of transistor devices and the effects of process fluctuations, the input decoder’s noise should be considered. Fig. 5 shows the distribution of input current pulse with the transistors mismatch taken into account for the cases where the input code is 1. Repeating evaluation 200 times resulted in a mean current of 999.971 nA and a standard deviation of 6.90987nA, which is only 0.7% of the mean.
In the proposed accumulation scheme, many cells attached to CBLP and CBLN, the impact of resistance variation should be evaluated. Fig. 6 shows the circuits output value with different weights on memristor. \(W_p\) and \(W_n\) represent the cell attached to the CBLP and CBLN. For all R-values examined, high linearity is confirmed: even for the worst case with \(W_n\) is 0, the \(R^2\) value is 0.99996. This can be attributed to the normalization introduced by the circuit charge integral counting method and our memristor control method[29].
4.3 System-level evaluation
When AI algorithms are implemented on edge devices, the circuit non-idealities should be taken into consideration. To investigate the influence of the circuits’ non-linearity, UNet is implemented on the macro to perform the inference for CHAOS dataset, as shown in Fig. 7(a).
Fig. 7 (a) The UNet model implemented on the macro, (b) the computation accuracy with different weights bits on software and macro. |
Fig. 7(b) shows the impact of circuits non-idealities on computing accuracy with different weight precision. The customized UNet are mapped on hardware and tested on CHAOS dataset. The fixed-point calculation before and after mapping as well as the full precision calculation are red. The highest inference accuracy on hardware is 89.7%, which is attributed to the high resolution and linearity of our macro.
Table I compares this brief with previous works. Since the proposed macro can effectively reduce the static power during MAC computation, the energy efficiency is improved by \(\sim\)1.3X. The macro has 188.7 \(\mu\)W power consumption, which is suitable to apply in edge computing device.
5. Conclusion
In this brief, we introduce a high-resolution and configurable ReRAM CIM macro designed for efficient processing of CNNs used in medical image segmentation. The proposed charge based operation effectively suppresses the large direct current at both the cell level and array level during MAC operation, thereby achieving high energy efficiency. The corresponding configurable method for different output precision contributes to reducing peripheral circuit complexity. The evaluation results demonstrate the promise of the CIM accelerator with our proposed techniques for low-power AI edge applications.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China 92364202 and in part by the CAS Strategic Leading Science and Technology Project XDB44000000.
References
[1] H. Lu, et al.: “Half-UNet: a simplified U-net architecture for medical image segmentation,” Front Neuroinform. 16 (2022) 911679 (DOI: 10.3389/fninf.2022.911679).
CrossRef
[2] V.V. Valindria, et al.: “Multi-modal learning from unpaired images: application to multi-organ segmentation in CT and MRI,” 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (2018) 547 (DOI: 10.1109/WACV.2018.00066).
CrossRef
[3] F. Aboudi, et al.: “Efficient U-net CNN with data augmentation for MRI ischemic stroke brain segmentation,” 2022 8th International Conference on Control, Decision and Information Technologies (CoDIT) (2022) 724 (DOI: 10.1109/CoDIT55151.2022.9804030).
CrossRef
[4] X. Niu, et al.: “Ferroelectric polymers for neuromorphic computing,” Appl. Phys. Rev. 9 (2022): 021309 (DOI: 10.1063/5.0073085).
CrossRef
[5] V. Bajaj, et al.: High Performance Computing for Intelligent Medical Systems (IOP Publishing, 2021).
[6] S. Yoo, et al.: “Accelerating FPGA-implementations for mobile medical devices with high-level AI libraries: an object detection model for colorectal polyp images,” 2021 IEEE International Conference on Imaging Systems and Techniques (IST) (2021) 1 (DOI: 10.1109/IST50367.2021.9651361).
CrossRef
[7] S. Sağlam, et al.: “FPGA implementation of CNN algorithm for detecting malaria diseased blood cells,” 2019 International Symposium on Advanced Electrical and Communication Technologies (ISAECT) (2019) 1 (DOI: 10.1109/ISAECT47714.2019.9069724).
CrossRef
[8] A. Baroni, et al.: “An energy-efficient in-memory computing architecture for survival data analysis based on resistive switching memories,” Frontiers in Neuroscience 16 (2022) (DOI: 10.3389/fnins.2022.932270).
CrossRef
[9] D. Chen, et al.: “A 1T2R1C ReRAM CIM accelerator with energy-efficient voltage division and capacitive coupling for CNN acceleration in AI edge applications,” IEEE Trans. Circuits Syst. II, Exp. Briefs 70 (2023) 276 (DOI: 10.1109/TCSII.2022.3201367).
CrossRef
[10] W. Li, et al.: “A 40 nm RRAM compute-in-memory macro featuring on-chip write-verify and offset-cancelling ADC references,” ESSCIRC 2021 - IEEE 47th European Solid State Circuits Conference (ESSCIRC) (2021) 79 (DOI: 10.1109/ESSCIRC53450.2021.9567844).
CrossRef
[11] A. Baroni, et al: “An energy-efficient in-memory computing architecture for survival data analysis based on resistive switching memories,” Frontiers in Neuroscience 16 (2022) (DOI: 10.3389/fnins.2022.932270).
CrossRef
[12] Y. Chen, et al.: “Realization of artificial neuron using MXene bi-directional threshold switching memristors,” IEEE Electron Device Lett. 40 (2019) 1686 (DOI: 10.1109/LED.2019.2936261).
CrossRef
[13] L. Wang, et al.: “Efficient and robust nonvolatile computing-in-memory based on voltage division in 2T2R RRAM with input-dependent sensing control,” IEEE Trans. Circuits Syst. II, Exp. Briefs 68 (2021) 1640 (DOI: 10.1109/TCSII.2021.3067385).
CrossRef
[14] S. Yin, et al.: “High-throughput in-memory computing for binary deep neural networks with monolithically integrated RRAM and 90-nm CMOS,” IEEE Trans. Electron Devices 67 (2020) 4185 (DOI: 10.1109/TED.2020.3015178).
CrossRef
[15] Q. Liu, et al.: “33.2 a fully integrated analog ReRAM based 78.4 TOPS/W compute-in-memory chip with fully parallel MAC computing,” 2020 IEEE International Solid-State Circuits Conference (ISSCC) (2020) 500 (DOI: 10.1109/ISSCC19947.2020.9062953).
CrossRef
[16] L. Wang, et al.: “Efficient and robust nonvolatile computing-in-memory based on voltage division in 2T2R RRAM with input-dependent sensing control,” IEEE Trans. Circuits Syst. II, Exp. Briefs 68 (2021) 1640 (DOI: 10.1109/TCSII.2021.3067385).
CrossRef
[17] B. Crafton, et al.: “CIM-SECDED: a 40 nm 64 Kb compute in-memory RRAM macro with ECC enabling reliable operation,” 2021 IEEE Asian Solid-State Circuits Conference (A-SSCC) (2021) 1 (DOI: 10.1109/A-SSCC53895.2021.9634742).
CrossRef
[18] K. Zhou, et al.: “An energy efficient computing-in-memory accelerator with 1T2R cell and fully analog processing for edge AI applications,” IEEE Trans. Circuits Syst. II, Exp. Briefs 68 (2021) 2932 (DOI: 10.1109/TCSII.2021.3065697).
CrossRef
[19] W. Wan, et al.: “A compute-in-memory chip based on resistive random-access memory,” Nature 608 (2022) 504 (DOI: 10.1038/s41586-022-04992-8).
CrossRef
[20] Z. Jing, et al.: “VSDCA: a voltage sensing differential column architecture based on 1T2R RRAM array for computing-in-memory accelerators,” IEEE Trans. Circuits Syst. I, Reg. Papers 69 (2022) 4028 (DOI: 10.1109/TCSI.2022.3186024).
CrossRef
[21] J. Yang, et al.: “A 28 nm 1.5 Mb embedded 1T2R RRAM with 14.8 Mb/mm^{2} using sneaking current suppression and compensation techniques,” 2020 IEEE Symposium on VLSI Circuits (2020) 1 (DOI: 10.1109/VLSICircuits18222.2020.9163035).
CrossRef
[22] K. Zhou, et al.: “An energy efficient computing-in-memory accelerator with 1T2R cell and fully analog processing for edge AI applications,” IEEE Trans. Circuits Syst. II, Exp. Briefs 68 (2021) 2932 (DOI: 10.1109/TCSII.2021.3065697).
CrossRef
[23] W. Wan, et al.: “A compute-in-memory chip based on resistive random-access memory,” Nature 608 (2022) 504 (DOI: 10.1038/s41586-022-04992-8).
CrossRef
[24] R. Jia, et al.: “A RRAM characterization system with flexible readout operations using an integrating ADC,” 2023 18th Conference on Ph.D Research in Microelectronics and Electronics (PRIME) (2023) 245 (DOI: 10.1109/PRIME58259.2023.10161880).
CrossRef
[25] Y. He, et al.: “C-RRAM: a fully input parallel charge-domain RRAM-based computing-in-memory design with high tolerance for RRAM variations,” 2022 IEEE International Symposium on Circuits and Systems (ISCAS) (2022) 3279 (DOI: 10.1109/ISCAS48785.2022.9937513).
CrossRef
[26] N. Altini, et al.: “Liver, kidney and spleen segmentation from CT scans and MRI with deep learning: a survey,” Neurocomputing 490 (2022) 30 (DOI: 10.1016/j.neucom.2021.08.157).
CrossRef
[27] O. Ronneberger, et al.: “U-Net: convolutional networks for biomedical image segmentation,” MICCAI 2015, Lecture Notes in Computer Science 9351 (2015) 234 (DOI: 10.1007/978-3-319-24574-4_28).
CrossRef
[28] M. Viqar, et al.: “Opto-UNet: optimized UNet for segmentation of varicose veins in optical coherence tomography,” 2022 10th European Workshop on Visual Information Processing (EUVIP) (2022) 1 (DOI: 10.1109/EUVIP53989.2022.9922769).
CrossRef
[29] Y. Chen, et al.: China Patent CN115240734A (2022).
[30] B. Jacob, et al.: “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018) 2704 (DOI: 10.1109/CVPR.2018.00286).
CrossRef
[31] V. Milo, et al.: “Accurate program/verify schemes of resistive switching memory (RRAM) for in-memory neural network circuits,” IEEE Trans. Electron Devices 68 (2021) 3832 (DOI: 10.1109/TED.2021.3089995).
CrossRef
[32] A.E. Kavur, et al.: “CHAOS challenge - combined (CT-MR) healthy abdominal organ segmentation,” Medical Image Analysis 69 (2021) 101950 (DOI: 10.1016/j.media.2020.101950).
CrossRef
Authors
Junjia Su
Institute of Semiconductors, Chinese Academy of Sciences
University of Chinese Academy of Sciences
Semiconductor Neural Network Intelligent and Computing Technology Beijing Key Laboratory
Yihao Chen
Institute of Semiconductors, Chinese Academy of Sciences
University of Chinese Academy of Sciences
Semiconductor Neural Network Intelligent and Computing Technology Beijing Key Laboratory
Pengcheng Feng
Institute of Semiconductors, Chinese Academy of Sciences
University of Chinese Academy of Sciences
Semiconductor Neural Network Intelligent and Computing Technology Beijing Key Laboratory
Zhelong Jiang
Institute of Semiconductors, Chinese Academy of Sciences
University of Chinese Academy of Sciences
Semiconductor Neural Network Intelligent and Computing Technology Beijing Key Laboratory
Zhigang Li
Institute of Semiconductors, Chinese Academy of Sciences
Semiconductor Neural Network Intelligent and Computing Technology Beijing Key Laboratory
Gang Chen
Institute of Semiconductors, Chinese Academy of Sciences
University of Chinese Academy of Sciences
Semiconductor Neural Network Intelligent and Computing Technology Beijing Key Laboratory