The search functionality is under construction.
The search functionality is under construction.

Open Access
REM-CiM: Attentional RGB-Event Fusion Multi-Modal Analog CiM for Area/Energy-Efficient Edge Object Detection during Both Day and Night

Yuya ICHIKAWA, Ayumu YAMADA, Naoko MISAWA, Chihiro MATSUI, Ken TAKEUCHI

  • Full Text Views

    26

  • Cite this
  • Free PDF (2.8MB)

Summary :

Integrating RGB and event sensors improves object detection accuracy, especially during the night, due to the high-dynamic range of event camera. However, introducing an event sensor leads to an increase in computational resources, which makes the implementation of RGB-event fusion multi-modal AI to CiM difficult. To tackle this issue, this paper proposes RGB-Event fusion Multi-modal analog Computation-in-Memory (CiM), called REM-CiM, for multi-modal edge object detection AI. In REM-CiM, two proposals about multi-modal AI algorithms and circuit implementation are co-designed. First, Memory capacity-Efficient Attentional Feature Pyramid Network (MEA-FPN), the model architecture for RGB-event fusion analog CiM, is proposed for parameter-efficient RGB-event fusion. Convolution-less bi-directional calibration (C-BDC) in MEA-FPN extracts important features of each modality with attention modules, while reducing the number of weight parameters by removing large convolutional operations from conventional BDC. Proposed MEA-FPN w/ C-BDC achieves a 76% reduction of parameters while maintaining mean Average Precision (mAP) degradation to < 2.3% during both day and night, compared with Attentional FPN fusion (A-FPN), a conventional BDC-adopted FPN fusion. Second, the low-bit quantization with clipping (LQC) is proposed to reduce area/energy. Proposed REM-CiM with MEA-FPN and LQC achieves almost the same memory cells, 21% less ADC area, 24% less ADC energy and 0.17% higher mAP than conventional FPN fusion CiM without LQC.

Publication
IEICE TRANSACTIONS on Electronics Vol.E107-C No.10 pp.426-435
Publication Date
2024/10/01
Publicized
2024/04/09
Online ISSN
1745-1353
DOI
10.1587/transele.2023CTP0001
Type of Manuscript
Special Section PAPER (Special Section on Analog Circuits and Their Application Technologies)
Category

1.  Introduction

In many kinds of edge applications such as autonomous driving and robot vision, object detection is an important task. Although object detection has evolved dramatically in recent years with the advancement of deep learning [1], there are still several challenges in edge object detection [2]. Area/energy limitation is one of the challenges [3]. To tackle this issue, Computation-in-Memory (CiM) [4], [5] is a promising Neural Network (NN) accelerator due to high-speed and low-power multiply-accumulate (MAC) calculation with analog approximate computation in the memory array.

Another challenge in edge object detection is the adaptability to various environments such as day and night. For example, when only RGB cameras are used in night conditions, the accuracy degrades severely [6], [7]. Event cameras [8]-[10] can cope with the night conditions by fusing with RGB cameras owing to their high dynamic range (e.g., 140dB compared to 60dB of RGB camera) [11]-[15].

Both model architecture and fusion methods that exploit the complementary characteristics of RGB and event data affect mean Average Precision (mAP), a metric of object detection accuracy. As for the model architecture, feature-pyramid network fusion (FPN fusion) [11] shows the mAP improvement by fusing RGB and event data at each layer of the feature extraction module. As for the fusion method, bi-directional calibration (BDC) [14] shows the mAP improvement by extracting important information with an attention mechanism and convolutional calculations. However, BDC-adopted FPN fusion, called Attentional FPN fusion (A-FPN) in this paper, results in a significant increase in parameters of the NN model.

Recently, 256Gb chalcogenide-based cross-point memory using Phase-Change Memory (PCM) has been proposed [16]. The capacity of standalone emerging non-volatile memory (eNVM) is increasing rapidly. On the other hand, the capacity of embedded eNVM is smaller (1-100MB) than that of standalone eNVM (1-100GB) because of the large overhead of peripheral circuits such as DAC/ADC [17]. The capacities of recently reported CiMs are around 1-10Mb [4], [18]-[20]. Although embedded NVM CiM capacity will increase as the technology of NVM integration becomes more mature, the implementation of the RGB-event fusion multi-modal AI on NVM CiM, which requires large memory capacity, is a big challenge.

In this paper, REM-CiM, RGB-Event fusion Multi-modal analog CiM, is proposed to overcome the trade-off between mAP and the number of parameters and to realize multi-modal edge AI (Fig. 1). In REM-CiM, the proposed multi-modal AI algorithms and circuit implementation are co-designed for area/energy-efficient object detection during day and night. The key contributions are:

Fig. 1  Overview of proposed REM-CiM. To fuse RGB and event data at edge with area/energy limitation, model architecture (MEA-FPN) and RGB-event fusion module (C-BDC) are proposed. Moreover, appropriate clipping range of weights and activations are investigated for low-bit quantization.

  • An RGB-event fusion model architecture, Memory capacity-efficient Attentional Feature Pyramid Network fusion (MEA-FPN), is proposed to realize the multi-modal AI on edge CiM. Convolution-less bi-directional calibration (C-BDC) in MEA-FPN achieves a 97% reduction in the number of weight parameters compared with BDC by removing large convolution operations, which leads to memory capacity reduction.

  • The optimal point of the trade-off between mAP and the number of parameters when considering edge analog CiM implementation is explored by comparing mAPs during day/night and the number of weight parameters among models. The proposed MEA-FPN achieves a 76% reduction of parameters compared with A-FPN while keeping mAP degradation to \(<2.3\)%.

  • To pursue area/energy/memory-capacity efficiency of analog CiM and realize muti-modal AI on CiM, low-bit quantization with clipping (LQC) is proposed. REM-CiM with MEA-FPN and LQC achieves \(<150\)M memory capacity, which indicates the possibility of multi-modal edge CiM.

2.  Background and Motivation

2.1  RGB-Event Fusion Method

To effectively utilize the complementary characteristics of RGB and events, it is very important how to fuse them. In [15], late fusion has been proposed. [11] has proposed fusing modalities at multiple stages by utilizing feature pyramid network (FPN) architecture. In these methods, fusion is performed with simple concatenation.

In [12] and [14], the attention module is utilized at the fusion stage to focus on important features and suppress unnecessary ones. In [14], bi-directional calibration (BDC) has been proposed. In BDC, features from one modality are applied to the other. Moreover, important information is extracted in both spatial and channel dimensions with Channel Attention and Spatial Attention [21]. Although BDC achieves mAP improvement, parameter overheads of BDC are so large that it is difficult to implement a BDC-adopted NN model to CiM with energy, area, and memory capacity limitations. In this paper, Convolution-less BDC (C-BDC) is proposed to tackle this challenge.

2.2  Limitation and Non-Ideality of NVM CiM

Computation-in-Memory (CiM) [4], [5] is a promising NN accelerator. By utilizing Ohm’s and Kirchhof’s law in weight-embedded memory array, CiM can perform high-speed and low-power analog MAC calculations. By using analog CiM with non-volatile memory (NVM) such as ReRAM, MRAM, PRAM, and Flash, footprint and power consumption are reduced [22], [23]. However, NVM analog CiM has two major issues.

The first issue is a trade-off between accuracy and area/energy due to the bit-resolution of weight memory cells and ADC/DAC. It is reported that ADC/DAC takes a significant portion of the total area/energy consumption [24], [25] and area/energy increases proportionally to ADC/DAC, i.e., activation, bit-resolution [26]. As for weights, the increase in weight bit-resolution leads to the increase in memory area and power consumption of writing operation. To tackle these issues, low-bit quantization is desired.

As for weights, minimizing loss during quantization from floating point to int8 and dequantization from int8 to floating point has been proposed [27]. However, this method does not assume the characteristics of CiM weights (e.g., symmetrical weight distribution with differential pair). As for activations, automatic optimization of the clipping range by considering the clipping range as a parameter has been proposed [28]. However, this quantization method does not consider the quantization of output, which is inevitable when considering ADC in CiM. As shown in these examples, GPU and CiM have fundamentally different weights and quantization methods, so it is necessary to consider clipping and quantization methods suitable for CiM. In this paper, appropriate clipping ranges for weights and activations (i.e., the value of inputs and outputs) of the proposed MEA-FPN are investigated and low-bit quantization with clipping (LQC) is conducted for realizing multi-modal edge CiM.

The second issue is the non-idealities of NVM, such as write variation [18], conductance shift by data-retention [29], [30], and endurance [31]. Due to approximate analog computation, these errors inevitably affect the accuracy. In this paper, the tolerance against these errors is verified.

3.  Proposed MEA-FPN Architecture w/ C-BDC

To reduce the number of weight parameters and realize multi-modal AI on edge CiM, an RGB-event fusion model architecture, Memory capacity-efficient Attentional FPN fusion (MEA-FPN), is proposed (Fig. 1). Figure 2 shows the diagrams of simple concatenation in FPN fusion [11], conventional BDC [14] and the proposed Convolution-less bi-directional calibration (C-BDC). In the proposed MEA-FPN, C-BDC is adopted as an RGB-event fusion module.

Fig. 2  Diagram of (a) cross Spatial Attention (SA) and cross Channel Attention (CA), (b) simple concatenation in FPN fusion, (c) conventional Bi-directional calibration (BDC) and (d) proposed Convolution-less BDC (C-BDC). In the proposed C-BDC, heavy convolution calc in (c) is removed.

Channel Attention (CA) [21] in cross CA [14] makes a channel attention map with average/max-pooling in spatial dimension and Multi-layer Perceptron. Spatial Attention (SA) in cross SA makes a spatial attention map with average/max-pooling in channel dimension and wide (e.g., 7 \(\times\) 7) convolution. By utilizing CA and SA, BDC extracts important features from each modality. Moreover, with the cross-calibration mechanism, BDC transports features from one modality to another (Figs. 2 (a) and 2 (c)). However, the number of weight parameters in BDC is too large to adopt in CiM (Table 1).

To reduce the memory capacity for realizing CiM implementation, C-BDC, a weight parameter-reduced RGB-event fusion module, is proposed (Fig. 2 (d)). In C-BDC, the large convolution calculations in BDC are removed except for the first convolution layer. Note that the first convolution is remained just for adjusting the number of channels of C-BDC output to \(C\) as shown in Fig. 2 (d). On the other hand, cross attention mechanisms are fully utilized for extracting important information and applying features of one modality to the other. By removing large convolutions, the proposed C-BDC achieves a 97% reduction in the number of weight parameters compared with conventional BDC (Table 1). This result shows the memory-capacity efficiency of C-BDC in edge multi-modal CiM.

Table 1  Comparison of parameters among each module

Figures 3 (a)-3 (c) show the simple model diagram of FPN fusion, A-FPN, and the proposed MEA-FPN respectively. In A-FPN, the conventional BDC replaces the single RGB-event concatenation in FPN fusion. On the other hand, in the proposed MEA-FPN, C-BDC replaces the concatenation. With middle-fusion architecture, RGB and event features are fused at each stage of the ResNet [32] backbone.

Fig. 3  Diagram of (a) FPN fusion, (b) A-FPN and (c) MEA-FPN. (d) Trade-off between mAP and the number of parameters. MEA-FPN overcomes this trade-off for multi-modal AI on edge CiM.

Fused features are fed into the FPN module (Fig. 1). The output of each stage of the FPN module is then fed into the regression and classification subnetwork, and finally the regression boxes and classification results are output.

There is a trade-off between the number of parameters and mAP (Fig. 3 (d)). FPN fusion requires fewer parameters, while resulting in relatively low mAP. In contrast, A-FPN achieves higher mAP, while requiring more parameters due to large convolution layers in BDC. A-FPN is suitable when inference is conducted at cloud, with sufficient computational resources. On the other hand, the memory capacity of A-FPN is too large to implement on edge CiM. The proposed MEA-FPN achieves parameter reduction while minimizing the mAP degradation compared with A-FPN, and shows the possibility of implementing multi-modal RGB-fusion AI on edge analog CiM.

4.  Evaluation of Proposed MEA-FPN

4.1  Datasets

DSEC dataset is a dataset in driving scenarios including event camera data [33] and contains a lot of night data. However, the object detection label is not publicly released. Therefore, in this work, the dataset provided in [11] is utilized, where over 100,000 objects are labeled using YOLO v5 [34]. In this paper, labels of Car and Pedestrian, which are considered to occur frequently in driving scenarios, are used.

The dataset is split into day and night to measure mAPs during day and night respectively. The number of night data included in the original test dataset is too small (\(<1500\) labels) to measure mAP during the night accurately (Table 2). Therefore, a part of the night data in the original training dataset (about 10000 labels) is moved to the test dataset. With this dataset, the impact of the high dynamic range of the event camera on mAP improvement in night conditions is evaluated.

Table 2  The number of labels in original dataset [11] and this work

4.2  Appropriate Input Preprocessing

Event data are considered as 4D inputs (x,y,p,t), where (x,y) stands for the spatial resolution, p stands for the polarity and t stands for the temporal axis. To fuse with the RGB frame, event representation, where sparse and asynchronous events are converted to dense frames, is necessary. In this work, voxel-grid method [35] is adopted. Voxel-grid retains both temporal and spatial information by dividing each temporal cue into several bins.

In addition, appropriate preprocessing for RGB and event frames is investigated by comparing the 4 points below:

  1. Mean & standard deviation (std) of RGB: Whether the mean and std of ImageNet or DSEC is used for the standardization of the RGB frames.

  2. RGB normalization: Whether to divide RGB frames by three times the std calculated by all frames (3 sigma) or by the maximum absolute value of each frame (Max).

  3. Event standardization: Whether to use the mean and std of all event frames (Global) or those of each event frame (Local) for standardization of event frames.

  4. Event clipping: Whether to clip the event frames or not.

Table 3 shows the comparison results of input preprocessing. The top 2 scores are colored red. With original input processing in [11], loss becomes too large due to the instability of the output of C-BDC, and the model cannot be trained. For RGB standardization (Type 1 vs. Type 2), using the mean and std of DSEC is better than those of ImageNet. Using the mean and std of specific domains (i.e., driving scenario in this work) leads to mAP improvement. The better method of RGB normalization (Type 2 vs Type 3) and event standardization (Type 2 vs Type 4) are 3 sigma and Global respectively. In both methods, all images are divided by the same value, which means that the intensity ratio among images should be preserved. Event clipping leads to score improvement (Type 2 vs Type 5). Clipping reduces the instability of calculation by suppressing outliers. From these results, Type 2 is utilized for the following experiments as input preprocessing.

Table 3  AP comparison between input preprocessing

4.3  Evaluation Setups & Metrics

Models are trained to minimize the sum of focal loss, regression loss, and classification loss. ResNet-50 [32] is selected as the backbone. The initial learning rate is set to 0.0001. Adam is selected as an optimizer. In training, the batch number is set to 16 and the epoch number is set to 65.

The accuracy of object detection is evaluated by using Average Precision (AP), with setting the threshold of Intersection of Union (IoU) to 50%. In this paper, mAP means the average of the AP of cars and pedestrians. mAPs during day and night are calculated respectively in Sect. 4.4 to investigate the effectiveness of the event camera on mAP in each light condition.

4.4  Comparison with Conventional Models

To investigate the effectiveness of the proposed C-BDC on object detection accuracy under each light condition (day or night) and each label, AP is compared among models (Fig. 5). Moreover, to verify the computational resource and memory-capacity-area efficiency of the proposed MEA-FPN, MACs and the number of weight parameters are also compared among models (Table 4).

Table 4  Comparison of the number of weight parameters, MACs, and mAP during day and night

To better understand the effectiveness of FPN fusion architecture and multi-modality with RGB-event fusion on mAP improvement, early fusion (Fig. 4 (a)) and RGB-only model (Fig. 4 (b)) are compared with MEA-FPN. To compare the effectiveness of feature-extraction improvement and fusion-method improvement on the increase in mAP and weight parameters, FPN fusion with ResNet-101, which is deeper than ResNet-50, is also compared with MEA-FPN.

Fig. 4  The model architecture of (a) early fusion and (b) RGB only model.

The proposed MEA-FPN achieves 76% parameter reduction compared with A-FPN, while keeping mAP reduction to \(<2.3\)% (Table 4). Owing to the much smaller weight parameters of C-BDC compared with BDC, MEA-FPN shows suitability for edge multi-modal CiM. MEA-FPN also achieves more than 3% mAP improvement during both day and night with only 0.3% parameter overhead compared with FPN fusion (Table 4). MEA-FPN also achieves 76% parameter reduction compared with A-FPN, while keeping mAP reduction to \(<2.3\)%. The function of C-BDC, i.e., the ability to extract important features and transport one modality feature to the other, plays an important role in mAP improvement during both day and night. Moreover, the proposed C-BDC is suitable for area-limited edge CiM due to the much smaller weight parameters compared with conventional BDC.

The proposed MEA-FPN achieves both better mAP and fewer parameters than FPN fusion with the ResNet-101 backbone. Extracting important information from RGB and event features by the proposed C-BDC can achieve better mAP with fewer weight parameters and calculations compared with simply making the feature extraction module deeper. Therefore, C-BDC is necessary for accurate object detection at edge multi-modal CiM, where circuit area and memory capacity are limited.

Early fusion shows better mAP improvement during the night (\(+3.6\)%) than daytime (\(+0.5\)%) (Table 4). The high dynamic range of the event camera leads to mAP improvement especially during the night. On the other hand, RGB only model achieves higher AP than early fusion in detecting cars during the day (Fig. 5 (a)). Even when event data is not effective for detection (e.g., when objects to be detected are hidden by other objects), event features are not suppressed in early fusion, which leads to mAP degradation. Therefore, it is important not only to add the sensors together, but also to extract important features and suppress unimportant ones with the middle-fusion architecture and C-BDC in MEA-FPN for better multi-modal RGB-event fusion during both day and night.

Fig. 5  Comparison about AP of (a) car and (b) pedestrian during day and night among models.

4.5  Effectiveness of Cross SA and Cross CA

To better understand the impact of the cross-attention modules (Fig. 3 (a)) on multi-modal object detection accuracy, APs are compared among MEA-FPN, MEA-FPN without cross CA, MEA-FPN without cross SA and FPN fusion (Fig. 6). Both MEA-FPN without cross SA and without cross CA achieve higher AP than FPN fusion, but lower AP than MEA-FPN during both day and night. It is necessary to extract important features in both spatial and channel dimensions with attentional fusion to achieve high AP during both day and night.

Fig. 6  AP comparison to verify the effectiveness of cross CA and cross SA. (a) AP of car. (b) AP of pedestrian.

5.  Methodology and Evaluation of Proposed LQC

To implement multi-modal AI on edge analog CiM (Fig. 7), area, energy and memory capacity are required to be small. However, there is a trade-off between accuracy and area/energy/memory capacity due to the bit-resolution of weights and activations is a major limitation in analog CiM. To achieve the area/energy reduction while maintaining mAP, low-bit quantization with clipping (LQC) is proposed. The novelty of proposed LQC is below: First, the appropriate weight clipping is investigated with consideration of the zero-centered symmetrical characteristics of differential pairs in CiM. Second, to pursue low-bit quantization while maintaining mAP, weight bit-precision sensitivity of each module in MEA-FPN (Fig. 1) is investigated. Third, in contrast to [28], the quantization for both inputs and outputs is conducted to take DAC/ADC into account.

Fig. 7  (a) Bit-parallel weight cell and (b) bit-serial weight cell intypical CiM array. (c) Mapping of convolution layer to CiM. (d) Memory cell assumed in this paper. (e) Cumulative probability of ReRAM cell current (©2021 IEEE [31]).

To determine the clipping range in LQC, the appropriate clipping range for weights and activations is investigated respectively (Fig. 1). Note that quantization and clipping of “activation” in this paper means those of both inputs and outputs, as described above. Then, bit-precision sensitivity against weights and activation quantization under the appropriate clipping are compared among models. In addition, to pursue low-bit quantization while maintaining mAP, the weight bit-precision sensitivity of each module in MEA-FPN (Fig. 1) is investigated. From these experiments, the appropriate LQC configuration for the proposed REM-CiM is determined. Moreover, the error-tolerance of proposed and conventional models is investigated by injecting write variation and data retention errors of analog CiM for weights (Fig. 8). Finally, the performance of REM-CiM, with MEA-FPN and LQC, is compared with other CiMs without LQC method.

Fig. 8  (a) Histogram of weight values with clipping and quantization. (b) Gaussian error and (c) shift error applied to clipped and quantized weight.

5.1  Configuration of Proposed REM-CiM

Figure 7 shows the typical weight bit-representation with memory cells, the mapping method of convolution weights, and the representation of weight values in our proposed REM-CiM.

Figure 7 (a) and Fig. 7 (b) show bit-parallel weight representation and bit-serial weight representation respectively [36], [37]. In these weight representation methods, weight values are represented by multiple 1-bit memory cells to avoid errors due to the limited signal margins and the device variations of MLC. On the other hand, it is assumed that each memory cell in the proposed CiM can represent the weight values with analog conductance and only one memory cell is used for representing weight value, with reference to [18], [38]. Therefore, the proposed CiM stores weights in its analog conductance without bit-serial or bit-parallel method, shown in Fig. 7 (a) or (b).

\(C_{in}\), \(C_{out}\), \(K\) represents the size of input channels, the size of output channels, and kernel size respectively. As shown in Fig. 7 (c), each \(C_{in}\) weights in one kernel are mapped in one column, and the convolution weights at the same place of each kernel are mapped in one array, referring to [39].

Figure 7 (d) shows each weight cell in the CiM array. Each weight of neural network is represented by a differential pair and the value of weight is represented as the difference of analog conductance: \(G_{ij}^{+}-G_{ij}^{-}\). Therefore, two memory cells are required to represent one weight value. Note that each memory cell retains analog conductance and positive/negative weight value is represented with a single cell respectively. Figure 7 (c) shows the cumulative probability of ReRAM cell current [31].

The weight and activation quantization step is determined by peak-to-peak method after clipping (Fig. 8 (a)) [38]. As weight is represented by differential pair, the weights are quantized symmetrically around zero.

Write variation is reproduced by adding Gaussian errors with standard deviation \(\sigma\) to weights (Fig. 8 (b)). Conductance shift is reproduced by adding a constant value to weights (Fig. 8 (c)). Write-verify operation [18] is assumed to performed when writing weights. The baseline of mAP is set to 0.460 in each experiment, which is 1.5% lower than mAP achieved by MEA-FPN with 32-bit precision. The experiment of write variation is conducted with the assumption that the write variation errors injected into weights include the impact of non-linearity. Regarding the tolerance against ADC errors, the report about ADC noise in [38] is referenced. In [38], \(\sigma_{\mathrm{ADC}}\) is introduced as the parameter representing ADC noise, and the differential non-linearity (DNL) of each output code follows normal distribution \(N\)(1.0 LSB, \(\sigma_{\mathrm{ADC}}\) LSB). Considering 4-bit activation in CiM, 0.4\(\sigma_{\mathrm{ADC}}\) of ADC non-linearity is tolerated. This indicates that DNL with an average of 0.5 LSB and integrated non-linearity (INL) with an average of 1.0 LSB are tolerated respectively. With this consideration, it is assumed that the influence of ADC non-linearity is less than that of weight variation errors.

5.2  Appropriate Clipping Range of Weight & Activation

To achieve weight & activation low-bit quantization while maintaining mAP, the appropriate clipping range for the proposed MEA-FPN is investigated (Fig. 9). By setting the clipping range to 3\(\sigma\), weight bit-precision sensitivity improves from 5-bit to 4-bit (Fig. 9 (a)). On the other hand, the activation clipping range needs to be relatively wide (Fig. 9 (b)). With 12\(\sigma\) clipping, activation bit-precision sensitivity improves from 8-bit to 6-bit. In MEA-FPN, the outputs of C-BDC become large (Fig. 10 (a)) due to the instability of cross-calibration mechanisms in cross SA and cross CA. In particular, the outliers become much larger than other values, which leads to serious degradation of bit-precision sensitivity. These observations indicate the importance of appropriate activation clipping in MEA-FPN. From these results, 3\(\sigma\) and 12\(\sigma\) are determined as the appropriate clipping range for weights and activation respectively.

Fig. 9  Bit-precision sensitivity of MEA-FPN with various clipping range of (a) weight and (b) activation.

Fig. 10  Input histogram of FPN module in (a) proposed MEA-FPN fusion and (b) conventional FPN fusion.

5.3  Appropriate Bit-Precision of Weight & Activation for Low-Bit Quantization

To compare the impact of RGB-event fusion and the proposed C-BDC on AP degradation in low-bit quantization, bit-precision sensitivity of weights (Fig. 11 (a)) and activation (Fig. 11 (b)) is compared among 3 models: MEA-FPN, FPN fusion and RGB only model. Based on the results in Sect. 5.1, 3\(\sigma\) clipping is adopted to weights, and 12\(\sigma\) clipping is adopted to activations. With the appropriate weight & activation clipping, the proposed MEA-FPN can maintain higher mAP with 4-bit and 6-bit quantization for weights and activations respectively.

Fig. 11  Comparison of bit-precision sensitivity against (a) weight quantization and (b) activation quantization.

To reduce CiM memory cells while maintaining mAP, the weight bit-precision sensitivity of each module is also investigated (Fig. 12). The RGB module is a little less tolerant to low-bit quantization compared with others. To avoid wasting the rich RGB information, high bit-resolution is required for RGB-feature extraction. For maintaining mAP when weights of all modules are low-bit quantized, the bit-precision of W\(_\mathrm{RGB}\) is determined as 5-bit in the proposed LQC.

Fig. 12  Weight bit-precision sensitivity of each module in proposed MEA-FPN.

5.4  Comparison of Error-Tolerance among Models

To compare the impact of RGB-event fusion and the proposed C-BDC on the tolerance against write variation and data retention error of analog CiM, the error-tolerance of weights are compared among 3 models (Fig. 13). In this experiment, weights are quantized to 8-bit with 3\(\sigma\) clipping, to ensure that quantization do not affect the mAP degradation and to investigate the mAP degradation driven by NVM errors precisely. The unit of error size “n.s.” stands for normalized step, meaning the relative size to weights normalized between \(-1\) and 1. In Fig. 8 (a), only the error of write variation is injected to models. In Fig. 8 (b), only the error of conductance shift is injected to models. The results show that MEA-FPN tolerates up to 0.03 n.s. gaussian error and 0.002 n.s. shift error respectively. In other words, to maintain high mAP, gaussian errors should be less than 0.03 n.s. and shift error should be less than 0.002 n.s.

Fig. 13  Comparison of error-tolerance when (a) gaussian or (b) shift errors are injected to each model.

As reported in [18], the write variation of ReRAM is 0.59 \(\mu\)A and the range of conductance is 30 \(\mu\)A when the write-verify operation is performed. From this result, the normalized write variation of each ReRAM cell is supposed to be about 0.02 n.s. [38] shows that the variation of the differential pair when the write-verify operation is performed is supposed to be around 0.03 n.s. With this consideration, it can be said that the proposed MEA-FPN tolerates write variation if the write-verify operation is performed.

Under these errors, MEA-FPN maintains higher mAP than the other models.

5.5  LQC Impact on mAP and CiM Performance

From the results in this chapter, the appropriate configuration of LQC for MEA-FPN is determined. 3\(\sigma\) and 12\(\sigma\) clipping is adopted for weights and activations respectively. As for weights, W\(_\mathrm{RGB}\) is quantized to 5-bit and others are quantized to 4-bit. As for activations, 6-bit quantization is adopted uniformly.

Table 5 shows the comparison of each model CiM without LQC and the proposed REM-CiM with MEA-FPN and LQC. The mAPs with write variation are compared considering mapping to analog CiM. The mAPs with write variation & data retention error are also compared considering the case where time has passed since the mapping. Let \(ocs\), \(hwif\), \(iob\) and \(ks\) represent the output channel size, the product of height and width of input feature, activation bit, and kernel size, respectively, in each matrix operation. It is assumed that every 8 columns are shared in one ADC and weights at different spatial locations of each kernel are mapped to different sub-matrices, referring to [39]. ADC area/energy are assumed to be proportional to the activation bit-precision, referring to [26]. From these assumptions, the relative ADC area/energy is calculated with the following equations:

\[\begin{align} & Area \propto \sum ocs\cdot iob\cdot {ks}^{2} \tag{1} \\ & Energy \propto \sum ocs\cdot hwif\cdot iob\cdot {ks}^{2} \tag{2} \end{align}\]

Table 5  Comparison between CiMs of each model

The proposed REM-CiM achieves a 25% reduction of ADC area/energy compared with MEA-FPN CiM without LQC. When write variation error is added considering mapping on analog CiM, REM-CiM keeps the mAP reduction to \(<2.8\)% compared with MEA-FPN CiM without LQC. A-FPN CiM achieves the best mAP for all error patterns, however, requires the most memory cells (\(>500\)Mb) and ADC area/energy. Therefore, it is difficult to implement A-FPN on edge CiM and A-FPN CiM is not suitable for edge usage. On the other hand, REM-CiM achieves the memory capacity around 130Mb, which is compatible with current memory capacity limitation at edge (\(<100\)Mb). Considering the rapid evolution of the technology of NVM integration [16], the capacity of embedded eNVM will become larger in the same way as the capacity of standalone NVM. Therefore, it can be said that the implementation of the proposed REM-CiM is feasible even though the capacity of REM-CiM is a little bigger than the current target of 100Mb.

REM-CiM also achieves almost the same memory cells, 19% less ADC area, 24% less ADC energy and 0.17% higher mAP than FPN fusion CiM without LQC, as the arrows in Table 5 indicate. By co-designing area/energy-efficient algorithm and implementation method of analog CiM, both higher mAP and less area/energy computational resource than conventional method are achieved and implementation of accurate multi-modal AI on edge CiM is realized. The higher mAP of REM-CiM is also maintained when data retention error is also added. This result shows that REM-CiM maintain higher mAP even after time has elapsed.

6.  Conclusion

In this paper, REM-CiM: RGB-Event fusion Multi-modal CiM is proposed for multi-modal edge object detection during both day and night. In REM-CiM, multi-modal algorithms and circuit implementation are co-designed to realize multi-modal AI on edge analog CiM under the memory capacity limitation. First, memory capacity-reduced RGB-event fusion model architecture, MEA-FPN, is proposed with C-BDC. C-BDC reduces the number of weight parameters by removing large convolution operations, which leads to memory capacity reduction. MEA-FPN achieves a 76% reduction of parameters compared with A-FPN while keeping mAP degradation to \(<2.3\)% during both day and night. Second, low-bit quantization with clipping (LQC) is proposed. In LQC, the appropriate clipping range of weight and activation for low-bit quantization is explored. By co-designing algorithms and analog CiM implementation with MEA-FPN and LQC, multi-modal AI on edge CiM is realized. REM-CiM achieves almost the same memory cells, 21% less ADC area, 24% less ADC energy, and 0.17% higher mAP compared with FPN fusion CiM without LQC.

Acknowledgments

This paper is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).

References

[1] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object Detection in 20 Years: A Survey,” Proc. IEEE, vol.111, no.3, pp.257-276, 2023.
CrossRef

[2] Z. Chang, S. Liu, X. Xiong, Z. Cai, and G. Tu, “A Survey of Recent Advances in Edge-Computing-Powered Artificial Intelligence of Things,” IEEE Internet Things J. Journal, vol.8, no.18, pp.13849-13875, 2021.
CrossRef

[3] L. Liu, Y. Yao, R. Wang, B. Wu, and W. Shi, “Equinox: A Road-Side Edge Computing Experimental Platform for CAVs,” MetroCAD, pp.41-42, 2020.
CrossRef

[4] M. Chang, A.S. Lele, S.D. Spetalnick, B. Crafton, S. Konno, Z. Wan, A. Bhat, W.-S. Khwa, Y.-D. Chih, M.-F. Chang, and A. Raychowdhury, “A 73.53TOPS/W 14.74TOPS Heterogeneous RRAM In-Memory and SRAM Near-Memory SoC for Hybrid Frame and Event-Based Target Tracking,” ISSCC, pp.426-428, 2023.
CrossRef

[5] N. Verma, H. Jia, H. Valavi, Y. Tang, M. Ozatay, L.-Y. Chen, B. Zhang, and P. Deaville, “In-Memory Computing: Advances and Prospects,” IEEE SSC-M, vol.11, no.3, pp.43-55, 2019.
CrossRef

[6] J. Lin and F. Zhang, “R3LIVE: A Robust, Real-time, RGB-colored, LiDAR-Inertial-Visual tightly-coupled state Estimation and mapping package,” ICRA, pp.10672-10678, 2022.
CrossRef

[7] Z. Wu, S. Gobichettipalayam, B. Tamadazte, G. Allibert, D.P. Paudel, and C. Demonceaux, “Robust RGB-D Fusion for Saliency Detection,” 3DV, pp.403-413, 2022.
CrossRef

[8] G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A.J. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza, “Event-Based Vision: A Survey,” IEEE IEEE Trans. Pattern Anal. Mach. Intell., vol.44, no.1, pp.154-180, 2022.
CrossRef

[9] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 X 128 120db 30mw asynchronous vision sensor that responds to relative intensity change,” ISSCC, pp.2060-2069, 2006.
CrossRef

[10] T. Finateu, A. Niwa, D. Matolin, K. Tsuchimoto, A. Mascheroni, E. Reynaud, P. Mostafalu, F. Brady, L. Chotard, F. LeGoff, H.Takahashi, H. Wakabayashi, Y. Oike, and C. Posch, “5.10 A 1280×720 Back-Illuminated Stacked Temporal Contrast Event-Based Vision Sensor with 4.86μm Pixels, 1.066GEPS Readout, Programmable Event-Rate Controller and Compressive Data-Formatting Pipeline,” ISSCC, pp.112-114, 2020.
CrossRef

[11] A. Tomy, A. Paigwar, K.S. Mann, A. Renzaglia, and C. Laugier, “Fusing Event-based and RGB camera for Robust Object Detection in Adverse Conditions,” ICRA, pp.933-939, 2022.
CrossRef

[12] L. Sun, C. Sakaridis, J. Liang, Q. Jiang, K. Yang, P. Sun, Y. Ye, K. Wang, and L.V. Gool, “Event-based fusion for motion deblurring with cross-modal attention,” ECCV, vol.13678, pp.412-428, 2022.
CrossRef

[13] P. Shi, J. Peng, J. Qiu, X. Ju, F.P.W. Lo, and B. Lo, “EVEN: An Event-Based Framework for Monocular Depth Estimation at Adverse Night Conditions,” arXiv preprint arXiv:2302.03680, 2023.

[14] Z. Zhou, Z. Wu, R. Boutteau, F. Yang, C. Demonceaux, and D. Ginhac, “RGB-Event Fusion for Moving Object Detection in Autonomous Driving,” arXiv preprint Arxiv:2209.08323v2.

[15] S. Tulyakov, A. Bochicchio, D. Gehrig, S. Georgoulis, Y. Li, and D. Scaramuzza, “Time lens++: Event-based frame interpolation with parametric non-linear flow and multi-scale fusion,” CVPR, pp.17734-17743, 2022.
CrossRef

[16] F. Pellizzer, A. Piirovano, R. Bez, and R.L. Mayer, “Status and Perspectives of Chalcogenide-based Cross-Point Memories (Invited),” IEEE International Electron Devices Meeting (IEDM), pp.1-4, 2023.
CrossRef

[17] N. Lepri, A. Glukhov, L. Cattaneo, M. Farronato, P. Mannocci, and D. Ielmini, “In-memory computing for machine learning and deep learning,” IEEE Journal of the Electron Devices Society, vol.11, pp.587-601, 2023.
CrossRef

[18] R. Mochida, K. Kouno, Y. Hayata, M. Nakayama, T. Ono, H. Suwa, R. Yasuhara, K. Katayama, T. Mikawa, and Y. Gohou, “A 4M Synapses integrated Analog ReRAM based 66.5 TOPS/W Neural-Network Processor with Cell Current Controlled Writing and Flexible Network Architecture,” VLSI Tech., pp.175-176, 2018.
CrossRef

[19] J. Han, H. Liu, M. Wang, Z. Li, and Y. Zhang, “ERA-LSTM: An Efficient ReRAM-Based Architecture for Long Short-Term Memory,” IEEE Trans. Parallel Distrib. Syst., vol.31, no.6, pp.1328-1342, 2020.
CrossRef

[20] C.-X. Xue, W.-H. Chen, J.-S. Liu, J.-F. Li, W.-Y. Lin, W.-E. Lin, J.-H. Wang, W.-C. Wei, T.-W. Chang, T.-C. Chang, T.-Y. Huang, H.-Y. Kao, S.-Y. Wei, Y.-C. Chiu, C.-Y. Lee, C.-C. Lo, Y.-C. King, C.-J. Lin, R.-S. Liu, C.-C. Hsieh, K.-T. Tang, and M.-F. Chang, “24.1 A 1Mb Multibit ReRAM Computing-In-Memory Macro with 14.6ns Parallel MAC Computing Time for CNN Based AI Edge Processors,” 2019 IEEE ISSCC, pp.388-390, 2019.
CrossRef

[21] S. Woo, J. Park, J.-Y. Lee, and I.S. Kweon, “Cbam: Convolutional block attention module,” ECCV, vol.11211, pp.3-19, 2018.
CrossRef

[22] L. Song et al., “PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning,” HPCA, pp.541-552, 2017.
CrossRef

[23] Q. Liu, B. Gao, P. Yao, D. Wu, J. Chen, Y. Pang, W. Zhang, Y. Liao, C.-X. Xue, W.-H. Chen, J. Tang, Y. Wang, M.-F. Chang, H. Qian, and H. Wu, “33.2 A Fully Integrated Analog ReRAM Based 78.4TOPS/W Compute-In-Memory Chip with Fully Parallel MAC Computing,” ISSCC, pp.500-502, 2020.
CrossRef

[24] P. Chen, M. Wu, Y. Ma, L. Ye, and R. Huang, “RIMAC: An Array-level ADC/DAC-Free ReRAM-Based In-Memory DNN Processor with Analog Cache and Computation,” ASP-DAC, pp.228-233, 2023.
CrossRef

[25] H. Jiang, W. Li, S. Huang, and S. Yu, “A 40nm Analog-Input ADC-Free Compute-in-Memory RRAM Macro with Pulse-Width Modulation between Sub-arrays,” VLSI Technology and Circuits, pp.266-267, 2022.
CrossRef

[26] S. Yu, X. Sun, X. Peng, and S. Huang, “Compute-in-Memory with Emerging Nonvolatile-Memories: Challenges and Prospects,” CICC, pp.1-4, 2020.
CrossRef

[27] H. Wu, P. Judd, X. Zhang, M. Isaev, and P. Micikevicius, “Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation,” arXiv preprint https://arxiv.org/abs/2004.09602.

[28] J. Choi, S. Venkataramani, V.V. Srinivasan, K. Gopalakrishnan, Z. Wang, and P. Chuang, “Accurate and efficient 2-bit quantized neural networks,” Proc. Machine Learning and Systems, vol.1, pp.348-359, 2019.

[29] S. Fukuyama et al., “Comprehensive Analysis of Data-Retention and Endurance Trade-Off of 40nm TaOx-based ReRAM,” IRPS, pp.1-6, 2019.
CrossRef

[30] Y.-H. Lin, C.-H. Wang, M.-H. Lee, D.-Y. Lee, Y.-Y. Lin, F.-M. Lee, H.-L. Lung, K.-C. Wang, T.-Y. Tseng, and C.-Y. Lu, “Performance Impacts of Analog ReRAM Non-ideality on Neuromorphic Computing,” IEEE Trans. Electron Devices, vol.66, no.3, pp.1289-1295, 2019.
CrossRef

[31] K. Taoka, N. Misawa, S. Koshino, C. Matsui, and K. Takeuchi, “Simulated Annealing Algorithm & ReRAM Device Co-optimization for Computation-in-Memory,” 2021 IEEE International Memory Workshop (IMW), Dresden, Germany, pp.1-4, 2021.
CrossRef

[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” CVPR, pp.770-778, 2016.
CrossRef

[33] M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza, “DSEC: A Stereo Event Camera Dataset for Driving Scenarios,” IEEE RALs, vol.6, no.3, pp.4947-4954, 2021.
CrossRef

[34] Ultralytics. Yolov5. [Online]. Available: https://github.com/ultralytics/yolov5.

[35] A.Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Unsupervised Event-Based Learning of Optical Flow, Depth, and Egomotion,” CVPR, pp.989-997, 2019.
CrossRef

[36] Q. Liu, B. Gao, P. Yao, D. Wu, J. Chen, Y. Pang, W. Zhang, Y. Liao, C.-X. Xue, W.-H. Chen, J. Tang, Y. Wang, M.-F. Chang, H. Qian, and H. Wu, “33.2 A Fully Integrated Analog ReRAM Based 78.4TOPS/W Compute-In-Memory Chip with Fully Parallel MAC Computing,” 2020 IEEE International Solid-State Circuits Conference - (ISSCC), San Francisco, CA, USA, pp.500-502, 2020.
CrossRef

[37] A. Parmar, K. Prasad, N. Rao, and J. Mekie, “An Automated Approach to Compare Bit Serial and Bit Parallel In-Memory Computing for DNNs,” 2022 IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, pp.2948-2952, 2022.
CrossRef

[38] A. Yamada, N. Misawa, C. Matsui, and K. Takeuchi, “LIORAT: NN Layer I/O Range Training for Area/Energy-Efficient Low-Bit A/D Conversion System Design in Error-Tolerant Computation-in-Memory,” 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, USA, pp.1-9, 2023.
CrossRef

[39] X. Peng, R. Liu, and S. Yu, “Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on Processing-in-Memory Architectures,” IEEE IEEE Trans. Circuits Syst.-I, vol.67, no.4, pp.1333-1343, 2020.
CrossRef

Authors

Yuya ICHIKAWA
  The University of Tokyo

received the B.S. degree in Information and Communication Engineering from the University of Tokyo in 2022. He is now a master course student in Takeuchi Laboratory in the department of Electrical Engineering and Information Systems, the University of Tokyo. His current research interests include RGB-event fusion multimodal AI and Computation-in-Memory system.

Ayumu YAMADA
  The University of Tokyo

received the B.S. degree in Electrical Engineering from the University of Tokyo in 2022. He is now a master course student in Takeuchi Laboratory in the department of Electrical Engineering and Information Systems, the University of Tokyo. His current research interests include Computation-in-Memory (CiM) system, emerging non-volatile memories, neuromorphic computing, and Bayesian machine learning.

Naoko MISAWA
  The University of Tokyo

received the M.S. degree from Imperial College London in 2012. She is currently an academic staff in Takeuchi Laboratory in the department of Electrical Engineering and Information Systems, Graduate School of The University of Tokyo. Her research interests include emerging non-volatile memories, neuromorphic computing, and Vision Transformer.

Chihiro MATSUI
  The University of Tokyo

is currently a Project Associate Professor in the Department of Electronics Engineering and Information Systems, Graduate School of Engineering of The University of Tokyo. Her research interest includes system, circuit, and device co-design with emerging non-volatile memories for enterprise applications. She earned her B.S. and M.S. degrees in Physics from Ochanomizu University, Tokyo, Japan, in 2003 and 2005, respectively, and her Ph.D. degree in Information Security Sciences from Chuo University, Tokyo, Japan, in 2018. She was a Project Assistant Professor of Research and Development Initiative at Chuo University from 2018 to 2020 and a Project Assistant Professor in the Department of Electronics Engineering and Information Systems, Graduate School of Engineering of The University of Tokyo from 2020 to 2023.

Ken TAKEUCHI
  The University of Tokyo

is currently a Professor at Department of Electrical Engineering and Information Systems, Graduate School of Engineering of The University of Tokyo. He is now working on data-centric computing such as computation in memory, approximate computing, datacenter scale computing, AI chip design and brain-inspired memory. He received the B.S. and M.S. degrees in Applied Physics and the Ph.D. degree in Electric Engineering from The University of Tokyo in 1991, 1993 and 2006, respectively. In 2003, he also received the M.B.A. degree from Stanford University. Since he joined Toshiba in 1993, he had been leading Toshiba’s NAND flash memory circuit design for fourteen years. He was an Associate Professor at Department of Electrical Engineering and Information Systems, Graduate School of Engineering of The University of Tokyo from 2007 till 2012. He was a Professor at Department of Electrical, Electronic and Communication Engineering, Faculty of Science and Engineering of Chuo University from 2012 till 2020. In 2020, he rejoined The University of Tokyo. He designed six world’s highest density NAND flash memory products such as 0.7um 16Mbit, 0.4um 64Mbit, 0.25um 256Mbit, 0.16um 1Gbit, 0.13um 2Gbit and 56nm 8Gbit NAND flash memories. He holds 228 patents worldwide including 124 U.S. patents. Especially, with his invention, “multipage cell architecture”, presented at Symposium on VLSI Circuits in 1997, he successfully commercialized world’s first multi-level cell NAND flash memory in 2001. He has authored numerous technical papers, one of which won the Takuo Sugano Award for Outstanding Paper at ISSCC 2007. He is serving as the program chair of Asian Solid-State Circuits Conference (A-SSCC) in 2023. He served as the symposium chair/co-chair of Symposium on VLSI Circuits in 2021/2020. He served as the program chair/co-chair of Symposium on VLSI Circuits in 2019/2018. He has also served on the program committee member of International Solid-State Circuits Conference (ISSCC), Custom Integrated Circuits Conference (CICC), Asian Solid-State Circuits Conference (A-SSCC), International Memory Workshop (IMW), International Conference on Solid State Devices and Materials (SSDM) and Non-Volatile Memory Technology Symposium (NVMTS). He served as a tutorial speaker at ISSCC 2008, forum speaker at ISSCC 2015, SSD forum organizer at ISSCC 2009, 3D-LSI forum organizer at ISSCC 2010, Ultra-low voltage LSI forum organizer at ISSCC 2011 and Robust VLSI System forum organizer at ISSCC 2012.

Keyword