Xiantao JIANG Tian SONG Takashi SHIMAMOTO Wen SHI Lisheng WANG
The next generation high efficiency video coding (HEVC) standard achieves high performance by extending the encoding block to 64×64. There are some parallel tools to improve the efficiency for encoder and decoder. However, owing to the dependence of the current prediction block and surrounding block, parallel processing at CU level and Sub-CU level are hard to achieve. In this paper, focusing on the spatial motion vector prediction (SMVP) and temporal motion vector prediction (TMVP), parallel improvement for spatio-temporal prediction algorithms are presented, which can remove the dependency between prediction coding units and neighboring coding units. Using this proposal, it is convenient to process motion estimation in parallel, which is suitable for different parallel platforms such as multi-core platform, compute unified device architecture (CUDA) and so on. The simulation experiment results demonstrate that based on HM12.0 test model for different test sequences, the proposed algorithm can improve the advanced motion vector prediction with only 0.01% BD-rate increase that result is better than previous work, and the BDPSNR is almost the same as the HEVC reference software.
Yande XIANG Jiahui LUO Taotao ZHU Sheng WANG Xiaoyan XIANG Jianyi MENG
Arrhythmia classification based on electrocardiogram (ECG) is crucial in automatic cardiovascular disease diagnosis. The classification methods used in the current practice largely depend on hand-crafted manual features. However, extracting hand-crafted manual features may introduce significant computational complexity, especially in the transform domains. In this study, an accurate method for patient-specific ECG beat classification is proposed, which adopts morphological features and timing information. As to the morphological features of heartbeat, an attention-based two-level 1-D CNN is incorporated in the proposed method to extract different grained features automatically by focusing on various parts of a heartbeat. As to the timing information, the difference between previous and post RR intervels is computed as a dynamic feature. Both the extracted morphological features and the interval difference are used by multi-layer perceptron (MLP) for classifing ECG signals. In addition, to reduce memory storage of ECG data and denoise to some extent, an adaptive heartbeat normalization technique is adopted which includes amplitude unification, resolution modification, and signal difference. Based on the MIT-BIH arrhythmia database, the proposed classification method achieved sensitivity Sen=93.4% and positive predictivity Ppr=94.9% in ventricular ectopic beat (VEB) detection, sensitivity Sen=86.3% and positive predictivity Ppr=80.0% in supraventricular ectopic beat (SVEB) detection, and overall accuracy OA=97.8% under 6-bit ECG signal resolution. Compared with the state-of-the-art automatic ECG classification methods, these results show that the proposed method acquires comparable accuracy of heartbeat classification though ECG signals are represented by lower resolution.
Ying-Yao TING Chi-Wei HSIAO Huan-Sheng WANG
To prevent constraints or defects of a single sensor from malfunctions, this paper proposes a fire detection system based on the Dempster-Shafer theory with multi-sensor technology. The proposed system operates in three stages: measurement, data reception and alarm activation, where an Arduino is tasked with measuring and interpreting the readings from three types of sensors. Sensors under consideration involve smoke, light and temperature detection. All the measured data are wirelessly transmitted to the backend Raspberry Pi for subsequent processing. Within the system, the Raspberry Pi is used to determine the probability of fire events using the Dempster-Shafer theory. We investigate moderate settings of the conflict coefficient and how it plays an essential role in ensuring the plausibility of the system's deduced results. Furthermore, a MySQL database with a web server is deployed on the Raspberry Pi for backlog and data analysis purposes. In addition, the system provides three notification services, including web browsing, smartphone APP, and short message service. For validation, we collected the statistics from field tests conducted in a controllable and safe environment by emulating fire events happening during both daytime and nighttime. Each experiment undergoes the No-fire, On-fire and Post-fire phases. Experimental results show an accuracy of up to 98% in both the No-fire and On-fire phases during the daytime and an accuracy of 97% during the nighttime under reasonable conditions. When we take the three phases into account, the accuracy in the daytime and nighttime increase to 97% and 89%, respectively. Field tests validate the efficiency and accuracy of the proposed system.
Sanchuan GUO Zhenyu LIU Guohong LI Takeshi IKENAGA Dongsheng WANG
H.264 video codec system requires big capacity and high bandwidth of Frame Store (FS) for buffering reference frames. The up-to-date three dimensional (3D) stacked Phase change Random Access Memory (PRAM) is the promising approach for on-chip caching the reference signals, as 3D stacking offers high memory bandwidth, while PRAM possesses the advantages in terms of high density and low leakage power. However, the write endurance problem, that is a PRAM cell can only tolerant limited number of write operations, becomes the main barrier in practical applications. This paper studies the wear reduction techniques of PRAM based FS in H.264 codec system. On the basis of rate-distortion theory, the content oriented selective writing mechanisms are proposed to reduce bit updates in the reference frame buffers. With the proposed control parameter a, our methods make the quantitative trade off between the quality degradation and the PRAM lifetime prolongation. Specifically, taking a in the range of [0.2,2], experimental results demonstrate that, our methods averagely save 29.9–35.5% bit-wise write operations and reduce 52–57% power, at the cost of 12.95–20.57% BDBR bit-rate increase accordingly.
Shaojing FU Dongsheng WANG Ming XU Jiangchun REN
Remote data possession checking for cloud storage is very important, since data owners can check the integrity of outsourced data without downloading a copy to their local computers. In a previous work, Chen proposed a remote data possession checking protocol using algebraic signature and showed that it can resist against various known attacks. In this paper, we find serious security flaws in Chen's protocol, and shows that it is vulnerable to replay attack by a malicious cloud server. Finally, we propose an improved version of the protocol to guarantee secure data storage for data owners.
Zhanye WANG Chuanyi LIU Dongsheng WANG
Over the last few years, Apache MapReduce has become the prevailing framework for large scale data processing. Instead of writing MapReduce programs which are too obscure to express, many developers usually adopt high level query languages, such as Hive or Pig Latin, to finish their complex queries. These languages automatically compile each query into a workflow of MapReduce jobs, so they greatly facilitate the querying and management of large datasets. One option to speed up the execution of workflows is to save the results produced previously and reuse them in the future if needed. In this paper we present SuperRack, which uses shared storage devices to store the results of each workflow and allows a new query to reuse these results in order to avoid redundant computation and hasten execution. We propose several novel techniques to improve the access and storage efficiency of the previous results. We also evaluate SuperRack to verify its feasibility and effectiveness. Experiments show that our solution outperforms Hive significantly under TPC-H benchmark and real life workloads.
Xi ZHANG Chuanyi LIU Zhenyu LIU Dongsheng WANG
As the number of concurrently running applications on the chip multiprocessors (CMPs) is increasing, efficient management of the shared last-level cache (LLC) is crucial to guarantee overall performance. Recent studies have shown that cache partitioning can provide benefits in throughput, fairness and quality of service. Most prior arts apply true Least Recently Used (LRU) as the underlying cache replacement policy and rely on its stack property to work properly. However, in commodity processors, pseudo-LRU policies without stack property are commonly used instead of LRU for their simplicity and low storage overhead. Therefore, this study sets out to understand whether LRU-based cache partitioning techniques can be applied to commodity processors. In this work, we instead propose a cache partitioning mechanism for two popular pseudo-LRU policies: Not Recently Used (NRU) and Binary Tree (BT). Without the help of true LRU's stack property, we propose a profiling logic that applies curve approximation methods to derive the hit curve (hit counts under varied way allocations) for an application. We then propose a hybrid partitioning mechanism, which mitigates the gap between the predicted hit curve and the actual statistics. Simulation results demonstrate that our proposal can improve throughput by 15.3% on average and outperforms the stack-estimate proposal by 12.6% on average. Similar results can be achieved in weighted speedup. For the cache configurations under study, it requires less than 0.5% storage overhead compared to the last-level cache. In addition, we also show that profiling mechanism with only one true LRU ATD achieves comparable performance and can further reduce the hardware cost by nearly two thirds compared with the hybrid mechanism.
Chin-Long WEY Shin-Yo LIN Hsu-Sheng WANG Hung-Lieh CHEN Chun-Ming HUANG
In UWB systems, data symbols are transmitted and received continuously. The Fast Fourier Transform (FFT) processor must be able to seamlessly process input/output data. This paper presents the design and implementation of a continuous data flow parallel memory-based FFT (CF-PMBFFT) processor without the use of input buffer for pre-loading the input data. The processor realizes a memory space of two N-words and multiple processing elements (PEs) to achieve the seamless data flow and meet the design requirement. The circuit has been fabricated in TSMC 0.18 µm 1P6M CMOS process with the supply voltage of 1.8 V. Measurement results of the test chip shows that the developed CF-PMBFFT processor takes a core area of 1.97 mm2 with a power consumption of 62.12 mW for a throughput rate of 528 MS/s.
Yaping LIU Zhihong LIU Baosheng WANG Qianming YANG
We present the design of a secure identifier-based inter-domain routing, SIR, for the identifier/locator split network. On the one hand, SIR is a distributed path-vector protocol inheriting the flexibility of BGP. On the other hand, SIR separates ASes into several groups called trust groups, which assure the trust relationships among ASes by enforceable control and provides strict isolation properties to localize attacks and failures. Security analysis shows that SIR can provide control plane security that can avoid routing attacks including some smart attacks which S-BGP/soBGP can be deceived. Meanwhile, emulation experiments based on the current Internet topology with 47,000 ASes from the CAIDA database are presented, in which we compare the number of influenced ASes under attacks of subverting routing policy between SIR and S-BGP/BGP. The results show that, the number of influenced ASes decreases substantially by deploying SIR.
Jun WANG Desheng WANG Yingzhuang LIU
In this paper, we investigate the problem of maximizing the weighted sum outage rate in multiuser multiple-input single-output (MISO) interference channels, where the transmitters have no knowledge of the exact values of channel coefficients, only the statistical information. Unfortunately, this problem is nonconvex and very difficult to deal with. We propose a new, provably convergent iterative algorithm where in each iteration, the original problem is approximated as second-order cone programming (SOCP) by introducing slack variables and using convex approximation. Simulation results show that the proposed SOCP algorithm converges in a few steps, and yields a better performance gain with a lower computational complexity than existing algorithms.
Yue-Bin LUO Bao-Sheng WANG Xiao-Feng WANG Bo-Feng ZHANG Wei HU
Network servers and applications commonly use static IP addresses and communication ports, making themselves easy targets for network reconnaissances and attacks. Moving target defense (MTD) is an innovatory and promising proactive defense technique. In this paper, we develop a novel MTD mechanism, called Random Port and Address Hopping (RPAH). The goal of RPAH is to hide network servers and applications and resist network reconnaissances and attacks by constantly changing their IP addresses and ports. In order to enhance the unpredictability, RPAH integrates source identity, service identity and temporal parameter in the hopping to provide three hopping frequencies, i.e., source hopping, service hopping and temporal hopping. RPAH provides high unpredictability and the maximum hopping diversities by introducing port and address demultiplexing mechanism, and provides a convenient attack detection mechanism with which the messages from attackers using invalid or inactive addresses/ports will be conveniently detected and denied. Our experiments and evaluation on campus network and PlanetLab show that RPAH is effective in resisting various network reconnaissance and attack models such as network scanning and worm propagation, while introducing an acceptable operation overhead.
Wei HAN Baosheng WANG Zhenqian FENG Baokang ZHAO Wanrong YU Zhu TANG
Comparing with that of terrestrial networks, the location management in satellite networks is mainly restricted by three factors, i.e., the limited on-board processing (OBP), insufficient link resources and long propagation delay. Under these restrictions, the limited OBP can be smoothened by terrestrial gateway-based location management, the constraint from link resources demands the bandwidth-efficient management scheme and long propagation delay potentially lowers the management efficiency. Currently, the reduction of the management cost has always been the main direction in existing work which is based on the centralized management architecture. This centralized management has many defects, such as the non-optimal routing, scalability problem and single point of failure. To address these problems, this paper explores gateway-based distributed location management schemes for Low Earth Orbit (LEO) satellite networks. Three management schemes based on terrestrial gateways are proposed and analyzed: loose location management, precise location management, and the grouping location management. The analyses specifically analyze the cost of location queries and show their significant influence on the total cost which includes the location management and query. Starting from the above analysis, we speculate and prove the existence of the optimum scheme in grouping location management, which has the lowest total cost for the query frequency within given range. Simulation results validate the theoretical analysis on the cost and show the feature of latency in location queries, which provide a valuable insight into the design of the distributed location management scheme in satellite networks.
In this paper, we report a new approach about parsing and searching problem for a given phonetic lattice. The approach is based on the Divide and Conquer (DC) strategy. By dividing the phonetic lattice, we first construct a PD-tree to represent this lattice, then, we parse through this PD-tree to identify the possible sentence which is supposed to be the speech utterance. Next, we propose a new search scheme called Downward Request (DR) search model to decrease the computation costs, and this search model gives us the optimal or N-best solutions. Experiments performed on Chinese speech recognition show us the good results.
Zhenyu LIU Dongsheng WANG Takeshi IKENAGA
Variable block size motion estimation developed by the latest video coding standard H.264/AVC is the efficient approach to reduce the temporal redundancies. The intensive computational complexity coming from the variable block size technique makes the hardwired accelerator essential, for real-time applications. Propagate partial sums of absolute differences (Propagate Partial SAD) and SAD Tree hardwired engines outperform other counterparts, especially considering the impact of supporting variable block size technique. In this paper, the authors apply the architecture-level and the circuit-level approaches to improve the maximum operating frequency and reduce the hardware overhead of Propagate Partial SAD and SAD Tree, while other metrics, in terms of latency, memory bandwidth and hardware utilization, of the original architectures are maintained. Experiments demonstrate that by using the proposed approaches, at 110.8 MHz operating frequency, compared with the original architectures, 14.7% and 18.0% gate count can be saved for Propagate Partial SAD and SAD Tree, respectively. With TSMC 0.18 µm 1P6M CMOS technology, the proposed Propagate Partial SAD architecture achieves 231.6 MHz operating frequency at a cost of 84.1 k gates. Correspondingly, the maximum work frequency of the optimized SAD Tree architecture is improved to 204.8 MHz, which is almost two times of the original one, while its hardware overhead is merely 88.5 k-gate.
Shoichiro KAWASHIMA Keizo MORITA Mitsuharu NAKAZAWA Kazuaki YAMANE Mitsuhiro OGAI Kuninori KAWABATA Kazuaki TAKAI Yasuhiro FUJII Ryoji YASUDA Wensheng WANG Yukinobu HIKOSAKA Ken'ichi INOUE
An 8-Mbit 0.18-µm CMOS 1T1C ferroelectric RAM (FeRAM) in a planar ferroelectric technology was developed. Even though the cell area of 2.48 µm2 is almost equal to that of a 4-Mbit stacked-capacitor FeRAM (STACK FeRAM) 2.32 µm2[1], the chip size of the developed 8-Mbit FeRAM, including extra 2-Mbit parities for the error correction code (ECC), is just 52.37 mm2, which is about 30% smaller than twice of the 4-Mbit STACK FeRAM device, 37.68mm2×2[1]. This excellent characteristic can be attributed to the large cell matrix architectures of the sectional cyclic word line (WL) that was used to increase the column numbers, and to the 1T1C bit-line GND level sensing (BGS)[2][3] circuit design intended to sense bit lines (BL) that have bit cells 1K long and a large capacitance. An access time of 52 ns and a cycle time of 77 ns in RT at a VDD of 1.8 V were achieved.
Xiantao JIANG Tian SONG Wen SHI Takafumi KATAYAMA Takashi SHIMAMOTO Lisheng WANG
In this work, a high efficiency coding unit (CU) size decision algorithm is proposed for high efficiency video coding (HEVC) inter coding. The CU splitting or non-splitting is modeled as a binary classification problem based on probability graphical model (PGM). This method incorporates two sub-methods: CU size termination decision and CU size skip decision. This method focuses on the trade-off between encoding efficiency and encoding complexity, and it has a good performance. Particularly in the high resolution application, simulation results demonstrate that the proposed algorithm can reduce encoding time by 53.62%-57.54%, while the increased BD-rate are only 1.27%-1.65%, compared to the HEVC software model.
Guohong LI Zhenyu LIU Sanchuan GUO Dongsheng WANG
As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they induce higher overall L1 miss latencies because of the longer average distance between the requestor and the home node, and the potential congestions at certain nodes. We observed that there is a high probability that the target data of an L1 miss resides in the L1 cache of a neighbor node. In such cases, these long-distance accesses to the home nodes can be potentially avoided. In order to leverage the aforementioned property, we propose Bayesian Theory based Adaptive Proximity Data Accessing (APDA). In our proposal, we organize the multi-core into clusters of 2x2 nodes, and introduce the Proximity Data Prober (PDP) to detect whether an L1 miss can be served by one of the cluster L1 caches. Furthermore, we devise the Bayesian Decision Classifier (BDC) to adaptively select the remote L2 cache or the neighboring L1 node as the server according to the minimum miss cost. We evaluate this approach on a 64-node multi-core using SPLASH-2 and PARSEC benchmarks, and we find that the APDA can reduce the execution time by 20% and reduce the energy by 14% compared to a standard multi-core with a shared L2. The experimental results demonstrate that our proposal outperforms the up-to-date mechanisms, such as ASR, DCC and RNUCA.
Xiantao JIANG Tian SONG Wen SHI Takashi SHIMAMOTO Lisheng WANG
The purpose of this work is to reduce the redundant coding process with the tradeoff between the encoding complexity and coding efficiency in HEVC, especially for high resolution applications. Therefore, a CU depth prediction algorithm is proposed for motion estimation process of HEVC. At first, an efficient CTU depth prediction algorithm is proposed to reduce redundant depth. Then, CU size termination and skip algorithm is proposed based on the neighboring block depth and motion consistency. Finally, the overall algorithm, which has excellent complexity reduction performance for high resolution application is proposed. Moreover, the proposed method achieves steady performance, and it can significantly reduce the encoding time in different environment configuration and quantization parameter. The simulation experiment results demonstrate that, in the RA case, the average time saving is about 56% with only 0.79% BD-bitrate loss for the high resolution, and this performance is better than the previous state of the art work.
Peisheng WANG Yuan LUO A.J. Han VINCK
The generalized Hamming weight played an important role in coding theory. In the study of the wiretap channel of type II, the generalized Hamming weight was extended to a two-code format. Two equivalent concepts of the generalized Hamming weight hierarchy and its two-code format, are the inverse dimension/length profile (IDLP) and the inverse relative dimension/length profile (IRDLP), respectively. In this paper, the Singleton upper bound on the IRDLP is improved by using a quotient subcode set and a subset with respect to a generator matrix, respectively. If these new upper bounds on the IRDLP are achieved, in the corresponding coordinated two-party wire-tap channel of type II, the adversary cannot learn more from the illegitimate party.
Xi ZHANG Chongmin LI Zhenyu LIU Haixia WANG Dongsheng WANG Takeshi IKENAGA
Previous research illustrates that LRU replacement policy is not efficient when applications exhibit a distant re-reference interval. Recently RRIP policy is proposed to improve the performance for such kind of workloads. However, the lack of access recency information in RRIP confuses the replacement policy to make the accurate prediction. To enhance the robustness of RRIP for recency-friendly workloads, we propose an Dynamic Adaptive Insertion and Re-reference Prediction (DAI-RRP) policy which evicts data based on both re-reference prediction value and the access recency information. DAI-RRP makes adaptive adjustment on insertion position and prediction value for different access patterns, which makes the policy robust across different workloads and different phases. Simulation results show that DAI-RRP outperforms LRU and RRIP. For a single-core processor with a 1 MB 16-way set last-level cache (LLC), DAI-RRP reduces CPI over LRU and Dynamic RRIP by an average of 8.1% and 2.7% respectively. Evaluations on quad-core CMP with a 4 MB shared LLC show that DAI-RRP outperforms LRU and Dynamic RRIP (DRRIP) on the weighted speedup metric by an average of 8.1% and 15.7% respectively. Furthermore, compared to LRU, DAI-RRP consumes the similar hardware for 16-way cache, or even less hardware for high-associativity cache. In summary, the proposed policy is practical and can be easily integrated into existing hardware approximations of LRU.