Kazuaki MURAKAMI Hidetaka MAGOSHI
This paper briefly surveys architectural technologies of recent or future high-performance, low-power processors for improving the performance and power/energy consumption simultaneously. Achieving both high performance and low power at the same time imposes a lot of challenges on processor design, and therefore gives us a lot of opportunities for devising new technologies. The paper also tries to provide some insights into the technology direction in future.
Hideo OHIRA Toshihisa KAMEMARU Hirokazu SUZUKI Ken-ichi ASANO Masahiko YOSHIMOTO
An architectural design of a media processor core optimized for MPEG4/H26x video codec targeted for use in mobile multimedia terminals is presented. The architecture consists of a maximum 6.4 GOPS SIMD (Single Instruction Multiple Data) processor, RISC-processor, VLC-processor, and intelligent DMA controller. The unique SIMD processor completes 2-D DCT processing in 132 clock cycles, or block matching (16 by 16 pixels) in 24 clock-cycles. VLC-processor allows the completion of 8 by 8 block run-level coding in average 10 clock cycles in the case of low bit-rates. The functions of transpose-registers in the SIMD processor, data sub-sampling technique in the DMA, or data-sliding technique between PEs (Processor Elements) in the SIMD processor eliminate a large amount of cycle loss for data handling, and extract the highest level of performance. Through the use of the above architecture and the lower power approach, CIF 30 frames/s MPEG4 Simple Profile video codec @ 100 MHz can be achieved. Estimated dissipation is as low as 280 mW. 300 kgates and 16 kBytes four port SRAM are contained on a 12 mm2 area by using 0.18 µm process technology. The combination of the RISC-processor and SIMD-processor can also operate MPEG4 core profile (shape coding) that requires flexibility and performance.
Koji INOUE Tohru ISHIHARA Kazuaki MURAKAMI
This paper proposes a new approach to achieving high performance and low energy consumption for set-associative caches. The cache, called way-predicting set-associative cache, speculatively selects a single way, which is likely to contain the data desired by the processor, from the set designated by a memory address, before it starts a normal cache access. By accessing only the single way predicted, instead of accessing all the ways in a set, energy consumption can be reduced. In order for the way-predicting cache to perform well, accuracy of way prediction is important. This paper shows that the accuracy of an MRU (most recently used)-based way prediction is higher than 90% for most of the benchmark programs. The proposed way-predicting cache improves the ED (energy-delay) product by 60-70% compared to the conventional set-associative cache.
This paper describes an MOS current-mode, voltage-controlled oscillator (VCO) circuit that potentially operates with a 2 V supply voltage, 500 MHz oscillation frequency, and -90 dBc/Hz phase noise at the 1 MHz offset. It also has an improved oscillation frequency linearity of the control voltage and 11 mW power dissipation. The oscillation frequency reached 920 MHz when the supply voltage was increased to 3 V.
In a multisystem data sharing environment (MDSE), the computing nodes are locally coupled via a high-speed network and share a common database at the disk level. To reduce the amount of expensive and slow disk I/O, each node caches database pages in its main memory buffer. This paper focuses on the MDSE that uses record-level locking as a concurrency control. While the record-level locking can guarantee higher concurrency than page-level locking, it may result in heavy message traffic. In this paper, we first propose a cache coherency scheme that can reduce the message traffic in the standard locking. Then the scheme is extended to the context where lock caching and lock de-escalation are adopted. Using a distributed database simulation model, we evaluate the performance of the proposed schemes under a wide variety of database workloads.
Akifumi MAKINOUCHI Tetsuro KAKESHITA Hirofumi AMANO
This paper gives an overview of research activities on high performance databases in Japan. It focuses on parallel algorithms for relational databases and data mining, parallel approaches for object-oriented databases, and parallel disk systems. Studies surveyed in this paper are carried out mainly by database researchers in Japanese universities under the Grant-in-Aid for Scientific Research (1996-1998).
This paper proposes a new method for dynamically controlling the clock speed of a processor in order to reduce power consumption without decreasing system performance. It automatically tunes the processor's speed by monitoring its activities and avoiding useless work so as not to exhaust the battery energy. Experiments with performance bottlenecks caused by disk activities show that the proposed method is very effective in comparison with the traditional one, in which the processor's speed is fixed.
Barbara M. CHAPMAN Piyush MEHROTRA Hans P. ZIMA
Highly parallel scalable multiprocessing systems (HMPs) are powerful tools for solving large-scale scientific and engineering problems. However, these machines are difficult to program since algorithms must exploit locality in order to achieve high performance. Vienna Fortran was the first fully specified data-parallel language for HMPs that provided features for the specification of data distribution and alignment at a high level of abstraction. In this paper we outline the major elements of Vienna Fortran and compare it to High Performance Fortran (HPF), a de-facto standard in this area. A significant weakness of HPF is its lack of support for many advanced applications, which require irregular data distributions and dynamic load balancing. We introduce HPF +, an extension of HPF based on Vienna Fortran, that provides the required functionality.
Sam APPLETON Shannon MORTON Michael LIEBELT
In this paper we describe the implementation of complex architectures using a general design approach for two-phase asynchronous systems. This fundamental approach, called Event Controlled Systems, can be used to widely extend the utility of two phase systems. We describe solutions that we have developed that dramatically improve the performance of static and dynamic-logic asynchronous pipelines, and briefly describe a complex microprocessor designed using ECS.
Masato OGUCHI Hitoshi AIDA Tadao SAITO
Distributed shared memory is an attractive option for realizing functionally distributed computing in a wide area distributed environment, because of its simplicity and flexibility in software programming. However, up till now, distributed shared memory has mainly been studied in a local environment. In a widely distributed environment, latency of communication greatly affects system performance. Moreover, bandwidth of networks available in a wide area is dramatically increasing recently. DSM architecture using high performance networks must be different from the case of low speed networks being used. In this paper, distributed shared memory models in a widely distributed environment are discussed and evaluated. First, existing distributed shared memory models are examined: They are shared virtual memory and replicated shared memory. Next, an improved replicated shared memory model, which uses internal machine memory, is proposed. In this model, we assume the existence of a seamless, multi-cast wide area network infrastructure - for example, an ATM network. A prototype of this model using multi-thread programming have been implemented on multi-CPU SPARCstations and an ATM-LAN. These DSM models are compared with SCRAMNetTM, whose mechanism is based on replicated shared memory. Results from this evaluation show the superiority of the replicated shared memory compared to shared virtual memory when the length of the network is large. While replicated shared memory using external memory is influenced by the ratio of local and global accesses, replicated shared memory using internal machine memory is suitable for a wide variety of cases. The replicated shared memory model is considered to be suitable particularly for applications which impose real time operation in a widely distributed environment, since some latency hiding techniques such as context switching or data prefetching are not effective for real time demands.