IEICE global.ieice.org Site

Keyword Search Result

[Keyword] shared memory(32hit)

1-20hit(32hit)

Compiler Software Coherent Control for Embedded High Performance Multicore
Boma A. ADHI Tomoya KASHIMATA Ken TAKAHASHI Keiji KIMURA Hironori KASAHARA

PAPER

Vol:
E103-C No:3
Page(s):
85-97
The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism; therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 “equake” with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8 cores with no hardware coherence mechanism available.
A Finite Automaton-Based String Matching Engine on Graphic Processing Unit
JinMyung YOON Kang-Il CHOI HyunJin KIM

LETTER-VLSI Design Technology and CAD

Vol:
E100-A No:9
Page(s):
2031-2033
A non-deterministic finite automaton (NFA)-based parallel string matching scheme is proposed. To parallelize the operations of NFAs, a graphic processing unit (GPU) is adopted. Considering the resource occupancy of threads and size of the shared memory, the optimized resource allocation is performed in the proposed string matching scheme. Therefore, the performance is enhanced significantly in all evaluations.
Offline Permutation Algorithms on the Discrete Memory Machine with Performance Evaluation on the GPU
Akihiko KASAGI Koji NAKANO Yasuaki ITO

PAPER

Vol:
E96-D No:12
Page(s):
2617-2625
The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of the shared memory access of GPUs. Bank conflicts should be avoided for maximizing the bandwidth of the shared memory access. Offline permutation of an array is a task to copy all elements in array a into array b along a permutation given in advance. The main contribution of this paper is to implement a conflict-free permutation algorithm on the DMM in a GPU. We have also implemented straightforward permutation algorithms on the GPU. The experimental results for 1024 double (64-bit) numbers on NVIDIA GeForce GTX-680 show that the straightforward permutation algorithm takes 247.8 ns for the random permutation and 1684ns for the worst permutation that involves the maximum bank conflicts. Our conflict-free permutation algorithm runs in 167ns for any permutation including the random permutation and the worst permutation, although it performs more memory accesses. It follows that our conflict-free permutation is 1.48 times faster for the random permutation and 10.0 times faster for the worst permutation.
Low-Complexity Multi-Mode Memory-Based FFT Processor for DVB-T2 Applications
Kisun JUNG Hanho LEE

PAPER-Digital Signal Processing

Vol:
E94-A No:11
Page(s):
2376-2383
This paper presents a low-complexity multi-mode fast Fourier transform (FFT) processor for Digital Video Broadcasting-Terrestrial 2 (DVB-T2) systems. DVB-T2 operations need 1K/2K/4K/8K/16K/32K-point multiple mode FFT processors. The proposed architecture employs pipelined shared-memory architecture in which radix-2/22/23/24 FFT algorithms, multi-path delay commutator (MDC), and a novel data scaling approach are exploited. Based on this architecture, a novel low-cost data scaling unit is proposed to increase area efficiency, and an elaborate memory configuration scheme is designed to make single-port SRAM without degrading throughput rate. Also, new scheduling method of twiddle factor is proposed to reduce the area. The SQNR performance of 32K-point FFT mode is about 45.3 dB at 11-bit internal word length for 256QAM modulation. The proposed FFT processor has a lower hardware complexity and memory size compared to conventional FFT processors.
A Tight Bound on Online Buffer Management for Two-Port Shared-Memory Switches
Koji KOBAYASHI Shuichi MIYAZAKI Yasuo OKABE

PAPER-Computation and Computational Models

Vol:
E91-D No:8
Page(s):
2105-2114
The online buffer management problem formulates the problem of queueing policies of network switches supporting QoS (Quality of Service) guarantee. For this problem, several models are considered.In this paper, we focus on shared memory switches with preemption. We prove that the competitive ratio of the Longest Queue Drop (LQD) policy is (4M-4)/(3M-2) in the case of N=2, where N is the number of output ports in a switch and M is the size of the buffer.This matches the lower bound given by Hahne, Kesselman and Mansour.Also, in the case of arbitrary N, we improve the competitive ratio of LQD from 2 to 2 - (1/M) minK = 1, 2, ..., N{M/K + K - 1}.
A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips
Mariko SAKAMOTO Akira KATSUNO Go SUGIZAKI Toshio YOSHIDA Aiichiro INOUE Koji INOUE Kazuaki MURAKAMI

PAPER-VLSI Architecture for Communication/Server Systems

Vol:
E90-C No:10
Page(s):
1972-1982
Broadcast and synchronization techniques are used for cache coherence control in conventional larger scale snoop-based SMP systems. The penalty for synchronization is directly proportional to system size. Meanwhile, advances in LSI technology now enable placing a memory controller on a CPU die. The latency to access directly linked memory is drastically reduced by an on-die controller. Developing an enterprise server system with these CPUs allows us an opportunity to achieve higher performance. Though the penalty of synchronization is counted whenever a cache miss occurs, it is necessary to improve the coherence method to receive the full benefit of this effect. In this paper, we demonstrate a coherence directory organization that fits into DSM enterprise server systems. Originally, a directory-based method was adopted in high performance computing systems because of its huge scalability in comparison with snoop-based method. Though directory capacity miss and long directory access latency are the major problems of this method, the relaxed scalability requirement of enterprise servers is advantageous to us to solve these problems along with an advanced LSI technology. Our proposed directory solves both problems by implementing a full bit vector level map of the coherence directory on an LSI chip. Our experimental results validate that a system controlled by our proposed directory can surpass a snoop-based system in performance even without applying data localization optimization to an online transaction processing (OLTP) workload.
Minimizing the Directory Size for Large-Scale Shared-Memory Multiprocessors
Jinseok KONG Pen-Chung YEW Gyungho LEE

PAPER-Computer Systems

Vol:
E88-D No:11
Page(s):
2533-2543
Directory-based cache coherence schemes are commonly used in large-scale shared-memory multiprocessors, but most of them rely on heuristics to avoid large hardware requirements. We proposed using physical address mapping on directories to significantly reduce directory size needed. This approach allows the size of directory to grow as O(cn log2 n) as in optimal pointer-based directory schemes [11], where n is the number of nodes in the system and c is the number of cache lines in each cache memory. Performance aspects of the proposed scheme are studied in detail using simulation.
Reconfigurable Information-Sharing Network System Based on a Cyclic-Frequency AWG and Wavelength-Tunable Lasers
Akira OKADA Hiromasa TANOBE Morito MATSUOKA

PAPER-Fiber-Optic Transmission for Communications

Vol:
E88-B No:6
Page(s):
2449-2455
We propose an information-sharing network system, capable of forming and dynamically reconfiguring multiple information-sharing groups on the same network platform by using wavelength routing and distributed shared memory technologies. The network system comprises information-sharing terminal nodes equipped with a shared memory and a wavelength-tunable transmitter, network management terminal and an arrayed-waveguide grating (AWG). The information-sharing terminal nodes are connected to an AWG by a pair of optical fibers, forming a star-shaped topology. Information is shared among the information-sharing terminal nodes through the establishment of a logical information-sharing ring. This is accomplished by adjusting the output of the wavelength-tunable transmitter at each terminal node to an appropriate wavelength according to the wavelength-routing characteristics of the AWG wavelength router. We developed a prototype information-sharing network system, in which, as preliminary experiments, HDTV and SDTV videos were used for real-time information sharing. The dynamic reconfiguration of information-sharing groups and a simple automatic restoration technique have been successfully demonstrated. The system is applicable to distributed computer processing systems and high-capacity information-sharing applications such as high-quality videoconferences.
Binding Time in Distributed Shared Memories for Generic Patterns of Memory References
Jinseok KONG Gyungho LEE

LETTER-Computer Systems

Vol:
E87-D No:8
Page(s):
2148-2151
Performance of three binding schemes for memory local to a node is evaluated. Since a large number of cache misses can occur in a large (relative to the cache size) working set, binding at a page fault time alone cannot efficiently utilize locality of reference at the local memory. In a small working set, the address bound to the local memory at a node miss time is not effective due to low cache miss rates. Our simulation shows that binding at a cache miss time achieves up to 3.1 times and 2.4 times performance of the schemes of binding at a page fault time and at a node miss time respectively.
TLB Update-Hint: A Scalable TLB Consistency Algorithm for Cache-Coherent Non-uniform Memory Access Multiprocessors
Byeonghag SEONG Donggook KIM Yangwoo ROH Kyuho PARK Daeyeon PARK

PAPER-Networking and System Architectures

Vol:
E87-D No:7
Page(s):
1682-1692
Shared memory multiprocessors in which each processor has its own TLB must manage consistency among TLBs and a page table. As the large-scale CC-NUMA (cache-coherent non-uniform memory access) shared memory multiprocessors become popular, it is important for TLB consistency management algorithms to be highly scalable. In this paper, we propose a TLB update-hint algorithm as a scalable TLB consistency management solution for CC-NUMA multiprocessors. By using a lazy TLB invalidation approach, we reduced the number of unnecessary processor interruptions and idle-waiting time, and achieved a high level of scalability. Using a shared memory simulator, we evaluated the TLB update-hint algorithm. For performance comparison, we also simulated the TLB shootdown algorithm, one of the most popular TLB consistency algorithms. The simulations demonstrated that the TLB update-hint algorithm scales well in systems with a large number of processors. At 64 node systems, the TLB update-hint algorithm shows 4787% better performance than the TLB shootdown algorithm.
Delaying Coherence Requests to Enhance the Performance of Strict Consistency Models
Young Chul SOHN NaiHoon JEONG Jin-Soo KIM Seung Ryoul MAENG

PAPER-Computer Systems

Vol:
E87-D No:3
Page(s):
751-760
Advances in ILP techniques enable strict consistency models to relax memory order through speculative execution of memory operations. However, ordering constraints still hinder the performance because speculatively executed operations cannot be committed out of program order for the possibility of mis-speculation. In this paper, we propose a new technique which allows memory operations to be non-speculatively committed out of order without violating consistency constraints. Consistency constraints are guaranteed through delaying the coherence requests. The proposed technique also improves the performance of spin lock primitives such as TTS lock or MCS lock. Through delaying early acquire requests, the lock transfer time can be improved when there is high contention for a lock.
Highly Concurrent Group Mutual Exclusion Algorithms Based on Ticket Orders
Masataka TAKAMURA Yoshihide IGARASHI

PAPER

Vol:
E87-D No:2
Page(s):
322-329
Group mutual exclusion is an interesting generalization of the mutual exclusion problem. This problem was introduced by Joung, and some algorithms for the problem have been proposed by incorporating mutual exclusion algorithms. Group mutual exclusion occurs naturally in a situation where a resource can be shared by processes of the same group, but not by processes of a different group. It is also called the congenial talking philosophers problem. In this paper we propose two algorithms based on ticket orders for the group mutual exclusion problem on the asynchronous shared memory model. These algorithms are some modifications of the Bakery algorithm. They satisfy lockout freedom and a high degree of concurrency performance. Each of these algorithms uses single-writer shared variables together with two multi-writer shared variables that are never concurrently written. One of these algorithms has another desirable property, called smooth admission. By this property, during the period that the resource is occupied by the leader (called the chair), a process wishing to join the same group as the leader's group can be granted use of the resource in constant time.
The Development of the Earth Simulator
Shinichi HABATA Mitsuo YOKOKAWA Shigemune KITAWAKI

INVITED PAPER

Vol:
E86-D No:10
Page(s):
1947-1954
The Earth Simulator (ES), developed by the Japanese government's initiative "Earth Simulator project," is a highly parallel vector supercomputer system. In May 2002, the ES was proven to be the most powerful computer in the world by achieving 35.86 teraflops on the LINPACK benchmark and 26.58 teraflops for a global atmospheric circulation model with the spectral method. Three architectural features enabled these great achievements; vector processor, shared-memory and high-bandwidth non-blocking interconnection crossbar network. In this paper, an overview of the ES, the three architectural features and the result of performance evaluation are described particularly with its hardware realization of the interconnection among 640 processor nodes.
Simple Mutual Exclusion Algorithms Based on Bounded Tickets on the Asynchronous Shared Memory Model
Masataka TAKAMURA Yoshihide IGARASHI

PAPER-Parallel/Distributed Algorithms

Vol:
E86-D No:2
Page(s):
246-254
We propose two simple algorithms based on bounded tickets for the mutual exclusion problem on the asynchronous single-writer/multi-reader shared memory model. These algorithms are modifications of the Bakery algorithm. An unattractive property of the Bakery algorithm is that the size of its shared memory is unbounded. Initially we design a provisional algorithm based on bounded tickets. It guarantees mutual exclusion in the case where a certain condition is satisfied. To remove the condition, we use an additional process that does not correspond to any user. The algorithm with the additional process is a lockout-free mutual exclusion algorithm on the asynchronous single-writer/multi-reader shared memory model. We then modify this algorithm to reduce the shared memory size with the cost of using another additional process. The maximum waiting time using each of the algorithms proposed in this paper is bounded by (n-1)c+O(nl), where n is the number of users, l is an upper bound on the time between two successive atomic steps, and c is an upper bound on the time that any user spends using the resource. The shared memory size needed by the first algorithm and the second algorithm are (n+1)(1+log (4n)) bits and n(1+log (4n-4))+2 bits, respectively.
Shared Page Table: Sharing of Virtual Memory Resources
Young-Woong KO Chuck YOO

PAPER-Software Systems

Vol:
E86-D No:1
Page(s):
45-55
Traditionally, UNIX has been weak in data sharing. By data sharing, we mean that multiple cooperative processes concurrently access and update the same set of data. As the degree of sharing (the number of cooperative processes) increases, the existing UNIX virtual memory systems run into page table thrashing, which causes a major performance bottleneck. Once page table thrashing occurs, UNIX performs miserably regardless of the hardware platforms it is running on. This is a critical problem because UNIX is increasingly used in environments such as banking that require intensive data sharing. We consider several alternatives to avoid page table thrashing, and propose a solution of which the main idea is to share page tables in virtual memory. Extensive experiments have been carried out with real workloads, and the results show that the shared page table solution avoids the page table thrashing and improves the performance of UNIX by an order of magnitude.
Scheduling Loop Applications in Software Distributed Shared Memory Systems
Tyng-Yeu LIANG Ce-Kuen SHIEH Deh-Cheng LIU

PAPER-Algorithms

Vol:
E83-D No:9
Page(s):
1721-1730
This paper first examines the issues related to scheduling loop applications on a software distributed shared memory (DSM) system. Then, a dynamic scheduling scheme is developed based on the examined issues to enhance the performance of loop applications on DSM. Compared with previous works, the proposed scheme has several specialties. The first is that the workload of processors can be effectively balanced even when the computational capabilities of processors and the computational needs of threads are not identical. The second is it divides thread mapping into two phases, each with one consideration, i.e., load balance or communication cost, and adopts thread migration and exchange in the two phases, respectively. The third is the exploitation of data sharing among threads to reduce data-consistency communication, and the last is to attack the negative effect of the unnecessary inter-node sharing caused by thread re-mapping. The proposed scheme has been implemented on a page-based DSM system called Cohesion. Our experiments show that the proposed scheme is more effective to improve the performance of the test programs than related schemes.
Wait-Free Linearizable Distributed Shared Memory
Sen MORIYA Katsuro SUDA Michiko INOUE Toshimitsu MASUZAWA Hideo FUJIWARA

PAPER-Algorithms

Vol:
E83-D No:8
Page(s):
1611-1621
We consider a wait-free linearizable implementation of shared objects on a distributed message-passing system. We assume that the system provides each process with a local clock that runs at the same speed as global time and that all message delays are in the range [d-u,d] where d and u (0< u d) are constants known to every process. We present four wait-free linearizable implementations of read/write registers on reliable and unreliable broadcast models. We also present two wait-free linearizable implementations of general objects on a reliable broadcast model. The efficiency of an implementation is measured by the worst-case response time for each operation of the implemented object. Response times of our wait-free implementations of read/write registers on a reliable broadcast model is better than a previously known implementation in which wait-freedom is not taken into account.
A False-Sharing Free Distributed Shared Memory Management Scheme
Alexander I-Chi LAI Chin-Laung LEI Hann-Huei CHIOU

PAPER-Computer Systems

Vol:
E83-D No:4
Page(s):
777-788
Distributed shared memory (DSM) systems on top of network of workstations are especially vulnerable to the impact of false sharing because of their higher memory transaction overheads and thus higher false sharing penalties. In this paper we develop a dynamic-granularity shared memory management scheme that eliminates false sharing without sacrificing the transparency to conventional shared-memory applications. Our approach utilizes a special threaded splay tree (TST) for shared memory information management, and a dynamic token-based path-compression synchronization algorithm for data transferring. The combination of the TST and path compression is quite efficient; asymptotically, in an n-processor system with m shared memory segments, synchronizing at most s segments takes O(s log m log n) amortized computation steps and generates O(s log n) communication messages, respectively. Based on the proposed scheme we constructed an experimental DSM prototype which consists of several Ethernet-connected Pentium-based computers running Linux. Preliminary benchmark results on our prototype indicate that our scheme is quite efficient, significantly outperforming traditional schemes and scaling up well.
Multi-Threaded Design for a Software Distributed Shared Memory Systems
Jyh-Chang UENG Ce-Kuen SHIEH Su-Cheong MAC An-Chow LAI Tyng-Yue LIANG

PAPER-Sofware System

Vol:
E82-D No:12
Page(s):
1512-1523
This paper describes the design and implementation of a multi-threaded Distributed Shared Memory (DSM) system, called Cohesion, which provides high programming flexibility and latency masking, and supports load balancing. Cohesion offers a parallel programming environment which is very similar to that on a multiprocessors system. Threads could be created recursively in this environment, and users are not required to handle the locations of the threads. Instead of supporting a shared variable model, Cohesion provides a global shared address space among all nodes in the system. The space is further divided into three regions, i. e. , release, conventional, and object-based memory, each is applied with different consistency protocol. In this paper, the design issues in an ordinary thread system, such as thread management, load balancing, and synchronization, have been reconsidered with the memory management provided by the DSM system. Several real applications have been used to evaluate the performance of the system. The results show that multi-threading usually has better performance than single-threading because the network latency can be masked by overlapping communication and computation. However, the gain depends on program behavior and the number of threads executed on each node in the system.
Some Modifications of the Tournament Algorithm for the Mutual Exclusion Problem
Yoshihide IGARASHI Hironobu KURUMAZAKI Yasuaki NISHITANI

PAPER-Algorithm and Computational Complexity

Vol:
E82-D No:2
Page(s):
368-375
We propose two lockout-free (starvation-free) mutual exclusion algorithms for the asynchronous multi-writer/reader shared memory model. The first algorithm is a modification of the well-known tournament algorithm for the mutual exclusion problem. By the modification we can speed up the original algorithm. The running time of the modified algorithm from the entrance of the trying region to the entrance of the critical region is at most (n-1)c+O(nl), where n is the number of processes, l is an upper bound on the time between successive two steps of each process, and c is is an upper bound on the time that any user spends in the critical region. The second algorithm is a further modification of the first algorithm. It is designed so that some processes have an advantage of access to the resource over other processes.

1-20hit(32hit)

Keyword Search Result

[Keyword] shared memory(32hit)

Compiler Software Coherent Control for Embedded High Performance Multicore

A Finite Automaton-Based String Matching Engine on Graphic Processing Unit

Offline Permutation Algorithms on the Discrete Memory Machine with Performance Evaluation on the GPU

Low-Complexity Multi-Mode Memory-Based FFT Processor for DVB-T2 Applications

A Tight Bound on Online Buffer Management for Two-Port Shared-Memory Switches

A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips

Minimizing the Directory Size for Large-Scale Shared-Memory Multiprocessors

Reconfigurable Information-Sharing Network System Based on a Cyclic-Frequency AWG and Wavelength-Tunable Lasers

Binding Time in Distributed Shared Memories for Generic Patterns of Memory References

TLB Update-Hint: A Scalable TLB Consistency Algorithm for Cache-Coherent Non-uniform Memory Access Multiprocessors

Delaying Coherence Requests to Enhance the Performance of Strict Consistency Models

Highly Concurrent Group Mutual Exclusion Algorithms Based on Ticket Orders

The Development of the Earth Simulator

Simple Mutual Exclusion Algorithms Based on Bounded Tickets on the Asynchronous Shared Memory Model

Shared Page Table: Sharing of Virtual Memory Resources

Scheduling Loop Applications in Software Distributed Shared Memory Systems

Wait-Free Linearizable Distributed Shared Memory

A False-Sharing Free Distributed Shared Memory Management Scheme

Multi-Threaded Design for a Software Distributed Shared Memory Systems

Some Modifications of the Tournament Algorithm for the Mutual Exclusion Problem

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles