1-6hit |
Chien-Hui LIAO Charles H.-P. WEN
Hotspots occur frequently in 3D multi-core processors (3D-MCPs), and they may adversely impact both the reliability and lifetime of a system. We present a new thermally constrained task scheduler based on a thermal-pattern-aware voltage assignment (TPAVA) to reduce hotspots in and optimize the performance of 3D-MCPs. By analyzing temperature profiles of different voltage assignments, TPAVA pre-emptively assigns different initial operating-voltage levels to cores for reducing temperature increase in 3D-MCPs. The proposed task scheduler consists of an on-line allocation strategy and a new voltage-scaling strategy. In particular, the proposed on-line allocation strategy uses the temperature-variation rates of the cores and takes into two important thermal behaviors of 3D-MCPs that can effectively minimize occurrences of hotspots in both thermally homogeneous and heterogeneous 3D-MCPs. Furthermore, a new vertical-grouping voltage scaling (VGVS) strategy that considers thermal correlation in 3D-MCPs is used to handle thermal emergencies. Experimental results indicate that, when compared to a previous online thermally constrained task scheduler, the proposed task scheduler can reduce hotspot occurrences by approximately 66% (71%) and improve throughput by approximately 8% (2%) in thermally homogeneous (heterogeneous) 3D-MCPs. These results indicate that the proposed task scheduler is an effective technique for suppressing hotspot occurrences and optimizing throughput for 3D-MCPs subject to thermal constraints.
Cheol-Ho HONG Young-Pil KIM Seehwan YOO Chi-Young LEE Chuck YOO
Facing practical limits to increasing processor frequencies, manufacturers have resorted to multi-core designs in their commercial products. In multi-core implementations, cores in a physical package share the last-level caches to improve inter-core communication. To efficiently exploit this facility, operating systems must employ cache-aware schedulers. Unfortunately, virtualization software, which is a foundation technology of cloud computing, is not yet cache-aware or does not fully exploit the locality of the last-level caches. In this paper, we propose a cache-aware virtual machine scheduler for multi-core architectures. The proposed scheduler exploits the locality of the last-level caches to improve the performance of concurrent applications running on virtual machines. For this purpose, we provide a space-partitioning algorithm that migrates and clusters communicating virtual CPUs (VCPUs) in the same cache domain. Second, we provide a time-partitioning algorithm that co-schedules or schedules in sequence clustered VCPUs. Finally, we present a theoretical analysis that proves our scheduling algorithm is more efficient in supporting concurrent applications than the default credit scheduler in Xen. We implemented our virtual machine scheduler in the recent Xen hypervisor with para-virtualized Linux-based operating systems. We show that our approach can improve performance of concurrent virtual machines by up to 19% compared to the credit scheduler.
Wenhua FAN Chen CHEN Yun CHEN Zhiyi YU Xiaoyang ZENG
This paper presents an efficient implementation of OFDM inner receiver on a programmable multi-core processor platform with CMMB as an application. The platform consists of an array of programmable SIMD processors interconnected in a 2-D mesh network, which can provide high performance and is quite suitable for wireless communication applications. Implemented on one cluster with 8 cores, the receiver includes symbol timing, carrier frequency offset and sampling frequency offset synchronization, channel estimation and equalization. Multiple optimization techniques are explored to improve system throughput such as: task-level parallelism on many cores, data-level parallelism on SIMD cores, minimization of memory access and route-length-minimization task mapping techniques. Besides, efficient memory strategy and specific instructions for complex computation increase the performance. The simulation results show that the inner receiver could achieve a throughput of up to 120 Mbps when operating at 750 MHz.
Junichi OHMURA Takefumi MIYOSHI Hidetsugu IRIE Tsutomu YOSHINAGA
In this paper, we propose an approach to obtaining enhanced performance of the Linpack benchmark on a GPU-accelerated PC cluster connected via relatively slow inter-node connections. For one node with a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060 GPU card, we implement a CPU–GPU parallel double-precision general matrix–matrix multiplication (dgemm) operation, and achieve a performance improvement of 34% compared with the GPU-only case and 64% compared with the CPU-only case. For an entire 16-node cluster, each node of which is the same as the above and is connected with two gigabit Ethernet links, we use a computation-communication overlap scheme with GPU acceleration for the Linpack benchmark, and achieve a performance improvement of 28% compared with the GPU-accelerated high-performance Linpack benchmark (HPL) without overlapping. Our overlap GPU acceleration solution uses overlaps in which the main inter-node communication and data transfer to the GPU device memory are overlapped with the main computation task on the CPU cores. These overlaps use multi-core processors, which almost all of today's high-performance computers use. In particular, as well as using a CPU core for communication tasks, we also simultaneously use other CPU cores and the GPU for computation tasks. In order to enable overlap between inter-node communication and computation tasks, we eliminate their close dependence by breaking the main computation task into smaller tasks and rescheduling. Based on a scheme in which part of the CPU computation power is simultaneously used for tasks other than computation tasks, we experimentally find the optimal computation ratio for CPUs; this ratio differs from the case of parallel dgemm operation of one node.
Shunsuke OKUMURA Yuki KAGIYAMA Yohei NAKATA Shusuke YOSHIMOTO Hiroshi KAWAGUCHI Masahiko YOSHIMOTO
This paper proposes 7T SRAM which realizes block-level simultaneous copying feature. The proposed SRAM can be used for data transfer between local memories such as checkpoint data storage and transactional memory. The 1-Mb SRAM is comprised of 32-kb blocks, in which 16-kb data can be copied in 33.3 ns at 1.2 V. The proposed scheme reduces energy consumption in copying by 92.7% compared to the conventional read-modify-write manner. By applying the proposed scheme to transactional memory, the number of write back cycles is possibly reduced by 98.7% compared with the conventional memory system.
Hasitha Muthumala WAIDYASOORIYA Daisuke OKUMURA Masanori HARIYAMA Michitaka KAMEYAMA
Heterogeneous multi-core processors are attracted by the media processing applications due to their capability of drawing strengths of different cores to improve the overall performance. However, the data transfer bottlenecks and limitations in the task allocation due to the accelerator-incompatible operations prevents us from gaining full potential of the heterogeneous multi-core processors. This paper presents a task allocation method based on algorithm transformation to increase the freedom of task allocation. We use approximation methods such as CORDIC algorithms to map the accelerator-incompatible operations to accelerator cores. According to the experimental results using HOG descriptor computation, the proposed task allocation method reduces the data transfer time by more than 82% and the total processing time by more than 79% compared to the conventional task allocation method.