Harunobu DAIKOKU Hideyuki KAWASHIMA Osamu TATEBE
This paper proposes and examines the three in-memory shuffling methods designed to address problems in MapReduce shuffling caused by skewed data. Coupled Shuffle Architecture (CSA) employs a single pairwise all-to-all exchange to shuffle both blocks, units of shuffle transfer, and meta-blocks, which contain the metadata of corresponding blocks. Decoupled Shuffle Architecture (DSA) separates the shuffling of meta-blocks and blocks, and applies different all-to-all exchange algorithms to each shuffling process, attempting to mitigate the impact of stragglers in strongly skewed distributions. Decoupled Shuffle Architecture with Skew-Aware Meta-Shuffle (DSA w/ SMS) autonomously determines the proper placement of blocks based on the memory consumption of each worker process. This approach targets extremely skewed situations where some worker processes could exceed their node memory limitation. This study evaluates implementations of the three shuffling methods in our prototype in-memory MapReduce engine, which employs high performance interconnects such as InfiniBand and Intel Omni-Path. Our results suggest that DSA w/ SMS is the only viable solution for extremely skewed data distributions. We also present a detailed investigation of the performance of CSA and DSA in various skew situations.
Lei CHEN Wei LU Ergude BAO Liqiang WANG Weiwei XING Yuanyuan CAI
MapReduce is an effective framework for processing large datasets in parallel over a cluster. Data locality and data skew on the reduce side are two essential issues in MapReduce. Improving data locality can decrease network traffic by moving reduce tasks to the nodes where the reducer input data is located. Data skew will lead to load imbalance among reducer nodes. Partitioning is an important feature of MapReduce because it determines the reducer nodes to which map output results will be sent. Therefore, an effective partitioner can improve MapReduce performance by increasing data locality and decreasing data skew on the reduce side. Previous studies considering both essential issues can be divided into two categories: those that preferentially improve data locality, such as LEEN, and those that preferentially improve load balance, such as CLP. However, all these studies ignore the fact that for different types of jobs, the priority of data locality and data skew on the reduce side may produce different effects on the execution time. In this paper, we propose a naive Bayes classifier based partitioner, namely, BAPM, which achieves better performance because it can automatically choose the proper algorithm (LEEN or CLP) by leveraging the naive Bayes classifier, i.e., considering job type and bandwidth as classification attributes. Our experiments are performed in a Hadoop cluster, and the results show that BAPM boosts the computing performance of MapReduce. The selection accuracy reaches 95.15%. Further, compared with other popular algorithms, under specific bandwidths, the improvement BAPM achieved is up to 31.31%.
Md. Al-Amin KHANDAKER Yasuyuki NOGAMI
Scalar multiplication over higher degree rational point groups is often regarded as the bottleneck for faster pairing based cryptography. This paper has presented a skew Frobenius mapping technique in the sub-field isomorphic sextic twisted curve of Kachisa-Schaefer-Scott (KSS) pairing friendly curve of embedding degree 18 in the context of Ate based pairing. Utilizing the skew Frobenius map along with multi-scalar multiplication procedure, an efficient scalar multiplication method for KSS curve is proposed in the paper. In addition to the theoretic proposal, this paper has also presented a comparative simulation of the proposed approach with plain binary method, sliding window method and non-adjacent form (NAF) for scalar multiplication. The simulation shows that the proposed method is about 60 times faster than plain implementation of other compared methods.
Komang OKA SAPUTRA Wei-Chung TENG Takaaki NARA
A network-based remote host clock skew measurement involves collecting the offsets, the differences between sending and receiving times, of packets from the host within a period of time. Although the variant and immeasurable delay in each packet prevents the measurer from getting the real clock offset, the local minimum delays and the majority of delays delineate the clock offset shifts, and are used by existing approaches to estimate the skew. However, events during skew measurement like time synchronization and rerouting caused by switching network interface or base transceiver station may break the trend into multi-segment patterns. Although the skew in each segment is theoretically of the same value, the skew derived from the whole offset-set usually differs with an error of unpredictable scale. In this work, a method called dynamic region of offset majority locating (DROML) is developed to detect multi-segment cases, and to precisely estimate the skew. DROML is designed to work in real-time, and it uses a modified version of the HT-based method [8] both to measure the skew of one segment and to detect the break between adjacent segments. In the evaluation section, the modified HT-based method is compared with the original method and with a linear programming algorithm (LPA) on accumulated-time and short-term measurement. The fluctuation of the modified method in the short-term experiment is 0.6 ppm (parts per million), which is obviously less than the 1.23 ppm and 1.45 ppm from the other two methods. DROML, when estimating a four-segment case, is able to output a skew of only 0.22 ppm error, compared with the result of the normal case.
Koichi FUJIWARA Kazushi KAWAMURA Masao YANAGISAWA Nozomu TOGAWA
Recently, high-level synthesis techniques for FPGA designs (FPGA-HLS techniques) are strongly required in various applications. Both interconnection delays and clock skews have a large impact on circuit performance implemented onto FPGA, which indicates the need for floorplan-driven FPGA-HLS algorithms considering them. To appropriately estimate interconnection delays and clock skews at HLS phase, a reasonable model to estimate them becomes essential. In this paper, we demonstrate several experiments to characterize interconnection delays and clock skews in FPGA and propose novel estimate models called “IDEF” and “CSEF”. In order to evaluate our models, we integrate them into a conventional floorplan-driven FPGA-HLS algorithm. Experimental results demonstrate that our algorithm can realize FPGA designs which reduce the latency by up to 22% compared with conventional approaches.
Zhihong LIU Aimal KHAN Peixin CHEN Yaping LIU Zhenghu GONG
MapReduce still suffers from a problem known as skew, where load is unevenly distributed among tasks. Existing solutions follow a similar pattern that estimates the load of each task and then rebalances the load among tasks. However, these solutions often incur heavy overhead due to the load estimation and rebalancing. In this paper, we present DynamicAdjust, a dynamic resource adjustment technique for mitigating skew in MapReduce. Instead of rebalancing the load among tasks, DynamicAdjust adjusts resources dynamically for the tasks that need more computation, thereby accelerating these tasks. Through experiments using real MapReduce workloads on a 21-node Hadoop cluster, we show that DynamicAdjust can effectively mitigate the skew and speed up the job completion time by up to 37.27% compared to the native Hadoop YARN.
Masayoshi SHIMOKOSHI Jay MOSBRUCKER Kris SCHOUTERDEN
Two-track squeeze and adjacent track interference (ATI) are major barriers to increasing track density in hard disk drives (HDD). These depend on skew angles made by a magnetic head and circumferential direction on a magnetic disk. This paper describes relationships between the skew angle and the magnetic core width (MCW) which affects two-track squeeze and ATI performance. We propose a design concept of a track pitch profile at different skew angles considering MCW. Equivalent robustness of ATI performance on different skew angle conditions is obtained with the optimized track pitch.
Minjia SHI Ting YAO Adel ALAHMADI Patrick SOLÉ
In this article, we study skew cyclic codes over $R=mathbb{F}_{q}+vmathbb{F}_{q}+v^{2}mathbb{F}_{q}$, where $q=p^{m}$, $p$ is an odd prime and v3=v. We describe the generator polynomials of skew cyclic codes over this ring and investigate the structural properties of skew cyclic codes over R by a decomposition theorem. We also describe the generator polynomial of the dual of a skew cyclic code over R. Moreover, the idempotent generators of skew cyclic codes over $mathbb{F}_{q}$ and R are considered.
Yukihide KOHIRA Atsushi TAKAHASHI
Multi-domain clock skew scheduling in general-synchronous framework is an effective technique to improve the performance of sequential circuits by using practical clock distribution network. Although the upper bound of performance of a circuit increases as the number of clock domains increases in multi-domain clock skew scheduling, the improvement of the performance becomes smaller while the cost of clock distribution network increases much. In this paper, a linear time algorithm that finds an optimum two-domain clock skew schedule in general-synchronous framework is proposed. Experimental results on ISCAS89 benchmark circuits and artificial data show that optimum circuits are efficiently obtained by our method in short time.
Susumu KOBAYASHI Fumihiro MINAMI
As the LSI process technology advances and the gate size becomes smaller, the signal delay on interconnect becomes a significant factor in the signal path delay. Also, as the size of interconnect structure becomes smaller, the interconnect process variations have become one of the dominant factors which influence the signal delay and thus clock skew. Therefore, controlling the influence of interconnect process variations on clock skew is a crucial issue in the advanced process technologies. In this paper, we propose a method for minimizing clock skew fluctuations caused by interconnect process variations. The proposed method identifies the suitable balance of clock buffer size and wire length in order to minimize the clock skew fluctuations caused by the interconnect process variations. Experimental results on test circuits of 28nm process technology show that the proposed method reduces the clock skew fluctuations by 30-92% compared to the conventional method.
Soma SHIRAISHI Yaokai FENG Seiichi UCHIDA
This paper proposes a new part-based approach for skew estimation of document images. The proposed method first estimates skew angles on rather small areas, which are the local parts of characters, and subsequently determines the global skew angle by aggregating those local estimations. A local skew estimation on a part of a skewed character is performed by finding an identical part from prepared upright character images and calculating the angular difference. Specifically, a keypoint detector (e.g. SURF) is used to determine the local parts of characters, and once the parts are described as feature vectors, a nearest neighbor search is conducted in the instance database to identify the parts. Finally, a local skew estimation is acquired by calculating the difference of the dominant angles of brightness gradient of the parts. After the local skew estimation, the global skew angle is estimated by the majority voting of those local estimations, disregarding some noisy estimations. Our experiments have shown that the proposed method is more robust to short and sparse text lines and non-text backgrounds in document images compared to conventional methods.
Clock network synthesis is one of the most important and limiting factors in VLSI designs. Hence, the clock skew variation reduction is one of the most important objectives in clock distribution methodology. Cross-link insertion is proposed in [1], however, it is based on empirical methods and does not use variation information for link insertion location choice. [17] considers the delay variation, but it is slow even for small clock trees. In this paper, we propose a fast link insertion algorithm that considers the delay variation information directly during link selection process. Experimental results show that our algorithm is very fast and achieves better skew variability reduction while utilizing considerably lesser routing resources compared with existing methods.
Yanling ZHI Wai-Shing LUK Yi WANG Changhao YAN Xuan ZENG
Yield-driven clock skew scheduling was previously formulated as a minimum cost-to-time ratio cycle problem, by assuming that variational path delays are in Gaussian distributions. However in today's nanometer technology, process variations show growing impacts on this assumption, as variational delays with non-Gaussian distributions have been observed on these paths. In this paper, we propose a novel yield-driven clock skew scheduling method for arbitrary distributions of critical path delays. Firstly, a general problem formulation is proposed. By integrating the cumulative distribution function (CDF) of critical path delays, the formulation is able to handle path delays with any distributions. It also generalizes the previous formulations on yield-driven clock skew scheduling and indicates their statistical interpretations. Generalized Howard algorithm is derived for finding the critical cycles of the underlying timing constraint graphs. Moreover, an effective algorithm based on minimum balancing is proposed for the overall yield improvement. Experimental results on ISCAS89 benchmarks show that, compared with two representative existing methods, our method remarkably improves the yield by 10.25% on average (up to 14.66%).
The impact of clock-skew on circuit timing increases rapidly as technology scales. As a result, it becomes important to deal with clock-skew at the early stages of circuit designs. This paper presents a novel datapath design that aims at mitigating the impact of clock-skew in high-level synthesis, by integrating margin (evaluated as the maximum number of clock-cycles to absorb clock-skew) and ordered clocking into high-level synthesis tasks. As a first attempt to the proposed datapath design, this paper presents a 0-1 integer linear programming formulation that focuses on register binding to achieve the minimum cost (the minimum number of registers) under given scheduling result. Experimental results show the optimal results can be obtained without increasing the latency, and with a few extra registers compared to traditional high-level synthesis design.
Jihoon SON Hyunsik CHOI Yon Dohn CHUNG
MapReduce is a parallel processing framework for large scale data. In the reduce phase, MapReduce employs the hash scheme in order to distribute data sharing the same key across cluster nodes. However, this approach is not robust for the skewed data distribution. In this paper, we propose a skew-tolerant key distribution method for MapReduce. The proposed method assigns keys to cluster nodes balancing their workloads. We implemented our proposed method on Hadoop. Through experiments, we evaluate the performance of the proposed method in comparison with the conventional method.
Yuki YOSHIKAWA Tomomi NUWA Hideyuki ICHIHARA Tomoo INOUE
In this paper, we propose a hybrid test application in partial skewed-load (PSL) scan design. The PSL scan design in which some flip-flops (FFs) are controlled as skewed-load FFs and the others are controlled as broad-side FFs was proposed in [1]. We notice that the PSL scan design potentially has a capability of two test application modes: one is the broad-side test mode, and the other is the hybrid test mode which corresponds to the test application considered in [1]. According to this observation, we present a hybrid test application of the two test modes in the PSL scan design. In addition, we also address a way of skewed-load FF selection based on propagation dominance of FFs in order to take advantage of the hybrid test application. Experimental results for ITC'99 benchmark circuits show that the hybrid test application in the proposed PSL scan design can achieve higher fault coverage than the design based on the skewed-load FF selection [1] does.
Kazuyoshi TAKAGI Yuki ITO Shota TAKESHIMA Masamitsu TANAKA Naofumi TAKAGI
In this paper, we propose a method for layout-driven skewed clock tree synthesis for SFQ logic circuits. For a given logic circuit without a clock tree, our algorithm outputs a circuit with a synthesized clock tree and timing adjustments achieving the given clock period and a rough placement of the clocked gates. In the proposed algorithm, clocked gates are grouped into levels and the clock tree is synthesized for each level. For each level, we estimate the clock timing for all possible placements of each gate, and then we search a placement of all gates that minimizes the total number of delay elements for timing adjustment. Once the placement is obtained, we synthesize a clock tree without wire intersections. We applied the proposed method to a moderate size circuit and confirmed that clock trees satisfying given timing requirements can be synthesized automatically.
A self-calibrating per-pin phase adjuster, which does not require any feedback from the slave chip and a multi-phase clock in the master and slave chips, is proposed for a high speed parallel chip-to-chip interface with a source synchronous double data rate (DDR) signaling. It achieves not only per-pin phase adjustment but also 90° phase shift of a strobe signal for a source synchronous DDR signaling. For this self-calibration, the phase adjuster measures and compensates the only relative mismatched delay among channels by utilizing on-chip time-domain reflectometry (TDR). Thus, variable delay lines, finite state machines, and a test signal generator are additionally required for the proposed phase adjuster. In addition, the power-gating receiver is used to reduce the discontinuity effect of the channel including parasitic components of chip package. To verify the proposed self-calibrating per-pin phase adjuster, the transceivers with 16 data, strobe, and clock signals for the interface with a source synchronous DDR signaling were implemented by using a 60 nm 1-poly 3-metal CMOS DRAM process with a 1.5 V supply. Each phase skew between Strobe and 16 Data was corrected within 0.028UI at 1.6-Gb/s data rate in a point-to-point channel.
Daewoong KIM Kilhyung CHA Soo-Ik CHAE
We propose an optimized scanline filling algorithm for OpenVG two-dimensional vector graphics. For each scanline of a path, it adaptively selects a left or right scanning direction that minimizes the number of pixels visited during scanning. According to the experimental results, the proposed algorithm reduces the number of pixels visited by 6 to 37% relative to that with a constant scanning direction for all the scanlines.
Koji OBATA Kazuyoshi TAKAGI Naofumi TAKAGI
An algorithm for clock scheduling of concurrent-flow clocking rapid single-flux-quantum (RSFQ) digital circuits is proposed. RSFQ circuit technology is an emerging technology of digital circuits. In concurrent-flow clocking RSFQ digital circuits, all logic gates are driven by clock pulses. Appropriate clock scheduling makes clock frequency of the circuits higher. Given a clock period, the proposed algorithm determines the arrival time of clock pulses and the delay that should be inserted. Experimental results show that inserted delay elements by the proposed algorithm are 59.0% fewer and the height of clock trees are 40.4% shorter on average than those by a straightforward algorithm. The proposed algorithm can also be used to minimize the clock period, thus obtaining 19.0% shorter clock periods on average.