The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] thread(29hit)

1-20hit(29hit)

  • Implementation of a Multi-Word Compare-and-Swap Operation without Garbage Collection

    Kento SUGIURA  Yoshiharu ISHIKAWA  

     
    PAPER

      Pubricized:
    2022/02/03
      Vol:
    E105-D No:5
      Page(s):
    946-954

    With the rapid increase in the number of CPU cores, software that can utilize these many cores is required. A lock-free algorithm based on compare-and-swap (CAS) operations is one of the concurrency control methods to implement such multi-threading software. A multi-word CAS (MwCAS) operation is an extension of a CAS operation to swap multiple words atomically. However, we noticed that the performance of the existing MwCAS implementation is limited because of garbage collection even if in a low-contention environment. To achieve high performance in low-contention workloads, we propose a new MwCAS algorithm without garbage collection. Experimental results show that our approach is three to five times faster than implementation with garbage collection in low-contention workloads. Moreover, the performance of the proposed method is also superior in a high-contention environment.

  • Driver Status Monitoring System with Body Channel Communication Technique Using Conductive Thread Electrodes

    Beomjin YUK  Byeongseol KIM  Soohyun YOON  Seungbeom CHOI  Joonsung BAE  

     
    PAPER-Wireless Communication Technologies

      Pubricized:
    2021/09/24
      Vol:
    E105-B No:3
      Page(s):
    318-325

    This paper presents a driver status monitoring (DSM) system with body channel communication (BCC) technology to acquire the driver's physiological condition. Specifically, a conductive thread, the receiving electrode, is sewn to the surface of the seat so that the acquired signal can be continuously detected. As a signal transmission medium, body channel characteristics using the conductive thread electrode were investigated according to the driver's pose and the material of the driver's pants. Based on this, a BCC transceiver was implemented using an analog frequency modulation (FM) scheme to minimize the additional circuitry and system cost. We analyzed the heart rate variability (HRV) from the driver's electrocardiogram (ECG) and displayed the heart rate and Root Mean Square of Successive Differences (RMSSD) values together with the ECG waveform in real-time. A prototype of the DSM system with commercial-off-the-shelf (COTS) technology was implemented and tested. We verified that the proposed approach was robust to the driver's movements, showing the feasibility and validity of the DSM with BCC technology using a conductive thread electrode.

  • Performance Comparison of Training Datasets for System Call-Based Malware Detection with Thread Information

    Yuki KAJIWARA  Junjun ZHENG  Koichi MOURI  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2021/09/21
      Vol:
    E104-D No:12
      Page(s):
    2173-2183

    The number of malware, including variants and new types, is dramatically increasing over the years, posing one of the greatest cybersecurity threats nowadays. To counteract such security threats, it is crucial to detect malware accurately and early enough. The recent advances in machine learning technology have brought increasing interest in malware detection. A number of research studies have been conducted in the field. It is well known that malware detection accuracy largely depends on the training dataset used. Creating a suitable training dataset for efficient malware detection is thus crucial. Different works usually use their own dataset; therefore, a dataset is only effective for one detection method, and strictly comparing several methods using a common training dataset is difficult. In this paper, we focus on how to create a training dataset for efficiently detecting malware. To achieve our goal, the first step is to clarify the information that can accurately characterize malware. This paper concentrates on threads, by treating them as important information for characterizing malware. Specifically, on the basis of the dynamic analysis log from the Alkanet, a system call tracer, we obtain the thread information and classify the thread information processing into four patterns. Then the malware detection is performed using the number of transitions of system calls appearing in the thread as a feature. Our comparative experimental results showed that the primary thread information is important and useful for detecting malware with high accuracy.

  • Which Metric Is Suitable for Evaluating Your Multi-Threading Processors? In Terms of Throughput, Fairness, and Predictability

    Xin JIN  Ningmei YU  

     
    LETTER-VLSI Design Technology and CAD

      Vol:
    E103-A No:9
      Page(s):
    1127-1132

    Simultaneous multithreading technology (SMT) can effectively improve the overall throughput and fairness through improving the resources usage efficiency of processors. Traditional works have proposed some metrics for evaluation in real systems, each of which strikes a trade-off between fairness and throughput. How to choose an appropriate metric to meet the demand is still controversial. Therefore, we put forward suggestions on how to select the appropriate metrics through analyzing and comparing the characteristics of each metric. In addition, for the new application scenario of cloud computing, the data centers have high demand for the quality of service for killer applications, which bring new challenges to SMT in terms of performance guarantees. Therefore, we propose a new metric P-slowdown to evaluate the quality of performance guarantees. Based on experimental data, we show the feasibility of P-slowdown on performance evaluation. We also demonstrate the benefit of P-slowdown through two use cases, in which we not only improve the performance guarantee level of SMT processors through the cooperation of P-slowdown and resources allocation strategy, but also use P-slowdown to predict the occurrence of abnormal behavior against security attacks.

  • Logging Inter-Thread Data Dependencies in Linux Kernel

    Takafumi KUBOTA  Naohiro AOTA  Kenji KONO  

     
    PAPER-Software System

      Pubricized:
    2020/04/06
      Vol:
    E103-D No:7
      Page(s):
    1633-1646

    Logging is a practical and useful way of diagnosing failures in software systems. The logged events are crucially important to learning what happened during a failure. If key events are not logged, it is almost impossible to track error propagations in the diagnosis. Tracking an error propagation becomes utterly complicated if inter-thread data dependency is involved. An inter-thread data dependency arises when one thread accesses to share data corrupted by another thread. Since the erroneous state propagates from a buggy thread to a failing thread through the corrupt shared data, the root cause cannot be tracked back solely by investigating the failing thread. This paper presents the design and implementation of K9, a tool that inserts logging code automatically to trace inter-thread data dependencies. K9 is designed to be “practical”; it scales to one million lines of code in C, causes negligible runtime overheads, and provides clues to tracking inter-thread dependencies in real-world bugs. To scale to one million lines of code, K9 ditches rigorous static analysis of pointers to detect code locations where inter-thread data dependency can occur. Instead, K9 takes the best-effort approach and finds out “most” of those code locations by making use of coding conventions. This paper demonstrates that K9 is applicable to Linux and captures relevant code locations, in spite of the best-effort approach, enough to provide useful clues to root causes in real-world bugs, including a previously unknown bug in Linux. The paper also shows K9 runtime overhead is negligible. K9 incurs 1.25% throughput degradation and 0.18% CPU usage increase, on average, in our evaluation.

  • Accelerating Large-Scale Interconnection Network Simulation by Cellular Automata Concept

    Takashi YOKOTA  Kanemitsu OOTSU  Takeshi OHKAWA  

     
    PAPER-Computer System

      Pubricized:
    2018/10/05
      Vol:
    E102-D No:1
      Page(s):
    52-74

    State-of-the-art parallel systems employ a huge number of computing nodes that are connected by an interconnection network. An interconnection network (ICN) plays an important role in a parallel system, since it is responsible to communication capability. In general, an ICN shows non-linear phenomena in its communication performance, most of them are caused by congestion. Thus, designing a large-scale parallel system requires sufficient discussions through repetitive simulation runs. This causes another problem in simulating large-scale systems within a reasonable cost. This paper shows a promising solution by introducing the cellular automata concept, which is originated in our prior work. Assuming 2D-torus topologies for simplification of discussion, this paper discusses fundamental design of router functions in terms of cellular automata, data structure of packets, alternative modeling of a router function, and miscellaneous optimization. The proposed models have a good affinity to GPGPU technology and, as representative speed-up results, the GPU-based simulator accelerates simulation upto about 1264 times from sequential execution on a single CPU. Furthermore, since the proposed models are applicable in the shared memory model, multithread implementation of the proposed methods achieve about 162 times speed-ups at the maximum.

  • View Priority Based Threads Allocation and Binary Search Oriented Reweight for GPU Accelerated Real-Time 3D Ball Tracking

    Yilin HOU  Ziwei DENG  Xina CHENG  Takeshi IKENAGA  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2018/08/31
      Vol:
    E101-D No:12
      Page(s):
    3190-3198

    In real-time 3D ball tracking of sports analysis in computer vision technology, complex algorithms which assure the accuracy could be time-consuming. Particle filter based algorithm has a large potential to accelerate since the algorithm between particles has the chance to be paralleled in heterogeneous CPU-GPU platform. Still, with the target multi-view 3D ball tracking algorithm, challenges exist: 1) serial flowchart for each step in the algorithm; 2) repeated processing for multiple views' processing; 3) the low degree of parallelism in reweight and resampling steps for sequential processing. On the CPU-GPU platform, this paper proposes the double stream system flow, the view priority based threads allocation, and the binary search oriented reweight. Double stream system flow assigns tasks which there is no data dependency exists into different streams for each frame processing to achieve parallelism in system structure level. View priority based threads allocation manipulates threads in multi-view observation task. Threads number is view number multiplied by particles number, and with view priority assigning, which could help both memory accessing and computing achieving parallelism. Binary search oriented reweight reduces the time complexity by avoiding to generate cumulative distribution function and uses an unordered array to implement a binary search. The experiment is based on videos which record the final game of an official volleyball match (2014 Inter-High School Games of Men's Volleyball held in Tokyo Metropolitan Gymnasium in Aug. 2014) and the test sequences are taken by multiple-view system which is made of 4 cameras locating at the four corners of the court. The success rate achieves 99.23% which is the same as target algorithm while the time consumption has been accelerated from 75.1ms/frame in CPU environment to 3.05ms/frame in the proposed system which is 24.62 times speed up, also, it achieves 2.33 times speedup compared with basic GPU implemented work.

  • Equivalent Circuit of Yee's Cells and Its Application to Mixed Electromagnetic and Circuit Simulations

    Yuichi TANJI  

     
    PAPER-Microwaves, Millimeter-Waves

      Vol:
    E101-C No:9
      Page(s):
    703-710

    An equivalent circuit of Yee's cells is proposed for mixed electromagnetic and circuit simulations. Using the equivalent circuit, a mixed electromagnetic and circuit simulator can be developed, in which the electromagnetic field and circuit responses are simultaneously analyzed. Representing the electromagnetic system as a circuit, active and passive device models in a circuit simulator can be used for the mixed simulations without any modifications. Hence, the propose method is very useful for designing various electronic systems. To evaluate the mixed simulations with the equivalent circuit, two implementations with shared or distributed memory computer system are presented. In the numerical examples, we evaluate the performances of the prototype simulators to demonstrate the effectiveness.

  • An Efficient Parallel Coding Scheme in Erasure-Coded Storage Systems

    Wenrui DONG  Guangming LIU  

     
    PAPER-Computer System

      Pubricized:
    2017/12/12
      Vol:
    E101-D No:3
      Page(s):
    627-643

    Erasure codes have been considered as one of the most promising techniques for data reliability enhancement and storage efficiency in modern distributed storage systems. However, erasure codes often suffer from a time-consuming coding process which makes them nearly impractical. The opportunity to solve this problem probably rely on the parallelization of erasure-code-based application on the modern multi-/many-core processors to fully take advantage of the adequate hardware resources on those platforms. However, the complicated data allocation and limited I/O throughput pose a great challenge on the parallelization. To address this challenge, we propose a general multi-threaded parallel coding approach in this work. The approach consists of a general multi-threaded parallel coding model named as MTPerasure, and two detailed parallel coding algorithms, named as sdaParallel and ddaParallel, respectively, adapting to different I/O circumstances. MTPerasure is a general parallel coding model focusing on the high level data allocation, and it is applicable for all erasure codes and can be implemented without any modifications of the low level coding algorithms. The sdaParallel divides the data into several parts and the data parts are allocated to different threads statically in order to eliminate synchronization latency among multiple threads, which improves the parallel coding performance under the dummy I/O mode. The ddaParallel employs two threads to execute the I/O reading and writing on the basis of small pieces independently, which increases the I/O throughput. Furthermore, the data pieces are assigned to the coding thread dynamically. A special thread scheduling algorithm is also proposed to reduce thread migration latency. To evaluate our proposal, we parallelize the popular open source library jerasure based on our approach. And a detailed performance comparison with the original sequential coding program indicates that the proposed parallel approach outperforms the original sequential program by an extraordinary speedups from 1.4x up to 7x, and achieves better utilization of the computation and I/O resources.

  • Efficient Parallel Join Processing Exploiting SIMD in Multi-Thread Environments

    Gilseok HONG  Seonghyeon KANG  Chang soo KIM  Jun-Ki MIN  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2017/12/14
      Vol:
    E101-D No:3
      Page(s):
    659-667

    In this paper, we study parallel join processing to improve the performance of the merge phase of sort-merge join by integrating all parallelism provided by mainstream CPUs. Modern CPUs support SIMD instruction sets with wider SIMD registers which allows to process multiple data items per each instruction. Thus, we devise an efficient parallel join algorithm, called Parallel Merge Join with SIMD instructions (PMJS). In our proposed algorithm, we utilize data parallelism by exploiting SIMD instructions. And we also accelerate the performance by avoiding the usage of conditional branch instructions. Furthermore, to take advantage of the multiple cores, our proposed algorithm is threaded in multi-thread environments. In our multi-thread algorithm, to distribute workload evenly to each thread, we devise an efficient workload balancing algorithm based on the kernel density estimator which allows to estimate the workload of each thread accurately.

  • A Capacity-Aware Thread Scheduling Method Combined with Cache Partitioning to Reduce Inter-Thread Cache Conflicts

    Masayuki SATO  Ryusuke EGAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER-Computer System

      Vol:
    E96-D No:9
      Page(s):
    2047-2054

    Chip multiprocessors (CMPs) improve performance by simultaneously executing multiple threads using integrated multiple cores. However, since these cores commonly share one cache, inter-thread cache conflicts often limit the performance improvement by multi-threading. This paper focuses on two causes of inter-thread cache conflicts. In shared caches of CMPs, cached data fetched by one thread are frequently evicted by another thread. Such an eviction, called inter-thread kickout (ITKO), is one of the major causes of inter-thread cache conflicts. The other cause is capacity shortage that occurs when one cache is shared by threads demanding large cache capacities. If the total capacity demanded by the threads exceeds the actual cache capacity, the threads compete to use the limited cache capacity, resulting in capacity shortage. To address inter-thread cache conflicts, we must take into account both ITKOs and capacity shortage. Therefore, this paper proposes a capacity-aware thread scheduling method combined with cache partitioning. In the proposed method, inter-thread cache conflicts due to ITKOs and capacity shortage are decreased by cache partitioning and thread scheduling, respectively. The proposed scheduling method estimates the capacity demand of each thread with an estimation method used in the cache partitioning mechanism. Based on the estimation used for cache partitioning, the thread scheduler decides thread combinations sharing one cache so as to avoid capacity shortage. Evaluation results suggest that the proposed method can improve overall performance by up to 8.1%, and the performance of individual threads by up to 12%. The results also show that both cache partitioning and thread scheduling are indispensable to avoid both ITKOs and capacity shortage simultaneously. Accordingly, the proposed method can significantly reduce the inter-thread cache conflicts and hence improve performance.

  • Efficient Tracking of News Topics Based on Chronological Semantic Structures in a Large-Scale News Video Archive

    Ichiro IDE  Tomoyoshi KINOSHITA  Tomokazu TAKAHASHI  Hiroshi MO  Norio KATAYAMA  Shin'ichi SATOH  Hiroshi MURASE  

     
    PAPER-Video Processing

      Vol:
    E95-D No:5
      Page(s):
    1288-1300

    Recent advance in digital storage technology has enabled us to archive a large volume of video data. Thanks to this trend, we have archived more than 1,800 hours of video data from a daily Japanese news show in the last ten years. When considering the effective use of such a large news video archive, we assumed that analysis of its chronological and semantic structure becomes important. We also consider that providing the users with the development of news topics is more important to help their understanding of current affairs, rather than providing a list of relevant news stories as in most of the current news video retrieval systems. Therefore, in this paper, we propose a structuring method for a news video archive, together with an interface that visualizes the structure, so that users could track the development of news topics according to their interest, efficiently. The proposed news video structure, namely the “topic thread structure”, is obtained as a result of an analysis of the chronological and semantic relation between news stories. Meanwhile, the proposed interface, namely “mediaWalker II”, allows users to track the development of news topics along the topic thread structure, and at the same time watch the video footage corresponding to each news story. Analyses on the topic thread structures obtained by applying the proposed method to actual news video footages revealed interesting and comprehensible relations between news topics in the real world. At the same time, analyses on their size quantified the efficiency of tracking a user's topic-of-interest based on the proposed topic thread structure. We consider this as a first step towards facilitating video authoring by users based on existing contents in a large-scale news video archive.

  • An Improvement of Twisted Ate Pairing Efficient for Multi-Pairing and Thread Computing

    Yumi SAKEMI  Yasuyuki NOGAMI  Shoichi TAKEUCHI  Yoshitaka MORIKAWA  

     
    PAPER

      Vol:
    E94-A No:6
      Page(s):
    1356-1367

    In the case of Barreto-Naehrig pairing-friendly curves of embedding degree 12 of order r, recent efficient Ate pairings such as R-ate, optimal, and Xate pairings achieve Miller loop lengths of(1/4) ⌊log2 r⌋. On the other hand, the twisted Ate pairing requires (3/4) ⌊log2 r⌋ loop iterations, and thus is usually slower than the recent efficient Ate pairings. This paper proposes an improved twisted Ate pairing using Frobenius maps and a small scalar multiplication. The proposed idea splits the Miller's algorithm calculation into several independent parts, for which multi-pairing techniques apply efficiently. The maximum number of loop iterations in Miller's algorithm for the proposed twisted Ate pairing is equal to the (1/4) ⌊log2 r ⌋ attained by the most efficient Ate pairings.

  • Issue Mechanism for Embedded Simultaneous Multithreading Processor

    Chengjie ZANG  Shigeki IMAI  Steven FRANK  Shinji KIMURA  

     
    PAPER

      Vol:
    E91-A No:4
      Page(s):
    1092-1100

    Simultaneous Multithreading (SMT) technology enhances instruction throughput by issuing multiple instructions from multiple threads within one clock cycle. For in-order pipeline to each thread, SMT processors can provide large number of issued instructions close to or surpass than using out-of-order pipeline. In this work, we show an efficient issue logic for predicated instruction sequence with the parallel flag in each instruction, where the predicate register based issue control is adopted and the continuous instructions with the parallel flag of '0' are executed in parallel. The flag is pre-defined by a compiler. Instructions from different threads are issued based on the round-robin order. We also introduce an Instruction Queue skip mechanism for thread if the queue is empty. Using this kind of issue logic, we designed a 6 threads, 7-stage, in-order pipeline processor. Based on this processor, we compare round-robin issue policy (RR(T1-Tn)) with other policies: thread one always has the highest priority (PR(T1)) and thread one or thread n has the highest priority in turn (PR(T1-Tn)). The results show that RR(T1-Tn) policy outperforms others and PR(T1-Tn) is almost the same to RR(T1-Tn) from the point of view of the issued instructions per cycle.

  • Dynamic Programming and Clique Based Approaches for Protein Threading with Profiles and Constraints

    Tatsuya AKUTSU  Morihiro HAYASHIDA  Dukka Bahadur K.C.  Etsuji TOMITA  Jun'ichi SUZUKI  Katsuhisa HORIMOTO  

     
    PAPER

      Vol:
    E89-A No:5
      Page(s):
    1215-1222

    The protein threading problem with profiles is known to be efficiently solvable using dynamic programming. In this paper, we consider a variant of the protein threading problem with profiles in which constraints on distances between residues are given. We prove that protein threading with profiles and constraints is NP-hard. Moreover, we show a strong hardness result on the approximation of an optimal threading satisfying all the constraints. On the other hand, we develop two practical algorithms: CLIQUETHREAD and BBDPTHREAD. CLIQUETHREAD reduces the threading problem to the maximum edge-weight clique problem, whereas BBDPTHREAD combines dynamic programming and branch-and-bound techniques. We perform computational experiments using protein structure data in PDB (Protein Data Bank) using simulated distance constraints. The results show that constraints are useful to improve the alignment accuracy of the target sequence and the template structure. Moreover, these results also show that BBDPTHREAD is in general faster than CLIQUETHREAD for larger size proteins whereas CLIQUETHREAD is useful if there does not exist a feasible threading.

  • High-Level Power Optimization Based on Thread Partitioning

    Jumpei UCHIDA  Nozomu TOGAWA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-System Level Design

      Vol:
    E87-A No:12
      Page(s):
    3075-3082

    This paper proposes a thread partitioning algorithm in low power high-level synthesis. The algorithm is applied to high-level synthesis systems. In the systems, we can describe parallel behaving circuit blocks (threads) explicitly. First it focuses on a local register file RF in a thread. It partitions a thread into two sub-threads, one of which has RF and the other does not have RF. The partitioned sub-threads need to be synchronized with each other to keep the data dependency of the original thread. Since the partitioned sub-threads have waiting time for synchronization, gated clocks can be applied to each sub-thread. Then we can synthesize a low power circuit with a low area overhead, compared to the original circuit. Experimental results demonstrate effectiveness and efficiency of the algorithm.

  • Development of a High-Performance Web-Server through a Real-Time Compression Architecture

    Byungjo MIN  Euiseok NAHM  June HWANG  Hagbae KIM  

     
    LETTER-Internet

      Vol:
    E87-B No:12
      Page(s):
    3781-3783

    This paper proposes a Real-Time Compression Architecture (RTCA), which maximizes the efficiency of web services, while reducing the response time at the same time. The developed architecture not only guarantees the freshness of compressed contents but also minimizes the time needed to compress the message, especially when the traffic is heavy.

  • Network of Plant Remote Monitoring System Using UDP/IP for Wind-Farms

    Haruhi ETO  Hirofumi MATSUO  Fujio KUROKAWA  

     
    PAPER-Power System Architecture

      Vol:
    E87-B No:12
      Page(s):
    3457-3464

    Wind power generation occupies an important position as a new non-fossil energy in recent years, and the plant scale has been rapidly expanding as wind-farm. Since they are often built in topographically inconvenient places, the remote monitoring system has been required. Ethernet had been said to be unsuitable to the industrial network, it is one of the strong options because of its low cost and easiness to apply. In this case, it is important to secure the throughput enough for updating the data of numerous wind turbines within a fixed time. In order to achieve this, we adopted User Datagram Protocol/Internet Protocol (UDP/IP) and the multi-thread method to make the overhead of software small as possible. This paper presents the scheme of powerful network using Ethernet with multi-thread and multi-cast. The relation between the number of threads and total throughputs of network is clarified. The design procedure to derive the optimum number of threads is shown. And it is demonstrated that this scheme provide the local network of wind-farm with sufficient performance.

  • Software Implementation of a Secure Socket Layer (SSL) Accelerator Based on Kernel Thread

    Euiseok NAHM  Byungjo MIN  Jinbae PARK  Hagbae KIM  

     
    LETTER-Software Engineering

      Vol:
    E87-D No:1
      Page(s):
    244-245

    We implement an efficient Secure Socket Layer (SSL) accelerator, which is embedded in the kernel level and utilizes kernel threads as the same number of CPUs. In comparison with the conventional Apache with/without our SSL accelerator, the SSL accelerator significantly improves the web-server performance by up to 200%.

  • Proposal of a Multi-Threaded Processor Architecture for Embedded Systems and Its Evaluation

    Shinsuke KOBAYASHI  Yoshinori TAKEUCHI  Akira KITAJIMA  Masaharu IMAI  

     
    PAPER

      Vol:
    E84-A No:3
      Page(s):
    748-754

    In this paper, an architecture of multi-threaded processor for embedded systems is proposed and evaluated comparing with other processors for embedded systems. The experimental results show the trade-off of hardware costs and execution times among processors. Taking proposed multi-threaded processor into account as an embedded processor, design space of embedded systems are enlarged and more suitable architecture can be selected under some design constraints.

1-20hit(29hit)