The search functionality is under construction.

Keyword Search Result

[Keyword] big data(16hit)

1-16hit
  • Modeling Inter-Sector Air Traffic Flow and Sector Demand Prediction

    Ryosuke MISHIMA  Kunihiko HIRAISHI  

     
    PAPER-Mathematical Systems Science

      Pubricized:
    2022/04/11
      Vol:
    E105-A No:10
      Page(s):
    1413-1420

    In 2015, the Ministry of Land, Infrastructure and Transportation started to provide information on aircraft flying over Japan, called CARATS Open Data, and to promote research on aviation systems actively. The airspace is divided into sectors, which are used for limiting air traffic to control safely and efficiently. Since the demand for air transportation is increasing, new optimization techniques and efficient control have been required to predict and resolve demand-capacity imbalances in the airspace. In this paper, we aim to construct mathematical models of the inter-sector air traffic flow from CARATS Open Data. In addition, we develop methods to predict future sector demand. Accuracy of the prediction is evaluated by comparison between predicted sector demand and the actual data.

  • Sublinear Computation Paradigm: Constant-Time Algorithms and Sublinear Progressive Algorithms Open Access

    Kyohei CHIBA  Hiro ITO  

     
    INVITED PAPER-Algorithms and Data Structures

      Pubricized:
    2021/10/08
      Vol:
    E105-A No:3
      Page(s):
    131-141

    The challenges posed by big data in the 21st Century are complex: Under the previous common sense, we considered that polynomial-time algorithms are practical; however, when we handle big data, even a linear-time algorithm may be too slow. Thus, sublinear- and constant-time algorithms are required. The academic research project, “Foundations of Innovative Algorithms for Big Data,” which was started in 2014 and will finish in September 2021, aimed at developing various techniques and frameworks to design algorithms for big data. In this project, we introduce a “Sublinear Computation Paradigm.” Toward this purpose, we first provide a survey of constant-time algorithms, which are the most investigated framework of this area, and then present our recent results on sublinear progressive algorithms. A sublinear progressive algorithm first outputs a temporary approximate solution in constant time, and then suggests better solutions gradually in sublinear-time, finally finds the exact solution. We present Sublinear Progressive Algorithm Theory (SPA Theory, for short), which enables to make a sublinear progressive algorithm for any property if it has a constant-time algorithm and an exact algorithm (an exponential-time one is allowed) without losing any computation time in the big-O sense.

  • Practical Evaluation of Online Heterogeneous Machine Learning

    Kazuki SESHIMO  Akira OTA  Daichi NISHIO  Satoshi YAMANE  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2020/08/31
      Vol:
    E103-D No:12
      Page(s):
    2620-2631

    In recent years, the use of big data has attracted more attention, and many techniques for data analysis have been proposed. Big data analysis is difficult, however, because such data varies greatly in its regularity. Heterogeneous mixture machine learning is one algorithm for analyzing such data efficiently. In this study, we propose online heterogeneous learning based on an online EM algorithm. Experiments show that this algorithm has higher learning accuracy than that of a conventional method and is practical. The online learning approach will make this algorithm useful in the field of data analysis.

  • High-Performance End-to-End Integrity Verification on Big Data Transfer

    Eun-Sung JUNG  Si LIU  Rajkumar KETTIMUTHU  Sungwook CHUNG  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2019/04/24
      Vol:
    E102-D No:8
      Page(s):
    1478-1488

    The scale of scientific data generated by experimental facilities and simulations in high-performance computing facilities has been proliferating with the emergence of IoT-based big data. In many cases, this data must be transmitted rapidly and reliably to remote facilities for storage, analysis, or sharing, for the Internet of Things (IoT) applications. Simultaneously, IoT data can be verified using a checksum after the data has been written to the disk at the destination to ensure its integrity. However, this end-to-end integrity verification inevitably creates overheads (extra disk I/O and more computation). Thus, the overall data transfer time increases. In this article, we evaluate strategies to maximize the overlap between data transfer and checksum computation for astronomical observation data. Specifically, we examine file-level and block-level (with various block sizes) pipelining to overlap data transfer and checksum computation. We analyze these pipelining approaches in the context of GridFTP, a widely used protocol for scientific data transfers. Theoretical analysis and experiments are conducted to evaluate our methods. The results show that block-level pipelining is effective in maximizing the overlap mentioned above, and can improve the overall data transfer time with end-to-end integrity verification by up to 70% compared to the sequential execution of transfer and checksum, and by up to 60% compared to file-level pipelining.

  • Delay Distribution Based Remote Data Fetch Scheme for Hadoop Clusters in Public Cloud

    Ravindra Sandaruwan RANAWEERA  Eiji OKI  Nattapong KITSUWAN  

     
    PAPER-Network

      Pubricized:
    2019/02/04
      Vol:
    E102-B No:8
      Page(s):
    1617-1625

    Apache Hadoop and its ecosystem have become the de facto platform for processing large-scale data, or Big Data, because it hides the complexity of distributed computing, scheduling, and communication while providing fault-tolerance. Cloud-based environments are becoming a popular platform for hosting Hadoop clusters due to their low initial cost and limitless capacity. However, cloud-based Hadoop clusters bring their own challenges due to contradictory design principles. Hadoop is designed on the shared-nothing principle while cloud is based on the concepts of consolidation and resource sharing. Most of Hadoop's features are designed for on-premises data centers where the cluster topology is known. Hadoop depends on the rack assignment of servers (configured by the cluster administrator) to calculate the distance between servers. Hadoop calculates the distance between servers to find the best remote server from which to fetch data from when fetching non-local data. However, public cloud environment providers do not share rack information of virtual servers with their tenants. Lack of rack information of servers may allow Hadoop to fetch data from a remote server that is on the other side of the data center. To overcome this problem, we propose a delay distribution based scheme to find the closest server to fetch non-local data for public cloud-based Hadoop clusters. The proposed scheme bases server selection on the delay distributions between server pairs. Delay distribution is calculated measuring the round-trip time between servers periodically. Our experiments observe that the proposed scheme outperforms conventional Hadoop nearly by 12% in terms of non-local data fetch time. This reduction in data fetch time will lead to a reduction in job run time, especially in real-world multi-user clusters where non-local data fetching can happen frequently.

  • lcyanalysis: An R Package for Technical Analysis in Stock Markets

    Chun-Yu LIU  Shu-Nung YAO  Ying-Jen CHEN  

     
    PAPER-Office Information Systems, e-Business Modeling

      Pubricized:
    2019/03/26
      Vol:
    E102-D No:7
      Page(s):
    1332-1341

    With advances in information technology and the development of big data, manual operation is unlikely to be a smart choice for stock market investing. Instead, the computer-based investment model is expected to bring investors more accurate strategic analysis and more effective investment decisions than human beings. This paper aims to improve investor profits by mining for critical information in the stock data, therefore helping big data analysis. We used the R language to find the technical indicators in the stock market, and then applied the technical indicators to the prediction. The proposed R package includes several analysis toolkits, such as trend line indicators, W type reversal patterns, V type reversal patterns, and the bull or bear market. The simulation results suggest that the developed R package can accurately present the tendency of the price and enhance the return on investment.

  • Medical Healthcare Network Platform and Big Data Analysis Based on Integrated ICT and Data Science with Regulatory Science Open Access

    Ryuji KOHNO  Takumi KOBAYASHI  Chika SUGIMOTO  Yukihiro KINJO  Matti HÄMÄLÄINEN  Jari IINATTI  

     
    INVITED PAPER

      Pubricized:
    2018/12/19
      Vol:
    E102-B No:6
      Page(s):
    1078-1087

    This paper provides perspectives for future medical healthcare social services and businesses that integrate advanced information and communication technology (ICT) and data science. First, we propose a universal medical healthcare platform that consists of wireless body area network (BAN), cloud network and edge computer, big data mining server and repository with machine learning. Technical aspects of the platform are discussed, including the requirements of reliability, safety and security, i.e., so-called dependability. In addition, novel technologies for satisfying the requirements are introduced. Then primary uses of the platform for personalized medicine and regulatory compliance, and its secondary uses for commercial business and sustainable operation are discussed. We are aiming at operate the universal medical healthcare platform, which is based on the principle of regulatory science, regionally and globally. In this paper, trials carried out in Kanagawa, Japan and Oulu, Finland will be revealed to illustrate a future medical healthcare social infrastructure by expanding it to Asia-Pacific, Europe and the rest of the world. We are representing the activities of Kanagawa medical device regulatory science center and a joint proposal on security in the dependable medical healthcare platform. Novel schemes of ubiquitous rehabilitation based on analyses of the training effect by remote monitoring of activities and machine learning of patient's electrocardiography (ECG) with a neural network are proposed and briefly investigated.

  • Hadoop I/O Performance Improvement by File Layout Optimization

    Eita FUJISHIMA  Kenji NAKASHIMA  Saneyasu YAMAGUCHI  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2017/11/22
      Vol:
    E101-D No:2
      Page(s):
    415-427

    Hadoop is a popular open-source MapReduce implementation. In the cases of jobs, wherein huge scale of output files of all relevant Map tasks are transmitted into Reduce tasks, such as TeraSort, the Reduce tasks are the bottleneck tasks and are I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we propose methods for improving the performance of Reduce tasks of such jobs by considering the following two things. One is that these files are accessed sequentially on an HDD, and the other is that each zone in an HDD has different sequential I/O performance. The proposed methods control the location to store intermediate data by modifying block bitmap of filesystem, which manages utilization (free or used) of blocks in an HDD. In addition, we propose striping layout for applying these methods for virtualized environment using image files. We then present performance evaluation of the proposed method and demonstrate that our methods improve the Hadoop application performance.

  • Study on Record Linkage of Anonymizied Data

    Hiroaki KIKUCHI  Takayasu YAMAGUCHI  Koki HAMADA  Yuji YAMAOKA  Hidenobu OGURI  Jun SAKUMA  

     
    INVITED PAPER

      Vol:
    E101-A No:1
      Page(s):
    19-28

    Data anonymization is required before a big-data business can run effectively without compromising the privacy of personal information it uses. It is not trivial to choose the best algorithm to anonymize some given data securely for a given purpose. In accurately assessing the risk of data being compromised, there needs to be a balance between utility and security. Therefore, using common pseudo microdata, we propose a competition for the best anonymization and re-identification algorithm. The paper reported the result of the competition and the analysis on the effective of anonymization technique. The competition result reveals that there is a tradeoff between utility and security, and 20.9% records were re-identified in average.

  • Fully Parallelized LZW Decompression for CUDA-Enabled GPUs

    Shunji FUNASAKA  Koji NAKANO  Yasuaki ITO  

     
    PAPER-GPU computing

      Pubricized:
    2016/08/25
      Vol:
    E99-D No:12
      Page(s):
    2986-2994

    The main contribution of this paper is to present a work-optimal parallel algorithm for LZW decompression and to implement it in a CUDA-enabled GPU. Since sequential LZW decompression creates a dictionary table by reading codes in a compressed file one by one, it is not easy to parallelize it. We first present a work-optimal parallel LZW decompression algorithm on the CREW-PRAM (Concurrent-Read Exclusive-Write Parallel Random Access Machine), which is a standard theoretical parallel computing model with a shared memory. We then go on to present an efficient implementation of this parallel algorithm on a GPU. The experimental results show that our GPU implementation performs LZW decompression in 1.15 milliseconds for a gray scale TIFF image with 4096×3072 pixels stored in the global memory of GeForce GTX 980. On the other hand, sequential LZW decompression for the same image stored in the main memory of Intel Core i7 CPU takes 50.1 milliseconds. Thus, our parallel LZW decompression on the global memory of the GPU is 43.6 times faster than a sequential LZW decompression on the main memory of the CPU for this image. To show the applicability of our GPU implementation for LZW decompression, we evaluated the SSD-GPU data loading time for three scenarios. The experimental results show that the scenario using our LZW decompression on the GPU is faster than the others.

  • Max-Min-Degree Neural Network for Centralized-Decentralized Collaborative Computing

    Yiqiang SHENG  Jinlin WANG  Chaopeng LI  Weining QI  

     
    PAPER

      Vol:
    E99-B No:4
      Page(s):
    841-848

    In this paper, we propose an undirected model of learning systems, named max-min-degree neural network, to realize centralized-decentralized collaborative computing. The basic idea of the proposal is a max-min-degree constraint which extends a k-degree constraint to improve the communication cost, where k is a user-defined degree of neurons. The max-min-degree constraint is defined such that the degree of each neuron lies between kmin and kmax. Accordingly, the Boltzmann machine is a special case of the proposal with kmin=kmax=n, where n is the full-connected degree of neurons. Evaluations show that the proposal is much better than a state-of-the-art model of deep learning systems with respect to the communication cost. The cost of the above improvement is slower convergent speed with respect to data size, but it does not matter in the case of big data processing.

  • k-Degree Layer-Wise Network for Geo-Distributed Computing between Cloud and IoT

    Yiqiang SHENG  Jinlin WANG  Haojiang DENG  Chaopeng LI  

     
    PAPER

      Vol:
    E99-B No:2
      Page(s):
    307-314

    In this paper, we propose a novel architecture for a deep learning system, named k-degree layer-wise network, to realize efficient geo-distributed computing between Cloud and Internet of Things (IoT). The geo-distributed computing extends Cloud to the geographical verge of the network in the neighbor of IoT. The basic ideas of the proposal include a k-degree constraint and a layer-wise constraint. The k-degree constraint is defined such that the degree of each vertex on the h-th layer is exactly k(h) to extend the existing deep belief networks and control the communication cost. The layer-wise constraint is defined such that the layer-wise degrees are monotonically decreasing in positive direction to gradually reduce the dimension of data. We prove the k-degree layer-wise network is sparse, while a typical deep neural network is dense. In an evaluation on the M-distributed MNIST database, the proposal is superior to a state-of-the-art model in terms of communication cost and learning time with scalability.

  • Managing the Synchronization in the Lambda Architecture for Optimized Big Data Analysis Open Access

    Thomas VANHOVE  Gregory VAN SEGHBROECK  Tim WAUTERS  Bruno VOLCKAERT  Filip DE TURCK  

     
    INVITED PAPER

      Vol:
    E99-B No:2
      Page(s):
    297-306

    In a world of continuously expanding amounts of data, retrieving interesting information from enormous data sets becomes more complex every day. Solutions for precomputing views on these big data sets mostly follow either an offline approach, which is slow but can take into account the entire data set, or a streaming approach, which is fast but only relies on the latest data entries. A hybrid solution was introduced through the Lambda architecture concept. It combines both offline and streaming approaches by analyzing data in a fast speed layer first, and in a slower batch layer later. However, this introduces a new synchronization challenge: once the data is analyzed by the batch layer, the corresponding information needs to be removed in the speed layer without introducing redundancy or loss of data. In this paper we propose a new approach to implement the Lambda architecture concept independent of the technologies used for offline and stream computing. A universal solution is provided to manage the complex synchronization introduced by the Lambda architecture and techniques to provide fault tolerance. The proposed solution is evaluated by means of detailed experimental results.

  • k-Dominant Skyline Query Computation in MapReduce Environment

    Md. Anisuzzaman SIDDIQUE  Hao TIAN  Yasuhiko MORIMOTO  

     
    PAPER

      Pubricized:
    2015/01/21
      Vol:
    E98-D No:5
      Page(s):
    1027-1034

    Filtering uninteresting data is important to utilize “big data”. Skyline query is popular technique to filter uninteresting data, in which it selects a set of objects that are not dominated by another from a given large database. However, a skyline query often retrieves too many objects to analyze intensively especially for high-dimensional dataset. To solve the problem, k-dominant skyline queries have been introduced. The size of databases sometimes become too large to compute in a centralized environment. Conventional algorithms for computing k-dominant skyline queries are not well suited for parallel and distributed environments, such as the MapReduce framework. In this paper, we consider an efficient parallel algorithm to process k-dominant skyline query in MapReduce framework. Extensive experiments demonstrate the scalability of proposed algorithm for synthetic big datasets under different settings of data distribution, dimensionality, and cardinality.

  • The History of and Prospects for ITS with a Focus on Car Navigation Systems

    Tsuneo TAKAHASHI  

     
    INVITED PAPER

      Vol:
    E98-A No:1
      Page(s):
    251-258

    ITS refers to advanced transportation systems in which control technology and information communication technology are applied for the purpose of coping with issues concerning safety, congestion, the environment, resource usage, etc. Here, we will review the history of ITS and look at its prospects for the future, with a focus on the rise of car navigation systems in Japan.

  • Tuning GridFTP Pipelining, Concurrency and Parallelism Based on Historical Data

    Jangyoung KIM  

     
    LETTER-Information Network

      Pubricized:
    2014/07/28
      Vol:
    E97-D No:11
      Page(s):
    2963-2966

    This paper presents a prediction model based on historical data to achieve optimal values of pipelining, concurrency and parallelism (PCP) in GridFTP data transfers in Cloud systems. Setting the correct values for these three parameters is crucial in achieving high throughput in end-to-end data movement. However, predicting and setting the optimal values for these parameters is a challenging task, especially in shared and non-predictive network conditions. Several factors can affect the optimal values for these parameters such as the background network traffic, available bandwidth, Round-Trip Time (RTT), TCP buffer size, and file size. Existing models either fail to provide accurate predictions or come with very high prediction overheads. The author shows that new model based on historical data can achieve high accuracy with low overhead.