The search functionality is under construction.

Keyword Search Result

[Keyword] lossless data compression(9hit)

1-9hit
  • Compression by Substring Enumeration Using Sorted Contingency Tables

    Takahiro OTA  Hiroyoshi MORITA  Akiko MANADA  

     
    PAPER-Information Theory

      Vol:
    E103-A No:6
      Page(s):
    829-835

    This paper proposes two variants of improved Compression by Substring Enumeration (CSE) with a finite alphabet. In previous studies on CSE, an encoder utilizes inequalities which evaluate the number of occurrences of a substring or a minimal forbidden word (MFW) to be encoded. The inequalities are derived from a contingency table including the number of occurrences of a substring or an MFW. Moreover, codeword length of a substring and an MFW grows with the difference between the upper and lower bounds deduced from the inequalities, however the lower bound is not tight. Therefore, we derive a new tight lower bound based on the contingency table and consequently propose a new CSE algorithm using the new inequality. We also propose a new encoding order of substrings and MFWs based on a sorted contingency table such that both its row and column marginal total are sorted in descending order instead of a lexicographical order used in previous studies. We then propose a new CSE algorithm which is the first proposed CSE algorithm using the new encoding order. Experimental results show that compression ratios of all files of the Calgary corpus in the proposed algorithms are better than those of a previous study on CSE with a finite alphabet. Moreover, compression ratios under the second proposed CSE get better than or equal to that under a well-known compressor for 11 files amongst 14 files in the corpus.

  • A Variable-to-Fixed Length Lossless Source Code Attaining Better Performance than Tunstall Code in Several Criterions

    Mitsuharu ARIMURA  

     
    PAPER-Information Theory

      Vol:
    E101-A No:1
      Page(s):
    249-258

    Tunstall code is known as an optimal variable-to-fixed length (VF) lossless source code under the criterion of average coding rate, which is defined as the codeword length divided by the average phrase length. In this paper we define the average coding rate of a VF code as the expectation of the pointwise coding rate defined by the codeword length divided by the phrase length. We call this type of average coding rate the average pointwise coding rate. In this paper, a new VF code is proposed. An incremental parsing tree construction algorithm like the one that builds Tunstall parsing tree is presented. It is proved that this code is optimal under the criterion of the average pointwise coding rate, and that the average pointwise coding rate of this code converges asymptotically to the entropy of the stationary memoryless source emitting the data to be encoded. Moreover, it is proved that the proposed code attains better worst-case coding rate than Tunstall code.

  • Average Coding Rate of a Multi-Shot Tunstall Code with an Arbitrary Parsing Tree Sequence

    Mitsuharu ARIMURA  

     
    LETTER-Source Coding and Data Compression

      Vol:
    E99-A No:12
      Page(s):
    2281-2285

    Average coding rate of a multi-shot Tunstall code, which is a variation of variable-to-fixed length (VF) lossless source codes, for stationary memoryless sources is investigated. A multi-shot VF code parses a given source sequence to variable-length blocks and encodes them to fixed-length codewords. If we consider the situation that the parsing count is fixed, overall multi-shot VF code can be treated as a one-shot VF code. For this setting of Tunstall code, the compression performance is evaluated using two criterions. The first one is the average coding rate which is defined as the codeword length divided by the average block length. The second one is the expectation of the pointwise coding rate. It is proved that both of the above average coding rate converge to the entropy of a stationary memoryless source under the assumption that the geometric mean of the leaf counts of the multi-shot Tunstall parsing trees goes to infinity.

  • Lossless Data Compression via Substring Enumeration for k-th Order Markov Sources with a Finite Alphabet

    Ken-ichi IWATA  Mitsuharu ARIMURA  

     
    PAPER-Source Coding and Data Compression

      Vol:
    E99-A No:12
      Page(s):
    2130-2135

    A generalization of compression via substring enumeration (CSE) for k-th order Markov sources with a finite alphabet is proposed, and an upper bound of the codeword length of the proposed method is presented. We analyze the worst case maximum redundancy of CSE for k-th order Markov sources with a finite alphabet. The compression ratio of the proposed method asymptotically converges to the optimal one for k-th order Markov sources with a finite alphabet if the length n of a source string tends to infinity.

  • Almost Sure Convergence Coding Theorems of One-Shot and Multi-Shot Tunstall Codes for Stationary Memoryless Sources

    Mitsuharu ARIMURA  

     
    PAPER-Source Coding

      Vol:
    E98-A No:12
      Page(s):
    2393-2406

    Almost sure convergence coding theorems of one-shot and multi-shot Tunstall codes are proved for stationary memoryless sources. Coding theorem of one-shot Tunstall code is proved in the case that the leaf count of Tunstall tree increases. On the other hand, coding theorem is proved for multi-shot Tunstall code with increasing parsing count, under the assumption that the Tunstall tree grows as the parsing proceeds. In this result, it is clarified that the theorem for the one-shot Tunstall code is not a corollary of the theorem for the multi-shot Tunstall code. In the case of the multi-shot Tunstall code, it can be regarded that the coding theorem is proved for the sequential algorithm such that parsing and coding are processed repeatedly. Cartesian concatenation of trees and geometric mean of the leaf counts of trees are newly introduced, which play crucial roles in the analyses of multi-shot Tunstall code.

  • Evaluation of Maximum Redundancy of Data Compression via Substring Enumeration for k-th Order Markov Sources

    Ken-ichi IWATA  Mitsuharu ARIMURA  Yuki SHIMA  

     
    PAPER-Information Theory

      Vol:
    E97-A No:8
      Page(s):
    1754-1760

    Dubé and Beaudoin proposed a lossless data compression called compression via substring enumeration (CSE) in 2010. We evaluate an upper bound of the number of bits used by the CSE technique to encode any binary string from an unknown member of a known class of k-th order Markov processes. We compare the worst case maximum redundancy obtained by the CSE technique for any binary string with the least possible value of the worst case maximum redundancy obtained by the best fixed-to-variable length code that satisfies the Kraft inequality.

  • On the Average Coding Rate of the Tunstall Code for Stationary and Memoryless Sources

    Mitsuharu ARIMURA  

     
    PAPER-Source Coding

      Vol:
    E93-A No:11
      Page(s):
    1904-1911

    The coding rate of a one-shot Tunstall code for stationary and memoryless sources is investigated in non-universal situations so that the probability distribution of the source is known to the encoder and the decoder. When studying the variable-to-fixed length code, the average coding rate has been defined as (i) the codeword length divided by the average block length. We define the average coding rate as (ii) the expectation of the pointwise coding rate, and prove that (ii) converges to the same value as (i).

  • Unequal Error Protection in Ziv-Lempel Coding

    Eiji FUJIWARA  Masato KITAKAMI  

     
    PAPER-Dependable Communication

      Vol:
    E86-D No:12
      Page(s):
    2595-2600

    Data compression is popularly applied to computer systems and communication systems. Especially, lossless compression is applied to text compression. Since compressed data are very sensitive to errors, several error control methods for data compression using probability model, such as for arithmetic coding, have been proposed. This paper proposes to apply an unequal error protection, or a UEP, scheme to LZ77 coding and LZW coding. This investigates a structure of the compressed data and clarifies a part which is more sensitive to errors than the other by using theoretical analysis and computer simulation. The UEP scheme protects the error-sensitive part from errors more strongly than the others. Computer simulation says that the proposed scheme can recover from errors in the compressed data more effectively than the conventional methods.

  • Asymptotic Optimality of the Block Sorting Data Compression Algorithm

    Mitsuharu ARIMURA  Hirosuke YAMAMOTO  

     
    PAPER-Source Coding

      Vol:
    E81-A No:10
      Page(s):
    2117-2122

    In this paper the performance of the Block Sorting algorithm proposed by Burrows and Wheeler is evaluated theoretically. It is proved that the Block Sorting algorithm is asymptotically optimal for stationary ergodic finite order Markov sources. Our proof is based on the facts that symbols with the same Markov state (or context) in an original data sequence are grouped together in the output sequence obtained by Burrows-Wheeler transform, and the codeword length of each group can be bounded by a function described with the frequencies of symbols included in the group.