16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Shigeru YOSHIDA, Takashi MORIHARA, Hironori YAHAGI, Noriko ITANI, "Application of a Word-Based Text Compression Method to Japanese and Chinese Texts" in IEICE TRANSACTIONS on Fundamentals,
vol. E85-A, no. 12, pp. 2933-2938, December 2002, doi: .
Abstract: 16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/e85-a_12_2933/_p
Copy
@ARTICLE{e85-a_12_2933,
author={Shigeru YOSHIDA, Takashi MORIHARA, Hironori YAHAGI, Noriko ITANI, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Application of a Word-Based Text Compression Method to Japanese and Chinese Texts},
year={2002},
volume={E85-A},
number={12},
pages={2933-2938},
abstract={16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.},
keywords={},
doi={},
ISSN={},
month={December},}
Copy
TY - JOUR
TI - Application of a Word-Based Text Compression Method to Japanese and Chinese Texts
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 2933
EP - 2938
AU - Shigeru YOSHIDA
AU - Takashi MORIHARA
AU - Hironori YAHAGI
AU - Noriko ITANI
PY - 2002
DO -
JO - IEICE TRANSACTIONS on Fundamentals
SN -
VL - E85-A
IS - 12
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - December 2002
AB - 16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.
ER -