Application of a Word-Based Text Compression Method to Japanese and Chinese Texts

Shigeru YOSHIDA; Takashi MORIHARA; Hironori YAHAGI; Noriko ITANI

IEICE TRANSACTIONS on Fundamentals

Application of a Word-Based Text Compression Method to Japanese and Chinese Texts

Shigeru YOSHIDA, Takashi MORIHARA, Hironori YAHAGI, Noriko ITANI

Full Text Views

0

Cite this

Summary :

16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.

Publication: IEICE TRANSACTIONS on Fundamentals Vol.E85-A No.12 pp.2933-2938

Publication Date: 2002/12/01

Publicized

Online ISSN

DOI

Type of Manuscript: PAPER

Category: Information Theory

Cite this

Copy

Shigeru YOSHIDA, Takashi MORIHARA, Hironori YAHAGI, Noriko ITANI, "Application of a Word-Based Text Compression Method to Japanese and Chinese Texts" in IEICE TRANSACTIONS on Fundamentals, vol. E85-A, no. 12, pp. 2933-2938, December 2002, doi: .
Abstract: 16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/e85-a_12_2933/_p

Copy

@ARTICLE{e85-a_12_2933,
author={Shigeru YOSHIDA, Takashi MORIHARA, Hironori YAHAGI, Noriko ITANI, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Application of a Word-Based Text Compression Method to Japanese and Chinese Texts},
year={2002},
volume={E85-A},
number={12},
pages={2933-2938},
abstract={16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.},
keywords={},
doi={},
ISSN={},
month={December},}

Copy

TY - JOUR
TI - Application of a Word-Based Text Compression Method to Japanese and Chinese Texts
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 2933
EP - 2938
AU - Shigeru YOSHIDA
AU - Takashi MORIHARA
AU - Hironori YAHAGI
AU - Noriko ITANI
PY - 2002
DO -
JO - IEICE TRANSACTIONS on Fundamentals
SN -
VL - E85-A
IS - 12
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - December 2002
AB - 16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.
ER -

IEICE TRANSACTIONS on Fundamentals

Application of a Word-Based Text Compression Method to Japanese and Chinese Texts

Summary :

Authors

Keyword

Latest Issue

Contents

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

IEICE TRANSACTIONS on Fundamentals

Application of a Word-Based Text Compression Method to Japanese and Chinese Texts

Summary :

Authors

Keyword

Latest Issue

Contents

Copyrights notice of machine-translated contents

Cite this

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles