The search functionality is under construction.
The search functionality is under construction.

Author Search Result

[Author] Gen HU(9hit)

1-9hit
  • Sentence-Embedding and Similarity via Hybrid Bidirectional-LSTM and CNN Utilizing Weighted-Pooling Attention

    Degen HUANG  Anil AHMED  Syed Yasser ARAFAT  Khawaja Iftekhar RASHID  Qasim ABBAS  Fuji REN  

     
    PAPER-Natural Language Processing

      Pubricized:
    2020/08/27
      Vol:
    E103-D No:10
      Page(s):
    2216-2227

    Neural networks have received considerable attention in sentence similarity measuring systems due to their efficiency in dealing with semantic composition. However, existing neural network methods are not sufficiently effective in capturing the most significant semantic information buried in an input. To address this problem, a novel weighted-pooling attention layer is proposed to retain the most remarkable attention vector. It has already been established that long short-term memory and a convolution neural network have a strong ability to accumulate enriched patterns of whole sentence semantic representation. First, a sentence representation is generated by employing a siamese structure based on bidirectional long short-term memory and a convolutional neural network. Subsequently, a weighted-pooling attention layer is applied to obtain an attention vector. Finally, the attention vector pair information is leveraged to calculate the score of sentence similarity. An amalgamation of both, bidirectional long short-term memory and a convolutional neural network has resulted in a model that enhances information extracting and learning capacity. Investigations show that the proposed method outperforms the state-of-the-art approaches to datasets for two tasks, namely semantic relatedness and Microsoft research paraphrase identification. The new model improves the learning capability and also boosts the similarity accuracy as well.

  • Voting-Based Ensemble Classifiers to Detect Hedges and Their Scopes in Biomedical Texts

    Huiwei ZHOU  Xiaoyan LI  Degen HUANG  Yuansheng YANG  Fuji REN  

     
    PAPER-Artificial Intelligence, Data Mining

      Vol:
    E94-D No:10
      Page(s):
    1989-1997

    Previous studies of pattern recognition have shown that classifiers ensemble approaches can lead to better recognition results. In this paper, we apply the voting technique for the CoNLL-2010 shared task on detecting hedge cues and their scope in biomedical texts. Six machine learning-based systems are combined through three different voting schemes. We demonstrate the effectiveness of classifiers ensemble approaches and compare the performance of three different voting schemes for hedge cue and their scope detection. Experiments on the CoNLL-2010 evaluation data show that our best system achieves an F-score of 87.49% on hedge detection task and 60.87% on scope finding task respectively, which are significantly better than those of the previous systems.

  • An Active Transfer Learning Framework for Protein-Protein Interaction Extraction

    Lishuang LI  Xinyu HE  Jieqiong ZHENG  Degen HUANG  Fuji REN  

     
    PAPER-Natural Language Processing

      Pubricized:
    2017/10/30
      Vol:
    E101-D No:2
      Page(s):
    504-511

    Protein-Protein Interaction Extraction (PPIE) from biomedical literatures is an important task in biomedical text mining and has achieved great success on public datasets. However, in real-world applications, the existing PPI extraction methods are limited to label effort. Therefore, transfer learning method is applied to reduce the cost of manual labeling. Current transfer learning methods suffer from negative transfer and lower performance. To tackle this problem, an improved TrAdaBoost algorithm is proposed, that is, relative distribution is introduced to initialize the weights of TrAdaBoost to overcome the negative transfer caused by domain differences. To make further improvement on the performance of transfer learning, an approach combining active learning with the improved TrAdaBoost is presented. The experimental results on publicly available PPI corpora show that our method outperforms TrAdaBoost and SVM when the labeled data is insufficient,and on document classification corpora, it also illustrates that the proposed approaches can achieve better performance than TrAdaBoost and TPTSVM in final, which verifies the effectiveness of our methods.

  • Multi-Level Attention Based BLSTM Neural Network for Biomedical Event Extraction

    Xinyu HE  Lishuang LI  Xingchen SONG  Degen HUANG  Fuji REN  

     
    PAPER-Natural Language Processing

      Pubricized:
    2019/04/26
      Vol:
    E102-D No:9
      Page(s):
    1842-1850

    Biomedical event extraction is an important and challenging task in Information Extraction, which plays a key role for medicine research and disease prevention. Most of the existing event detection methods are based on shallow machine learning methods which mainly rely on domain knowledge and elaborately designed features. Another challenge is that some crucial information as well as the interactions among words or arguments may be ignored since most works treat words and sentences equally. Therefore, we employ a Bidirectional Long Short Term Memory (BLSTM) neural network for event extraction, which can skip handcrafted complex feature extraction. Furthermore, we propose a multi-level attention mechanism, including word level attention which determines the importance of words in a sentence, and the sentence level attention which determines the importance of relevant arguments. Finally, we train dependency word embeddings and add sentence vectors to enrich semantic information. The experimental results show that our model achieves an F-score of 59.61% on the commonly used dataset (MLEE) of biomedical event extraction, which outperforms other state-of-the-art methods.

  • Creating Chinese-English Comparable Corpora

    Degen HUANG  Shanshan WANG  Fuji REN  

     
    PAPER-Natural Language Processing

      Vol:
    E96-D No:8
      Page(s):
    1853-1861

    Comparable Corpora are valuable resources for many NLP applications, and extensive research has been done on information mining based on comparable corpora in recent years. While there are not enough large-scale available public comparable corpora at present, this paper presents a bi-directional CLIR-based method for creating comparable corpora from two independent news collections in different languages. The original Chinese document collections and English documents collections are crawled from XinHuaNet respectively and formatted in a consistent manner. For each document from the two collections, the best query keywords are extracted to represent the essential content of the document, and then the keywords are translated into the language of the other collection. The translated queries are run against the collection in the same language to pick up the candidate documents in the other language and candidates are aligned based on their publication dates and the similarity scores. Results show that our approach significantly outperforms previous approaches to the construction of Chinese-English comparable corpora.

  • Recognition of Collocation Frames from Sentences

    Xiaoxia LIU  Degen HUANG  Zhangzhi YIN  Fuji REN  

     
    PAPER-Natural Language Processing

      Pubricized:
    2018/12/14
      Vol:
    E102-D No:3
      Page(s):
    620-627

    Collocation is a ubiquitous phenomenon in languages and accurate collocation recognition and extraction is of great significance to many natural language processing tasks. Collocations can be differentiated from simple bigram collocations to collocation frames (referring to distant multi-gram collocations). So far little focus is put on collocation frames. Oriented to translation and parsing, this study aims to recognize and extract the longest possible collocation frames from given sentences. We first extract bigram collocations with distributional semantics based method by introducing collocation patterns and integrating some state-of-the-art association measures. Based on bigram collocations extracted by the proposed method, we get the longest collocation frames according to recursive nature and linguistic rules of collocations. Compared with the baseline systems, the proposed method performs significantly better in bigram collocation extraction both in precision and recall. And in extracting collocation frames, the proposed method performs even better with the precision similar to its bigram collocation extraction results.

  • Analysis/Synthesis of Speech Using the Short-Time Fourier Transform and a Time-Varying ARMA Process

    Andreas SPANIAS  Philipos LOIZOU  Gim LIM  Ye CHEN  Gen HU  

     
    PAPER-Speech

      Vol:
    E76-A No:4
      Page(s):
    645-652

    A speech analysis/synthesis system that relies on a time-varying Auto Regressive Moving Average (ARMA) process and the Short-Time Fourier Transform (STFT) is proposed. The narrowband components in speech are represented in the frequency domain by a set of harmonic components, while the broadband random components are represented by a time-varying ARMA process. The time-varying ARMA model has a dual function, namely, it creates a spectral envelope that fits accurately the harmonic STFT components, and provides for the spectral representation of the broadband components of speech. The proposed model essentially combines the features of waveform coders by employing the STFT and the features of traditional vocoders by incorporating an appropriately shaped noise sequence.

  • Corpus Expansion for Neural CWS on Microblog-Oriented Data with λ-Active Learning Approach

    Jing ZHANG  Degen HUANG  Kaiyu HUANG  Zhuang LIU  Fuji REN  

     
    PAPER-Natural Language Processing

      Pubricized:
    2017/12/08
      Vol:
    E101-D No:3
      Page(s):
    778-785

    Microblog data contains rich information of real-world events with great commercial values, so microblog-oriented natural language processing (NLP) tasks have grabbed considerable attention of researchers. However, the performance of microblog-oriented Chinese Word Segmentation (CWS) based on deep neural networks (DNNs) is still not satisfying. One critical reason is that the existing microblog-oriented training corpus is inadequate to train effective weight matrices for DNNs. In this paper, we propose a novel active learning method to extend the scale of the training corpus for DNNs. However, due to a large amount of partially overlapped sentences in the microblogs, it is difficult to select samples with high annotation values from raw microblogs during the active learning procedure. To select samples with higher annotation values, parameter λ is introduced to control the number of repeatedly selected samples. Meanwhile, various strategies are adopted to measure the overall annotation values of a sample during the active learning procedure. Experiments on the benchmark datasets of NLPCC 2015 show that our λ-active learning method outperforms the baseline system and the state-of-the-art method. Besides, the results also demonstrate that the performances of the DNNs trained on the extended corpus are significantly improved.

  • Detecting New Words from Chinese Text Using Latent Semi-CRF Models

    Xiao SUN  Degen HUANG  Fuji REN  

     
    PAPER-Natural Language Processing

      Vol:
    E93-D No:6
      Page(s):
    1386-1393

    Chinese new words and their part-of-speech (POS) are particularly problematic in Chinese natural language processing. With the fast development of internet and information technology, it is impossible to get a complete system dictionary for Chinese natural language processing, as new words out of the basic system dictionary are always being created. A latent semi-CRF model, which combines the strengths of LDCRF (Latent-Dynamic Conditional Random Field) and semi-CRF, is proposed to detect the new words together with their POS synchronously regardless of the types of the new words from the Chinese text without being pre-segmented. Unlike the original semi-CRF, the LDCRF is applied to generate the candidate entities for training and testing the latent semi-CRF, which accelerates the training speed and decreases the computation cost. The complexity of the latent semi-CRF could be further adjusted by tuning the number of hidden variables in LDCRF and the number of the candidate entities from the Nbest outputs of the LDCRF. A new-words-generating framework is proposed for model training and testing, under which the definitions and distributions of the new words conform to the ones existing in real text. Specific features called "Global Fragment Information" for new word detection and POS tagging are adopted in the model training and testing. The experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags. The proposed model is found to be performing competitively with the state-of-the-art models presented.