The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] multimodal fusion(2hit)

1-2hit
  • Multimodal Named Entity Recognition with Bottleneck Fusion and Contrastive Learning

    Peng WANG  Xiaohang CHEN  Ziyu SHANG  Wenjun KE  

     
    PAPER-Natural Language Processing

      Pubricized:
    2023/01/18
      Vol:
    E106-D No:4
      Page(s):
    545-555

    Multimodal named entity recognition (MNER) is the task of recognizing named entities in multimodal context. Existing methods focus on utilizing co-attention mechanism to discover the relationships between multiple modalities. However, they still have two deficiencies: First, current methods fail to fuse the multimodal representations in a fine-grained way, which may bring noise of visual modalities. Second, current methods ignore bridging the semantic gap between heterogeneous modalities. To solve the above issues, we propose a novel MNER method with bottleneck fusion and contrastive learning (BFCL). Specifically, we first incorporate the transformer-based bottleneck fusion mechanism, subsequently, information between different modalities can only be exchanged through several bottleneck tokens, thus reducing the noise propagation. Then we propose two decoupled image-text contrastive losses to align the unimodal representations, making the representations of semantically similar modalities closer, while the representations of semantically different modalities farther away. Experimental results demonstrate that our method is competitive to the state-of-the-art models, and achieves 74.54% and 85.70% F1-scores on Twitter-2015 and Twitter-2017 datasets, respectively.

  • Multimodal Affect Recognition Using Boltzmann Zippers

    Kun LU  Xin ZHANG  

     
    LETTER-Image Recognition, Computer Vision

      Vol:
    E96-D No:11
      Page(s):
    2496-2499

    This letter presents a novel approach for automatic multimodal affect recognition. The audio and visual channels provide complementary information for human affective states recognition, and we utilize Boltzmann zippers as model-level fusion to learn intrinsic correlations between the different modalities. We extract effective audio and visual feature streams with different time scales and feed them to two component Boltzmann chains respectively. Hidden units of the two chains are interconnected to form a Boltzmann zipper which can effectively avoid local energy minima during training. Second-order methods are applied to Boltzmann zippers to speed up learning and pruning process. Experimental results on audio-visual emotion data recorded by ourselves in Wizard of Oz scenarios and collected from the SEMAINE naturalistic database both demonstrate our approach is robust and outperforms the state-of-the-art methods.