Cross-Domain Deep Feature Combination for Bird Species Classification with Audio-Visual Data

Naranchimeg BOLD; Chao ZHANG; Takuya AKASHI

doi:10.1587/transinf.2018EDP7383

IEICE TRANSACTIONS on Information

Cross-Domain Deep Feature Combination for Bird Species Classification with Audio-Visual Data

Naranchimeg BOLD, Chao ZHANG, Takuya AKASHI

Full Text Views

0

Cite this

Summary :

In recent decade, many state-of-the-art algorithms on image classification as well as audio classification have achieved noticeable successes with the development of deep convolutional neural network (CNN). However, most of the works only exploit single type of training data. In this paper, we present a study on classifying bird species by exploiting the combination of both visual (images) and audio (sounds) data using CNN, which has been sparsely treated so far. Specifically, we propose CNN-based multimodal learning models in three types of fusion strategies (early, middle, late) to settle the issues of combining training data cross domains. The advantage of our proposed method lies on the fact that we can utilize CNN not only to extract features from image and audio data (spectrogram) but also to combine the features across modalities. In the experiment, we train and evaluate the network structure on a comprehensive CUB-200-2011 standard data set combing our originally collected audio data set with respect to the data species. We observe that a model which utilizes the combination of both data outperforms models trained with only an either type of data. We also show that transfer learning can significantly increase the classification performance.

Publication: IEICE TRANSACTIONS on Information Vol.E102-D No.10 pp.2033-2042

Publication Date: 2019/10/01

Publicized: 2019/06/27

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2018EDP7383

Type of Manuscript: PAPER

Category: Multimedia Pattern Processing

Authors

Naranchimeg BOLD
  Iwate University
Chao ZHANG
  University of Fukui
Takuya AKASHI
  Iwate University

Keyword

bird species classification, multimodal learning, feature combination, spectrogram feature, convolutional neural networks

Cite this

Copy

Naranchimeg BOLD, Chao ZHANG, Takuya AKASHI, "Cross-Domain Deep Feature Combination for Bird Species Classification with Audio-Visual Data" in IEICE TRANSACTIONS on Information, vol. E102-D, no. 10, pp. 2033-2042, October 2019, doi: 10.1587/transinf.2018EDP7383.
Abstract: In recent decade, many state-of-the-art algorithms on image classification as well as audio classification have achieved noticeable successes with the development of deep convolutional neural network (CNN). However, most of the works only exploit single type of training data. In this paper, we present a study on classifying bird species by exploiting the combination of both visual (images) and audio (sounds) data using CNN, which has been sparsely treated so far. Specifically, we propose CNN-based multimodal learning models in three types of fusion strategies (early, middle, late) to settle the issues of combining training data cross domains. The advantage of our proposed method lies on the fact that we can utilize CNN not only to extract features from image and audio data (spectrogram) but also to combine the features across modalities. In the experiment, we train and evaluate the network structure on a comprehensive CUB-200-2011 standard data set combing our originally collected audio data set with respect to the data species. We observe that a model which utilizes the combination of both data outperforms models trained with only an either type of data. We also show that transfer learning can significantly increase the classification performance.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2018EDP7383/_p

Copy

@ARTICLE{e102-d_10_2033,
author={Naranchimeg BOLD, Chao ZHANG, Takuya AKASHI, },
journal={IEICE TRANSACTIONS on Information},
title={Cross-Domain Deep Feature Combination for Bird Species Classification with Audio-Visual Data},
year={2019},
volume={E102-D},
number={10},
pages={2033-2042},
abstract={In recent decade, many state-of-the-art algorithms on image classification as well as audio classification have achieved noticeable successes with the development of deep convolutional neural network (CNN). However, most of the works only exploit single type of training data. In this paper, we present a study on classifying bird species by exploiting the combination of both visual (images) and audio (sounds) data using CNN, which has been sparsely treated so far. Specifically, we propose CNN-based multimodal learning models in three types of fusion strategies (early, middle, late) to settle the issues of combining training data cross domains. The advantage of our proposed method lies on the fact that we can utilize CNN not only to extract features from image and audio data (spectrogram) but also to combine the features across modalities. In the experiment, we train and evaluate the network structure on a comprehensive CUB-200-2011 standard data set combing our originally collected audio data set with respect to the data species. We observe that a model which utilizes the combination of both data outperforms models trained with only an either type of data. We also show that transfer learning can significantly increase the classification performance.},
keywords={},
doi={10.1587/transinf.2018EDP7383},
ISSN={1745-1361},
month={October},}

Copy

TY - JOUR
TI - Cross-Domain Deep Feature Combination for Bird Species Classification with Audio-Visual Data
T2 - IEICE TRANSACTIONS on Information
SP - 2033
EP - 2042
AU - Naranchimeg BOLD
AU - Chao ZHANG
AU - Takuya AKASHI
PY - 2019
DO - 10.1587/transinf.2018EDP7383
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E102-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2019
AB - In recent decade, many state-of-the-art algorithms on image classification as well as audio classification have achieved noticeable successes with the development of deep convolutional neural network (CNN). However, most of the works only exploit single type of training data. In this paper, we present a study on classifying bird species by exploiting the combination of both visual (images) and audio (sounds) data using CNN, which has been sparsely treated so far. Specifically, we propose CNN-based multimodal learning models in three types of fusion strategies (early, middle, late) to settle the issues of combining training data cross domains. The advantage of our proposed method lies on the fact that we can utilize CNN not only to extract features from image and audio data (spectrogram) but also to combine the features across modalities. In the experiment, we train and evaluate the network structure on a comprehensive CUB-200-2011 standard data set combing our originally collected audio data set with respect to the data species. We observe that a model which utilizes the combination of both data outperforms models trained with only an either type of data. We also show that transfer learning can significantly increase the classification performance.
ER -

IEICE TRANSACTIONS on Information