A CNN-Based Multi-Scale Pooling Strategy for Acoustic Scene Classification

Rong HUANG; Yue XIE

doi:10.1587/transinf.2023EDL8048

IEICE TRANSACTIONS on Information

A CNN-Based Multi-Scale Pooling Strategy for Acoustic Scene Classification

Rong HUANG, Yue XIE

Full Text Views

0

Cite this

Summary :

Acoustic scene classification (ASC) is a fundamental domain within the realm of artificial intelligence classification tasks. ASC-based tasks commonly employ models based on convolutional neural networks (CNNs) that utilize log-Mel spectrograms as input for gathering acoustic features. In this paper, we designed a CNN-based multi-scale pooling (MSP) strategy for ASC. The log-Mel spectrograms are utilized as the input to CNN, which is partitioned into four frequency axis segments. Furthermore, we devised four CNN channels to acquire inputs from distinct frequency ranges. The high-level features extracted from outputs in various frequency bands are integrated through frequency pyramid average pooling layers at multiple levels. Subsequently, a softmax classifier is employed to classify different scenes. Our study demonstrates that the implementation of our designed model leads to a significant enhancement in the model's performance, as evidenced by the testing of two acoustic datasets.

Publication: IEICE TRANSACTIONS on Information Vol.E107-D No.1 pp.153-156

Publication Date: 2024/01/01

Publicized: 2023/10/17

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2023EDL8048

Type of Manuscript: LETTER

Category: Speech and Hearing

Authors

Rong HUANG
Nanjing University of Posts and Telecommunications
Yue XIE
Nanjing Institute of Technology

Keyword

log-Mel spectrograms, convolutional neural network, pyramid average pooling, deep learning

Cite this

Copy

Rong HUANG, Yue XIE, "A CNN-Based Multi-Scale Pooling Strategy for Acoustic Scene Classification" in IEICE TRANSACTIONS on Information, vol. E107-D, no. 1, pp. 153-156, January 2024, doi: 10.1587/transinf.2023EDL8048.
Abstract: Acoustic scene classification (ASC) is a fundamental domain within the realm of artificial intelligence classification tasks. ASC-based tasks commonly employ models based on convolutional neural networks (CNNs) that utilize log-Mel spectrograms as input for gathering acoustic features. In this paper, we designed a CNN-based multi-scale pooling (MSP) strategy for ASC. The log-Mel spectrograms are utilized as the input to CNN, which is partitioned into four frequency axis segments. Furthermore, we devised four CNN channels to acquire inputs from distinct frequency ranges. The high-level features extracted from outputs in various frequency bands are integrated through frequency pyramid average pooling layers at multiple levels. Subsequently, a softmax classifier is employed to classify different scenes. Our study demonstrates that the implementation of our designed model leads to a significant enhancement in the model's performance, as evidenced by the testing of two acoustic datasets.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2023EDL8048/_p

Copy

@ARTICLE{e107-d_1_153,
author={Rong HUANG, Yue XIE, },
journal={IEICE TRANSACTIONS on Information},
title={A CNN-Based Multi-Scale Pooling Strategy for Acoustic Scene Classification},
year={2024},
volume={E107-D},
number={1},
pages={153-156},
abstract={Acoustic scene classification (ASC) is a fundamental domain within the realm of artificial intelligence classification tasks. ASC-based tasks commonly employ models based on convolutional neural networks (CNNs) that utilize log-Mel spectrograms as input for gathering acoustic features. In this paper, we designed a CNN-based multi-scale pooling (MSP) strategy for ASC. The log-Mel spectrograms are utilized as the input to CNN, which is partitioned into four frequency axis segments. Furthermore, we devised four CNN channels to acquire inputs from distinct frequency ranges. The high-level features extracted from outputs in various frequency bands are integrated through frequency pyramid average pooling layers at multiple levels. Subsequently, a softmax classifier is employed to classify different scenes. Our study demonstrates that the implementation of our designed model leads to a significant enhancement in the model's performance, as evidenced by the testing of two acoustic datasets.},
keywords={},
doi={10.1587/transinf.2023EDL8048},
ISSN={1745-1361},
month={January},}

Copy

TY - JOUR
TI - A CNN-Based Multi-Scale Pooling Strategy for Acoustic Scene Classification
T2 - IEICE TRANSACTIONS on Information
SP - 153
EP - 156
AU - Rong HUANG
AU - Yue XIE
PY - 2024
DO - 10.1587/transinf.2023EDL8048
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E107-D
IS - 1
JA - IEICE TRANSACTIONS on Information
Y1 - January 2024
AB - Acoustic scene classification (ASC) is a fundamental domain within the realm of artificial intelligence classification tasks. ASC-based tasks commonly employ models based on convolutional neural networks (CNNs) that utilize log-Mel spectrograms as input for gathering acoustic features. In this paper, we designed a CNN-based multi-scale pooling (MSP) strategy for ASC. The log-Mel spectrograms are utilized as the input to CNN, which is partitioned into four frequency axis segments. Furthermore, we devised four CNN channels to acquire inputs from distinct frequency ranges. The high-level features extracted from outputs in various frequency bands are integrated through frequency pyramid average pooling layers at multiple levels. Subsequently, a softmax classifier is employed to classify different scenes. Our study demonstrates that the implementation of our designed model leads to a significant enhancement in the model's performance, as evidenced by the testing of two acoustic datasets.
ER -

IEICE TRANSACTIONS on Information