Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Jordan J. Bird; Diego R. Faria; Cristiano Premebida; Anikó Ekárt; George Vogiatzis

doi:10.48550/arxiv.2007.10175

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Anikó Ekárt and George Vogiatzis

arXiv (Cornell University) · 2020

Conference paper

PDF DOI

Abstract

The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.

Citation

Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Anikó Ekárt and George Vogiatzis. “Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines.” arXiv (Cornell University), pp. 10380–10385. 2020.

BibTeX

@inproceedings{bird2020,
  title     = {Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines},
  author    = {Jordan J. Bird and Diego R. Faria and Cristiano Premebida and Anikó Ekárt and George Vogiatzis},
  booktitle = {arXiv (Cornell University)},
  pages     = {10380--10385},
  year      = {2020},
  doi       = {10.48550/arxiv.2007.10175},
}

← All publications