Note: This bibliographic page is archived and will no longer be updated. For an up-to-date list of publications from the Music Technology Group see the Publications list .

Deep Neural Networks for Music and Audio Tagging

Title Deep Neural Networks for Music and Audio Tagging
Publication Type PhD Thesis
Year of Publication 2019
University Universitat Pompeu Fabra
Authors Pons, J.
Advisor Serra, X.
Academic Department Information and Communication Technologies
Number of Pages 216+xxiii
Date Published 11/2019
City Barcelona, Spain

Automatic music and audio tagging can help increase the retrieval and re-use possibilities of many audio databases that remain poorly labeled. In this dissertation, we tackle the task of music and audio tagging from the deep learning perspective and, within that context, we address the following research questions:

i. Which deep learning architectures are most appropriate for(music) audio signals?

ii. In which scenarios is waveform-based end-to-end learning feasible?

iii. How much data is required for carrying out competitive deep learning research?

In pursuit of answering research question(i), we propose to use musically motivated convolutional neural networks as an alternative to designing deep learning models that is based on domain knowledge, and we evaluate several deep learning architectures for audio at a low computational cost with a novel methodology based on non-trained(randomly weighted) convolutional neural networks. Throughout our work, we find that employing music and audio domain knowledge during the model’s design can help improve the efficiency, interpretability, and performance of spectrogram-based deep learning models.

For research questions (ii)and (iii), we perform a study with the Sample CNN, a recently proposed end-to-end learning model, to assess its viability for music audio tagging when variable amounts of training data —ranging from 25k to 1.2M songs— are available. We compare the Sample CNN against a spectrogram-based architecture that is musically motivated and conclude that, given enough data, end-to-end learning models can achieve better results. Finally, throughout our quest for answering research question(iii), we also investigate whether a naive regularization of the solution space, prototypical networks, transfer learning, or their combination, can foster deep learning models to better leverage a small number of training examples. Results indicate that transfer learning and proto-typical networks are powerful strategies in such low-data regimes.

preprint/postprint document
Final publication