Unsupervised Incremental Learning and Prediction of Audio Signals

Marxer, R.; Purwins, P.

Note: This bibliographic page is archived and will no longer be updated. For an up-to-date list of publications from the Music Technology Group see the Publications list .

Unsupervised Incremental Learning and Prediction of Audio Signals

Title	Unsupervised Incremental Learning and Prediction of Audio Signals
Publication Type	Conference Paper
Year of Publication	2010
Conference Name	International Symposium on Music Acoustics
Authors	Marxer, R. , & Purwins H.
Abstract	The artful play with the listener’s expectations is one of the supreme skills of a gifted musician. We present a system that analyzes an audio signal in an unsupervised manner in order to generate a musical representation of it on-the-fly. The system performs the task of next note prediction using the emerged representation. The main difference between our system and other existing music prediction systems is the fact that it dynamically creates the necessary representations as needed. Therefore it can adapt itself to any type of sounds, with as many timbre classes as there may be. The system consists of a conceptual clustering algorithm coupled with a modified hierarchical N-gram. The main flow of the system can be summarized in the following processing steps: 1) segmentation by transient detection, 2) timbre representation of each segment by Mel-cepstrum coefficients, 3) discretization by conceptual clustering, yielding a number of different sound classes (e.g. instruments) that can incrementally grow or shrink depending on the context resulting in a discrete sequence of sound events, 4) extraction of statistical regularities using hierarchical N-grams (Pfleger 2002), 5) prediction of continuation, and 6) sonification. The system is tested on voice recordings. We assess the robustness of the performance with respect to complexity and noise of the signal. Given that the number of estimated timbre classes is not necessarily the same as in the ground truth, we propose a performance measure (F-recall) based on pairwise matching. Finally, we sonify the predicted sequence in order to evaluate the system from a qualitative point of view. We evaluate separately the different steps in the process and finally the system as a whole as well as the interacting components of the complete system. Onset detection performs with an F-measure of 98.6% for a data set of a singing voice. Clustering in isolation yields an F-recall of 88.5%. Onset detection jointly with Clustering achieve an F-recall of 91.4%. The prediction of the entire system yields F-recall of 51.3%.
preprint/postprint document	http://www.mtg.upf.edu/files/publications/MarxerPurwinsUnsupervisedIncrementalPredictionAudioISMA2010.pdf

MarxerPurwinsUnsupervisedIncrementalPredictionAudioISMA2010.pdf