Unsupervised Incremental Learning and Prediction of Audio Signals

TitleUnsupervised Incremental Learning and Prediction of Audio Signals
Publication TypeConference Paper
Year of Publication2010
Conference NameInternational Symposium on Music Acoustics
AuthorsMarxer, R., & Purwins H.
AbstractThe artful play with the listener’s expectations is one of the supreme skills of a gifted musician. We present a system that analyzes an audio signal in an unsupervised manner in order to generate a musical representation of it on-the-fly. The system performs the task of next note prediction using the emerged representation. The main difference between our system and other existing music prediction systems is the fact that it dynamically creates the necessary representations as needed. Therefore it can adapt itself to any type of sounds, with as many timbre classes as there may be. The system consists of a conceptual clustering algorithm coupled with a modified hierarchical N-gram. The main flow of the system can be summarized in the following processing steps: 1) segmentation by transient detection, 2) timbre representation of each segment by Mel-cepstrum coefficients, 3) discretization by conceptual clustering, yielding a number of different sound classes (e.g. instruments) that can incrementally grow or shrink depending on the context resulting in a discrete sequence of sound events, 4) extraction of statistical regularities using hierarchical N-grams (Pfleger 2002), 5) prediction of continuation, and 6) sonification. The system is tested on voice recordings. We assess the robustness of the performance with respect to complexity and noise of the signal. Given that the number of estimated timbre classes is not necessarily the same as in the ground truth, we propose a performance measure (F-recall) based on pairwise matching. Finally, we sonify the predicted sequence in order to evaluate the system from a qualitative point of view. We evaluate separately the different steps in the process and finally the system as a whole as well as the interacting components of the complete system. Onset detection performs with an F-measure of 98.6% for a data set of a singing voice. Clustering in isolation yields an F-recall of 88.5%. Onset detection jointly with Clustering achieve an F-recall of 91.4%. The prediction of the entire system yields F-recall of 51.3%.
preprint/postprint documenthttp://www.mtg.upf.edu/files/publications/MarxerPurwinsUnsupervisedIncrementalPredictionAudioISMA2010.pdf