Clustering of multiple-event online sound collections with the codebook approach

Lluis Suros

Note: This bibliographic page is archived and will no longer be updated. For an up-to-date list of publications from the Music Technology Group see the Publications list .

Clustering of multiple-event online sound collections with the codebook approach

Title	Clustering of multiple-event online sound collections with the codebook approach
Publication Type	Master Thesis
Year of Publication	2019
Authors	Suros, L.
Abstract	In massive online audio databases such as Freesound, automatic methods to encode, process, compare and organize the content are relevant topics of research. For instance, a properly organized presentation of these audio collections can improve the user experience when browsing for sounds. When dealing with collections of diverse audio content, some items can contain multiple sonic events. This multi-event audio clips might be challenging to numerically encode in a concise yet representative manner. However, these numerical representations are essential in order to proceed with further computation of the signal. This multiple-event audio clips might be misrepresented by statistical aggregation methods such as computing the mean over the features; in this regard, techniques that retain the elemental blocks of a given signal can be potentially beneficial. This thesis aims to explore the contribution of the codebook approach to the clustering of large collections of multiple-event audio content. The codebook approach might allow to automatically obtain "acoustic words", which are discrete partitions of the feature space that could be roughly representative of the underlying acoustic events present in the original sounds. These "acoustic words" can be understood and processed analogous to natural language words, thus giving access to the use of varied Natural Language Processing techniques, such as Bag-of-Words, TF-IDF or neural network word embeddings. We believe that this approach could help to improve the similarity assessment between multi-event audio clips, and potentially increase the accuracy on an ulterior clustering stage. In future works, this text analogy might also empower interesting artistic research. In order to experiment with the presented concepts, a corresponding end-to-end processing pipeline has been implemented. Freesound Datasets and a brand new custom dataset of acoustic scenes have been employed to perform the experiments. The codebase and the custom dataset are both delivered in the accompanying Github project ( https://github.com/lluissuros/codebook-approach ). The experiments results are reported, discussed and expanded with ideas for future work.
Final publication	https://doi.org/10.5281/zenodo.3475480