Expressive Singing Voice Processing based on Deep Learning

Expressive Singing Voice Processing based on Deep Learning (jordi [dot] bonada [at] upf [dot] edu, emilia [dot] gomez [at] upf [dot] edu)

Humans use singing to express emotion, tell stories, create identity, exercise creativity, and connect with each other while singing together. Vocal music makes up an important part of our cultural heritage. In the last decade, singing voice synthesis has become a topic of great interest and used in a wide range of applications. The most clear example is found in Japan, where singing voice synthesis with Vocaloid (result of more than a decade of collaboration between the MTG and Yamaha) has brought birth to a cultural revolution. In some cases, virtual singers have even become as popular as the real ones, having hundreds of thousands of supporting fans and even performing in live concerts with real musicians [1]. Furthermore, an increasing community of amateur and professional musicians create synthetic songs with the available synthesizer editors.

However, the current state of the art in singing synthesis lacks much of the naturalness and emotion found in real singing. Furthermore, users unavoidably have to devote a lot of time tweaking the synthesizer parameters so to enhance the expression. Recent advances in deep learning techniques have shown to significantly reduce the gap between human and synthetic speech, and we believe similar techniques could also help to reduce the gap between human and synthetic singing. Moreover, a semi-supervised or automatic system able to generate synthesizer controls that emulates the singing style and expression of real singers is also of paramount importance, and besides easing the musical creation it would open the doors of virtual singing to a much broader user community. Research in this direction would help to better understand and model the different aspects involved in the expression of singing voice. Part of the research would be devoted to develop open datasets for singing voice research.