Synthesis of the Singing Voice by Performance Sampling and Spectral Models

TitleSynthesis of the Singing Voice by Performance Sampling and Spectral Models
Publication TypeJournal Article
Year of Publication2007
AuthorsBonada, J., & Serra X.
Journal TitleIEEE Signal Processing Magazine
AbstractAmong the many existing approaches to the synthesis of musical sounds, the ones that have had the biggest success are without any doubt the sampling based ones, which sequentially concatenate samples from a corpus database [1]. Strictly speaking, we could say that sampling is not a synthesis technique, but from a practical perspective it is convenient to treat it as such. From what we explain in this article it should become clear that, from a technology point of view, it is also adequate to include sampling as a type of sound synthesis model.
The success of sampling relies on the simplicity of the approach, it just samples existing sounds, but most importantly it succeeds in capturing the naturalness of the sounds, since the sounds are real sounds. However, sound synthesis is far from being a solved problem and sampling is far from being an ideal approach. The lack of flexibility and expressivity are two of the main problems, and there are still many issues to be worked on if we want to reach the level of quality that a professional musician expects to have in a musical instrument.
Sampling based techniques have been used to reproduce practically all types of sounds and basically have been used to model the sound space of all musical instruments. They have been particularly successful for instruments that have discrete excitation controls, such as percussion or keyboard instruments. For these instruments it is feasible to reach an acceptable level of quality by using large sample databases, thus by sampling a sufficient portion of the sound space produced by a given instrument. This is much more difficult for the case of continuously excited instruments, such as bowed strings, wind instruments or the singing voice, and therefore recent sampling based systems consider a trade-off between performance modeling and sample reproduction (e.g. [2]). For these instruments there are numerous control parameters and many ways to attack, articulate or play each note. The control parameters are constantly changing and the sonic space covered by a performer could be considered to be much larger than for the discretely excited instruments. The synthesis approaches based on physical models have the advantage of having the right parameterization for being controlled like a real instrument, thus they have great flexibility and the potential to play expressively. One of the main open problems relates to the control of these models, in particular how to generate the physical actions that excite the instrument. In sampling these actions are embedded in the recorded sounds.
We have worked on the synthesis of the singing voice for many years now mostly together with Yamaha Corp., part of our results having being incorporated into the Vocaloid software synthesizer. Our goal has always been to develop synthesis engines that could sound as natural and expressive as a real singer (or choir [3]) and whose inputs could be just the score and the lyrics of the song. This is a very difficult goal and there is still a lot of work to be done, but we believe that our proposed approach can reach that goal. In this paper we will overview the key aspects of the technologies developed so far and identify the open issues that still need to be tackled. The core of the technologies is based on spectral processing and over the years we have added performance actions and physical constraints in order to convert the basic sampling approach to a more flexible and expressive technology while maintaining its inherent naturalness.
In the first part of the article we introduce the concept of synthesis based on performance sampling and the specific spectral models that we have developed and used for the singing voice. In the second part we go over the different components of the synthesizer and we conclude by identifying the open issues of this research work.
Published documentfiles/publications/IEEESP-SingingVoiceSynthesis_FINAL.pdf