Modeling and Transforming Speech Using Variational Autoencoders

Merlijn Blaauw; Bonada, J.

Note: This bibliographic page is archived and will no longer be updated. For an up-to-date list of publications from the Music Technology Group see the Publications list .

Modeling and Transforming Speech Using Variational Autoencoders

Title	Modeling and Transforming Speech Using Variational Autoencoders
Publication Type	Conference Paper
Year of Publication	2016
Conference Name	Interspeech
Authors	Blaauw, M. , & Bonada J.
Conference Start Date	13/09/2016
Conference Location	San Francisco, USA
Abstract	Latent generative models can learn higher-level underlying factors from complex data in an unsupervised manner. Such models can be used in a wide range of speech processing applications, including synthesis, transformation and classification. While there have been many advances in this field in recent years, the application of the resulting models to speech processing tasks is generally not explicitly considered. In this paper we apply the variational autoencoder (VAE) to the task of modeling frame-wise spectral envelopes. The VAE model has many attractive properties such as continuous latent variables, prior probability over these latent variables, a tractable lower bound on the marginal log likelihood, both generative and recognition models, and end-to-end training of deep models. We consider different aspects of training such models for speech data and compare them to more conventional models such as the Restricted Boltzmann Machine (RBM). While evaluating generative models is difficult, we try to obtain a balanced picture by considering both performance in terms of reconstruction error and when applying the model to a series of modeling and transformation tasks to get an idea of the quality of the learned features.

mblaauw_IS2016_VAE.pdf