Abstract | Latent generative models can learn higher-level underlying factors
from complex data in an unsupervised manner. Such models
can be used in a wide range of speech processing applications,
including synthesis, transformation and classification.
While there have been many advances in this field in recent
years, the application of the resulting models to speech processing
tasks is generally not explicitly considered. In this paper we
apply the variational autoencoder (VAE) to the task of modeling
frame-wise spectral envelopes. The VAE model has many
attractive properties such as continuous latent variables, prior
probability over these latent variables, a tractable lower bound
on the marginal log likelihood, both generative and recognition
models, and end-to-end training of deep models. We consider
different aspects of training such models for speech data and
compare them to more conventional models such as the Restricted
Boltzmann Machine (RBM). While evaluating generative
models is difficult, we try to obtain a balanced picture by
considering both performance in terms of reconstruction error
and when applying the model to a series of modeling and transformation
tasks to get an idea of the quality of the learned features.
|