Modeling of Phoneme Durations for Alignment between Polyphonic Audio and Lyrics

Georgi Dzhambazov; Xavier Serra

Note: This bibliographic page is archived and will no longer be updated. For an up-to-date list of publications from the Music Technology Group see the Publications list .

Modeling of Phoneme Durations for Alignment between Polyphonic Audio and Lyrics

Title	Modeling of Phoneme Durations for Alignment between Polyphonic Audio and Lyrics
Publication Type	Conference Paper
Year of Publication	2015
Conference Name	Sound and Music Computing Conference 2015
Authors	Dzhambazov, G. , & Serra X.
Conference Location	Maynooth, Ireland
Abstract	In this work we propose how to modify a standard scheme for text-to-speech alignment for the alignment of lyrics and singing voice. To this end we model the duration of phonemes specific for the case of singing. We rely on a duration-explicit hidden Markov model (DHMM) phonetic recognizer based on mel frequency cepstral coefficients (MFCCs), which are extracted in a way robust to background instrumental sounds. The proposed approach is tested on polyphonic audio from the classical Turkish music tradition in two settings: with and without modeling phoneme durations. Phoneme durations are inferred from sheet music. In order to assess the impact of the polyphonic setting, alignment is evaluated as well on an acapella dataset, compiled especially for this study. We show that the explicit modeling of phoneme durations improves alignment accuracy by absolute 10 percent on the level of lyrics lines (phrases) and performs on par with state-of-the-art aligners for other languages.

postprint version

Additional material:

demo video: https://vimeo.com/169459368 http://dunya.compmusic.upf.edu/makam/lyric-align/727cff89-392f-4d15-926d...