TTS-Transducer: End-to-end Speech Synthesis With Neural Transducer

We present audio examples for our paper TTS-Transducer: End-to-end Speech Synthesis With Neural Transducer. To perform robust text-to-speech synthesis, we propose a TTS-Transducer model that can learn robust text and speech alignment without modifying the model architecture or requiring ground-truth text duration. Experiments demonstrate that our alignment learning procedure improves the reliability of TTS synthesis, especially for challenging text inputs and outperforms prior LLM-based TTS models on both intelligibility and naturalness. We present audio samples generated by TTS-Transducer model for both seen and unseen speakers.

TTS from Challenging Texts

We evaluate different LLM-based TTS models on a set of 92 challenging texts (Full list can be found here) For each text, we synthesize audios per model using speakers from the voice presets of the given models. We present some of the audio samples from this experiment below.

Transcript VALLE-X Bark SpeechT5 TTS-Transducer Large (ours with EnCodec) TTS-Transducer Large (ours with NeMo-Codec RVQ)

We evaluate our TTS models on a set of 92 challenging texts (Full list can be found here), instead of BPE tokens now we use IPA tokenization. For each text, we synthesize audios per model using speakers from the voice presets of the given models. We present some of the audio samples from this experiment below. The text is tokenized using IPA based tokenizer.

Transcript TTS-Transducer Large with IPA (ours with EnCodec) TTS-Transducer Large with IPA (ours with NeMo-Codec RVQ)

TTS-Transducer Generated Audio using Different Codecs for Seen Speakers

In this section, we present TTS results when the reference audio is from a seen speaker and the text is from unseen holdout sentences. We train TTS-Transducer Large on different audio codecs - Encodec, DAC and NeMo-Codec RVQ. We present the results on the holdout utterances from the Libri-TTS-R train-clean-100 set (seen speakers, unseen texts) below.

Transcript Target Audio Encodec Predicted Audio Dac Predicted Audio NeMo-Codec RVQ Predicted Audio

TTS-Transducer Generated Audio using Different Codecs for Unseen Speakers

We also evaluate our models on the zero-shot TTS task, when the context audio is from an unseen speaker. We present audio samples generated by TTS-Transducer model for unseen speakers from Libri-TTS-R dev dataset. We compare our two aligned TTS-Transducer Large model trained on 3 different audio codecs - Encodec, DAC and NeMo-Codec RVQ.

Transcript Target Audio Encodec Predicted Audio Dac Predicted Audio NeMo-Codec RVQ Predicted Audio