We present audio examples for our paper TTS-Transducer: End-to-end Speech Synthesis With Neural Transducer. To perform robust text-to-speech synthesis, we propose a TTS-Transducer model that can learn robust text and speech alignment without modifying the model architecture or requiring ground-truth text duration. Experiments demonstrate that our alignment learning procedure improves the reliability of TTS synthesis, especially for challenging text inputs and outperforms prior LLM-based TTS models on both intelligibility and naturalness. We present audio samples generated by TTS-Transducer model for both seen and unseen speakers.
We evaluate different LLM-based TTS models on a set of 92 challenging texts (Full list can be found here) For each text, we synthesize audios per model using speakers from the voice presets of the given models. We present some of the audio samples from this experiment below.
Transcript | VALLE-X | Bark | SpeechT5 | TTS-Transducer Large (ours with EnCodec) | TTS-Transducer Large (ours with NeMo-Codec RVQ) |
---|
We evaluate our TTS models on a set of 92 challenging texts (Full list can be found here), instead of BPE tokens now we use IPA tokenization. For each text, we synthesize audios per model using speakers from the voice presets of the given models. We present some of the audio samples from this experiment below. The text is tokenized using IPA based tokenizer.
Transcript | TTS-Transducer Large with IPA (ours with EnCodec) | TTS-Transducer Large with IPA (ours with NeMo-Codec RVQ) |
---|
In this section, we present TTS results when the reference audio is from a seen speaker and the text is from unseen holdout sentences. We train TTS-Transducer Large on different audio codecs - Encodec, DAC and NeMo-Codec RVQ. We present the results on the holdout utterances from the Libri-TTS-R train-clean-100 set (seen speakers, unseen texts) below.
Transcript | Target Audio | Encodec Predicted Audio | Dac Predicted Audio | NeMo-Codec RVQ Predicted Audio |
---|
We also evaluate our models on the zero-shot TTS task, when the context audio is from an unseen speaker. We present audio samples generated by TTS-Transducer model for unseen speakers from Libri-TTS-R dev dataset. We compare our two aligned TTS-Transducer Large model trained on 3 different audio codecs - Encodec, DAC and NeMo-Codec RVQ.
Transcript | Target Audio | Encodec Predicted Audio | Dac Predicted Audio | NeMo-Codec RVQ Predicted Audio |
---|