TTS Demo - Multilingual Samples - Erik Beltrán Lobato

This demo showcases a text-to-speech model I trained as a side project, based on the Qwen3-4B language model. I fine-tuned the Qwen3-4B model using the WavTokenizer neural codec to enable multilingual text-to-speech generation (Spanish, Catalan, and English). For this, I preprocessed a 300-hour multi-speaker audio dataset, extended the model tokenizer with new discrete speech code tokens, and trained the model to predict neural codec sequences. The system can condition on audio context to imitate speech patterns. All datasets utilised are of public domain, and the resulting model is intended for research and educational purposes only. Huggingface page: https://huggingface.co/ebellob/qwav3_4B

Speed 1.00× Loop