TaDiCodec:

Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

Abstract Speech tokenizers serve as foundational components for speech language models, yet current designs exhibit several limitations, including: (1) dependence on multi-layer residual vector quantization structures or high frame rates, (2) reliance on auxiliary pre-trained models for semantic distillation, and (3) requirements for complex two-stage training processes. In this work, we introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a novel approach designed to overcome these challenges. TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of 0.0875 kbps with a single-layer codebook for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS). Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, obviating the need for auxiliary pre-trained models. We also validate the compatibility of TaDiCodec in language model based zero-shot text-to-speech with both autoregressive modeling and masked generative modeling, demonstrating its effectiveness and efficiency for speech language modeling, as well as a significantly small reconstruction-generation gap. We will open source our code and model checkpoints.

This page is for research demonstration purposes only.

TaDiCodec Architecture

Figure 1: Comparison between TaDiCodec and other speech tokenizers. We use a three-dimensional coordinate system to display the performance across three dimensions: the x-axis represents WER, the y-axis represents UTMOS, and the z-axis represents SIM. The size of the markers is proportional to the kbps value.

Speech Reconstruction Comparison with Other Systems

This section compares the speech reconstruction quality of various systems with different bitrates.

Ground Truth Mimi (1.1 kbps) WavTokenizer (0.9 kbps) BiCodec (0.65 kbps) DualCodec (0.925 kbps) Ours (0.0875 kbps)
In-the-Wild, EN
In-the-Wild, ZH
Singing

Zero-Shot TTS

This section demonstrates the zero-shot TTS capabilities built on TaDiCodec, showcasing various case types including articulatory, cross-lingual, and code-switching scenarios.

Case Type Target Text Prompt Speech Generated Speech
Regular
  • You think you can just waltz in here and cause chaos? Well, I've got news for you. This time, there's no escaping the consequences. So, one by one, step forward, and let's see who’s bold enough to face the music. It's time for a little dose of reality—prepare to be dealt with!
  • Let it envelop you like a cozy blanket, reminding you of the moments shared and the lessons learned. Embrace the wisdom that flows from those experiences, letting it guide you through challenges and inspire your dreams. Each memory is a treasure, a flicker of light that can illuminate the darkest paths. Let the love of your elder resonate within, fueling your journey forward with strength and compassion.
  • It might give you the space to collect your thoughts and emotions, to reflect on everything we've been through. After all, sometimes a little distance can provide clarity, and I want you to feel comfortable.
  • I can't shake this feeling of guilt, this notion that happiness is something I have to earn, not something I can simply take. Every laugh feels forced, every smile feels like a mask, hiding the turmoil within. I keep reminding myself of the struggles around me, the battles that so many face daily. How could I possibly revel in joy when there’s so much sorrow in the world? It's a constant tug-of-war between wanting to feel free and the weight of responsibility pressing down on my heart.
  • 实在令人心动,仿佛整个世界都在她的笑容中变得温柔。那一瞬间,时光仿佛凝固,周围的一切都失去了颜色,只有她的笑容如同春日的阳光,温暖而明亮。朕不禁想,是否有幸能常伴在她身侧,共享这份宁静与美好。人生如梦,唯愿这一刻能长久停留。
  • 臣愿以忠诚之心,竭尽所能,捧心事于朝廷,效力于国家。愿以微薄之力,与众同心,共同开创繁荣昌盛的新局面。若有不妥之处,恳请陛下明示,我必当勤修焉。
  • 一缕温暖的阳光透过树梢洒下,将雪地的白色映衬得愈加明亮。茶香袅袅升起,伴随着微风,仿佛连空气都变得甜美起来。这样恬静的午后,似乎时间都静止了,只有心中的宁静与这片白色世界相互交融。
Code-Switching
  • It's truly heartwarming,仿佛 the entire world becomes gentle in her smile. In that瞬间, time seems to freeze, and everything around失去了颜色, only her smile shines like春日的阳光, warm and bright. I can't help but wonder,是否有幸 to always be by her side,分享这份宁静与美好. Life is like a dream, I only wish this moment能长久停留.
  • I am willing to以忠诚之心, do my utmost, and捧心事于朝廷, serve the国家. I wish to以微薄之力, join hands with everyone, and一起开创繁荣昌盛的新局面. If there are any不妥之处, I humbly ask your majesty to明示, and I shall certainly勤修焉.
  • 和各种 stage performances。在这段时间里,我努力提升自己的能力,希望能够通过 my music 和 dance,将正能量传递给每一个人。无论未来的 road 多么艰辛,我都会坚持追求自己的 dreams。期待在未来的 stage 上与大家见面,感谢大家的 support!
  • But her mood仍然难以释怀,friends 对此感到担忧。Gradually, 一些人开始主动联系她,trying to understand her recent situation,甚至提议一起出去散心。After几次的邀请,她终于 agreed,决定让自己走出阴霾,重新面对生活。That day,她走出家门,sunlight洒在脸上,似乎带来了一丝久违的 warmth与希望。
  • In this fleeting passage of time, 我们常常忙于追寻, yet we overlook the beauty around us. 每一刻都是独一无二的记忆, worth savoring. Cherish the present, 感恩生活, and let our souls find light in the ordinary. No matter how the future changes, may we迎接每一个新的日出 with a calm heart.
Cross-lingual
  • Yes, usually people choose to face life with more positive emotions, after all, happy times are always yearning. However, sometimes slowing down and experiencing the details of life can bring deeper joy and satisfaction. What do you think?
  • I know that this matter is of great importance and must not be taken lightly. We can never let her safety be compromised. Therefore, I quickly began to gather all the resources that might be needed, contacting those who could help, ensuring that she can be brought back safely no matter what. In my heart, I quietly prayed, hoping everything would go smoothly.
  • 在那之后,技术、科学和文化等各个领域的新发展和见解不断涌现。保持最新信息非常重要,以保持我们的理解与时俱进且相关。参与最近的出版物、讨论和更新将增强我们对不断变化的话题的知识和视角。在这个信息不断发展和变化的世界中,持续学习的过程至关重要。
  • 这可能会给你空间来整理你的想法和情感,反思我们经历的一切。毕竟,有时候一点距离可以提供清晰度,我希望你感到自在。
Articulatory
  • Jittery Jack's jam jars jiggled jauntily, jolting Jack's jumbled jelly-filled jars joyously.
  • Cindy's circular cymbals clanged cheerfully, clashing crazily near Carla's crashing crockery.
  • 是啊,通常大家都会选择更积极的情绪来面对对生活,毕竟竟竟竟快乐的时光总是让人向往。不过,有时时时候慢下来,去感受生活中的细节,反而能能能带来更深刻的乐趣趣趣和满足感。你觉得呢呢呢呢?
  • 谢老爹和薛大爷谢老爹在街上扫雪,薛大爷在屋里打铁。薛大爷见谢老爹扫雪,就放下手里打着的铁,到街上帮谢老爹扫雪。谢老爹扫完街上的雪,进屋去帮薛大爷打铁。两人同扫雪,两人同打铁。
  • 寡妇马华莎,光汉贾家嘉。马华莎脸上麻,贾家嘉独眼瞎,两人登记成了家。贾家嘉不嫌马华莎麻,马华莎不嫌贾家嘉瞎。