| | <!DOCTYPE html> |
| | <html lang="en"> |
| |
|
| | <head> |
| | <meta charset="UTF-8" /> |
| | <meta name="viewport" content="width=device-width, initial-scale=1.0" /> |
| | <title>MiniMax-Speech Tech Report | Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder</title> |
| | <meta name="description" |
| | content=" MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech" /> |
| | <meta name="keywords" content="latex.css,css library,class-less css,latex css" /> |
| | <meta property="og:title" |
| | content="MiniMax-Speech Tech Report | Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder" /> |
| | <meta property="og:url" content="https://minimax-ai.github.io/tts_tech_report" /> |
| | <meta property="og:description" |
| | content=" MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech" /> |
| | <meta property="og:type" content="website" /> |
| |
|
| | <link rel="stylesheet" href="style.css" /> |
| | </head> |
| |
|
| | <body id="top" class="text-justify"> |
| | <header |
| | style="background-image: url('assets/images/header-bg.jpeg'); background-size: cover; background-position: center; padding: 1rem 0; border-radius: 1rem;"> |
| | <h1>MiniMax-Speech</h1> |
| | <h4 style="font-size: 1.3rem; line-height: 1; text-align: center;">Intrinsic Zero-Shot Text-to-Speech |
| | with a |
| | Learnable Speaker |
| | Encoder</h4> |
| | <p class="author"> |
| | MiniMax Team <span class="date">May 2025</span><br /> |
| | <a style="font-size: 1.1rem;" target="_blank" href="https://arxiv.org/abs/2505.07916">[Tech |
| | Report]</a> |
| | <a style="font-size: 1.1rem; margin-left: 1rem;" target="_blank" |
| | href="https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set">[Multilingual Test Set]</a> |
| | <a style="font-size: 1.1rem; margin-left: 1rem;" target="_blank" href="https://github.com/MiniMax-AI">[GitHub]</a> |
| | </p> |
| | </header> |
| |
|
| | <div class="abstract"> |
| | <h2>Abstract</h2> |
| | <p style="text-align: left;"> |
| | We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates |
| | high-quality |
| | speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio |
| | without |
| | requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre |
| | consistent with |
| | the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high |
| | similarity to |
| | the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed |
| | Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and |
| | subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning |
| | metrics |
| | (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. |
| | Another |
| | key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, |
| | is its |
| | extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion |
| | control |
| | via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional |
| | voice |
| | cloning (PVC) by fine-tuning timbre features with additional data. |
| | </p> |
| | </div> |
| |
|
| | <nav role="navigation" class="toc"> |
| | <h2>Explore MiniMax-Speech</h2> |
| | <p>Welcome to visit |
| | <a href="https://www.minimax.io/audio">MiniMax Audio</a> and |
| | explore our powerful TTS features. |
| | </p> |
| | <h2>Contents</h2> |
| | <ol> |
| | <li> |
| | <a href="#architecture-overview">Architecture Overview</a> |
| | </li> |
| | <li> |
| | <a href="#expressiveness-demonstrations">Expressiveness Demonstrations</a> |
| | <ol> |
| | <li><a href="#showcase-with-high-versatility">Showcase with High Versatility</a></li> |
| | <li><a href="#showcase-with-multiple-generation-attempts">Showcase with Multiple Generation Attempts</a></li> |
| | </ol> |
| | </li> |
| | <li><a href="#zero-shot-vs-one-shot-demonstrations">Zero-Shot vs. One-Shot Demonstrations</a></li> |
| | <li><a href="#multilingual-and-cross-lingual-capabilities-demonstrations">Multilingual and Cross-Lingual |
| | Capabilities Demonstrations</a></li> |
| | <li><a href="#flow-vae-vs-vae-comparisons">Flow-VAE vs. VAE Comparisons</a></li> |
| | <li><a href="#professional-voice-clone-pvc-demonstrations">Professional Voice Clone (PVC) Demonstrations</a></li> |
| | <li><a href="#emotion-control-demonstrations">Emotion Control Demonstrations</a></li> |
| | <li><a href="#text-prompted-voice-generation-demonstrations">Text-Prompted Voice Generation Demonstrations</a> |
| | </li> |
| | <li><a href="#comparison-of-voice-naturalness">Comparison of voice |
| | naturalness with the previous generation products</a></li> |
| | <li><a href="#citation">Citation</a></li> |
| | </ol> |
| | </nav> |
| |
|
| | <main> |
| | <article> |
| | <div class="article-block"> |
| | <h2 id="architecture-overview">Architecture Overview</h2> |
| | <figure> |
| | <img src="assets/images/system-overview.jpg" loading="lazy" alt="System Architecture" width="100%" |
| | height="auto" /> |
| | <figcaption> |
| | An overview of the architecture of MiniMax-Speech. |
| | </figcaption> |
| | </figure> |
| | </div> |
| |
|
| | <div class="article-block"> |
| | <h2 id="expressiveness-demonstrations">Expressiveness Demonstrations</h2> |
| | <h3 id="showcase-with-high-versatility">Showcase with High Versatility</h3> |
| | <div class="scroll-wrapper"> |
| | <table style="width: 100%;"> |
| | <tbody> |
| | <tr class="border-bottom-thin"> |
| | <th scope="col" style="width: 40%;">Description</th> |
| | <th scope="col" style="width: 30%; text-align: center;">Source Audio</th> |
| | <th scope="col" style="width: 30%; text-align: center;">Generated Audio</th> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | A Compelling and Persuasive Speaker Voice |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Marketing_Voice_Sourse.wav" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Compelling%20and%20Persuasive.wav" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | A Clear and Explanatory Voice with Broad Emotional Dynamics Across Different Texts |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Science_Voice_Sourse.wav" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Explanatory%20Broad%20Emotional.wav" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | Another Explanatory Voice with Supernatural Prosody, <br> |
| | Featuring Distinct Ethnic and Age Characteristics |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Sociology_Sourse.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Explanatory Supernatural Prosody.MP3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | A Warm and Magnetic Voice that Brings Comfort |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Warm%20and%20Magnetic_Sourse.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Warm%20and%20Magnetic.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | An ASMR Whispering Voice with Generated Breathing and Sound Effects |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Breathy%20ASMR_Sourse.wav" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Breathy%20ASMR.MP3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | A Robotic Voice with Rich Bass Resonance and Spatial Presence |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Lucky%20Robot_Sourse.wav" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Lucky%20Robot.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | A Sardonic Mature Female Voice |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Onee-san_Sourse.MP3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Onee-san.wav" controls></audio> |
| | </td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | </div> |
| |
|
| | <h3 id="showcase-with-multiple-generation-attempts">Showcase with Multiple Generation Attempts, Post-Processing |
| | Audio Effects and Added Sound Effects</h3> |
| | <div class="scroll-wrapper"> |
| | <table style="width: 100%;"> |
| | <tbody> |
| | <tr class="border-bottom-thin"> |
| | <th scope="col" style="width: 50%;">Description</th> |
| | <th scope="col" style="width: 50%; text-align: center;">Generated Audio</th> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | A Husky Male Voice: From Soft Murmur to Excitement to Anger, then to Whispers |
| | </td> |
| | <td> |
| | <audio class="audio-lg" src="assets/audios/Murmur-Excitement-Anger-%20Whispers.MP3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | An Angry Female Voice: From Soft Murmur to Rage to Reminiscence, then to Weeping |
| | </td> |
| | <td> |
| | <audio class="audio-lg" src="assets/audios/Neutral-Rage-Reminiscence-Weeping.MP3" controls></audio> |
| | </td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | </div> |
| | </div> |
| |
|
| | <div class="article-block"> |
| | <h2 id="zero-shot-vs-one-shot-demonstrations">Zero-Shot vs. One-Shot Demonstrations</h2> |
| | <p> |
| | ZeroShot maintains speaker identity while generating more natural emotions, pauses, and other expressive |
| | features based |
| | on the text content, whereas OneShot adheres more strictly to the speaker characteristics (prosody, speech |
| | rate, |
| | emotions, etc.). For details of Zero-Shot and One-Shot, refer to the <a |
| | href="https://arxiv.org/abs/2505.07916" target="_blank">technical report</a>. |
| | </p> |
| | <div class="scroll-wrapper" style="margin-top: 2rem;"> |
| | <table style="width: 100%;"> |
| | <tbody> |
| | <tr class="border-bottom-thin"> |
| | <th scope="col">Source Audio</th> |
| | <th scope="col">Text</th> |
| | <th scope="col">Zero-Shot Version</th> |
| | <th scope="col">One-Shot Version</th> |
| | <th scope="col">Elevenlabs Multilingual_v2</th> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Lyrical%20Cantonese_Prompt.WAV" controls></audio> |
| | </td> |
| | <td> |
| | 命运就算颠沛流离,<br> |
| | 命运就算曲折离奇,<br> |
| | 命运就算恐吓着你,<br> |
| | 做人没趣味。<br> |
| | 别流泪,心酸,更不应舍弃。<br> |
| | 我愿能,一生永远陪伴你。 |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Lyrical%20Cantonese_ZeroShot.mp3" controls></audio> |
| | Preserving Distinctive Voice<br> |
| | Timbre and Expressive <br> |
| | Prosody with Regularized <br> |
| | Pausing and Speech Rate |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Lyrical%20Cantonese_Oneshot.mp3" controls></audio> |
| | Better Reproduction of<br> |
| | Prompt's Exaggerated Speech<br> |
| | Rate and Characteristic<br> |
| | Phrase-Initial Pauses |
| | </td> |
| | <td> |
| | Cantonese not supported |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Breaking%20Down%20Mandarin_Prompt.WAV" controls></audio> |
| | </td> |
| | <td> |
| | 你们这些躲在道德高地的懦夫,<br> |
| | 敢承认自己对本我的恐惧吗?<br> |
| | 回答我!嗯?你回答我!<br> |
| | Look in my eyes!<br> |
| | 老子写梦的解析时<br> |
| | 你们还在玩泥巴,<br> |
| | 我精神分析引论每个字母都能<br> |
| | 刺穿文明社会的虚伪面具,<br> |
| | 我解剖潜意识就像<br> |
| | 外科医生划开皮肤。<br> |
| | 是不是啊?说话! |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Breaking%20Down%20Mandarin_ZeroShot.mp3" controls></audio> |
| | Capable of Generating<br> |
| | Relatively Calmer Emotions<br> |
| | while Preserving Voice<br> |
| | Identity |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Breaking%20Down%20Mandarin_OneShot.mp3" controls></audio> |
| | Consistently Reproducing the<br> |
| | Angry Emotion from Prompt<br> |
| | in Every Utterance |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/ElevenLabs_Breaking Down Mandarin.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Quirky%20Female%20English_Prompt.MP3" controls></audio> |
| | </td> |
| | <td> |
| | Would you believe what happened at the<br> |
| | grocery store today? My goodness! The<br> |
| | avocados were on sale - half price! Half<br> |
| | price! I bought twenty of them! |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Quirky%20Female%20English_ZeroShot.MP3" controls></audio> |
| | Effectively follows textual cues<br> |
| | for both longer and shorter<br> |
| | inter-sentence pauses |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Quirky%20Female%20English_OneShot.MP3" controls></audio> |
| | Better reproduces the<br> |
| | exaggerated high pitch<br> |
| | characteristic of anime voices<br> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/ElevenLabs_Quirky%20Female%20English.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Neurotic%20Teenage%20English_Prompt.MP3" controls></audio> |
| | </td> |
| | <td> |
| | Oh my gosh, like, I literally can't believe<br> |
| | what just happened! Um, so basically, I was,<br> |
| | you know, just sitting there in class,<br> |
| | right? And then, ugh, this totally weird<br> |
| | thing happened - like, seriously weird! Wait,<br> |
| | wait... Should I even be talking about this?<br> |
| | Ugh, whatever. |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Neurotic%20Teenage%20English_ZeroShot.MP3" |
| | controls></audio> |
| | Effectively follows textual cues<br> |
| | for both longer and shorter<br> |
| | inter-sentence pauses |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Neurotic%20Teenage%20English_OneShot.MP3" controls></audio> |
| | Better reproduces the<br> |
| | exaggerated high pitch<br> |
| | characteristic of anime voices<br> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/ElevenLabs_Neurotic%20Teenage%20English.mp3" |
| | controls></audio> |
| | </td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | </div> |
| | </div> |
| |
|
| | <div class="article-block"> |
| | <h2 id="multilingual-and-cross-lingual-capabilities-demonstrations">Multilingual and Cross-Lingual Capabilities |
| | Demonstrations</h2> |
| | <p>Speech-02-HD maintains high naturalness in less common languages while demonstrating significant advantages |
| | in |
| | Standard |
| | Chinese pronunciation accuracy.</p> |
| | <div class="scroll-wrapper" style="margin-top: 2rem;"> |
| | <table style="width: 100%;"> |
| | <tbody> |
| | <tr class="border-bottom-thin"> |
| | <th scope="col">Languages</th> |
| | <th scope="col">Source Audio</th> |
| | <th scope="col">Text</th> |
| | <th scope="col">MiniMax<br>Speech_02_HD</th> |
| | <th scope="col">ElevenLabs<br>Multilingual_v2</th> |
| | <th scope="col">OpenAI<br>TTS_1_HD<br>(*not cloned voice)</th> |
| | </tr> |
| | |
| | <tr class="border-bottom-thin"> |
| | <th>Thai</th> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Thai_Male_Sourse.wav" controls></audio> |
| | </td> |
| | <td> |
| | สวัสดีค่ะ วันนี้อากาศดีมากเลย<br> |
| | คุณจะไปทานอาหารกลางวันที่ไหนคะ<br> |
| | ฉันกำลังคิดว่าจะไปร้านอาหารไทยแถวนี้<br> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Thai.mp3" controls></audio> |
| | </td> |
| | <td> |
| | Thai not perfectly supported |
| | <audio class="audio-sm" src="assets/audios/ElevenLabs_Thai.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/OpenAI_Thai.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | |
| | <tr class="border-bottom-thin"> |
| | <th>Vietnamese</th> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Vietnamese_Female_Sourse.wav" controls></audio> |
| | </td> |
| | <td> |
| | Tôi đang đọc một cuốn sách rất hay về lịch sử Việt Nam.<br> |
| | Những câu chuyện về văn hóa truyền<br> |
| | thống thật sự rất thú vị.<br> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Vietnamese.mp3" controls></audio> |
| | </td> |
| | <td> |
| | Vietnamese not perfectly supported |
| | <audio class="audio-sm" src="assets/audios/ElevenLabs_Vietnamese.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/OpenAI_Vietnamese.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | |
| | <tr class="border-bottom-thin"> |
| | <th>Czech</th> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Czech_Female_Sourse.wav" controls></audio> |
| | </td> |
| | <td> |
| | Ranní mlha se pomalu zvedá nad řekou,<br> |
| | zatímco první paprsky slunce prosvítají mezi stromy.<br> |
| | Ptáci začínají svůj ranní koncert.<br> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Czech.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/ElevenLabs_Czech.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/OpenAI_Czech.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | |
| | <tr class="border-bottom-thin"> |
| | <th>Polish</th> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Polish_Male_Sourse.wav" controls></audio>、 |
| | </td> |
| | <td> |
| | Młoda sowa siedzi cicho na gałęzi sosny,<br> |
| | obserwując leśną polanę w świetle księżyca.<br> |
| | Wiatr delikatnie porusza liśćmi drzew.<br> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Polish.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/ElevenLabs_Polish.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/OpenAI_Polish.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | |
| | <tr class="border-bottom-thin"> |
| | <th>Japanese</th> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Japanese_DominantMan_Sourse.mp3" controls></audio> |
| | </td> |
| | <td> |
| | 電車が遅延している影響で、渋谷駅がとても混雑<br> |
| | しています。次の山手線は約10分後に到着<br> |
| | 予定です。お急ぎのお客様は、他の路線も<br> |
| | ご利用ください。 |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Japanese.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/ElevenLabs_Japanese_Dominant_Man.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/OpenAI_Japanese.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | </div> |
| | <p style="margin-top: 4rem;">Speech-02-HD has superior performance in zero-shot cross-lingual scenarios.</p> |
| | <div class="scroll-wrapper" style="margin-top: 2rem;"> |
| | <table style="width: 100%;"> |
| | <tbody> |
| | <tr class="border-bottom-thin"> |
| | <th scope="col">Original Language</th> |
| | <th scope="col">Source Audio</th> |
| | <th scope="col">Mixed Language</th> |
| | <th scope="col">Text</th> |
| | <th scope="col">MiniMax<br>Speech_02_HD</th> |
| | <th scope="col">ElevenLabs<br>Multilingual_v2</th> |
| | <th scope="col">OpenAI<br>TTS_1_HD<br>(*not cloned voice)</th> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td>English</td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Wong_Sourse.mp3" controls></audio> |
| | </td> |
| | <td>English + Mandarin</td> |
| | <td> |
| | Kiddo! Come come come, 学如逆水行舟,不进则退。<br> |
| | I see you're using AI tools already - so smart!<br> |
| | But eh, cannot just rely on tools only lah!<br> |
| | The future belongs to those who can work alongside AI,<br> |
| | not those scared of it. |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/English-Mandarin.wav" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/ElevenLabs_English-Mandarin.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/OpenAI_English-Mandarin.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td>Mandarin</td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/ShiBanYu_Sourse.mp3" controls></audio> |
| | </td> |
| | <td>Mandarin + Cantonese</td> |
| | <td> |
| | 老铁啊,多谢晒你送我呢本,广州话正音字典,咁好嘢喎!<br> |
| | 我呢个大老爷们儿学广州话真系好难㗎!成日都分唔清声调啊。<br> |
| | 嗱,而家有咗呢本书,什么都好啦。 |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Mandarin-Cantonese.MP3" controls></audio> |
| | </td> |
| | <td> |
| | Cantonese not supported |
| | </td> |
| | <td> |
| | Cantonese not supported |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td>Mandarin</td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/ShuanQ_Sourse.mp3" controls></audio> |
| | </td> |
| | <td>Mandarin + English</td> |
| | <td> |
| | The people said, 桂林's scenery is the first under heaven.<br> |
| | Yet in my opinion, 阳朔 scenery is better than 桂林。<br> |
| | 群峰倒影山浮水,无水无山不入神。 |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Mandarin-English.WAV" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/ElevenLabs_Mandarin-English.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/OpenAI_Mandarin-English.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td>English</td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/CoCo_Sourse.mp3" controls></audio> |
| | </td> |
| | <td>English + Spanish</td> |
| | <td> |
| | Mi abuelita always told me "el que persevera, alcanza".<br> |
| | If you persevere, you'll achieve your dreams!<br> |
| | Guess what! They choose me to play the lead role in our BIG show! |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/English-Spanish.wav" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/ElevenLabs_English-Spanish.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/OpenAI_English-Spanish.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td>Japanese</td> |
| | <td> |
| | <audio class="audio-sm" src="assets/audios/Powerful_Girl_Sourse.mp3" controls></audio> |
| | </td> |
| | <td>Japanese + Korean</td> |
| | <td> |
| | 最近の天気予報によりますと、今週末は桜の開花に最適<br> |
| | な気温になる予定です。<br> |
| | 東京都内の各公園では花見客で賑わうことが予想されますが、<br> |
| | 서울에서도 벚꽃이 피기 시작했다고 하네요.<br> |
| | 이번 주말에는 여의도 공원에서 벚꽃 축제가 열린다고 하니<br> |
| | 많은 분들이 찾아오실 것 같습니다. |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Japanese-Korean.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/ElevenLabs_Japanese-Korean.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/OpenAI_Japanese-Korean.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | </div> |
| | <p>*Although OpenAI currently does not support voice cloning functionality, we still wish to conduct comparative |
| | listening |
| | tests with its excellent naturalness as a reference.</p> |
| | </div> |
| |
|
| | <div class="article-block"> |
| | <h2 id="flow-vae-vs-vae-comparisons">Flow-VAE vs. VAE Comparison</h2> |
| | <p>Flow-VAE is less likely to produce the following instabilities.</p> |
| | <div class="scroll-wrapper" style="margin-top: 2rem;"> |
| | <table style="width: 100%;"> |
| | <tbody> |
| | <tr class="border-bottom-thin"> |
| | <th scope="col" style="text-align: center;">Source Audio</th> |
| | <th scope="col" style="text-align: center;">Flow-VAE</th> |
| | <th scope="col" style="text-align: center;">VAE</th> |
| | <th scope="col" style="text-align: center;">Differences</th> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td style="width: 25%"> |
| | <audio src="assets/audios/Condition1.wav" controls></audio> |
| | </td> |
| | <td style="width: 25%"> |
| | <audio src="assets/audios/FlowVAE1.wav" controls></audio> |
| | </td> |
| | <td style="width: 25%"> |
| | <audio src="assets/audios/VAE1.wav" controls></audio> |
| | </td> |
| | <td> |
| | Flow-VAE reproduces more continuous<br> |
| | and natural reverberation |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | <audio src="assets/audios/Condition2.wav" controls></audio> |
| | </td> |
| | <td> |
| | <audio src="assets/audios/FlowVAE2.wav" controls></audio> |
| | </td> |
| | <td> |
| | <audio src="assets/audios/VAE2.wav" controls></audio> |
| | </td> |
| | <td> |
| | VAE introduces unwanted<br> |
| | high-frequency components |
| | </td> |
| | </tr> |
| | <tr> |
| | <td> |
| | <audio src="assets/audios/Conditon3.wav" controls></audio> |
| | </td> |
| | <td> |
| | <audio src="assets/audios/FlowVAE3.wav" controls></audio> |
| | </td> |
| | <td> |
| | <audio src="assets/audios/VAE3.wav" controls></audio> |
| | </td> |
| | <td> |
| | VAE produces electronic-sounding<br> |
| | artifacts at the beginning |
| | </td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | </div> |
| | </div> |
| |
|
| | <div class="article-block"> |
| | <h2 id="professional-voice-clone-pvc-demonstrations">Professional Voice Clone (PVC) Demonstrations</h2> |
| | <p>For more complex dialectal accents and tonal characteristics, PVC can reproduce these features while |
| | maintaining high |
| | naturalness based on the text content.</p> |
| | <div class="scroll-wrapper" style="margin-top: 2rem;"> |
| | <table style="width: 100%;"> |
| | <tbody> |
| | <tr class="border-bottom-thin"> |
| | <th scope="col" style="text-align: center;">Source Audio</th> |
| | <th scope="col" style="text-align: center;">Zero-Shot</th> |
| | <th scope="col" style="text-align: center;">PVC</th> |
| | <th scope="col" style="text-align: center;">Differences</th> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td style="width: 25%"> |
| | <audio src="assets/audios/JosephBrodsky_Source.wav" controls></audio> |
| | </td> |
| | <td style="width: 25%"> |
| | <audio src="assets/audios/JosephBrodsky_Fast.mp3" controls></audio> |
| | </td> |
| | <td style="width: 25%"> |
| | <audio src="assets/audios/JosephBrodsky_PVC.mp3" controls></audio> |
| | </td> |
| | <td> |
| | Like the ZeroShot version, the PVC<br> |
| | version has rising sentence-final intonation,<br> |
| | but distinctively sustains this<br> |
| | elevated pitch instead of the typical<br> |
| | pitch declination found in common<br> |
| | declarative sentences |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | <audio src="assets/audios/TianJin_Source.wav" controls></audio> |
| | </td> |
| | <td> |
| | <audio src="assets/audios/TianJin_Fast.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio src="assets/audios/TianJin_PVC.mp3" controls></audio> |
| | </td> |
| | <td> |
| | With more materials, the model not only<br> |
| | reproduces the speaker's voice characteristics<br> |
| | but also accurately captures more<br> |
| | dialectal features |
| | </td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | </div> |
| | </div> |
| |
|
| | <div class="article-block"> |
| | <h2 id="emotion-control-demonstrations">Emotion Control Demonstrations</h2> |
| | <h3>Source Audio for Refreshing Young Man</h3> |
| | <audio src="assets/audios/Mandarin_Refreshing_Young_Man_Sourse.mp3" controls></audio> |
| | <h3>DEMO</h3> |
| | <div class="scroll-wrapper"> |
| | <table style="width: 100%;"> |
| | <tbody> |
| | <tr class="border-bottom-thin"> |
| | <th scope="col">Neutral</th> |
| | <th scope="col" style="min-width: 120px;">Emotion</th> |
| | <th scope="col">Text</th> |
| | <th scope="col">Emotion Control Audio</th> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Neutral1.mp3" controls></audio> |
| | </td> |
| | <td> |
| | Surprised |
| | </td> |
| | <td> |
| | 天哪!我完全没想到会在这里遇见你,<br> |
| | 都过去这么多年了,你一点都没变! |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Surprised.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Neutral2.mp3" controls></audio> |
| | </td> |
| | <td> |
| | Disgusted |
| | </td> |
| | <td> |
| | 这个地方实在太脏乱了,到处都是垃圾和难闻的气味儿,<br> |
| | 我一秒钟都不想多待。 |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Disgusted.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Neutral3.mp3" controls></audio> |
| | </td> |
| | <td> |
| | Fearful |
| | </td> |
| | <td> |
| | 深夜回家的路上,我清楚地听见身后有脚步声在跟着我,<br> |
| | 可是回头却什么都看不见。 |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Fearful.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Neutral4.mp3" controls></audio> |
| | </td> |
| | <td> |
| | Angry |
| | </td> |
| | <td> |
| | 我付出了这么多,换来的却是这样的背叛!<br> |
| | 你怎么可以这样对待我的信任! |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Angry.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Neutral5.mp3" controls></audio> |
| | </td> |
| | <td> |
| | Sad |
| | </td> |
| | <td> |
| | 躺在床上翻来覆去,心里压着说不出的难过和沮丧,<br> |
| | 昨天晚上又失眠了。 |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Sad.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Neutral6.mp3" controls></audio> |
| | </td> |
| | <td> |
| | Happy |
| | </td> |
| | <td> |
| | 和好朋友一起在院子里烧烤,聊着有趣的故事,<br> |
| | 享受着美食和欢乐的时光。 |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Happy.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | </div> |
| | </div> |
| |
|
| | <div class="article-block"> |
| | <h2 id="text-prompted-voice-generation-demonstrations">Text-Prompted Voice Generation Demonstrations</h2> |
| | <div class="scroll-wrapper"> |
| | <table style="width: 100%;"> |
| | <tbody> |
| | <tr class="border-bottom-thin"> |
| | <th scope="col">Prompt</th> |
| | <th scope="col">Text</th> |
| | <th scope="col" style="text-align: center;">Audio</th> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | 男性中年声音,说中文,音色浑厚醇厚,带有自然的磁性,<br> |
| | 语速偏慢,音量适中,音调偏低沉。声音整体给人沉稳可靠的感觉,<br> |
| | 在深度访谈场景中表现出专业性和亲和力,音质清晰,吐字规整有力。 |
| | </td> |
| | <td> |
| | 在这个安静的夜晚,让我们一起走进《人生笔记》这本书。<br> |
| | 作者用平实的文字记录下生活中的点点滴滴,<br> |
| | 让我们看到平凡中的真善美。<br> |
| | 今天,我们先来读第一章:'生活的痕迹'...... |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/深度访谈男中年.wav" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | 说中文的女青年,音色偏甜美,语速比较快,<br> |
| | 说话时带着一种轻快的感觉,整体音调较高,像是在直播带货,<br> |
| | 整体氛围比较活跃,声音清晰,听起来很有亲和力。 |
| | </td> |
| | <td> |
| | 亲爱的宝宝们,等了好久的神仙面霜终于到货啦!<br> |
| | 你们看这个包装是不是超级精致?<br> |
| | 我自己已经用了一个月了,效果真的绝绝子!<br> |
| | 而且这次活动价真的太划算了,错过真的会后悔的哦~ |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/直播带货女青年.wav" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | 中国男性声音,听着像是青年,音色清亮,语速比较快,<br> |
| | 说话很有激情,像是在解说比赛,声音中带着紧张和兴奋的感觉。 |
| | </td> |
| | <td> |
| | 漂亮!这个进攻太精彩了!张伟突破防线,<br> |
| | 一个漂亮的转身,球传到禁区,王超跟上,射门!<br> |
| | 球进了!难以置信的精彩配合,现场观众都沸腾了! |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/体育解说男青年.wav" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | 中国女青年的声音,音色清脆,说话速度偏快,语调活泼,<br> |
| | 像是在做游戏直播,声音中带着愉快的感觉,整体音调较高,<br> |
| | 整体氛围比较轻松。 |
| | </td> |
| | <td> |
| | 啊!这里有个宝箱!让我们看看里面是什么~<br> |
| | 哇!是传说中的紫色装备!运气也太好了吧!<br> |
| | 谢谢小伙伴们的打赏,我们继续往前探索...... |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/游戏主播女青年.wav" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | English-speaking female voice, sounding relatively young,<br> |
| | with a sweet and pleasant tone. Speaking at a moderate pace<br> |
| | with a touch of energy, similar to someone narrating a<br> |
| | beauty/makeup tutorial video. The overall atmosphere is<br> |
| | relaxed and cheerful. |
| | </td> |
| | <td> |
| | Hi everyone! Today I'll be sharing a soft, romantic<br> |
| | makeup look that's perfect for dates. Many of you have <br> |
| | been asking how to apply this eyeshadow naturally - the<br> |
| | key is using gentle techniques. Let's go through the<br> |
| | steps together... |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/美妆女博主.wav" controls></audio> |
| | </td> |
| | </tr> |
| | <tr> |
| | <td> |
| | English-speaking middle-aged male voice, slightly husky, <br> |
| | speaking at a moderate-to-slow pace with a deep tone. Like<br> |
| | someone telling an old story, conveying a nostalgic feeling,<br> |
| | with a relaxed and composed manner of speaking. |
| | </td> |
| | <td> |
| | That was back in the late 1970s. I remember when our <br> |
| | village first got electricity - everyone was so excited. <br> |
| | In theevenings, people would bring their stools and <br> |
| | gather under the big banyan tree by the village committee <br> |
| | office to watch movies projected on the wall. Even now, <br> |
| | thinking back to those moments still fills me with warmth. |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/回忆男中年.wav" controls></audio> |
| | </td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | </div> |
| | </div> |
| |
|
| | <div class="article-block"> |
| | <h2 id="comparison-of-voice-naturalness">Comparison of voice naturalness |
| | with the previous generation products</h2> |
| | <p>The new model demonstrates significant advantages in naturalness compared to the previous version.</p> |
| | <h3 style="margin-top: 2rem;">Source Audio for Radiant_Girl</h3> |
| | <audio src="assets/audios/English_Radiant_Girl_Sourse.wav" controls></audio> |
| | <h3>DEMO</h3> |
| | <div class="scroll-wrapper"> |
| | <table style="width: 100%;"> |
| | <tbody> |
| | <tr class="border-bottom-thin"> |
| | <th scope="col">Text</th> |
| | <th scope="col" style="text-align: center;">MiniMax<br>Speech_02_HD</th> |
| | <th scope="col" style="text-align: center;">Microsoft<br>Azure TTS</th> |
| | <th scope="col" style="text-align: center;">AWS<br>Polly</th> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | I sat alone in the empty room, staring at the old photographs,<br> |
| | wondering how everything could change so quickly,<br> |
| | how a lifetime of memories could fade away just like that. |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Radiant_Girl_1.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Emma_1.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Joanna_1.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | <tr class="border-bottom-thin"> |
| | <td> |
| | The moment I held my acceptance letter, my heart burst with joy - <br> |
| | all those sleepless nights finally paid off, and I couldn't stop<br> |
| | dancing around the room, calling everyone I knew to share this amazing news! |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Radiant_Girl_2.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Emma_2.mp3" controls></audio> |
| | </td> |
| | <td> |
| | <audio class="audio-md" src="assets/audios/Joanna_2.mp3" controls></audio> |
| | </td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | </div> |
| | </div> |
| |
|
| | <div class="article-block"> |
| | <h2 id="citation">Citation</h2> |
| | <div> |
| | <pre> |
| | <code> |
| | @misc{minimax2025minimaxspeechintrinsiczeroshottexttospeech, |
| | title={MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder}, |
| | author={Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, |
| | Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, |
| | Yuan Lu, Yucen He}, |
| | year={2025}, |
| | eprint={2505.07916}, |
| | archivePrefix={arXiv}, |
| | primaryClass={eess.AS}, |
| | url={https://arxiv.org/abs/2505.07916}, |
| | }</code> |
| | </pre> |
| | </div> |
| | </div> |
| | </article> |
| | </main> |
| |
|
| | <script> |
| | MathJax = { |
| | tex: { |
| | inlineMath: [['$', '$'],], |
| | }, |
| | } |
| | |
| | const darkModeToggle = document.getElementById('dark-mode-toggle') |
| | darkModeToggle.addEventListener('click', () => { |
| | document.body.classList.toggle('latex-dark') |
| | }) |
| | </script> |
| | <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script> |
| | </body> |
| |
|
| | </html> |