VibeVoice: A Frontier Open-Source Text-to-Speech Model

Here's a summary of the themes expressed in the Hacker News discussion:

Quality and Realism of Voices

A primary theme is the discussion around the quality and realism of the voices generated by VibeVoice and other TTS models. While many users find the technology impressive, there's debate on how close it is to human speech.

One user, baal80spam, expressed amazement: "Wow. I admit that I am not a native speaker, but this looks (or rather, sounds) VERY impressive and I could mistake it for hearing two people talking."
However, simiones found the intonation lacking and the modulation robotic: "The voices are decent, but the intonation is off on almost every phrase, and there is a very clear robotic-sounding modulation. It's generally very impressive compared to many text-to-speech solutions from a few years ago, but for today, I find it very uninspiring."
Others noted specific sonic artifacts they associate with computer generation. malnourish stated, "This is clearly high quality but there's something about the voices, the male voices in particular, which immediately register as computer generated. My audio vocabulary is not rich enough to articulate what it is."
lvncelot elaborated, "After hearing them myself, I think I know what you mean. The voices get a bit warbly and sound at times like they are very mp3-compressed."
heeton offered a technical guess: "I'm no audio engineer either, but those computer voice sound "saw-tooth"y to me. From what I understand, it's more basic models/techniques that are undersampling, so there is a series of audio pulses which give it that buzzy quality."
codebastard described it as "blockly, as if we visualise the sound wave it seems to be without peaks and cut upwards and downwards producing a metallic boxy echo."
jofzar likened it to "super low bitrate... reminds me of someone on Bluetooth microphone."

Distinguishing AI from Human Speech

Users identified several tells that distinguish AI-generated speech from human speech, even with advanced models.

x187463 pointed out a simple giveaway: "The giveaway is they will never talk over each other. Only one speaker at a time, consistently."
kaptainscarlet added, "Also the lack of stutter and perfect flow of speech are a dead giveaway"
kridsdale1 mentioned, "And longer pause between turns than humans would do."
tracker1 acknowledged the progress but noted it's still not perfect: "Yeah, a lot of the TTS has gotten really impressive in general. Definitely a clear leap from the TTS stuff I worked with for training simulations a bit over a decade ago." He also identified an issue with how numbers and dates are spoken: "When numbers are spoken. Dates/ages/timelines are just off as far as story generation, and really should be human tweaked. As to the voice gen, saying a year or measurement is just not how English speakers (US or otherwise) speak."
simiones also noted issues with intonation and modulation: "The voices are decent, but the intonation is off on almost every phrase, and there is a very clear robotic-sounding modulation." They also found the singing part "painfully bad."

Specific Accent and Accent Imitation Capabilities

A notable area of discussion is the ability of these models to produce and replicate specific accents, particularly British accents.

wewewedxfgdf expressed a desire for better British accents: "I'm really hoping one day there will be TTS does that does really nice British accents - I've surveyed them all deeply, none do. Most that claim to do a British accent end up sounding like Kelsey Grammer - sort of an American accent pretending to be British."
specproc specifically requested, "I'd like one that really nails Brummie."
The multilingual capabilities, especially with Chinese, were also highlighted. crvdgc was impressed: "Very impressive that it can reproduce the Mandarin accent when speaking English and English accent when speaking Mandarin."
ascrobic agreed, adding, "The Chinese is good. The Mandarin to English example she sounds native. The English to Mandarin sounds good too but he does have an English speaker's accent, which I think is intentional."
iansinnott elaborated on the impressive English/Mandarin segment: "The English/Mandarin section was VERY impressive. The accents of both the woman speaking English and the man speaking Chinese were spot on. Both sound very convincingly like they are speaking a second language, which anyone here can hear from the Chinese woman speaking English voice. I'd like to add that the foreigner speaking Chinese was also spot on."

Gendered Voice Quality and Investment

A significant portion of the conversation focused on the perceived difference in quality between male and female voices, and the underlying reasons for this disparity.

malnourish observed, "This is clearly high quality but there's something about the voices, the male voices in particular, which immediately register as computer generated."
IshKebab agreed: "I agree. For some reason the female voices are waaay more convincing than the male ones too, which sound barely better than speech synthesis from a decade ago."
selkin attributed this to investment: "Results correlate to investment, and there’s more in synthesizing female coded voices. As for the why female coded voices gets more investments, we all know, only difference is in attitude towards that (the correct answer, of course, is “it sucks”)"
kadoban provided a detailed explanation linking investment to gender and sexual desires: "When you say male voices are better, and there's more investment there: There's a lot of money and effort spent in satisfying the sexual desires of (predominantly straight) men. There's not typically quite as much interest in doing the same for women. (...) I would expect that this is not as pronounced an effect in the world generating speech, but it must still exist."
lacy_tinpot offered a counter-argument, suggesting it's more complex than just sex: "I think this is a very lazy kind of cultural analysis. The reason female voices are being chosen over male ones is a little more multifaceted than just SEX. Heterosexual women also tend to prefer female voices over male ones. Female voices are often rated as being clearer, easier to understand, "warmer", etc. Why this is the case is still an open question, but it's definitely more complex than just SEX."
selkin pushed back, tying it to gender perception: "That you consider it sex (rather than gender), is exactly why there’s a preference for female coded voices. Consider where we do hear male recorded voices used as default."

Licensing and Open Source Concerns

The implications of licensing, particularly the MIT license, for models dependent on LLMs and whether they can be run offline were a point of contention.

Havoc noted the MIT license positively: "MIT license - very nice!"
em-bee questioned its practical meaning: "what does that mean in this context? it seems to depend on an LLM. so can i run this completely offline? if i have to sign up and pay for an LLM to make it work, then it's not really more useful than any other non-free system"
ComputerGuru was critical of applying FOSS licenses to "binary-only release[s]": "The application of known FOSS licenses to what is effectively a binary-only release is misleading and borderline meaningless."
Havoc defended its utility for compliance: "If you're in a company and need a model which one do you think you're getting past compliance & legal - the one that says MIT or the one that says 'non-commercial use only'?"
Meneth questioned the source of training data: "Open-source, eh? Where's the training data, then?"
Joel_Mckay highlighted potential issues with training data: "Most scraped data is often full of copyright, usage agreement, and privacy law violations. Making it 'open' would be unwise for a commercial entity."
zoobab lamented the misuse of "open source": "Open source is being abused to not provide the actual source. Stop this."
nullc added: "Perhaps, but it is not Open Source in the traditional sense if they do not provide the preferred form for modifications."

Comparison to Existing TTS Services (ElevenLabs)

Several users directly compared VibeVoice to established commercial TTS services, most notably ElevenLabs.

echelon speculated about market impact: "With stuff like this coming out in the open, I wonder if ElevenLabs will maintain its huge ARR lead in the field. I really don't see how they can continue to maintain a lead when their offering is getting trounced by open models."
kamranjon asked for comparisons with Dia, finding parakeet-based models more realistic in pacing.
watsonmusic simply stated, "11labs is facing a real competitor."
mclau157 disagreed, finding ElevenLabs superior: "ElevenLabs has a much more convincing voice model."
Uehreka argued that generally, local TTS models struggle to match closed-source offerings like ElevenLabs: "It’s tough to name the best local TTS since they all seem to trade off on quality and features and none of them are as good as ElevenLabs’ closed-source offering." However, they praised Kokoro-82M as a local alternative.
refulgentis critiqued an apparent bias against open models: "There's a certain know-nothing feeling I get that makes me worried if we start at the link (which has data showing it > ElevenLabs quality), jump to eh it's actually worse than anything I've heard then last 2 years, and end up at 'none are as good as ElevenLabs' - the recommendation and commentary on it, of course, has nothing to do with my feeling, cheers"
jofzar commented, "Still not at the astonishing level of Google Notebook text to speech which has been out for a while now. I still can't believe how good that one is."

Integration of Singing and Background Music

The inclusion of singing and background music in the demo prompted specific reactions and discussions.

rafaelmn felt the singing was unnecessary: "They could have skipped the singing part, it would be better if the model did not try to do that :)"
kridsdale1 noted that the singing didn't impress but did drive engagement: "It did get me to look up the song [1] again though, which is a great stimulator of emotion. The robot singing has a long way to go."
lyu07282 and phildougherty discussed the explanation for spontaneously appearing background music from the FAQ as a deliberate "feature" rather than a bug, with skepticism.
lagniappe offered a clear opinion: "Bots should never sing."

Control and Customization (SSML, IPA, Voice Cloning)

Users expressed a desire for more control over the generated speech, with specific requests for SSML, IPA input, and better voice cloning.

throwaw12 asked, "Will there be a support for SSML to have more control of conversation?"
amelius echoed this need for markup for emotional control: "For example, it would be nice to do something like: Hey look! [enthusiastic] Should we tell the others? Maybe not ... [giggles] etc. In fact, I think this kind of thing is absolutely necessary if you want to use this to replace a voice actor."
data-ottawa confirmed ElevenLabs supports this: "Eleven labs has some models with support for that."
rcarmo highlighted the model's voice cloning capability: "One of the things this model is actually quite good at is voice cloning. Drop a recorded sample of your voice into the voices folder, and it just works."
weeb specifically inquired about IPA support for TTS: "does anyone know of recent TTS options that let you specify IPA rather than written words? Azure lets you do this, but something local (and better than existing OS voices) would be great for my project."
andybug confirmed Kokoro's support for phonemes, which appeared to be IPA-like.
viggity desired more granular control: "I would love to have a model that can make sense of things like stressing particular syllables or phonemes to make a point."
TheAceOfHearts suggested the need for an intermediate step for annotation: "Someone else mentioned in this thread that you cannot add annotations to the text to control the output. I think for these models to really level up there will have to be an intermediate step that takes your regular text as input and it generates an annotated output, which can be passed to the TTS model. That would give users way more control over the final output, since they would be able to inspect and tweak any details instead of expecting the model to get everything correctly in a single pass."

Utility and Use Cases

The discussion touched on the practical applications and value of advanced TTS, from accessibility to content creation.

tempodox expressed skepticism about the "hype" around AI TTS, contrasting it with macOS's long-standing TTS: "This is ludicrous. macOS has had text-to-speech for ages with acceptable quality, and they never needed energy- and compute-expensive models for it. And it reacts instantly, not after ridiculous delays. I cannot believe this hype about 'AI', it’s just too absurd."
NitpickLawyer countered that Apple's TTS is not "acceptable quality in any modern understanding of SotA": "Compared to IBMs Steven Hawking's chair, maybe. But apple tts is not acceptable quality in any modern understanding of SotA, IMO." They also shared their positive experience with Google's podcast AI, finding it capable of overcoming the uncanny valley.
Ukv and crazygringo discussed the value in translation, dubbing, creating personalized content (audiobooks, articles), allowing people with voice impairments to communicate in their own voice, and creating more interactive media. crazygringo likened the need for better TTS to the preference for higher DPI or better fonts in reading.
anarticle noted that some examples sounded like a "cry for help," but acknowledged the potential benefit of large context windows in improving performance.
ml_basics inquired about its relation to other Microsoft AI voice models.
ehutch79 felt the examples were "off-putting" and in "uncanny valley territory."
baxuz lamented the limited support for less prevalent languages: "Looking forward to the day when tts and speech recognition will work on Croatian, or other less prevalent languages. It seems that it's only variants of English, Spanish and Chinese which are somewhat working."
stuffoverflow praised VibeVoice-Large for producing convincing Finnish speech: "VibeVoice-Large is the first local TTS that can produce convincing Finnish speech with little to no accent. I tinkered with it yesterday and was pleasantly surprised at how good the voice cloning is and how it 'clones' the emotion in the speech as well."

Other Observations and Naming Conventions

The origin of the name "VibeVoice" was questioned, with some suggesting it might be a derivative of "VibeCode."
Users discussed Microsoft's naming conventions for its AI products, with humorous suggestions like "Microsoft VibeCode" or "Zunega."
The presence of Chinese comments in the HTML source code was noted as interesting.
Some demo sites had rendering issues (invisible text in Firefox).
There was a brief tangent about the cultural impact of movie soundtrack songs.
Users shared resources for comparing TTS models, such as Hugging Face leaderboards.