FFmpeg 8.0 adds Whisper support

Here's a summary of the themes from the Hacker News discussion, with direct quotes:

Integration of Whisper into FFmpeg for Transcription

A central theme is the excitement and discussion around the integration of OpenAI's Whisper model directly into FFmpeg for audio transcription. Users are exploring the implications, capabilities, and potential uses of this new feature.

"Very interesting to see this!" said ggap.
"Does this finally enable dynamically generating subtitles for movies with AI?" asked zzsshh.
yorwba provided a link to the documentation: "Or read the documentation for the new whisper filter: https://ffmpeg.org/ffmpeg-filters.html#whisper-1."
JohnKemeny shared a relevant blog post: "Related, a blog article by the author of the patch: Run Whisper audio transcriptions with one FFmpeg command https://medium.com/@vpalmisano/run-whisper-audio-transcripti..."

Performance, Hardware, and Latency

A significant portion of the discussion revolves around the performance of Whisper, especially concerning hardware requirements and latency. Users are sharing their experiences with different hardware configurations and discussing the trade-offs between speed and accuracy.

"If you have enough processing power. Without a GPU it's going to lag," noted regularfry.
KeplerBoy countered, "Whisper is pretty fast."
diggan shared a specific experience: "Took my iPhone 12 Mini a whole of 0.1 seconds to pass it. What hardware/OS are you using?"
johnisgood had a different experience: "Took me 8 seconds on my shitty desktop."
londons_explore observed: "Took about 30 secs for me (5 yr old intel cpu). Looked like there was a progress bar, but it didn't progress. Maybe the difficulty varies depending on IP address?"
miki123211 suggested a method for better latency and accuracy: "The right way to do this would be to use longer, overlapping chunks. E.g. do thranscription every 3 seconds, but transcribe the most recent 15s of audio (or less if it's the beginning of the recording). This would increase processing requirements significantly, though."
superluserdo described an implementation: "I basically implemented exactly this on top of whisper since I couldn't find any implementation that allowed for live transcription. https://tomwh.uk/git/whisper-chunk.git/"

Subtitle Generation and Format Support

The ability to generate subtitles, specifically in formats like SRT, is a key aspect of the integration. Users are discussing how this feature can be utilized and the nuances of subtitle formatting.

jeroenhd quoted documentation: "@item format\nThe destination format string; it could be "text" (only the transcribed text will be sent to the destination), "srt" (subtitle format) or "json". Default value: @code{"text"}."
jeroenhd also commented on the convenience: "Of course, you could already do that by just manually calling whisper on files, but now you don't need to export parts or transformed media files to feed into whisper."
preisschild suggested an alternative: "They could also just upload those transcriptions as normal closed-captioning srt subtitles..."
jimkleiber explained the need for "burned-in" subtitles: "...not all social media will show subtitles/captions tho, which is the challenge. YouTube Shorts, TikTok videos, IG reels, FB reels, Whatsapp statuses, and more."
Rkomorn noted issues with subtitle timing: "...as someone who not infrequently has to rewind content on just about all streaming apps because it decided one particular subtitle only needed to be display for less than 200ms this time around..."
LorenDB criticized burned-in subtitles: "The other other problem with burned-in subtitles is that they normally have horrible formatting. Who wants to try to read single words that only flash on-screen while they are being spoken?"

Whisper Model Capabilities and Limitations (Hallucination, Context, Multilingualism)

Users are sharing their experiences with the Whisper model's capabilities, including its multilingual support, its tendency to "hallucinate" or invent text, and its handling of context.

"Whisper works on 30 second chunks. So yes it can do that and that’s also why it can hallucinate quite a bit," explained ph4evers.
londons_explore raised a concern about context: "Eg. If I say "I scream", it sounds phonetically identical to "Ice cream". Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert"."
ph4evers elaborated on chunking: "The ffmpeg code seems to default to three second chunks..."
londons_explore followed up: "so if "I scream" is in one chunk, and "is the best dessert" is in the next, then there is no way to edit the first chunk to correct the mistake? That seems... suboptimal!"
anonymousiam shared a personal experience: "Whisper is excellent, but not perfect. I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem.""
JohnKemeny suggested a solution for context: "Whisper supports adding a context, and if you're transcribing a phone call, you should probably add "Transcribe this phone call with Gem", in which case it would probably transcribe more correctly."
bondarchuk asked about multilingual support: "Can whisper do multilingual yet? Last time I tried it on some mixed dutch/english text it would spit out english translations for some of the dutch text."
ph4evers replied: "Whisper-v3 works well for multi-lingual. I tried it with Dutch, German and English."
jeroenhd noted: "I found that it works quite well for Dutch+English as long as you use one of the larger models."
MaKey pointed to the documentation: "It's even better for some languages other than English (e. g. Spanish), see: https://github.com/openai/whisper?tab=readme-ov-file#availab..."
trenchpilgrim stated: "Whisper has quite bad issues with hallucination. It will inject sentences that were never said in the audio. It's decent for classification but poor at transcription."
prmoustache echoed this: "My personnal experience trying to transcribe (not translate) was a complete failure. The thing would invent stuff. It would also be completely lost when more than one language is used."
ethan_smith provided a command for translation: "Whisper can indeed transcribe Japanese and translate it to English, though quality varies by dialect and audio clarity. You'll need the "large-v3" model for best results, and you can use ffmpeg's new integration with a command like ffmpeg -i movie.mp4 -af whisper=model=large-v3:task=translate output.srt."

Bot Filters and Website Access

Several users encountered issues accessing the linked FFmpeg commit page due to bot protection mechanisms (Anubis), leading to a discussion about website security, usability, and different methods of bypassing these filters.

boutell requested: "Shut off the broken bot filter so we can read it please."
jerroenhd explained the filter and links: "Anubis has config for that: https://anubis.techaro.lol/docs/admin/policies#request-weight... It's up to the site admin to configure it that way..."
ta1243 shared their experience: "Cloudflare often complains and blocks me. This page loaded pretty much instantly... But then ffmpeg is written by old school engineers with old school ways of working."
politelemon reported: "Took me zero seconds to be blocked with invalid response."
miloignis stated: "It also instantly blocks me on GrapheneOS, both Firefox and Vanadium."
shaky-carrousel offered a counterpoint: "GrapheneOS here, with Vanadium in incognito, it doesn't block me..."
QuantumNomad_ provided alternative access: "Archived snapshots of the linked page: https://web.archive.org/web/20250813104007/https://code.ffmp... https://archive.is/dmj17 You can read it on one of these without having to pass that specific bot check."
majewsky defended bot filters: "From experience, these bot filters are usually installed because the site would be down entirely without rejecting AI scrapers, so the argument to shut it off to improve usability is rather silly."

Software Ecosystem and Alternatives

The discussion touches upon the broader software ecosystem, including related projects and alternative tools for speech-to-text and transcription.

demurgos commented on the relationship between VLC and FFmpeg: "I'm not very familiar with them, but I always assumed that there is a lot of overlap between the maintainers of both projects." SSLy clarified: "VLC and ffmpeg are unrelated projects."
kwar13 expressed enthusiasm for a specific application: "Fantastic! I am working on a speech-to-text GNOME extension that would immensely benefit from this. https://github.com/kavehtehrani/gnome-speech2text"
oezi asked about other implementations: "Isn't WhisperX the canonical choice for running Whisper?"
sampullman replied: "Maybe for running locally? whisper.cpp is nice because you can embed it pretty easily in apps for various targets like iOS, OSX, Android, wasm, etc."
0points compared performance: "While whisper and whisperx is python implementations, the whisper.cpp wins the benchmarks."
yewenjie mentioned an alternative: "I have recently found that parakeet from NVIDIA is way faster and pretty much as correct as Whisper, but it only works with English."
kmfrk highly recommended Subtitle Edit: "People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription."
tossit444 compared Subtitle Edit and Aegisub: "SE is much better for actual transcription. Aegisub still does the heavy lifting for typesetting and the like."
shrx shared a personal benefit: "As a hard of hearing person, I can now download any video from the internet (e.g. youtube) and generate subtitles on the fly, not having to struggle to understand badly recorded or unintelligible speech."

Bloat and Necessity of the Feature in FFmpeg

A minority opinion questions the inclusion of this AI-driven feature within FFmpeg, viewing it as potential bloat, especially given the current limitations of the Whisper model.

dncornholio stated: "I was expecting a lot more comments on if this is a necessary feature or if this even belongs in a library like ffmpeg. I think this is bloat, especially when the feature doesn't work flawless, whisper is very limited."
MrGilbert pointed to a discussion about subtitle workflow: "The only item that was discussed was that the subtitle workflow does not seem to be that good, afaict: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20022#issuecomment-2513."
zoobab questioned packaging: "Is this website open? Would love to see your work :P" (This seems misplaced, but the underlying sentiment might be about how features are included/packaged).
zoobab later asked: "so 'apt install ffmpeg' won't be enough to have the feature?" SahAssar clarified: "You'd have the feature, but you also need to supply the model. The feature seems to just be that ffmpeg has the ability to run the model, it does not include the model."

Future Potential and Advanced Use Cases

Users speculate on the future development and advanced applications, such as real-time translation, speaker diarization, and improving the core model's behavior.

instagraham inquired: "Does this mean that any software which uses ffmpeg can now add a transcription option? Audacity, Chrome, OBS etc."
Lio expressed hope for less "burned-in" subtitles: "Once local transcription is in more places hopefully we can persuade content creator not to burn bouncing sub-titles into their videos."
HPsquared added: "The other problem with burned-in subtitles is you can't change the language."
waltbosz considered speaker recognition: "I think in this context speaker recognition would be important."
martzoukos wondered about LLM integration: "I guess that there is no streaming option for sending generated tokens to, say, an LLM service to process the text in real-time."
nomad_horse offered an alternative for streaming: "Whisper has the encoder-decoder architecture, so it's hard to run streaming efficiently, though whisper-streaming is a thing. https://kyutai.org/next/stt is natively streaming STT."
donatj inquired about translation capabilities: "I know nothing about Whisper, is this usable for automated translation? I own a couple very old and as far as I'm aware never translated Japanese movies."
poglet confirmed: "Yep, whisper can do that. You can also try whisperx (https://github.com/m-bain/whisperX) for a possibly better experience with aligning of subtitles to spoken words."
pmarreck wished for more: "Now if it only did separate speaker identification (diarization)."