This Hacker News discussion revolves around a new tool called owhisper
, which aims to provide a real-time, headless, and compatible API for speech-to-text (STT) models, particularly Whisper and Moonshine. The primary themes emerging from the conversation are:
Real-time Transcription and Streaming
A key point of interest is the tool's ability to handle real-time audio streams and output text continuously, contrasting with the traditional chunk-based processing of many STT models.
- Users are seeking a solution that processes audio as a continuous stream and outputs text in the same manner.
- The implementation of Voice Activity Detection (VAD) for chunking audio is mentioned as a method to achieve more granular processing.
- The Moonshine model is highlighted for its advantage in processing shorter audio segments faster than traditional Whisper, which is often constrained by 30-second chunks. yujonglee states, concerning Moonshine: "Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER."
- The ability to pipe audio into the tool and receive streaming text output is highly desired for integration with other command-line tools. mijoharas expresses this need: "I was actually integrating some whisper tools yesterday. I was wondering if there was a way to get a streaming response, and was thinking it'd be nice if you can." and asks, "Same question for streaming, is there a way to get a streaming text output from owhisper?"
- Deepgram API compatibility is seen as a positive step towards establishing a standard for real-time transcription services, filling a gap left by OpenAI's lack of a real-time offering. solarkraft notes, "I’m excited to try this out and see your API (there seems to be a standard vaccuum here due to openai not having a real time transcription service, which I find to be a bummer)!" and later, "Edit: They seem to emulate the Deepgram API (https://developers.deepgram.com/reference/speech-to-text-api...), which seems like a solid choice. I’d definitely like to see a standard emerging here."
Feature Development and Future Roadmap
Several users are interested in specific features, with speaker diarization being a prominent request, and the project's roadmap is a point of discussion.
- The development team acknowledges the demand for speaker diarization and confirms it is on their roadmap. yujonglee states, "For splitting speaker within channel, we need AI model to do that. It is not implemented yet, but I think we'll be in good shape somewhere in September." Later confirming, "yeah that is on the roadmap!"
- Users suggest methods for speaker diarization, including using libraries like
pyannote
and incorporating vector embeddings for speaker recognition. clickety_clack advises, "Please find a way to add speaker diarization, with a way to remember the speakers. You can do it with pyannote, and get a vector embedding of each speaker that can be compared between audio samples, but that’s a year old now so I’m sure there’s better options now!" - The tool's integration with the
hyprnote
project, which focuses on meeting notes, is highlighted, particularly for use cases involving multiple speakers. JP_Watts asks, "I’d like to use this to transcribe meeting minutes with multiple people. How could this program work for that use case?" yujonglee responds, "If your use-case is meeting, https://github.com/fastrepl/hyprnote is for you. OWhisper is more like a headless version of it." - The need for outputting plain text rather than a TUI (Text User Interface) is raised for easier piping and integration with other command-line tools. mijoharas inquires, "also, it looks like the
owhisper run
command gives it's output as a tui. Is there an option for a plain text response so that we can just pipe it to other programs?"
Model Support and Accessibility
The discussion touches on the range of models supported by owhisper
and its availability across different operating systems.
- A list of supported local models, including various Whisper and Moonshine variants (e.g.,
whisper-cpp-base-q8
,moonshine-onnx-tiny-q4
), is provided. - The tool's availability and ease of use for users on different platforms, specifically Linux, are discussed. mijoharas notes, "Can you help me out to find where the code you've built is? I can see the folder in github[0], but I can't see the code for the cli for instance? unless I'm blind." yujonglee provides the CLI entry point: "This is CLI entry point: https://github.com/fastrepl/hyprnote/blob/8bc7a5ee586254d2c042e02580e1a9fe0fa/owhisper/owhisper-server/src/main.rs#L35" and later confirms Linux support with a download link: "I didn't tested on Linux yet, but we have linux build: http://owhisper.hyprnote.com/download/latest/linux-x86_64".
- There's a suggestion to improve discoverability by listing supported models on the project's website. alkh mentions, "Sorry, maybe I missed it but I didn't see this list on your website. I think it is a good idea to add this info there." yujonglee clarifies that the info is available via
owhisper pull --help
.
Community Collaboration and Project Ethics
Concerns are raised about project licensing and community engagement, drawing parallels with the llama.cpp
ecosystem.
- A user cautions against branding the project as an "Ollama for X" due to perceived commercialization and ethical concerns regarding FOSS (Free and Open Source Software) washing observed in other projects. DiabloD3 warns, "I suggest you don't brand this 'Ollama for X'. They've become a commercial operation that is trying to FOSS-wash their actions through using llama.cpp's code and then throwing their users under the bus when they can't support them."
- The developers clarify their stance, emphasizing the community-focused nature of the project and their use of
whisper.cpp
. yujonglee responds, "yeah we use whisper.cpp for whisper inference. this is more like a community-focused project, not a commercial product!" - The project's reliance on
whisper.cpp
is noted, and there's an implicit expectation for them to be good community members. DiabloD3 further states, "I see that you are also using llama.cpp's code? That's cool, but make sure you become a member of that community, not an abuser." (Note: The discussion mentionswhisper.cpp
, notllama.cpp
directly for the STT part ofowhisper
, but the user draws a parallel to thellama.cpp
community dynamics). - The question of how
owhisper
differentiates itself from the existingwhisper.cpp
stream example is posed. wanderingmind asks, "Thank you for taking the time to build something and share it. However what is the advantage of using this over whisper.cpp stream that can also do real time conversion?"