Real-time Voice AI with Local LLMs: A Promising but Resource-Intensive Approach
The Hacker News discussion revolves around RealtimeVoiceChat, an open-source system designed for real-time, local voice conversations with LLMs. The primary goal is to reduce latency and enable more natural interactions. The discussion highlights the project's potential and explores the challenges and optimizations related to speech-to-text (STT), text-to-speech (TTS), hardware requirements, and wake word implementation.
STT/TTS Model Selection and Performance
A significant portion of the discussion centers on the selection and performance of STT and TTS models. Users are keenly interested in identifying the best-in-class options and understanding the trade-offs between quality, speed, and resource consumption.
- Whisper as a Baseline: Whisper is acknowledged as a strong contender, especially with the ctranslate2 implementation. As noted by ivape, "I was sure whisper-cpp was the best."
- Emerging STT Models: koljab mentions that "nvidia open sourced Parakeet TDT today and it instantly went no 1 on open asr leaderboard" and indicated the need to evaluate the most recent models.
- TTS Innovation: oezi believes that "The core innovation is happening in TTS at the moment."
- Model Performance Considerations: When kristopolous suggested an alternative STT option, koljab replied, "Tried that one. Quality is great but sometimes generations fail and it's rather slow. Also needs ~13 GB of VRAM, it's not my first choice for voice agents tbh." This highlights the importance of speed and reliability in real-time applications.
- Monolingual Model Optimization: Potential speed gains by stripping unnecessary languages was proposed by kristopolous, although koljab replied: "You can't easily 'unlearn' things from the model weights... To gain speed you'd have to bring the parameter count down and train the model from scratch with a single language only."
Hardware Requirements and Optimization
The project demands substantial computing resources, particularly a CUDA-enabled GPU. This limitation prompts discussion about hardware requirements, optimization strategies, and compatibility with AMD GPUs.
- High GPU Requirement: koljab notes the requirement of "a decent CUDA-enabled GPU for good performance due to the STT/TTS models" and states "I only tested it on my 4090 so far".
- Challenges with AMD GPUs: dankwizard comments "I've given up trying to locally use LLMs on AMD", highlighting the challenges faced by users with non-Nvidia hardware.
- AMD GPU Solutions/Workarounds: lhl offered advice, stating "Basically anything llama.cpp (Vulkan backend) should work out of the box w/o much fuss (LM Studio, Ollama, etc)." and provided a resource for AMD GPU optimization.
- Mac Performance: karimf inquired about Mac performance: "I'm curious how fast it will run if we can get this running on a Mac. Any ballpark guess?" This indicates interest in portability beyond high-end GPUs.
System Architecture and Model Details
The discussion delves into the specific models and components used within RealtimeVoiceChat, emphasizing the project's commitment to local, open-source solutions.
- Local Model Focus: echelon highlights the project's strength, stating, "That's excellent. Really amazing bringing all of these together like this."
- Specific Model Choices: koljab provided a detailed list of models used for various stages, including VAD, transcription, turn detection, LLM, and TTS.
- Coqui XTTS Uncertainty: peterldowns sought clarification on the Coqui XTTS Lasinya models: "Can you explain more about the "Coqui XTTS Lasinya" models that the code is using? What are these, and how were they trained/finetuned?". And optimog commented "Seems like they are out of business. Their homepage mentions "Coqui is schutting down"*".
Turn Detection and Natural Conversation Flow
Users emphasize the importance of accurate turn detection and natural conversation flow, identifying it as a key area for improvement in voice AI systems.
- Smart Turn Detection: The project aims for "Smart turn detection to avoid cutting the user off mid-thought."
- Pauses and Natural Speech: smusamashah points out a common issue with current systems: "With these tools, AI starts taking as soon as we stop. Happens both in text and voice chat tools... I saw a demo on twitter a few weeks back where AI was waiting for the person to actually finish what he was saying. Length of pauses wasn't a problem."
Alternative Approaches and Future Features: Wake Words
The discussion touches upon alternative approaches to real-time voice AI and explores potential future features. A prominent topic is the implementation of wake word detection.
- Wake Word Engines: riquito asks, "any thought about wake word engines, to have something that listen without consuming all the time? The landscape for open solutions doesn't seem good"
- Always Listening vs. Wake Words: TeMPOraL argues for an "always listening" approach for a more natural interaction: "FWIW, wake words are a stopgap; if we want to have a Star Trek level voice interfaces, where the computer responds only when you actually meant to call it...the computer needs to be constantly listening."
- Home Assistant Integration: Dr4kn highlights Home Assistant as an example of a system that blends wake words with proactive assistance.
- External Service Integration: riquito suggests "any plan to allow using external services for stt/tts for the people who don't have a 4090 ready (at the cost of privacy and sass providers)?" which indicates interest in accessibility.