Experimenting with Local LLMs on macOS

This discussion revolves around the feasibility and current state of running Large Language Models (LLMs) locally, particularly within web browsers, and touches upon hardware considerations, model performance, and strategic product decisions by companies like Apple.

Running LLMs in the Browser

A primary theme is the desire to run LLMs directly in the browser, allowing users to select and utilize local models without reliance on servers. The initial poster, mg, suggested a JavaScript-based solution using WebGL (and later clarifying a preference for WebGL over WebGPU for broader compatibility). This sparked a conversation about existing projects and the technical challenges.

mg initially posited: "Is anyone working on software that lets you run local LLMs in the browser? In theory, it should be possible, shouldn't it?... The page could hold only the software in JavaScript that uses WebGL to run the neural net. And offer an 'upload' button that the user can click to select a model from their file system."
SparkyMcUnicorn pointed to mlc-ai/web-llm-chat and related projects, noting that MLC's inference engine runs on WebGPU/WASM.
mg clarified their preference: "Yeah, something like that, but without the WebGPU requirement. Neither FireFox nor Chromium support WebGPU on Linux. Maybe behind flags. But before using a technology, I would wait until it is available in the default config."
simonw highlighted that "Firefox Nightly on macOS now supports WebGPU, and the documentation says the Linux build supports it too."
simonw later demonstrated a more advanced implementation using transformers.js and WebGPU that allows loading models from a local folder (not just network). He detailed: "Now click 'Browse folder' and select the folder you just checked out with Git. Click the confusing 'Upload' confirmation (it doesn't upload anything, just opens those files in the current browser session). Now click 'Load local model' - and you should get a full working chat interface."

While the goal of truly browser-native, local model execution is advanced, several projects like web-llm and wllama were mentioned as steps in this direction, with transformers.js showing significant progress in enabling local file loading.

Software and Tooling for Local LLM Execution

A significant portion of the discussion focuses on the tools and software available for running LLMs locally, beyond the browser context. This includes discussions on ease of use, setup, and underlying technologies.

samsolomon introduced "Open WebUI" as a potential solution, though the original poster mg clarified that they were seeking something "without asking users to install Docker etc."
bravetraveler and Jemaclus debated whether Open WebUI fit the description, with Jemaclus stating, "From a UI perspective, it's exactly what you described. There's a dropdown where you select the LLM, and there's a ChatGPT-style chatbox. You just docker-up and go to town."
craftkiller clarified the distinction: "The end-user interface is what they are describing but it sounds like they want the actual LLM to run in the browser (perhaps via webgpu compute shaders). Open WebUI seems to rely on some external executor like ollama/llama.cpp, which naturally can still be self-hosted but they are not executing INSIDE the browser."
ngxson/wllama and mlc-ai/web-llm were frequently cited as projects aiming for this direct browser execution.
vonneumannstan shared Mozilla-Ocho/llamafile, a project that compiles GGUF models directly into executables that can open a browser interface.
simonw provided extensive links and examples for transformers.js, including demos and his own implementation for loading local models.
a-dub recommended ollama as a user-friendly wrapper around llama.cpp for downloading and managing LLM instances, though frontsideair noted a concern about ollama's paid cloud offering detracting from its local focus.
Olshansky and deepsquirrelnet praised LM Studio for its educational value in helping users understand LLM configurations and for providing an OpenAI-compatible server.
jondwillis shared a command (sudo sysctl iogpu.wired_limit_mb=184320) for increasing GPU memory allocation on macOS for LLMs.

The consensus is that while specialized projects exist, the ecosystem for easy-to-use, truly local, browser-based LLM execution is still evolving.

Hardware and Performance: Apple Silicon, NPUs, and GPU Power

A substantial part of the conversation is dedicated to the performance and suitability of various hardware, particularly Apple's Silicon (M-series chips), for running LLMs locally. This includes discussions on NPUs vs. GPUs, memory, and the limitations of current architectures, especially concerning Apple's Neural Engine (ANE).

paul irish introduced the WebNN API as a standardized approach for web machine learning.
coffeecoders noted that LLMs on Apple Silicon often run on the GPU via Metal rather than the ANE, stating, "Core ML isn't great for custom runtimes and Apple hasn't given low-level developer access to the ANE afaik."
GeekyBear and elpakal debated mlx-framework.org's support for the ANE, with y1n0 and jychang asserting that MLX primarily runs on the GPU and does not utilize the ANE.
aurareturn and bigyabai discussed the limitations of NPUs in general for LLM inference, with bigyabai stating, "Most NPUs are almost universally too weak to use for serious LLM inference." aurareturn added that "Nvidia does not optimize for mobile first... AMD and Intel were forced by Microsoft to add NPUs... Turns out the kind of AI that people want to run locally can’t run on an NPU. It’s too weak." They also noted that "Apple is in this NPU boat because they are optimized for mobile first" and that Apple "just needs to add matmul hardware acceleration into their GPUs."
zozbot234 provided detailed technical insights into the Apple Neural Engine, explaining it's "optimized for power efficiency, not performance" and "provides exclusively for statically scheduled MADD's of INT8 or FP16 values," which "wastes a lot of memory bandwidth on padding." They concluded, "The NPU/ANE is still potentially useful for lowering power use in the context of prompt pre-processing."
j45 and saagarjha debated the efficiency of Apple Silicon versus datacenter GPUs for LLMs, with saagarjha stating that "A datacenter GPU is going to be an order of magnitude more efficient."
Discussions also touched on the substantial RAM requirements for larger models, with users mentioning 48GB or more for models like Qwen3.
Concerns were raised about prompt processing speed and battery drain on laptops when running LLMs.

The consensus is that while Apple Silicon's unified memory is beneficial, the ANE's limitations mean LLMs primarily leverage the GPU, and for state-of-the-art performance or very large models, dedicated datacenter hardware remains superior.

Apple's Strategic Approach to AI and Local LLMs

A recurring theme is the critique of Apple's perceived slow or conservative approach to AI, particularly concerning local LLMs and competitive hardware strategies.

giancarlostoro expressed a desire for a different CEO at Apple, suggesting they should have "embraced local LLMs and built an inference engine that optimizes models that are designed for Nvidia" and considered selling "server-grade Apple Silicon processors." They characterized Tim Cook as a COO rather than a visionary.
bigyabai agreed, stating, "Apple has dropped the ball so hard here that it's dumbfounding... The only organizational roadblock that I can see is the executive vision, which has been pretty wishy-washy on AI for a while now."
jen20 pushed back, arguing Apple's focus is on its core products and that entering the datacenter market is not a trivial task, stating, "Apple not being in a particular industry is a perfectly valid choice."
saagarjha defended Apple's strategy based on profitability, noting, "Apple makes more profit on iPhones than Nvidia does on its entire datacenter business. Why would they want to enter a highly competitive market that they have no expertise in on a whim?"
moduspol praised Apple's "measured and evidence-based approach to LLMs" as potentially making them less susceptible to an AI bubble pop.
spease highlighted the advantage of Apple's unified memory for LLM applications and their existing platform and infrastructure.
jychang cited Apple's historical issues with Nvidia hardware as a reason for their focus on internal solutions.
jbverschoor and bigyabai proposed the idea of a powerful HomePod acting as a local AI server. bigyabai dismissed this due to the poor sales of the current HomePod compared to its price.

The debate centers on whether Apple's current, more cautious approach to AI is a strategic strength or a missed opportunity, with many users wishing for more aggressive hardware and software support for local AI.

Model Hallucinations and Reliability

Several users discussed the issue of LLM hallucinations and reliability, particularly in the context of local models.

daoboy shared a frustrating experience with an LLM hallucinating an entire interview with Sun Tzu, despite the prompt only asking for transcript cleanup. They concluded, "Having to meticulously check for weird hallucinations will be far more time consuming than just doing the editing myself."
simonh commented on the difficulty of training LLMs for accuracy, suggesting the issue stems from training them on human communicative behavior, possibly including "Reddit as a source."
smallmancontrov, HankStallone, and sandbags discussed how fine-tuning and data sources (like Reddit) influence model behavior, leading to the need for "bias-to-action" but also potential inaccuracies.
dragonwriter offered a technical perspective, stating that LLMs represent "an instance of extreme deliberate compromise of accuracy, correctness, and controllability to get some useful performance."
root_axis attributed hallucinations to the probabilistic nature of LLM outputs: "the generative outputs are statistical, not logical."

This theme underscores the ongoing challenge of ensuring LLM reliability and grounding, even for specific local tasks, and the need for users to be aware of and mitigate these potential inaccuracies.

Practical Use Cases for Local LLMs

The discussion also touched upon specific, practical applications for running LLMs locally, moving beyond theoretical possibilities to tangible benefits.

Users like daoboy and vorticalbox discussed using local LLMs for personal data processing, such as transcribing and summarizing audio journals or personal notes stored in Obsidian, driven by privacy concerns.
crazygringo saw local LLMs mainly for "automation as opposed to factual knowledge — for classification, summarization, search, and things like grammar checking," with OS tools integrating and prompting LLMs in the background.
luckydata mentioned using local models for embeddings, useful for tasks like building a screenshot manager.
dxetech highlighted the utility of local LLMs in situations with limited or no internet access.
ivape and j45 mentioned using smaller local models for terminal command autocompletion and code commit messages, suggesting "a solid 7-8b model can do this locally."
kergonath found the continue.dev extension for VS Code promising for local code completion with Mistral models.
bityard detailed several reasons for preferring local LLMs: avoiding SEO-optimized search engine results, the potential for future advertising in hosted models, data privacy, developing custom tools, and staying current with AI advancements.
rukuu001 shared using a small Gemma model for summarizing emails, demonstrating a practical workflow.
kolmogorovcomp and frontsideair discussed the challenge of selecting the right model size and type for specific hardware and tasks, balancing speed versus reasoning quality.
jokoon inquired about local image captioning models.

These examples illustrate that while advanced LLMs might still require powerful hardware, smaller models are already valuable for focused, local tasks, particularly those involving privacy or offline functionality.