DeepSeek-v3.1

The Hacker News discussion about DeepSeek V3.1 reveals several key themes:

Model Performance and Benchmarking Skepticism

A significant portion of the conversation revolves around the performance of DeepSeek V3.1, with many users expressing skepticism about the reliability and comprehensiveness of benchmarks. While some users report positive personal experiences, others point out discrepancies between benchmark results and real-world usage.

"Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice." - hodgehog11
"My personal experience is that it produces high quality results." - coliveira
"Vine is about the only benchmark I think is real. We made objective systems turn out subjective answers… why the shit would anyone think objective tests would be able to grade them?" - SV_BubbleTime
"tbh companies like anthropic, openai, create custom agents for specific benchmarks" - guluarte
"Aren't good benchmarks supposed to be secret?" - amelius
"garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted to present a meaningful benchmark, the agent tools will stay the same and then we can really compare the models. there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs." - segmondy
"Clearly, this is a dark harbinger for Chinese AI supremacy /s" - lenerdenator
"As a hobbyist, I have yet to put together a good heuristic for better-quant-lower-params vs. smaller-quant-high-params." - jkingsman
"Benchmarks can be a starting point, but you really have to see how the results work for you." - mdp2021
"Is it foot at tool use? For me tool use is table stakes, if a model can't use tools then its almost useless." - fariszr
"I don't think it's necessarily wrong, but your source is currently only showing a single provider. Comparing: [links to OpenRouter] for the same providers is probably better, although gpt-oss-120b has been around long enough to have more providers, and presumably for hosters to get comfortable with it / optimize hosting of it." - petesergeant

Hallucinations and Factual Accuracy

The issue of hallucinations and factual accuracy is a recurrent theme. While some users found DeepSeek V3.1 to be more precise than other models in certain scenarios, others reported it to be prone to hallucinations, especially for factual queries.

"I remember asking for quotes about the Spanish conquest of South America because I couldn't remember who said a specific thing. The GPT model started hallucinating quotes on the topic, while DeepSeek responded with, 'I don't know a quote about that specific topic, but you might mean this other thing.' or something like that then cited a real quote in the same topic, after acknowledging that it wasn't able to find the one I had read in an old book. i don't use it for coding, but for things that are more unique i feel is more precise." - imachine1980_
"Seems to hallucinate more than any model I've ever worked with in the past 6 months." - xmichael909
"DeepSeek is bad for hallucinations in my experience. I wouldn't trust its output for anything serious without heavy grounding. It's great for fantastical fiction though. It also excels at giving characters 'agency'." - Leynos
"My experience is that gpt-oss doesn't know much about obscure topics, so if you're using it for anything except puzzles or coding in popular languages, it won't do well as the bigger models. It's knowledge seems to be lacking even compared to gpt3." - okasaki
"Smaller quants are for folks who won't be able to run anything at all, so you run the largest you can run relative to your hardware." - segmondy
"It still cant name all the states in India" - donbreo
"All this points to "personality" being a big -- and sticky -- selling point for consumer-facing chat bots. People really did like the chatty, emoji-filled persona of the previous ChatGPT models. So OpenAI was ~forced to adjust GPT-5 to be closer to that style. It raises a funny "innovator's dilemma" that might happen. Where an incumbent has to serve chatty consumers, and therefore gets little technical/professional training data. And a more sober workplace chatbot provider is able to advance past the incumbent because they have better training data. Or maybe in a more subtle way, chatbot personas give you access to varying market segments, and varying data flywheels." - pradn

Tool Use and Formatting Issues

The discussion highlights challenges with DeepSeek V3.1's tool use capabilities, specifically concerning its inconsistent adherence to standardized formats like JSON. This requires users to build custom support for the model's unique output formats.

"It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those." - seunosewa
"Sometimes it will randomly generate something like this in the body of the text: ... or this: ... Prompting it to use the right format doesn't seem to work. Claude, Gemini, GPT5, and GLM 4.5, don't do that. To accomodate DeepSeek, the tiny agent that I'm building will have to support all the weird formats." - seunosewa
"Is it foot at tool use? For me tool use is table stakes, if a model can't use tools then its almost useless." - fariszr
"So, is the output price there why most models are extremely verbose? Is it just a ploy to make extra cash? It's super annoying that I have to constantly tell it to be more and more concise." - guerrilla

Pricing and Value Proposition

The pricing of DeepSeek V3.1 is a point of interest, with users debating its cost-effectiveness compared to other models, especially in light of recent price increases and the removal of discounts.

"Yeah but the pricing is insane, I don't care about SOTA if its not break my bank" - tonyhart7
"Claude's Opus pricing is nuts. I'd be surprised if anyone uses it without the top max subscription." - rapind
"Some people have startup credits" - memothon
"FWIW I have the €20 Pro plan and exchange maybe 20 messages with Opus (with thinking) every day, including one weeks-long conversation. Plus a few dozen Sonnet tasks and occasionally light weight CC. I'm not a programmer, though - engineering manager." - tmoravec
"Cheep! $0.56 per million tokens in — and $1.68 per million tokens out. ... That's actually a big bump from the previous pricing: $0.27/$1.10" - drmidnight
"The next cheapest and capable model is GLM 4.5 at $0.6 per million tokens in and $2.2 per million tokens out. Glad to see DeepSeek is still be the value king. But I am sti disappointed with the price increase." - manishsharan
"Sad to see the off peak discount go. I was able to crank tokens like crazy and not have it cost anything. That said the pricing is still very very good so I can't complain too much." - vitaflo
"Looks to be the ~same intelligence as gpt-oss-120B, but about 10x slower and 3x more expensive?" - rsanek
"Reminder DeepSeek is a Chinese company whose headstart is attributed to stealing IP from American companies. Without the huge theft, they'd be nowhere." - hereme888
"how can deepseek be so cheap yet so effective? pricing: MODEL deepseek-chat deepseek-reasoner 1M INPUT TOKENS (CACHE HIT) $0.07 1M INPUT TOKENS (CACHE MISS) $0.56 1M OUTPUT TOKENS $1.68" - niteshpant

Technical Implementation and Packaging Concerns (Unsloth)

A substantial part of the discussion is dedicated to the Unsloth library's packaging and dependency management, specifically its attempt to automatically install llama.cpp. This sparked a debate about security, best practices, and user experience.

"This industry is currently burning billions a month. With that much money around I don't think any secrets can exist." - wkat4242
"garbage benchmark, inconsistent mix of "agent tools" and models." - segmondy
"Dude, this is NEVER ok. What in the world??? A third party LIBRARY running sudo commands? That’s just insane. You just fail and print a nice error message telling the user exactly what they need to do, including the exact apt command or whatever that they need to run." - elteto
"They say the SWE bench verified score is 66%. Claude Sonnet 4 is 67%. Not sure if the 1% difference here is statistically significant or not." - aussieguy1234
"I'm doing this model" - loog5566
"Don't apologize, you are doing amazing work. I appreciate the effort you put. Usually you don't make assumptions on the host OS, just try to find the things you need and if not, fail, ideally with good feedback. If you want to provide the "hack", you can still do it, but ideally behind a flag, allow_installation or something like that. This is, if you want your code to reach broader audiences." - woile
"IMO the correct thing to do to make these people happy, while being sane, is - do not build llama.cpp on their system. Instead, bundle a portable llama.cpp binary along with unsloth, so that when they install unsloth with pip (or uv) they get it." - rfoo
"I think for a compromise solution I'll allow the permission asking to install. I'll definitely try investigating pre built binaries though" - danielhanchen
"I like it when software does work for me. Quietly installing stuff at runtime is shady for sure, but why not if I consent?" - solarkraft
"Please, please, never silently attempt to mutate the state of my machine, that is not a good practice at all and will break things more often than it will help because you don't know how the machine is set up in the first place." - Balinares
"My current solution is to pack llama.cpp as a custom nix formula (the one in nixpkgs has the conversion script broken) and run it myself. I wasn't able to run unsloth on ROCM nor for inference nor for conversion, sticking with peft for now but I'll attempt again to re-package it." - pshirshov

Hardware and Local Deployment

The conversation also touches upon running LLMs locally, with discussions about hardware requirements, GPU compatibility (AMD vs. NVIDIA), and optimization techniques.

"I've been running LLM models on my Radeon 7600 XT 16GB for past 2-3 months without issues (Windows 11). I've been using llama.cpp only. The only thing from AMD I installed (apart from latest Radeon drivers) is the 'AMD HIP SDK' (very straight forward installer). After unzipping (the zip from GitHub releases page must contain hip-radeon in the name) all I do is this: llama-server.exe -ngl 99 -m Qwen3-14B-Q6_K.gguf And then connect to llamacpp via browser to localhost:8080 for the WebUI (its basic but does the job, screenshots can be found on Google). You can connect more advanced interfaces to it because llama.cpp actually has OpenAI-compatible API." - DarkFuture
"The P40 has memory bandwidth of 346GB/s which means it should be able to do around 14+ t/s running a 24 GB model+context." - coolspot
"For local runs, I made some GGUFs! You need around RAM + VRAM >= 250GB for good perf for dynamic 2bit (2bit MoE, 6-8bit rest) - can also do SSD offloading but it'll be slow." - danielhanchen
"Yes we're working on Docker! https://hub.docker.com/r/unsloth/unsloth" - danielhanchen
"In my experience LLMs can do Nix very well, even the models I run locally. I just instruct them to pull dependencies through flake.nix and use direnv to run stuff." - pshirshov
"if you are running a 2bit quant, you are not giving up performance but gaining 100% performance since the alternative is usually 0%. Smaller quants are for folks who won't be able to run anything at all, so you run the largest you can run relative to your hardware." - segmondy