Grok Code Fast 1

Here's a summary of the themes discussed in the Hacker News thread, along with direct quotes:

Emphasis on Speed vs. Quality

A central debate in the discussion revolves around whether speed or quality of output is more important for LLMs, particularly in coding contexts. Some argue that speed enables faster iteration and a better development flow, while others prioritize accuracy and quality, even if it means slower responses.

Pro-Speed Arguments:
- "If an LLM is often going to be wrong anyway, then being able to try prompts quickly and then iterate on those prompts, could possibly be more valuable than a slow higher quality output." (eterm)
- "It simply enables a different method of interactive working." (eterm)
- "Latency can have critical impact on not just user experience but the very way tools are used." (eterm)
- "For autocompleting simple functions (string manipulation, function definitions, etc), the quality bar is pretty easy to hit, and speed is important." (peab)
- "But if you know what you're doing, I find having a dumber fast model is often nicer than a slow smart model that you still need to correct a bit, because it's easier to stay in flow state." (peab)
- "If gpt5 takes 3 minutes to output and qwen3 does it in 10 seconds and the agent can iterate 5 times to finish before gpt5, why do I care if gpt5 one shot it and qwen took 5 iterations" (jml78)
- "Fast can buy you a little quality by getting more inference on the same task." (furyofantares)
- "Right now you craft a prompt, hit send and then wait, and wait, and then wait some more, and after some time (anywhere from 30 seconds to minutes later) the agent finishes its job. It's not long enough for you to context switch to something else, but long enough to be annoying and these wait times add up during the whole day." (M4v3R)
- "It also discourages experimentation if you know that every prompt will potentially take multiple minutes to finish. If it instead finished in seconds then you could iterate faster." (M4v3R)
- "We already know that in most software domains, fast (as in, getting it done faster) is better than 100% correct." (defen)
- "For agentic workflows, speed and good tool use are the most important thing." (CuriousC)
- "It's fast. I tested it in EU tz, so ymmv" (NitpickLawyer)
- "It does agentic in an interesting way. Instead of editing a file whole or in many places, it does many small passes." (NitpickLawyer)
- "Had a feature take ~110k tokens (parsing html w/ bs4). Still finished the task. Didn't notice any problems at high context." (NitpickLawyer)
- "When things didn't work first try, it created a new file to test, did all the mocking / testing there, and then once it worked edited the main module file. Nice. GPT5-mini would often times edit working files, and then get confused and fail the task." (NitpickLawyer)
- "At the price point it's at, I could see it as a daily driver. Even agentic stuff w/ opus + gpt5 high as planners and this thing as an implementer. It's fast enough that it might be worth setting it up in parallel and basically replicate pass@x from research." (NitpickLawyer)
- "IMO it's good to have options at every level. Having many providers fight for the market is good, it keeps them on their toes, and brings prices down. GPT5-mini is at 2$/MTok, this is at 1.5$/MTok. This is basically "free", in the great scheme of things. I ndon't get the negativity." (NitpickLawyer)
Pro-Quality Arguments:
- "I would have thought it uncontroversial view among software engineers that token quality is much important than token output speed." (boole1854)
- "For 10% less time you can get 10% worse analysis? I don’t understand the tradeoff." (fmbb)
- "What if it's 10% less time and 3% worse analysis? Maybe that's valuable." (kelnos)
- "Fast but dumb models don’t progressively get better with more iterations." (wahnfrieden)
- "For me, I found Sonnet 3.5 to be a clear step up in coding, I thought 3.7 was worse, 2.5 pro equivalent, and 4 sonnet equal maybe tiny better than 3.5. Opus 4.1 is the first one to me that feels like a solid step up over sonnet 3.5. This of course required me to jump to Claude code max plan, but first model to be worth that (wouldn’t pay that much for just sonnet)." (mchusma)
- "Opus 4.1 is by far the best right now for most tasks. It’s the first model I think will almost always pump out “good code”. I do always plan first as a separate step, and I always ask it for plans or alternatives first and always remind it to keep things simple and follow existing code patterns." (mchusma)
- "I use Opus 4.1 exclusively in Claude Code but then I also use zen-mcp server to get both gpt5 and gemini-2.5-pro to review the code and then Opus 4.1 responds." (furyofantares)
- "I'd love to hear how you have this set up." (dotancohen)
- "I suspect most of the problems opus has for me are more context related, and I’m not sure more models would help." (mchusma)
- "I'm more curious if its based on Grok 3 or what, I used to get reasonable answers from Grok 3. If that's the case, the trick that works for Grok and basically any model out there is to ask for things in order and piecemeal, not all at once." (giancarlostoro)
- "I have a feeling its based on Grok 3 since Grok 3 has insane speeds then a heavy focus on programming." (giancarlostoro)
- "But that's not what OP said - they said they can be as useful as the large models by iterating them." (wahnfrieden)
- "Do you use them successfully in cases where you just had to re-run them 5 times to get a good answer, and was that a better experience than going straight to GPT 5?" (wahnfrieden)
- "It does still compare well against the others: https://vals.ai/benchmarks/swebench-2025-08-27" (hrdwdmr8)
- "But quality matters more for me." (mchusma)
- "What about promoting renewable energy, space exploration, frontier physics and advanced engineering makes you concerned?" (epa)
Nuance/Contextual Speed:
- "It depends how fast. If an LLM is often going to be wrong anyway, then being able to try prompts quickly and then iterate on those prompts, could possibly be more valuable than a slow higher quality output." (eterm)
- "For certain use-cases where legitimately speed would be much more interesting such as generating a massive amount of HTML. Tough I agree this makes it look like even more of a joke for anything serious." (6r17)
- "Different models for different things. Not everyone is solving complicated things every time they hit cmd-k in Cursor or use autocomplete, and they can easily switch to a different model when working harder stuff out via longer form chat." (dmix)
- "This is exactly what I use it for. It's my go-to "dumb tedious things" model. And it fills that role very well. You don't need the smartest slow model for every task. I've used it all week for tedious things nobody wants to do and gotten a ton done in less time." (Jcampuzano2)
- "The only thing I've had issues with is if you're not a level more specific than you might be with smarter models it can go off the rails. But give it a tedious task and a very clear example and it'll happily get the job done." (Jcampuzano2)

Perceived Quality and Reliability Issues

Several users raised concerns about Grok's coding quality, reliability, and tendency to hallucinate or produce incorrect outputs. Some users have had experiences where the model has deleted code or produced code that doesn't actually verify intended behavior.

"My experience is subpar so far" (cft)
"Sonic [Grok's previous iteration] put the guy in the middle of the desk, and the laptop floating over his head. Sonic was very fast, though!" (johnfn)
"I've been testing Grok for a few days, and it feels like a major step backward. It randomly deleted some of my code - something I haven't had happen in a long time." (RedMist)
"While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"I have had very poor results with it so far. Much less reliable than GPT 5 Mini, which was also faster, ironically." (ewoodrich)
"Why is deleting code a big problem? You have version control, right?" (Retr0id)
"Deleting extra code is easier than verifying deleted lines and restoring the ones that seem like an accident. It's just annoying." (markerz)
"It created tests and then iterate on those tests. The tests it wrote don't actually verify intended behavior. It only verified that mocks were called with the intended inputs while missing the larger picture of how it is used." (cendyne)
"The issue seemed to be that it didn't explain anything and just made some changes, which like, I said, were pretty good." (Demiurge)
"The issues became apparent very early that the prompt changes were leading to issues but reversion seemed to be something that X had to be pressured into - they were unwilling to treat it as a problem until the mechahitler thread." (jameshart)
"And this model is arguably their least impressive model." (seunosewa)
"I use grok a lot on the web interface (grok.com) and never had any weird incidents. It's a run-of-the-mill SOTA model with good web search and less safety training" (wongarsu)
"It doesn't just cause confusion, it's also hard to sort. To confirm my suspicion of sloppy coding, I tried to sort the date column and to my surprise I got this madness: ... Which is sorting by the day column -- the bit in the middle -- instead of the year!" (jiggawatts)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"I tested this yesterday with Cline. It's fast, works well with agentic flows, and produces decent code. No idea why this thread is so negative (also got flagged while I was typing this?) but it's a decent model." (NitpickLawyer)
"It works well with agentic flows, and produces decent code." (NitpickLawyer)
"When things didn't work first try, it created a new file to test, did all the mocking / testing there, and then once it worked edited the main module file. Nice. GPT5-mini would often times edit working files, and then get confused and fail the task." (NitpickLawyer)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The model said some unsavoury things and the problem was admitted and fixed - why is this making people lose their minds?" (simianwords)
"I've had good long trips with my Model Y where I didn’t need to intervene once. 4+ hour end of summer road trips." (cebert)
"When you don't have the capacity to process a large amount of code at once, it's better to have a smaller context window that is less prone to making mistakes during the analysis of that larger context." (mkd)
"My bottleneck currently is waiting for agent to scan/think/apply changes." (hu3)
"I have been testing it since yesterday in VS Code and it seemed fine so far. But I am also happy with all the GPT-4 variants, so YMMV." (threeducks)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)

User Experience and Workflow Integration

Users discussed how LLMs fit into their coding workflows, with some appreciating the ability to use them for scaffolding, debugging, or understanding existing code, while also noting limitations with large codebases. The idea of prompt engineering and step-by-step interaction was also highlighted.

"I have a different approach to LLMs and coding, I want to understand their proposed solutions and not just paste garbled up code (unless its scaffolded) if you treat every LLM as a piecemeal thing when designing code (or really trying to figure out anything) and go step by step, you get better results from most models." (giancarlostoro)
"The benchmark they are choosing to emphasize (in the one chart they show and even in the "fast" name of the model) is token output speed." (boole1854)
"I mostly use LLMs: Scaffolding, Ask it what's wrong with the code, Ask it for improvements I could make, Ask it what the code does (amazing for old code you've never seen), Ask it to provide architect level insights into best practices" (giancarlostoro)
"One area where they all seem to fail is lesser known packages they tend to either reference old functionality that is not there anymore, or never was, they hallucinate." (giancarlostoro)
"amazing for old code you've never seen" - "not if you have too much! a few hundred thousand lines of code and you can't ask shit!" (dingnuts)
"plus, you just handed over your company's entire IP to whoever hosts your model" (dingnuts)
"It's a fair trade off for smaller companies where IP or the software is necessary evil, not the main unique value added. It's hard to see what evil would anyone do with crappy legacy code. The IP risks taken may be well worth of productivity boosts." (miohtama)
"I hope in the future tooling and MCP will be better so agents can directly check what functionality exists in the installed package version instead of hallucinations." (miohtama)
"That's phase 1, ask it to "think deeply" (Claude keyword, only works with the anthropic models) while doing that. Then ask it to make a detailed plan of solving the issue and write that into current-fix.md and ask it to add clearly testable criteria when the issuen is solved. Now you can manually check the criteria wherever they sound plausible, if not - it's analysis failed and its output was worthless. But if it sounds good, you can start a new session and ask it to read the-markdown-file and implement the change. Now you can plausibility check the diff and are likely done" (ffsm8)
"But as the sister comment pointed out, agentic coding really breaks apart with large files like you usually have in brownfield projects." (ffsm8)
"You just need to scale out more. As you approach infinite monkeys, sorry - models, you'll surely get the result you need." (_kb)
"Otherwise, it's like measuring how fast your car can go by counting how often you clean the upholstery." (ori_b)
"It’s like measuring how fast your car can go by counting how often you clean the upholstery. There's nothing wrong with doing it, but it's entirely unrelated to performance." (ori_b)
"I don't think he was saying their release cadence is a direct metric on their model performance. Just that the team iterates and improves the app user experience much more quickly than on other teams." (Rover222)
"He seems to be stating that app release cadence correlates with internal upgrades that correlate with model performance. There is no reason for this to be true. He does not seem to be talking about user experience." (jdiff)
"How many times a day do you need to ship an update?" (ori_b)
"Maybe I'm making rapid updates to my app because I'm a terrible coder and I keep having to push out fixes to critical bugs. Maybe I'm bored and keep making little tweaks to the UI, and for some reason think that's worth people's time to upgrade. (And that's another thing: frequent upgrades can be annoying!)" (kelnos)
"But sure, ok, maybe it could mean making much faster progress than competitors. But then again, it could also mean that competitors have a much more mature platform, and you're only releasing new things so often because you're playing catch-up. (And note that I'm not specifically talking about LLMs here. This metric is useless for pretty much any kind of app or service.)" (kelnos)
"Coding faster than humans can review it is pointless. Between fast, good, and cheap, I'd prioritize good and cheap. Fast is good for tool use and synthesizing the results." (esafak)
"If you can't make the code more understandable or more reliable, then you're not going to get a speed advantage." (kchou)
"If you have to keep your hands near the wheel and maintain attention to the road then... shrugs not really the same. IMHO we're in the "uncanny valley" of vehicular automation" (vunderba)
"To me, "full self driving" means you can hop in the back seat and have a nap. If you have to keep your hands near the wheel and maintain attention to the road then... shrugs not really the same. IMHO we're in the "uncanny valley" of vehicular automation" (vunderba)
"Fast is cool! Totally has its place. But I use Claude code in a way right now where it’s not a huge issue and quality matters more." (mchusma)
"I do always plan first as a separate step, and I always ask it for plans or alternatives first and always remind it to keep things simple and follow existing code patterns. Sometimes I just ask it to double check before I look at it and it makes good tweaks. This works pretty well for me." (mchusma)
"For me, I found Sonnet 3.5 to be a clear step up in coding, I thought 3.7 was worse, 2.5 pro equivalent, and 4 sonnet equal maybe tiny better than 3.5. Opus 4.1 is the first one to me that feels like a solid step up over sonnet 3.5." (mchusma)
"It's fast. I tested it in EU tz, so ymmv" (NitpickLawyer)
"It does agentic in an interesting way. Instead of editing a file whole or in many places, it does many small passes." (NitpickLawyer)
"When things didn't work first try, it created a new file to test, did all the mocking / testing there, and then once it worked edited the main module file. Nice. GPT5-mini would often times edit working files, and then get confused and fail the task." (NitpickLawyer)
"When things didn't work first try, it created a new file to test, did all the mocking / testing there, and then once it worked edited the main module file. Nice. GPT5-mini would often times edit working files, and then get confused and fail the task." (NitpickLawyer)

Trust and Ethical/Environmental Concerns

A significant portion of the discussion centered on concerns about Grok and its parent company, xAI, stemming from Elon Musk's public persona, past controversies, and the environmental impact of data centers. Discussions about political leanings and "virtue signaling" also emerged in response to these concerns.

"Absolutely not, but that's a personal decision due to not wanting anything to do with X, rather than a purely rational decision." (eterm)
"This is a forum for tech-related discussions, not a venue for your virtue signaling." (cft)
"LOL. Says the guy who wrote, "Modern local religion (at least in the US) is neomarxism": https://news.ycombinator.com/item?id=31025588" (subsection1h)
"This will probably be a unpopular, wet blanket opinion... But anytime I hear of Grok or xAI, the only thing I can think about is how it's hoovering up water from the Memphis municipal water supply and running natural gas turbines to power all for a chat bot. Looks like they are bringing even more natural gas turbines online...great!" (disposition2)
"They started operating the turbines without permits and they were not equipped with the pollution controls normally required under federal rules. Worse, they are in an area that already led the state in people having to get emergency treatment for breathing problems. In their first 11 months they became one of the largest polluters in an area already noted for high pollution." (tzs)
"It's not the water that is the big problem here. It is the gas turbines and the location." (tzs)
"Most of the people bearing the brunt of all this local pollution are poor and Black." (tzs)
"Burning natural gas or methane is considered pretty clean, and produces mostly CO2 and water, which aren't toxic pollutants or a cause of breathing problems. That's why it's used inside homes in gas stoves." (Geee)
"Gas stoves in homes do cause breathing problems (because of non-CO2/water products)." (minitech)
"The turbines at the xAi Memphis datacenter are rentals. I believe they are intended to be temporary while the grid is improved to provide more power." (tzs)
"If the Grok brand wasn’t terminally tarnished for you by the ‘mechahitler’ incident, I’m not sure what more it would take." (jameshart)
"This is an offering being produced by a company whose idea of responsible AI use involves prompting a chatbot that “You spend a lot of time on 4chan, watching InfoWars videos” - https://www.404media.co/grok-exposes-underlying-prompts-for-... A lot of people rightly don’t want any such thing anywhere near their code." (jameshart)
"I don't use Twitter, I don't use X, I don't buy Tesla. It's not hard to understand why I don't use Grok either." (monsieurbanana)
"Its the equivalent of "voting with your wallet". Or "giving market share with your wallet". Context matters, not just for LLMs themselves. And Grok/X/Twitter's context is tarnished indeed for a lot of us." (czottmann)
"Grok has its own reputation issues." (cosmicgadget)
"Mostly by name association. The LLMs named Grok are good LLMs. The twitter bot of the same name, using those models and a custom prompt, has a habit of creating controversy. Usually after somebody modified the system prompt." (wongarsu)
"I use grok a lot on the web interface (grok.com) and never had any weird incidents. It's a run-of-the-mill SOTA model with good web search and less safety training" (wongarsu)
"How does somebody modify the system prompt over an x message to the chat bot?" (anukin)
"I think somebody here refers to a very specific meddling CEO" (jameshart)
"Grok is owned by Elon Musk. Anything positive that is even tangentially related to him will be treated negatively by certain people here. Additionally, it is an AI coding tool which is seen as a threat to some people’s livelihoods here. It’s a double whammy, so I’m not surprised by the reaction to it at all." (dlachausse)
"Claude Code threads are full of excited people so I’m not sure the second part is true." (dewey)
"If we accept the "broken windows" theory, it'd seem that people love to pile onto a thread that already has negativity. See also the Microsoft threads on HN where everyone threatens to switch to Linux, and by reading them you'd think Linux is finally about to have its infamous glory year on the desktop." (supriyo-biswas)
"Grok is doing some terrible things to the environment and to the community surrounding its data center, especially the disadvantaged in the area. Nobody, anywhere should be okay with that." (imglorp)
"The location of the Colossus datacenter is well known. It happens to be located in an industrial area, nestled between an active steel manufacturing plant (apparently scrap metal with an electric blast furnace, which should mean enormous power draw but no coke coal at least?), and an active industrial scale natural gas power plant." (ACCount37)
"With that, I just don't buy that it's the datacenter that is somehow the most notable consumer of fossil fuel power (or, for that matter, water) in the area." (ACCount37)
"It is forgivable because there is no real understanding in an llm. And other llm can also be prompted to say ridiculous things, so what? If a llm would accept a name of a Viking or Khan of the steppes it doesn’t mean it wants to rape and pillage." (ralfd)
"It's not about the model, it's about the ethics of the company intentionally building the model, and what they might do in the future." (tenuousemphasis)
"What was the alternative? This was clearly an oversight and this much was admitted. Your suggestion that an oversight like this is reason enough to not use the model? I don’t get the big problem over here. The model said some unsavoury things and the problem was admitted and fixed - why is this making people lose their minds? It has to be performative because I can’t explain it in any other way." (simianwords)
"Yes, it is performative. As is most of the outrage in this thread." (bhauer)
"From the outside, the Grok mechahitler incident appeared very much to be the embodiment of Musk’s top-down ‘free speech absolutist’ drive to strip ‘political correctness’ shackles from grok; the prompting changes were driven by his setting that direction. The issues became apparent very early that the prompt changes were leading to issues but reversion seemed to be something that X had to be pressured into - they were unwilling to treat it as a problem until the mechahitler thread. This all speaks to his having a particular vision for what he wants xAI agents to be – something which continues to be expressed in things like the ani product and other bot personas." (jameshart)
"Microsoft had Tay. Google Gemini had “Black George Washington”. I think that pinning your entire view of a model forever on a single incident is not a reasonable approach, but you do you." (efitz)
"It's not just the model, it's Elon Musk's view of the world and business in general. Neither Microsoft nor Google nor their leadership--though admittedly imperfect--make it a habit of trolling people, openly embroiling themselves in politics, and committing blatant legal and societal transgressions. You reap what you sow; and if you live for controversy, you can't expect people not to want to do business with you." (otterley)
"HN had severe TDS" (FirmwareBurner)
"Derangement suggests a complete lack of factual and reasoning capability. Do you honestly think we're unaware of the facts and circumstances that support our judgment?" (otterley)
"Elon Musk Charged With Securities Fraud for Misleading Tweets: https://www.sec.gov/newsroom/press-releases/2018-219 SEC Charges Elon Musk for Failing to Timely Disclose Beneficial Ownership of Twitter: https://www.debevoise.com/insights/publications/2025/01/sec-... Musk Sued for Calling Thai Cave Rescuer Pedophile: https://www.voanews.com/a/tesla-s-musk-sued-for-calling-thai... Elon Musk salute controversy: https://en.wikipedia.org/wiki/Elon_Musk_salute_controversy" (otterley)
"What about promoting renewable energy, space exploration, frontier physics and advanced engineering makes you concerned?" (epa)
"Donating to orphanages after committing a genocide resets your karma only in videogames." (badsectoracula)
"What genocide did Musk commit?" (FirmwareBurner)
"The karma system some games use (e.g. Fallout 3 where you can nuke an entire city that puts your karma in negatives and then give fresh water to beggars to reset your karma) was what i was reminded of." (badsectoracula)
"Musk didn't commit any genocide (that i'm aware of) but that wasn't what i wrote. The point of my comment is that you can't offset doing -what some people perceive as- bad things by doing -what some people perceive as- good things later." (badsectoracula)
"That single incident was only the worst of the bunch. This is on top of all heaps of context which paints Grok, X, and Elon Musk in general as something any decent human being should not touch with a 10 foot pole." (runarberg)
"The Anti-Defamation League stated it wasn't a salute and that they weren't offended. Rabbi Ari Lamm wrote that Musk has repeatedly shown he's a friend to the Jewish community. David Greenfield suggested people should focus on actual antisemitism instead. Netanyahu highlighted the absurdity of the accusations and pointed to Musk's aid and engagement after the October 7th attacks. And yes, Musk became a victim. I don't see what his current wealth has to do with it. It's hard to ignore the imbalance where one man drew the world's anger and became public enemy #1. If you call him a snowflake, I don't know what to call all those who might have been offended by his gesture" (ribelo)
"I reproduced it and nothing happened. The problem might be that I'm my own manager so need to went to the mirror and did it, but if any of my 20 employees did the same, I wouldn't take any action against them. The real reason is that I don't live in the West. Where I live, we don't suffer from the plague of misunderstood political correctness. At least not all of us yet." (ribelo)
"Yes, unfortunately. Even liberal commentators like Jon Stewart and Bill Maher have said the obsession with Trump was overblown and even dangerous in its own right." (elcritch)
"There should be a limit to slander and hatred and I believe someone should stand up to the crowd. In the name of a better tomorrow and to ensure history never repeats itself." (ribelo)
"So in your view, the true victim of Elon's nazi salute was.. Elon?" (gizzlon)
"How do you come to that conclusion? Because the backlash was "too much" ? He is still (one of) the richest people in the world, and controls several huuge companies. But he got his feelings hurt, I guess? And that was "too much" ?? Poor snowflake Elon." (gizzlon)
"I think Netanyahu had a bit of a conflict of interest here--he couldn't afford to get on Trump's bad side!" (Wowfunhappy)
"Almost half of the countries hates Netanyahu and he's only in charge because of the support from far-right." (Wytwwww)
"If you don't think it was a nazi salute, study the video so you can reproduce the gesture exactly, then go into your work and do it in front of your manager. See what happens." (ryandrake)
"So? Does that means nobody else is allowed to have an opinion about the salute that he made. Sure he's pro Israel, that's not uncommon at all amongst the far right these days. What about the people who seem to be highly offended by people who have been offended by his gesture. What do you call them?" (Wytwwww)
"Everyone should be free to have whatever opinion they like, or at least, they ought to be. The difference is this, some try to impose their opinions on society, while the rest couldn’t care less and refuse to lose sleep over it. The ability to mind our own business is a virtue, a real one. The world went downhill the moment people started obsessing over others instead of focusing on themselves. And anyone who truly cares about society’s well-being should stop meddling." (ribelo)
"So literally Musk and his pals? So again, Musk et al.? I'm really confused... what are you trying to say. That only some people are allowed to meddle while everyone else should shut up and mind their own business? How do you determine that? Wealth? Political opinions? Class? Race?" (Wytwwww)
"Gosh yeah, all that... getting rid of slavery, and women's rights, and disability support and awareness... Truly, the world is far better off!" (squigz)
"Elon Musk is not a nazi." (jjangkke)
"Then why did he publicly do a Nazi salute?" (Revisional_Sin)
"He did not do a Nazi salute because otherwise ADL would've been all over him. ADL came out saying he didn't do a Nazi salute." (jjangkke)
"The ADL is not the final arbitrator of what is a Nazi salute and what isn't." (iamdelirium)
"And they were criticised for this by many other Jewish organisations." (Revisional_Sin)
"While this point might be open to debate, the original claim, which I definitely stand by, was not that Musk is a Nazi, but rather that xAI have put out a product under the grok brand which manifestly promoted nazi ideas. If Musk is not in favor of those ideas he might need to work a bit harder to make that clear, because he does tend to leave people with the impression he’s okay with it." (jameshart)
"A prompt was edited by an xAI staff that caused xAI to ignore the politically correct filter, how is Elon Musk responsible for this ?" (jjangkke)
"Isn’t he the CEO and owner? I thought their massive wealth and control was morally ok because they carried the responsibility for the companies actions at the end of the day. Guess you can have the power and no responsibility! Always someone else’s fault!" (lovich)
"He's the CEO, and there's now been a few "oh geez some rogue employee made grok say white supremacist stuff, we totally didn't mean for it to say that!" moments. If the management isn't fixing the problems that led to those events, the management is responsible." (sjsdaiuasgdia)
"He's not very good at demonstrating that. I don't know man. For like... the other 7 billion people on Earth it seems preeeetty easy for them not to be confused with a Nazi. Seems to me just Elon has that issue. I've never had that issue. I don't know anyone who's has that issue. So, it makes you wonder." (const_cast)
"It’s not a winning argument." (simianwords)
"Netanyahu is a wanted war criminal for major crimes against humanity. Whatever he thinks should be dismissed as irrelevant." (runarberg)
"While Israel is home to the largest number of Jews on Earth, most Jews do not live in Israel. And Israel is also home to a large number non-Jews whom Netanyahu is also the prime minister of. It is in fact important that he is not representative of Jewish people." (runarberg)
"The elected representative of the country made for Jews which is the country that has highest Jewish population and has historical ties to Judaism has exonerated Elon. It has symbolic meaning and fretting over a salute and boycotting the company seems performative." (simianwords)
"That point was void to begin with. It is an appeal to authority in which the validity of the authority is on extremely shaky grounds." (runarberg)
"Performative actions are still actions, and sometimes deliver results. If those results are as little as make some people feel better, those are still results. That said, it is hard to be more performative than the gesture it self. So if you want to criticize HN users for being performative, you should apply the same standard to Elon Musk." (runarberg)
"By many standards (including those used by the Israeli Law of Return), the US has more Jewish people than Israel. EDIT: To be clear, I am not, in noting this fact, arguing against the parent's argument that (this is a paraphrase) the opinion of the head of a state with a large Jewish population (whether or not it is actually the largest in the world) does not itself constitute the response of world Judaism, either in general or specifically as an exoneration of an alleged expression of fascist sympathies; that position is absolutely correct, irrespective of which country happens to have the largest Jewish population." (dragonwriter)
"And the elected leader of the US was also supportive of Elon’s ‘gesture’ so I guess that settles it - Jewish people worldwide, as embodied through the representative voices of their elected leaders, must agree that it was not a Nazi salute. And that firmly settles it because nobody else gets to have any opinion about it because Nazis never bothered anyone else. /s, just in case." (jameshart)
"Judaism doesn’t have a single spiritual leader like Catholicism or Tibetan Buddhism (the Pope and the Dalai Lama respectively). This is like saying that anti-Tibetan racism can be absolved if Yan Jinhai (the chairman of Tibet Autonomous region) or Gombojavyn Zandanshatar (the prime minister of Mongolia) has said nice things about the racist because most Tibetan Buddhists live in Tibet or Mongolia." (runarberg)
"Democracy doesn't guarantee the election or ongoing approval of a person who is morally unimpeachable. If it did, Donald Trump wouldn't be president (and he's hardly the only one)." (otterley)
"Why does it matter to you when Netanyahu - one of the most important prime ministers and a representative of Jews - made a whole X post exonerating Elon over the salute? "@elonmusk is being falsely smeared. Elon is a great friend of Israel. He visited Israel after the October 7 massacre in which Hamas terrorists committed the worst atrocity against the Jewish people since the Holocaust. He has since repeatedly and forcefully supported Israel’s right to defend itself against genocidal terrorists and regimes who seek to annihilate the one and only Jewish state. I thank him for this. Kudos: Famously great guy that Netanyahu." (simianwords)
"Famously great guy that Netanyahu." (Kudos)
"Whatever you think of him it is important that the elected representative of Jewish people exonerated him." (simianwords)

Benchmarking and Model Performance

Users discussed how LLMs are benchmarked, with some questioning the choice of speed as a primary metric. There was also a degree of skepticism about Grok's performance relative to other models and a call for transparency in benchmarking methodologies.

"It's interesting that the benchmark they are choosing to emphasize (in the one chart they show and even in the "fast" name of the model) is token output speed." (boole1854)
"On the full subset of SWE-Bench-Verified, grok-code-fast-1 scored 70.8% using our own internal harness." (esafak)
"Let's see this harness, then, because third party reports rate it at 57.6%" (esafak)
"It does still compare well against the others: https://www.vals.ai/benchmarks/swebench-2025-08-27" (hrdwdmrbl)
"Is this the model that is the "Coding" version of Grok-4 promised when Grok-4 had awful coding benchmarks? I guess if you cannot do well in benchmarks, instead pick an easier to pump up one and run with that - speed." (Workaccount2)
"Looking online for benchmarks the first thing that came up was a reddit post from an (obvious) spam account[1] gloating about how amazing it was on a bunch of subs." (Workaccount2)
"I use Opus 4.1 exclusively in Claude Code but then I also use zen-mcp server to get both gpt5 and gemini-2.5-pro to review the code and then Opus 4.1 responds." (furyofantares)
"I'd love to hear how you have this set up." (dotancohen)
"I suspect most of the problems opus has for me are more context related, and I’m not sure more models would help. Speculation on my part." (mchusma)
"I'm more curious if its based on Grok 3 or what, I used to get reasonable answers from Grok 3. If that's the case, the trick that works for Grok and basically any model out there is to ask for things in order and piecemeal, not all at once." (giancarlostoro)
"Opus 4.1 is by far the best right now for most tasks. It’s the first model I think will almost always pump out “good code”." (mchusma)
"I found Sonnet 3.5 to be a clear step up in coding, I thought 3.7 was worse, 2.5 pro equivalent, and 4 sonnet equal maybe tiny better than 3.5. Opus 4.1 is the first one to me that feels like a solid step up over sonnet 3.5." (mchusma)
"After trying Cerebras free API (not affiliated) which delivers Qwen Coder 480b and gpt-oss-120b a mind boggling ~3000 tps, that output speed is the first thing I checked out when considering a model for speed. I just wish Cerebras had a better overall offering on their cloud, usage is capped at 70M tokens / day and people are reporting that it's easily hit and highly crippling for daily coding." (ojosilva)
"As a bit of a side note, I want to like Cerebras, but using any of the models through OpenRouter that uses them has lead to, too many throttling responses. Like you can't seem to make a few calls per minute. I'm not sure if Cerebras is throttling OpenRouter or if they are throttling everybody. If somebody from Cerebras is reading this, are you having capacity issues?" (sdesol)
"You can get your own key with cerebras and then use it in openrouter. Its a little hidden, but for each provider you can explicitly provide your own key. Then it won't be throttled." (gompertz)
"There is a national superset of “NIH” bias that I think will impede adoption of Chinese-origin models for the foreseeable future. That’s a shame because by many objective metrics they’re a better value." (stocksinsmocks)
"In my case it's not NIH, but rather that I don't trust or wish to support my nation's largest geopolitical adversary." (dlachausse)
"Your loss. Qwen3 A3B replaced ChatGPT for me entirely, it's hard for me to imagine going back using remote models when I can load finetuned and uncensored models at-will. Maybe you'd find consolation in using Apple or Nvidia-designed hardware for inference on these Chinese models? Sure, the hardware you own was also built by your "nation's largest geopolitical adversary" but that hasn't seemed to bother you much." (bigyabai)
"Genuine question: how does downloading an open-weight model (Qwen in this case) and running it either locally or via a third-party service benefit China?" (mft_)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I tested this yesterday with Cline. It's fast, works well with agentic flows, and produces decent code. No idea why this thread is so negative (also got flagged while I was typing this?) but it's a decent model. I'd say it's at or above gpt5-mini level, which is awesome in my book (I've been maining gpt5-mini for a few weeks now, does the job on a budget)." (NitpickLawyer)
"It's fast. I tested it in EU tz, so ymmv" (NitpickLawyer)
"It does agentic in an interesting way. Instead of editing a file whole or in many places, it does many small passes." (NitpickLawyer)
"Had a feature take ~110k tokens (parsing html w/ bs4). Still finished the task. Didn't notice any problems at high context." (NitpickLawyer)
"When things didn't work first try, it created a new file to test, did all the mocking / testing there, and then once it worked edited the main module file. Nice. GPT5-mini would often times edit working files, and then get confused and fail the task." (NitpickLawyer)
"At the price point it's at, I could see it as a daily driver. Even agentic stuff w/ opus + gpt5 high as planners and this thing as an implementer. It's fast enough that it might be worth setting it up in parallel and basically replicate pass@x from research." (NitpickLawyer)
"IMO it's good to have options at every level. Having many providers fight for the market is good, it keeps them on their toes, and brings prices down. GPT5-mini is at 2$/MTok, this is at 1.5$/MTok. This is basically "free", in the great scheme of things. I ndon't get the negativity." (NitpickLawyer)
"Qwen3-Coder-480B hosted by Cerebras is $2/Mtok (both input and output) through OpenRouter. OpenRouter claims Cerebras is providing at least 2000 tokens per second, which would be around 10x as fast, and the feedback I'm seeing from independent benchmarks indicates that Qwen3-Coder-480B is a better model." (coder543)
"On the full subset of SWE-Bench-Verified, grok-code-fast-1 scored 70.8% using our own internal harness. Let's see this harness, then, because third party reports rate it at 57.6% https://vals.ai/models/grok_grok-code-fast-1" (esafak)
"It does still compare well against the others: https://vals.ai/benchmarks/swebench-2025-08-27" (hrdwdmrbl)
"It's fast. I tested it in EU tz, so ymmv" (NitpickLawyer)
"It does agentic in an interesting way. Instead of editing a file whole or in many places, it does many small passes." (NitpickLawyer)
"When it works well with agentic flows, and produces decent code. No idea why this thread is so negative (also got flagged while I was typing this?) but it's a decent model." (NitpickLawyer)
"You also have Liberia following your “standards”! There’s two of you! Must be nice." (jiggawatts)
"There are many ways to skin a cat. Often all it takes is to reset to a checkpoint or undo and adjust the prompt a bit with additional context and even dumber models can get things right. I've used grok code fast plenty this week alongside gpt 5 when I need to pull out the big guns and it's refreshing using a fast model for smaller changes or for tasks that are tedious but repetitive during things like refactoring." (Jcampuzano2)
"This is exactly what I use it for. It's my go-to "dumb tedious things" model. And it fills that role very well. You don't need the smartest slow model for every task. I've used it all week for tedious things nobody wants to do and gotten a ton done in less time." (Jcampuzano2)
"The only thing I've had issues with is if you're not a level more specific than you might be with smarter models it can go off the rails. But give it a tedious task and a very clear example and it'll happily get the job done." (Jcampuzano2)
"Also, that LLM has a reputation for being untrustworthy and prone to making up stories about things which are not true." (babar.)
"For me, I found Sonnet 3.5 to be a clear step up in coding" (mchusma)
"Opus 4.1 is by far the best right now for most tasks." (mchusma)
"It's the first model I think will almost always pump out “good code”." (mchusma)
"I do always plan first as a separate step, and I always ask it for plans or alternatives first and always remind it to keep things simple and follow existing code patterns." (mchusma)
"What the company behind it. XAI, and the person named Elon Musk." (Wongarsu)
"I use grok a lot on the web interface (grok.com) and never had any weird incidents. It's a run-of-the-mill SOTA model with good web search and less safety training" (wongarsu)

Broader AI Capabilities and Future Directions

Users touched upon the broader implications of AI in coding, including the potential for agentic workflows, the evolution of "full self-driving" concepts to coding, and the importance of local LLM execution.

"Besides being a faster slot machine, to the extent that they're any good, a fast agentic LLM would be very nice to have for codebase analysis." (postalcoder)
"Now, will I try Grok? Absolutely not, but that's a personal decision due to not wanting anything to do with X, rather than a purely rational decision." (eterm)
"It's like measuring how fast your car can go by counting how often you clean the upholstery. There's nothing wrong with doing it, but it's entirely unrelated to performance." (ori_b)
"If Apple keeps improving things, you can run the model locally. I'm able to run models on my Macbook with an M4 that I can't even run on my 3080 GPU (mostly due to VRAM constraints) but they run reasonably fast, would the 3080 be faster? Sure, but its also plenty fast to where I'm not sitting there waiting longer than I wait for a cloud model to "reason" and look things up." (giancarlostoro)
"I think the biggest thing for offline LLMs will have to be consistency for having them search the web with an API like Google's or some other search engines API, maybe Kagi could provide an API for people who self-host LLMs (not necessarily for free, but it would still be useful)." (giancarlostoro)
"For agentic workflows, speed and good tool use are the most important thing. Agents should use tools for things by design, and that can include reasoning tools and oracles. The agent doesn't need to be smart, it just needs a line to someone who is that can give the agent a hyper-detailed plan to follow." (CuriousC)
"It's like measuring how fast your car can go by counting how often you clean the upholstery. There's nothing wrong with doing it, but it's entirely unrelated to performance." (ori_b)
"Full Self Coding? No, making edits to an exiting codebase." (RedMist)
"Full Self Coding by next year at the latest" (mplewis)
"And soon after that, self coding in a Mars colony" (antonvs)
"To me, "full self driving" means you can hop in the back seat and have a nap. If you have to keep your hands near the wheel and maintain attention to the road then... shrugs not really the same. IMHO we're in the "uncanny valley" of vehicular automation" (vunderba)
"I think this is a very good description of where autonomous vehicles are right now." (rkomorn)
"Everything a layman would call "AI" is in the "uncanny valley" at the moment!" (bpavuk)
"The LLM writing and code is oh-so-easy to spot" (sebastiennight)
"Maybe it's because we get use to it and therefore recognize it easier, but it does seem to get more and more recognizable instead of the opposite, doesn't it?" (sebastiennight)
"I think I could recognize a ChatGPT email way easier in 2025 than if you showed me the same email written by gpt-3.5." (sebastiennight)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"Okay, so are we going back to the 'full self-driving' debate here? Because that's how you get back to there." (mwigdahl)
"When you don't have the capacity to process a large amount of code at once, it's better to have a smaller context window that is less prone to making mistakes during the analysis of that larger context." (mkd)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"Opus 4.1 is by far the best right now for most tasks. It’s the first model I think will almost always pump out “good code”." (mchusma)
"I do always plan first as a separate step, and I always ask it for plans or alternatives first and always remind it to keep things simple and follow existing code patterns. Sometimes I just ask it to double check before I look at it and it makes good tweaks. This works pretty well for me." (mchusma)
"For me, I found Sonnet 3.5 to be a clear step up in coding, I thought 3.7 was worse, 2.5 pro equivalent, and 4 sonnet equal maybe tiny better than 3.5. Opus 4.1 is the first one to me that feels like a solid step up over sonnet 3.5." (mchusma)
"I'm more curious if its based on Grok 3 or what, I used to get reasonable answers from Grok 3. If that's the case, the trick that works for Grok and basically any model out there is to ask for things in order and piecemeal, not all at once." (giancarlostoro)
"I think the biggest thing for offline LLMs will have to be consistency for having them search the web with an API like Google's or some other search engines API, maybe Kagi could provide an API for people who self-host LLMs (not necessarily for free, but it would still be useful)." (giancarlostoro)
"For agentic workflows, speed and good tool use are the most important thing. Agents should use tools for things by design, and that can include reasoning tools and oracles. The agent doesn't need to be smart, it just needs a line to someone who is that can give the agent a hyper-detailed plan to follow." (CuriousC)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I tested this yesterday with Cline. It's fast, works well with agentic flows, and produces decent code." (NitpickLawyer)
"IMO it's good to have options at every level. Having many providers fight for the market is good, it keeps them on their toes, and brings prices down." (NitpickLawyer)
"Also, that LLM has a reputation for being untrustworthy and prone to making up stories about things which are not true." (babar.)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"It's like measuring how fast your car can go by counting how often you clean the upholstery. There's nothing wrong with doing it, but it's entirely unrelated to performance." (ori_b)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"Its like measuring how fast your car can go by counting how often you clean the upholstery. There's nothing wrong with doing it, but it's entirely unrelated to performance." (ori_b)
"You also have Liberia following your “standards”! There’s two of you! Must be nice." (jiggawatts)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"It's like measuring how fast your car can go by counting how often you clean the upholstery. There's nothing wrong with doing it, but it's entirely unrelated to performance." (ori_b)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"It's like measuring how fast your car can go by counting how often you clean the upholstery. There's nothing wrong with doing it, but it's entirely unrelated to performance." (ori_b)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)
"The benchmarks are not great. They're not great for this model and they're not for the others either. The first thing I want to know is: are the benchmarks accurate?" (Workaccount2)
"I also noticed that the model randomly deleted some of my code - something I haven't had happen in a long time. While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it." (RedMist)