Here's a breakdown of the key themes from the Hacker News discussion, supported by quotes from the users:
Cost and Efficiency of LLM-Based Observability
A major concern revolves around the cost-effectiveness of using LLMs for observability compared to traditional methods. Several users argue that while LLMs may offer potential benefits, the expense of processing vast amounts of telemetry data could be prohibitive.
- Expense compared to traditional alerting: "In terms of identifying the problems, shoving all your data into an LLM to spot irregularities would be exceptionally expensive vs traditional alerting, even though it may be much more capable at spotting potential issues without explicit alerting thresholds being set up." (danpalmer)
- The "10x your observability costs" argument: "I feel like the alternate title of this could be âhow to 10x your observability costs with this one easy trickâ. It didnât really show a way to get rid of all the graphs, the prompt was âshow me why my latency spikes every four hoursâ. Thatâs really cool, but in order to generate that prompt you need alerts and graphs." (techpineapple)
- Expense beyond the LLM call: " the article did not describe an LLM agent intuiting a hypothesis, checking with data, and coming up with a narrative all for 60 cents. The author did 80% of the work, and had the LLM do the final stretch. Maybe there's value there, but the article did not present a workflow that obviated the need for graphs." (nemothekid)
- Unified data storage cost: "its the end of observability as we know it). he goes on to say itâs AI constantly analyzing data in a Unified sub-second database? Even without the AI thatâs expensive." (techpineapple)
The Importance of Existing Observability Infrastructure
Several commentators emphasize that the usefulness of LLMs for observability is contingent on having a well-established telemetry pipeline and existing observability tools. LLMs are seen as tools to enhance existing systems, not replace them.
- LLMs as an addition, not a replacement: "Thatâs really cool, but in order to generate that prompt you need alerts and graphs. How do you know youâre latency is spiking to generate the prompt?" (techpineapple)
- Telemetry pipeline is key: "Can you be clearer about what you're defining as "80% of the work"? If it's "setting up a telemetry pipeline capable of serving an LLM agent loop", I'm going to disagree, but I don't want to disagree preemptively." (tptacek)
LLMs as Hypothesis Generators and Cross-Data Source Integrators
Some argue that the real value of LLMs lies in their ability to generate hypotheses, integrate information from disparate data sources, and accelerate the root cause analysis process. This allows for a more comprehensive understanding of system behavior and potential issues.
- Integrating disparate data sources: "One of the interesting things an agent can do that no individual telemetry tool does effectively is make deductions and integrate information across data sources. It's a big open challenge for us here; in any given incident, we're looking at Honeycomb traces, OpenSearch for system logs, and Prometheus metrics in a VictoriaMetrics cluster. Given tool calls for each of these data sources, an agent can generate useful integrated hypotheses without any direct integration between the data sources. That's pretty remarkable." (tptacek)
- Expediting problem solving: "But even with Honeycomb, we are sitting on an absolute mountain of telemetry data, in logs, in metrics, and in our trace indices at Honeycomb...We can solve problems by searching and drilling down into all those data sources; that's how everybody solves problems. But solving problems takes time. Just having the data in the graph does not mean we're near a solution!...An LLM agent can chase down hypotheses, several at a time, and present them with collected data and plausible narratives." (tptacek)
Challenges with Anomaly Detection and Alerting
The discussion highlights the inherent difficulties in anomaly detection and alerting, noting that anomalies are often the norm, leading to alert fatigue and the ignoring of critical issues.
- Anomalies as the norm: "Additionally, I wonder if any of this fixes the fact that anomaly detection in alerting is traditionally a really hard problem, and one Iâve hardly seen done well. Of any set of packaged or recommended alerts, I probably only use 1% of them because anomalies are often the norm." (techpineapple)
- Alert fatigue: "> or worse teams learn to ignore it and it becomes useless" (zdragnar)
Tuning Alert Systems and Logging Practices
Several users touch on the importance of carefully tuning alert systems and establishing clear logging practices to avoid alert fatigue and ensure that logs are useful for troubleshooting.
- Tuning alerts is key: "You fix these issues or you tune your alert system to make it clear that they aren't actionable. Otherwise you end up turning them off so your system doesn't turn into the boy who cried wolf (or worse teams learn to ignore it and it becomes useless)...Bayesian filters, and basic dirivative functions (think math) can do a lot to tame output from these systems. These arent "product features" so in most orgs they dont get the attention they need or deserve." (zer00eyz)
- Consistent logging practices are necessary: "There seem to be two schools of thought, just enough to tell something is wrong but not what it is - OR - you get to drink from the firehose. And most orgs go from the first to the second... If you can't match a log entry to a running environment... well." (zer00eyz)
The Nuance of LLM Output and the Need for Human Expertise
Concerns are raised that LLMs may confidently produce incorrect conclusions, potentially leading to misdiagnosis and operational problems, particularly if users blindly trust the AI's output. The value of human expertise in interpreting the LLM's findings is emphasized.
- Risk of confidently incorrect output: "As somebody who's good at RCA, I'm worried all my embarrassed coworkers are going to take at face value a tool that's confidently incorrect 10% of the time and screw stuff up more instead of having to admit they don't know something publicly." (zug_zug)
- Expertise required for leveraging AI: "I think the commenter might have been saying that you need experts I'm the field to leverage AI here, in which case your response is supporting their point." (NewJazz,referring to ok_dad's comment on experts)
- LLM as a starting point only: "Also the LLM would only point to a direction and Iâm still going to have to use the UI to confirm." (stlava)
Skepticism Regarding Marketing Hype and Thinly Veiled Promotion
Several commentators express skepticism regarding the marketing hype surrounding AI-powered observability, suggesting that the article in question is a thinly veiled promotion for a specific product.
- Thinly veiled marketing promo: "This post is a thinly veiled marketing promo." (AdieuToLogic)
- Focusing on fast feedback loops for marketing: AdieuToLogic dismantles the argument about speed: "AI thrives" on many things, but "speed" is not one of them. Note the false consequence ("it'll outrun you every time") used to set up the the epitome of vacuous sales pitch drivel...Honeycomb's entire modus operandi is predicated on fast feedback loops, collaborative knowledge sharing, and treating everything as an experiment."
- Marketing is a given: "Is it even attempting to be veiled at all? You know youâre reading a companyâs blog post, written about a feature the company is building for their product, right? It is explicitly marketing." (pgwhalen)
Usability Gaps in Existing Operability Tools
Some users pointed out frustrations with existing observability tools like Datadog, stating that the data is available but navigation between different views or products for related events is often cumbersome. LLMs could potentially bridge these usability gaps, but ideally the tooling would improve natively.
- Usability gaps in existing observability tools: "I feel that if you need an LLM to help pivot between existing data it just means the operability tool has gaps in user functionality. This is by far my biggest gripe with DataDog today. All the data is there but going from database query to front end traces should be easy but is not." (stlava)
- Problem of Siloed Datadog Tools: "The whole point of Datadog is that you can seamlessly go between products for the same event source...Doesnât it just highlight a setup issue?"(JyB)