Scaling our observability platform by embracing wide events and replacing OTel

Data Retention and the Value of Logs

A central theme revolves around how long and what data should be retained in logs. Some argue for minimizing log data to reduce waste, while others champion long-term retention for comprehensive analysis. The tension stems from balancing resource costs with the potential insights derived from historical data.

Reducing Data Waste: "Whenever I read things like this I think: You are doing it wrong. I guess it is an amazing engineering feat for Clickhouse but I think we (as in IT or all people) should really reduce the amount of data we create. It is wasteful." - ofrzeta
GDPR's Influence: "One nice side effects of the GDPR is that you're not allowed to keep logs indefinitely if there is any chance at all that they contain personal information. The easiest way to comply is to throw away logs after a month (accepted as the maximum justifiable for general error analysis) and be more deliberate about what you keep longer." - brazzy
Long-Term Value: "Access logs and payment information for compliance, troubleshooting and evaluating trends of something you didn't know existed until months or years later, finding out if an endpoint got exploited in the past for a vulnerability that you only now discovered, tracking events that may span across months. Logs are a very useful tool in many non-dev or longer term uses." - Sayrus
Refining Data: "It makes sense to keep a high fidelity history of what happened and why. However, I think the issue is more that this data is not refined correctly...If your logs are important data, maybe logging is the wrong way to go about it. Instead think about how to clean, refine and persist the data you need like your other application data." - surelymop
Compression Efficiency: "Sure, we should cut waste, but compression exists for a reason. Dropping valuable observability data to save space is usually shortsighted...Tiered storage with S3 or similar backends is cheap and lets you keep full-fidelity data without breaking the budget." - CSDude
Over-Logging: "That's a bit of a blanket statement, too :) I've seen many systems where a lot of stuff is logged without much thought. 'Connection to database successful' - does this need to be logged on every connection request? Log level info, warning, debug? Codebases are full of this." - ofrzeta
Unpredictable Usefulness: "There's always another log that could have been key to getting to the bottom of an incident. It's impossible to know completely what will be useful in advance." - throwaway0665

Data Representation and Optimization

A significant discussion point centers around how data is represented and stored, highlighting the potential for optimization and the perceived inefficiencies of common practices like using JSON for everything.

Wasteful Data Representations: "...data can be represented wastefully, which is often ignored...Most 'wide' log formats are implemented... naively. Literally just JSON REST APIs or the equivalent...I've experimented with Open Telemetry, and its flagrantly wasteful data representations make me depressed. Why must everything be JSON!?" - jiggawatts
Binary Differencing and RLE: jiggawatts describes an experiment using binary differencing and run-length encoding (RLE) on Windows Server metrics to achieve significant compression ratios, indicating that more efficient data representation is possible.
Serialization/Deserialization Costs: "Every time you have to serialize/de-serialize, and every time you have to perform disk/network I/O, you introduce a lot of performance cost and therefore overall cost to your wallet." - munchbunny

Kubernetes Log Aggregation Challenges

Several comments address the challenges of aggregating and viewing logs in Kubernetes environments.

Frustration with Kubernetes Logging: "one of my immense frustrations with kubernetes - none of the commands for viewing logs seem to accept logical aggregates like 'show me everything from this deployment'." - XorNot
Workarounds and Tools: knutzui notes that it's "rather trivial to build this, by simply combining all log streams from pods of a deployment," while referring to k9s as a tool that "supports this directly." Additionally, Sayrus mentions Stern which can tail deployments and filter logs. ofrzeta suggests "kubectl logs deploy/mydep --all-containers=true" as well as kubetail.com.

ClickHouse vs. Other Technologies

The discussion includes opinions on ClickHouse's strengths and weaknesses compared to other database systems, particularly Postgres and Elasticsearch.

ClickHouse Strengths: ClickHouse is favored for analytics and immutable, append-only data scenarios like log storage. "For things like logs I can 100% see the value." - joshstrange
ClickHouse Weaknesses: Concerns are voiced about ClickHouse's usability and limitations in handling ETL processes and data updates. "Every time I use Clickhouse I want blow my brains out, especially knowing that Postgres exists...Unless you are using it in a very specific, and in my opinion, limited way, it feels worse than Postgres in every way." - joshstrange
ClickHouse for Scale: "Can you search log data in this volume? ElasticSearch has query capabilities for small scale log data I think. Why would I use ClickHouse instead of storing log data as json file for historical log data?" - iw7tdb2kqo9 is answered by sethammons who points to immense cost savings when operating at scale: "A naive 'push json into splunk' will cost us over $6M/year, but I can only get maybe 5-10% of that approved...In the article, they talk about needing 8k cpu to process their json logs, but only 90 cpu afterward."

OpenTelemetry (OTel) Scrutiny

OpenTelemetry (OTel) is mentioned with both praise and skepticism, specifically regarding its efficiency and data collection methods.

OTel Efficiency Concerns: Referenced above under "Data Representation and Optimization," jiggawatts expresses frustration with what they perceive as "flagrantly wasteful data representations" in OTel.
OTel Push vs. Pull: mrbluecoat's quote about OTel operating in a passive fashion is directly refuted: "Everything OTel I ever did was fully active. So I wouldn't say this is very noteworthy. Instead it is wrong/incomplete information." - fuzzy2
OTel I/O Overhead: munchbunny notes "at the petabyte scale, the amount of money you save by throwing away a single hop can more than pay for an engineer whose only job is to write serializer/deserializer logic," implying that the benefits of the OTel collector can be offset by the added I/O and serialization costs.

The Importance of Logs from Failing Services

There's agreement that capturing logs even when services have crashed is crucial.

Capturing Logs During Failure: Thaxll states, "I mean if you don´t get the logs when the service is down the entire solution is useless." The article excerpt highlights the advantage of OpenTelemetry for capturing logs even when a service is crash-looping.