Essential insights from Hacker News discussions

450× Faster Joins with Index Condition Pushdown

The Role and Effectiveness of Query Optimizations, Particularly Index Condition Pushdown

A significant portion of the discussion revolves around the technical merits and implementation of query optimizations, with a particular focus on Index Condition Pushdown (ICP). Users debate whether ICP is a novel concept or a fundamental optimization that should be standard in all database systems.

  • "marceloaltmann: Straddled joins were still a bottleneck in Readyset even after switching to hash joins. By integrating Index Condition Pushdown into the execution path, we eliminated the inefficiency and achieved up to 450× speedups."
  • "jamesblonde: We call these pushdown joins in rondb. They only support an equality condition for the index condition. Joins with index condition pushdown is a bit of a mouthful. We also went from like 6 seconds to 50ms. Huge speedup."
  • "Sesse__: This isn't really the same as MySQL's ICP; it seems more like what MySQL would call a “ref” or “eq_ref” lookup, i.e. a simple lookup on an indexed value on the right side of a nested-loop join. It's bread and butter for basically any database optimizer."
  • "rotis: Exactly. This is a basic optimization technique and all the dinosaur era databases should have that. But if you build a new database product you have to implement these techniques from scratch. There is no way you shortcut that."
  • "HDThoreaun: Yea this is pretty fucking basic stuff. Any competent optimization engine should be doing this. "push down indexes as much as possible" is literally the first thing a query planner should be trying to do"
  • "ncruces: Yes. But here they are deciding between "pushdown o.status==shipped" and "pushdown u.email==address@", in parallel both, then join (which they already did) or first doing "u.email==address@" then pushing down "u.id==o.user_id" mostly. This is a judgment call. Their planner is pretty dumb to not know which one is better, but “push down as much as possible” doesn't cut it: you need to actually decide what to push down and why."
  • "HDThoreaun: No, it is not a judgement call. The query planner should be storing the distributions of the values in every index. This makes it obvious which pushdown to do here. Again, basic stuff. Youre right though not quite simple as "push down as much as possible", it is one step past that."
  • "jmalicki: Without storing the joint distribution of the values corresponding to the conditions that span multiple tables, it can be hard to know what's a win."
  • "Seattle3503: Postgres by default computes univariate stats for each column and uses those. If this is producing bad query plans, you can extend the statistics to be multivariate for select groups of columns manually. But to avoid combinatorially growth of stats related storage and work, you have to pick the columns by hand."
  • "Jupe: Agreed. Isn't this precisely why key statistics (table statistics) are maintained in many DB systems? Essentially, always "push down" the predicate with the worst statistics and always execute (early) the predicates with high selectivity. I'd be very surprised if virtually every RDBMS doesn't do this already."

Database Technology and Evolution (Row vs. Columnar, KV Stores, Modern Optimizers)

The discussion touches upon the broader landscape of database technologies, including comparisons between row-based and columnar databases, the use of Key-Value stores (like RocksDB) as a foundation for databases, and the evolution of query optimizers.

  • "vjerancrnjak: Another example of row based dbs somehow being insanely slow compared to column based. Just an endless sequence of misbehavior and we’re waving it off as rows work good for specific lookups but columns for aggregations, yet here it is all the other stuff that is unreasonably slow."
  • "tharkun__: It's an example. But not of that. It's an example of old things being new again maybe. Or reinventing the wheel because the wheel wasn't known to them. Yes I know, nobody wants to pay that tax or make that guy richer, but databases like Oracle have had JPPD for a long time. It's just something the database does and the optimizer chooses whether to do it or not depending on whether it's the best thing to do or not."
  • "zmmmmm: I had to dig through to see the details of what database was really in play here, and sure enough, it's a wrapper around a key-value store (RocksDB). So while I'll confess I know little about RocksDB it does sound an awful lot like they threw out a mature relational database engine with built in optimization and now are in the process of paying the price for that by manually optimizing each query (against a key-value store no less, which probably fundamentally limits what optimizations can be done in any general way)."
  • "hxtk: > against a key-value store no less, which probably fundamentally limits what optimizations can be done in any general way I would disagree with this assumption for two reasons: first, theoretically, a file system is a key value store, and basically all databases run on file systems, so it stands to reason that any optimization Postgres does can be achieved as an abstraction over a key-value store with a good API because Postgres already did."
  • "atombender: I think part of the problem is that the people working on Postgres for the most part aren't PhDs, and Postgres isn't very state of the art. Postgres implements the ancient Volcano model from the 1980s, but there's been a ton of query optimization research since then, especially from the database groups at TUM Munich, University of Washington, and Carnegie Mellon. Systems like HyPer and Umbra (both at TUM) are state of the art query planners that would eat Postgres' lunch. Lots of work on making planners smarter about rearranging joins to be more optimal, improving cache locality and buffer management, and so on."

The Nature and Role of the "Readyset" Product

There's confusion and discussion about what Readyset actually is, with some users speculating it acts as a cache or a proxy, while others try to clarify its function as a system that materializes queries and performs incremental view maintenance.

  • "flufluflufluffy: I read their website landing page but it’s still kinda confusing — what exactly is readyset? It all sounds like it’s a cache you can set up in front of MySQL/postgres. But then this article is talking about implementing joins which is what the database itself would do, not a cache. But then the blurbs talk about it like it’s a “CDN for your database” that brings your data to the edge. What the heck is it?!"
  • "Sesse__: It seems to be some sort of read-only reimplementation of MySQL/Postgres that can ingest their replication streams and materialize views (for caching). Complete with a really primitive optimizer, if the article is to be believed."
  • "aseipp: ReadySet is basically "incremental view maintenance" but applied to arbitrary SQL queries. It acts like a caching proxy for your database, but it simultaneously ingests the replication log from the system in order to see things happen. Then it uses that information to perform "incremental" updates of data it has cached, so that if you requery something, it is much faster."
  • "LtdJorge: Seems like they are caching MySQL with their own layer built on RocksDB."

Developer Practices and the Role of DBAs

A segment of the conversation addresses developer responsibilities regarding database optimization, the perceived inadequacy of traditional DBA roles for application-specific performance tuning, and the value of developers acquiring deep database knowledge.

  • "futevolei: Do the db guys at your company help you optimize queries and table set up at all? Ours basically don’t at all. Their job is to maintain the db apparently and us devs are left to handle this and it seems wrong. I’ve been partitioning tables and creating indexes the past few weeks trying to speed up a view and running explain analyze and throwing the results in Gemini and my queries are still slow af. I had one sql class in college, it’s not my thing. Seems like if dbas would spend a few minutes with me asking about the data and what we are trying to do they could get this guys results relatively easily. Am I wrong?"
  • "dinobones: I always see these fancy DB engines and data lake blog posts and I am curious… why? At every place I’ve worked at this is a solved problem: Hive+Spark, just keep everything sharded across a ton of machines. It’s cheaper to pay for a Hive cluster that does dumb queries than paying these expensive DB licenses, data engineers building arbitrary indices, etc… just throw compute at the problem, who cares. 1TB of RAM /flash is so cheap these days."
  • "richardw: We didn’t use the DBA’s for this but my last few teams, we got good at DB’s, performance etc. DBA’s were too general and they kept the lights on, but for real performance you should get one or two people who know what they’re doing for your applications. Or learn. I took on juniors who are now fantastic."
  • "ww520: Nope. If they didn't actively work against us, we would thank the lucky stars."
  • "code_runner: If you make it your thing and keep on being good at your other thing, you’re gonna be 90% more valuable than most of your coworkers. I totally lose respect for sr engineers who can’t write sql to find even simple answers to questions. It’s never bad to have another arrow in your quiver"

Discussion Dynamics and Perceptions of LLM Usage

A notable portion of the thread is dedicated to the meta-discussion about the manner in which the initial comment was presented, leading to accusations of it being LLM-generated and debates about the validity and fairness of such claims.

  • "airstrike: Reads like an ad written by an LLM, is my guess. It could just be that they translated from their original language to English and got that as a byproduct. Many such cases."
  • "Sesse__: It also does not add anything interesting to the discussion. Like, why add a bland summary of the article?"
  • "magicalhippo: So you don't have to read the article to figure out if you want to read the article? I for one appreciate such comments, given the guidelines to avoid submission summaries."
  • "nchmy: its literally the author of the article."
  • "OutOfHere: It is completely disingenuous and unfair to claim that something, especially a small blurb, is written by an LLM. And so what if it actually was written by an LLM. If you want to criticize something, do so on the merits or demerits of the points in it. You don't get a free pass by claiming it's LLM output, irrespective of whether it is or not."
  • "airstrike: I'm puzzled by this reply. It's perfectly fine for me to hypothesize on the reason for downvotes in response to someone else asking why it has been downvoted. You're free to opine on the reason for downvotes too. This metacomment, however, is more noise than signal."
  • "OutOfHere: What you had claimed is not even a potential reason in the universe of reasons. It is a demonstration of bias, an excuse to refrain from reason. One line summaries of comprehensible articles can get downvoted because they don't add value beyond what's already very clear from the article."
  • "airstrike: it is objectively a potential reason in the universe of reasons, but you're 100% free to believe whatever you want, even if it's wrong and the fact that multiple people upvoted my comment at a minimum suggests others also believe it to be a possible explanation"
  • "OutOfHere: What happens is that some people routinely use your purported reason "it's LLM generated" as an excuse to try to discredit anything at all, and it's not right, irrespective of whether the material is LLM generated or not. Any material should be critiqued on the basis of its own merits and demerits, irrespective of who or what authored it. We need to shed the pro-human bias."
  • "airstrike: Hard disagree. In fact, I'm very much pro-human and anti-unqualified "we need to..." statements Either way, I didn't even downvote the OP so you're beeping at the wrong human"
  • "OutOfHere: I am pro-truth. Being pro-truth is more pro-human in the long term via indirect effect, than is being pro-human directly. Focusing on being pro-human can reward bad behavior among masses of humans, leading to their ultimate downfall. I will leave it at that."