Event – Fast, In-Process Event Dispatcher

This discussion on Hacker News revolves around a newly presented Go event bus library that claims significant performance improvements over Go's built-in channels for pub/sub scenarios. The core themes emerging are: the nature and limitations of Go channels, potential reasons for the new library's performance gains, and the context and validity of performance benchmarks.

Go Channels: A Double-Edged Sword

A significant portion of the discussion focuses on the perception and actual behavior of Go's channels. While acknowledged as a powerful primitive for concurrent programming, many users express reservations about their suitability for high-throughput pub/sub patterns and their perceived performance cost compared to more specialized implementations.

Some users view channels primarily as a "memory-sharing mechanism" rather than a direct pub/sub tool. As atombender puts it, "I prefer to think of channels as a memory-sharing mechanism. In most cases where you want to send data between concurrent goroutines, channels are a better primitive, as they allow the sender and receiver to safely and concurrently process data without needing explicit locks." However, this perspective is tempered by the observation that channels are not always efficient for pub/sub.

A recurring point is that channels can be "fiendishly hard to debug once you chain more than a couple of them," with atombender warning, "Something forgets to close a close a channel, and your whole pipeline can stall." Furthermore, the performance overhead of channels is a major concern for those seeking maximum throughput. atombender states, "Channels are also slow, requiring mutex locking even in scenarios where data isn't in need of locking and could just be passed directly between functions." This leads to the sentiment that "Channels should remain a low-level primitive to build pipelines, but they're not what you should use as your main API surface."

This historical context is supported by MathMonkeyMan, who recalls, "I remember hearing (not sure where) that this is a lesson that was learned early on in Go. Channels were the new hotness, so let's use them to do things that were not possible before. But it turned out that Go was better for doing what was already possible before, but more cleanly."

The underlying mechanics of why channels might be slower are also explored. oefrha speculates, "Channel isn't a broadcast mechanism (except when you call close on the channel), so a naive channel-based broadcaster implementation like the one you find in bench/main.go here uses one channel for each subscriber; every event has to be sent on every subscriber channel. Condition variable on the other hand is a native broadcast mechanism." jerf elaborates on this, stating, "Channels allow for many-to-many communication. To a first approximation, you can imagine any decently optimized concurrency primitives as being extremely highly optimized, which means on the flip side that no additional capability, like 'multi-to-multi thread communication', ever comes for free versus something that doesn't offer that capability. The key to high-performance concurrency is to use as little 'concurrency power' as possible."

Claimed Performance Gains and Underlying Mechanisms

The primary motivation for the discussion is the claimed performance improvement of the new event bus over Go channels. The author, kelindar, states that the library "delivers events in 10-40ns, roughly 4-10x faster than the plain channel loop." This significant speed-up naturally sparks curiosity and some skepticism.

zx2c4 expresses a desire to "learn why/how and what the underlying structural differences are that make this possible." tombert echoes this, admitting, "I never thought that the channels were 'slow', so getting 4-10x the speed is pretty impressive. I wonder if it shares any design with LMAX Disruptor..."

singron offers a detailed hypothesis based on a "brief skim" of the code, suggesting the new implementation is "highly optimized for throughput and broadcasts whereas a channel has many other usecases." They explain: "Consumers subscribing to the same event type are placed in a group. There is a single lock for the whole group. When publishing, the lock is taken once and the event is replicated to each consumer's queue. Consumers take the lock and swap their entire queue buffer, which lets them consume up to 128 events per lock/unlock. Since channels each have a lock and only take 1 element at a time, they would require a lot more locking and unlocking."

oefrha also points towards the implementation leveraging sync.Cond, noting, "It’s a fairly standard broadcaster based on sync.Cond." They suggest that a channel-based approach might achieve similar performance by "leverag[ing] channel close as a broadcast mechanism."

hinkley touches upon optimizations like double buffering: "Seems that the trick would be detecting if there is a queue building up and dispatching multiple events per lock if so. Double buffering is a common enough solution here. The reader gets one buffer to write to and the writer gets another, and when the read buffer is drained the write buffer is swapped." This allows for amortizing overhead, a concept also mentioned by hinkley in relation to Amdahl's Law and Little's Law: "However TCP and this library both seem to be aware that while you cannot eliminate messaging overhead, you can amortize it over dozens and dozens of messages at a time and that reduces that bit of serialized cost by more than an order of magnitude."

Validity and Context of Benchmarks

A critical theme is the interpretation and reliability of performance benchmarks, particularly when dealing with claims of order-of-magnitude improvements. Some users are quick to question the conditions under which these benchmarks were run and their applicability to real-world scenarios.

st3fan expresses a common concern about benchmark purity: "Processes millions of events per second" - yes, sure, when there is nothing to process. But that is not representative of a real app. Add a database call or some simple data processing and then show some numbers comparing between channels or throughput. I hate these kind of claims. Similar with web frameworks that shows reqs/s for an empty method.

catlifeonmars elaborates on why adding I/O would invalidate the comparison: "if you add a database call the I/O blocking time will completely eclipse CPU time. It would not be a useful comparison, similar to if you added a time.Sleep to each event handler."

However, hinkley offers a counterpoint, suggesting that the benchmark might still be indicative of performance even with processing: "The reviews by some other people here lead me to believe that it works fairly well both when there is something to process and when it's just chattiness."

The specific use case for such high performance is also discussed. absolute_unit22, who might try the library in production, states, "I don’t really need the insane performance benefits as I don’t have my traffic lol - but I always like experimenting with new open source libraries - especially while the site isn’t very large yet."

scripturial champions the relevance of such microbenchmarks for specific application types: "How else do you compare “web frameworks” except foe comparing their overhead? No everyone wants to write a database application. There are absolutely other types of applications in the world. Applications can be CPU and/or memory bound."

Finally, there's a call for better documentation and comparison in the library's README. minaguib requests, "OP: the readme could really benefit from a section describing the underlying methodology, and comparing it to other approaches (Go channels, LMAX, etc...)"