Essential insights from Hacker News discussions

UTF-8 is a brilliant design

This Hacker News discussion delves into the intricacies and history of character encodings, primarily focusing on UTF-8 and its relationship with UTF-16, as well as broader implications for software development and internationalization.

The Ingenuity and Limitations of UTF-8

A significant portion of the discussion revolves around the brilliant design of UTF-8, particularly its backwards compatibility with ASCII, its efficiency for Western languages, and its self-synchronizing properties. However, users also highlight certain limitations and historical compromises made in its design.

  • The design is lauded for its elegance and efficiency: "UTF-8 is simply genius. It entirely obviated the need for clunky 2-byte encodings (and all the associated nonsense about byte order marks)."
  • Its self-synchronizing nature is a key technical advantage: "Having the continuation bytes always start with the bits 10 also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character."
  • A notable aspect is how it leverages the unused bits in ASCII: "...it's worth noting that we didn't always have 8 general purpose bits. 7 bits + 1 parity bit or flag bit or something else was really common... In a way, UTF-8 is just one of many good uses for that spare 8th bit in an ASCII byte..."
  • However, some limitations are pointed out, such as the sacrifice of encoding more than 21 bits (to maintain compatibility with UTF-16) and the reservation of certain byte sequences to prevent overlong encodings. "It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1."
  • The decision to reserve byte sequences to prevent overlong encodings is seen by some as a potential long-term flaw: "A true flaw of UTF-8 in the long run. They should have biased the values of multibyte sequences to remove redundant encodings."

The Case Against UTF-16 and the "16-bit Delusion"

UTF-16 receives considerable criticism throughout the discussion, often referred to as an "abortion" and a product of an early misunderstanding of Unicode's scope.

  • The core issue with UTF-16 is its surrogate mechanism and how it complicates handling characters outside the Basic Multilingual Plane (BMP): "The vast majority of bugs I remember having to fix that were directly related to encoding were related to surrogate pairs."
  • This design is often attributed to an initial underestimation of the number of characters needed in Unicode: "When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two)." and "The real problem was Unicode's first version getting released at a critical time and thus its 16-bit delusion ending up baked into a bunch of important software."
  • Users contrast the complexity of UTF-16 with UTF-8's relative simplicity for ASCII and its more manageable multi-byte character handling.
  • Despite its flaws, UTF-16's persistence is noted due to the widespread adoption by key operating systems and programming languages: "This won’t happen soon, of course, given JavaScript and Windows, but the situation might be different in a hundred or thousand years."

The Complexity and Challenges of Unicode and Character Encoding Evolution

The discussion highlights the immense complexity underlying Unicode and the ongoing challenges in its implementation and evolution, including issues like CJK unification, compatibility with legacy encodings, and the practical difficulties of string manipulation.

  • CJK Unification: A major pain point discussed is "Han unification," where characters from Chinese, Japanese, and Korean are mapped to the same code points, leading to rendering issues and difficulties in cross-language documents. "The use of 8-bit extensions of ASCII (like the ISO 8859-x family) was ubiquitous for a few decades, and arguably still is to some extent on Windows (the standard Windows code pages). If ASCII had been 8-bit from the start, but with the most common characters all within the first 128 integers, which would seem likely as a design, then UTF-8 would still have worked out pretty well." and "One pain point of unicode is CJK unification."
  • Legacy Encodings: The persistence of legacy encodings like Shift-JIS is mentioned, along with the headaches they still cause. "Mojibake is not as common as it once was, but still a problem."
  • String Manipulation and Indexing: The ability to index into strings and the implications for performance and complexity are debated. Some argue that O(1) indexing is an outdated assumption, while others highlight the need for efficient operations with variable-length encodings. "Variable width encodings like UTF-8 and UTF-16 cannot be indexed in O(1), only in O(N). But this is not really a problem! Instead of indexing strings we need to slice them, and generally we read them forwards, so if slices (and slices of slices) are cheap, then you can parse textual data without a problem."
  • Historical Context and Design Decisions: The design of ASCII and its 7-bit nature are explored, with users debating whether this was historical luck, a deliberate choice for extensibility, or a consequence of the computing hardware of the era. "ASCII was kept to 7 bits primarily so 'extended ASCII' sets could exist, with additional characters for various purposes (such as other languages, but also for things like mathematical symbols)." The early days of computing with 36-bit words and different character packing schemes are also referenced.

Backwards Compatibility: A Double-Edged Sword

The initial comment sets the stage for a complex relationship with backwards compatibility, which is a recurring theme throughout the discussion.

  • While the "mess" of maintaining backwards compatibility is acknowledged and disliked, the benefits of advancement are appreciated. "I have a love-hate relationship with backwards compatibility. I hate the mess - I love when an entity in a position of power is willing to break things in the name of advancement."
  • UTF-8's success is heavily attributed to its ASCII compatibility, which facilitated adoption. Conversely, UTF-16's issues are often linked to its attempt to accommodate an evolving Unicode standard while maintaining compatibility with early, 16-bit assumptions.
  • The broader topic of breaking changes in software is brought up, with Python's minor versioning and perceived disregard for semantic versioning being cited as an example of "breaking things in the name of advancement" that can be frustrating. "Or minor versions of python... Honestly python is probably one of the worst offender in this as they combine happily making breaking changes for low value rearranging of deck chairs with a dynamic language where you might only find out in runtime."