Essential insights from Hacker News discussions

Writing a C compiler in 500 lines of Python (2023)

The Hacker News discussion revolves around the feasibility and challenges of writing a Python compiler in a limited amount of C code, touching upon language design, compiler implementation strategies, and the role of libraries.

Feasibility of a 500-Line Python Compiler in C

A central theme is the ambitious nature of implementing a Python compiler within a strict 500-line C code limit. While some express skepticism due to Python's complexity, others believe it might be achievable with careful design and by targeting older Python versions.

  • "A python VM that consumes bytecode might be doable in not-ludicrous-amounts of C. Not 500 lines I suppose. But something manageable I think?" - "wyldfire"
  • "500 lines is aggressive, but IOCCC gives me plenty of tricks to get the line count down and the language isn't that big." - "bluGill"
  • "Not to be that guy, but Python is an interpreted language. That said, I guess technically you could make something that compiles python to an executable? This is hacker news after all" - "TZubiri"
  • "Maybe 500 lines of Pythonic, macro-heavy C. If the macros' LOC don't count. Maybe." - "nickpsecurity"

Data Structures and Performance Trade-offs

The choice of data structures in C, particularly for implementing Python's dictionary type, is a significant point of discussion. Users debate whether a linear search or a hash table is more appropriate within the line limit, considering potential performance implications and cache friendliness.

  • "My dictionaries would be a linked-list, looking for a key becomes a linear search... (if you gave me C++ I'd use std::map)" - "bluGill"
  • "A hash table in C is about 30 lines of code, so I don't think you have to stick to linked lists for dictionaries." - "dekhn"
  • "Indeed, a decent closed hash table is maybe 30 lines. An open hash table with linear probing is even less, especially if you don't need to remove entries. It's almost identical to a linear search through an array; you just change where you start iterating." - "ludocode"
  • "Sure but a linear search is 5. when my limit is 500 lines of C I don't dare spare those lines." - "bluGill"
  • "a linear search may be faster because it is cache and branch prediction frienly. Benchmarks on real world data is needed to make a final call." - "bluGill"

The Importance of Python's Standard Library

A crucial aspect highlighted is that Python's extensive and powerful standard library is a major contributor to its usefulness. The discussion acknowledges that a 500-line C compiler would likely be severely limited without access to these libraries, many of which are themselves implemented in C for performance.

  • "Note that most of what makes python great isn't the language, it is the library. I believe that large parts of the python library are also written in C (for speed), and thus you won't be able to use my 500 line python compiler for anything useful because you won't have any useful libraries." - "bluGill"

The Nature of "Compilers" and Wrapper Scripts

The thread also touches on what constitutes a "compiler" versus a "wrapper script." One user's two-line solution to compile Python code using subprocess to call gcc is met with the clarification that this is merely a wrapper, sparking a discussion about the historical evolution of compilers like cc and gcc, which themselves often leverage underlying tools.

  • "I wrote one in 2 lines: import sys, subprocess subprocess.run(["gcc", sys.argv[1], "-o", "a.out"])" - "tvickery"
  • "Now do it without imports." - "rossant"
  • "That is not a compiler. That is called a wrapper script. But funny none the less." - "emilbratt"
  • "The original cc was just a wrapper like this Python example around a bunch of external programs, calling c00, c01, until something could be fed to as and then linked using ld. GCC does basically the same thing even today," - "amszmidt"

Compiler Design: Single-Pass vs. Multi-Pass and ASTs

The discussion veers into compiler design principles, specifically comparing single-pass compilers with traditional lexer-parser-AST-emitter pipelines. Users ponder whether a single-pass approach is inherently simpler and the trade-offs involved, such as the difficulty of optimization without an Abstract Syntax Tree (AST).

  • "I find it surprising that a single-pass compiler is easier to implement than a traditional lexer->parser->AST->emitter. (I'm not a compiler expert, though.) I'd have expected that generating an AST would be at least as simple, if not simpler." - "MarsIronPI"
  • "Not sure if fewer LoC necessarily implies easier!" - "arjvik"
  • "A single-pass compiler is easier to implement in part because you're not going to do any of that optimization. You're writing a single-pass compiler either because you're banging out a quick sketch of an idea, and you don't care about production use, or because you've time-traveled back to the '70s or the '80s..." - "marssaxman"

Language Choice for Compiler Implementation

The suitability of Python for writing compilers is also debated, with comparisons drawn to languages like ML (OCaml) which are often considered more idiomatic and less verbose for such tasks, particularly regarding data structure definition and pattern matching.

  • "Historically, at least, it's pretty verbose to define a data type in Python compared to languages that are more designed for writing compilers. Consider these definitions from my prototype Bicicleta interpreter, which is written in ML, specifically OCaml:" - "kragen"
  • "Python does have GC and dynamic dispatch, though, and those count for a lot." - "kragen"
  • Quoting Andy Chu from https://andychu.net/projects/: "Python is not the right language for [implementing] languages. I will use OCaml for subsequent projects like this." - "kragen"

The Link Between Compilers and Linguistics/Formal Grammars

An interesting tangent explores the connection between compiler construction and linguistics, particularly the role of formal grammars and the Chomsky hierarchy in defining programming language structures. This connection is observed to be similar to aspects of DNA/genome analysis.

  • "this article breaks it down well enough to make me feel like I could write my own C compiler targeting AVR. (I probably could... but it would not be easy.) Never actually looked into how compilers work before, it's surprisingly similar/related to linguistics." - "Liftyee"
  • "It's b/c when Chomsky invented the theory of formal grammars he was studying natural languages & the universality of abstract grammar¹. Computer scientists realized later that they could use the same theory as a foundation for formalizing the grammatical structures of programming languages." - "measurablefunc"
  • Similar experience in DNA/genome analysis. A large part of DNA analysis was based on parser theory." - "dekhn"