Parsing JSON in Forty Lines of Awk

This discussion on Hacker News explores the capabilities and limitations of awk, particularly in comparison to other tools and languages for data manipulation, with a significant focus on handling JSON.

The Utility and Limitations of `awk`

A central theme is the appreciation for awk's power and elegance, but also a strong sentiment that its feature set is often lacking for more complex tasks. Users acknowledge its strengths for "short one liners" and parsing delimited data, but point out that as soon as tasks become more involved, awk's limitations become apparent.

Users highlight specific missing features that hinder its utility: "Like: printing all but one column somewhere in the middle. It turns into long, long commands that really pull away from the spirit of fast fabrication unix experimentation," according to chaps.
The "extra space in the middle" issue when deleting a field is a common sticking point, with workarounds like sed 's/ / /g' or the clever awk '{$1=$1};1' alias being shared.
The divergence in features across different awk implementations (mawk, gawk, nawk) is also noted as a complication: "...the several awks in existence have diverging sets of additional features," states mauvehaus.

The Quest for More Robust Tooling: Perl, Python, and Shell Alternatives

The conversation frequently turns to alternative tools and languages when awk's inherent limitations are encountered, with Perl and Python being prominent suggestions.

Perl is repeatedly invoked as a more capable alternative, with SoftTalker simply stating, "Whence perl." jcynix adds, "That's why I use Perl instead (besides some short one liners in awk, which in some cases are even shorter than the Perl version) and do my JSON parsing in Perl."
The ubiquity of pre-installed languages is discussed. saghm suspects that "the awk script linked to here was picked more for its ubiquity than elegance," suggesting that this is why tools like Perl might be favored. However, chaps counters that Perl's ubiquity isn't always guaranteed and can lead to infrastructure issues if not universally present.
Python is also mentioned as a common alternative, though chaps notes that "IME, python is much, much more universally installed on the hosts I've worked on." The anecdote about a PR being rejected for using a bash builtin because "python is more likely to be installed than bash" illustrates the sometimes humorous debates around tool selection.
jcynix compares Perl and awk for file comparison tasks, noting that "the Perl one-liner would be conceptually identical, the same length, but more performant (no calling out to rm): diff -rs a/ b/ | perl -ane '/identical$/ && unlink $F[3]'".
8n4vidtmkvmk prefers Perl over sed due to PCRE support, stating, "I've been using perl instead of sed because PCRE is just better and it's the same regex that PHP uses which I've been coding in for nearly 20 years."

The Challenges of Parsing JSON in the Shell Environment

A significant portion of the discussion is dedicated to the difficulties of parsing JSON within traditional Unix shell environments, including awk.

chubot identifies the core problem: "JSON is not a friendly format to the Unix shell — it’s hierarchical, and cannot be reasonably split on any character." This leads to the conclusion that "shell is definitely too weak to parse JSON!"
The argument is made that awk is also unsuitable for JSON parsing: "I'd argue that Awk is ALSO too weak to parse JSON," with the lack of support for arbitrarily nested data structures being the primary reason. comex elaborates, "Awk has a grand total of 2 data types: strings, and associative arrays mapping strings to strings. There is no support for arbitrarily nested data structures."
The reliance on external tools like jq is acknowledged as a common and often effective solution. packetlost states, "the ecosystem of tools is just fairly immature as most of the shells common tools predate JSON by at least a decade. jq being a pretty reasonable addition to the standard set of tools included in environments by default."
The inherent weakness of standard shell data types (strings and arrays) for handling hierarchical data is reiterated by comex, who notes that "bash and zsh have a limited number of data types (the same two as awk plus non-associative arrays)."
The Oilshell project (OSH and YSH) is presented as a potential solution, with built-in JSON support and more capable data structures. The example shows var d = { date: $(date --iso-8601) } and json write (d).
The quality of error handling in JSON parsing is brought up, with chubot noting, "If you don't reject invalid input, you're not really parsing."
packetlost offers a different perspective on the "too weak" argument, suggesting that "the real problem is that JSON doesn't work very well at as a because it's core abstraction is objects. It's a pain to deal with in pretty much every statically typed non-object oriented language unless you parse it into native, predefined data structures."

Reinventing Shell Tooling: The Case for OSH/YSH

The discussion pivots towards the development of new shell languages that aim to address the shortcomings of traditional shells, particularly in handling complex data formats like JSON.

The Oilshell project is highlighted as an effort to address these limitations. chubot states, "One reason I started https://oils.pub is because I saw that bash completion scripts try to parse bash in bash, which is an even worse idea than trying to parse JSON in bash."
OSH and YSH are presented as having native JSON support and the necessary data structures for APIs similar to Python/JavaScript.
The performance improvements of OSH are noted by alganet: "According to my tests, this is true. Congratulations!" in response to OSH being faster than bash.
The fundamental power of YSH is attributed to garbage-collected data structures, with a link provided to a blog post about GC.
comex, while acknowledging the limitations of awk and bash for JSON, also points out the potential for highly optimized shell builtins: "If you stick to shell builtins, your script will run much faster... However, this also rules out jq, and neither shell has any remotely comparable builtin." The question is posed: "But you might reasonably object that if I care about speed, I would be better off using a real programming language!"

The Enduring Appeal of Classic Unix Tools

Despite the emergence of more powerful languages and the complexities of modern data formats, there's an underlying appreciation for the simplicity and ubiquity of classic Unix tools like cut.

toddm reminisces about using cut for column manipulation: "cut -d " " -f1-2,4-5 file.txt where file.txt is: one two three four five and the return is: one two four five." This showcases a more direct and potentially less error-prone method for simple columnar tasks than some of the awk workarounds. The quote "Us old UNIX guys would likely go for cut for this sort of task" captures this sentiment.