(2023-04-29) AWK is underrated, even in the POSIX variant
---------------------------------------------------------
I want to take a little break from writing new stuff. But still, there is one
thing that bothers me a lot. Whenever I search any information on how to do 
this or that with AWK, especially on StackOverflow-like forums, I constantly 
stumble upon "solutions" using Bash, Coreutils, sed, Perl and even Python or 
Ruby. Anything but AWK the question authors initially ask about. I don't 
know, maybe forum know-it-alls think it's a kind of "XY problem" (which 
bears a bag of bullshit on its own, but that's another topic) and whoever 
asked the question chose the wrong tool for the job and the tool they offer 
is better and so on, but damn! I'm fluent in Bash, Dash, Python 3.6+, JS 
(from ES3 to ES6 and whatever was next), C89 and VTL-2, and as such, I have 
a lot of options to choose from when writing new stuff, but I want to get 
fluent in AWK as well. So, if I (hypothetically) ask about how to do 
something in AWK, I want an answer about AWK, not about Bash or Python which 
I already can write just about everything in, or about Perl which honestly 
must already die. The know-it-alls can't even consider the situation someone 
could be left with Busybox and nothing else, and that's why they want to 
learn how to solve problems with AWK alone (which is the only proper 
programming language they can have on some systems, and Busybox sed is much 
more limited compared to GNU sed too), not because they don't know Perl or 
whatever.

This is why I have given up on trying to find answers on forums and turned to
the sole point of authority: POSIX.1-2017, 2018 edition ([1]). It has some 
external links (e.g. for printf/sprintf format specifiers ([2]) or for 
extended regular expressions format ([3])) but this is where everything 
becomes crystal clear in terms of features we can use: anything not in there 
is some non-standard extension. Compared to the real-life AWK versions I'm 
using right now (Busybox and GAWK), I'm still missing bitwise operations 
but, to be honest, they are not necessary everywhere and can be emulated 
with normal integer arithmetics if required, although it would definitely be 
slower. To make sure you're on the safe side (mostly), GAWK even has a 
--posix (or -P) flag to turn on the POSIX compatibility mode. I say "mostly" 
because no matter which options you set, different implementations handle 
null bytes in strings differently, and POSIX states the behavior is 
undefined in this case, so no one is to blame. For instance, in Busybox, you 
can't have null bytes inside any string as they automatically truncate its 
contents, while in GAWK they are handled normally even if you don't 
explicitly pass the -b flag (treat all characters as raw bytes regardless of 
locale). The POSIX specification is also missing GAWK's epic TCP/UDP socket 
pseudo-filenames (starting with /inet) and bidirectional process 
communication operator (|&). Yet, despite all this, I consider even the 
standard AWK criminally underrated.

Why? Well, think about how much programming around us really boils down to
processing text in one way or another. Rendering templates, parsing logs, 
scraping web pages, collecting reports, emulating terminals, marshalling 
objects between client and server, most popular client-server protocols and 
APIs themselves... Not even to mention how smaller Bopher-NG could become if 
rewritten in AWK, but first, it couldn't be called Bopher anymore, second, I 
don't have time for this effort for now. But you get the idea, right? 
Whatever task involving text where using C is too tedious, is a job for AWK 
with its record- and field-oriented engine with extended regular expressions 
available out of the box. And, if you really need it, basic math is already 
there too, up to square roots, logarithms, sines, cosines and arctangents, 
as well as your basic built-in PRNG with rand() and srand(). I don't really 
know what prevented them to add bitwise operations to the standard but it's 
already pretty functional for such a tiny package (and I already mentioned 
that even Busybox AWK that has them is just under 3K SLOC long). Of course, 
this tinyness comes at a cost of some sacrifice in convenience: no way of 
explicitly declaring variables as local (only implicitly, as unused function 
parameters), 1-based string indexing (as opposed to C-like languages where 
0-based indexing is commonplace), no multi-assignment in the initializing 
clause of for loops (although Busybox supports them but even GAWK doesn't), 
a single format for numbers (stored as floating-point, even when explicitly 
cast to integers with int()), a single format for arrays (strictly 
associative and all keys are cast to strings), but all these are minor 
quirks compared to what this language is really capable of.

Another thing I'd like to mention is that AWK specification, while having
some minor updates to clarify things from time to time, has been staying 
like this for good 35 years or so, and this means as long as you adhere to 
POSIX, your programs will run on some ancient systems just as successfully 
as on the current ones. Yes, you may struggle to replicate the behavior of 
old C compilers and runtime libraries, you may find incompatibilities across 
various versions of Perl (not even to mention Bash, Lua and Python), you 
might have issues with compiling J2ME or other old Java 2/3 code on OpenJDK 
higher than 8 or running REXX on anything modern non-IBM, you can find your 
entire JS code not working on KaiOS 2.x because of some ES6 feature not yet 
present in Gecko 48 back then... but as long as you have an AWK there and an 
AWK here and you're not using any non-standard extensions and null bytes in 
your strings, you can be sure your program will be fully portable to any 
standard-compatible implementation from 35 years ago and probably from 35 
years forth. And this is probably where the lack of big-market interest is 
even somewhat good: no one is going to try to shove in fancy useless 
"features" like OOP, template-based programming, decorators and other BS 
that breaks all compatibility and makes the codebase even slower and much 
bulkier.

And, as a good example of "don't try to fix what's not broken", AWK is
definitely worth learning and using as an everyday tool.

--- Luxferre ---

[1]: https://pubs.opengroup.org/onlinepubs/9699919799.2018edition
     /utilities/awk.html
[2]: https://pubs.opengroup.org/onlinepubs/9699919799.2018edition
     /basedefs/V1_chap05.html#tag_05
[3]: https://pubs.opengroup.org/onlinepubs/9699919799.2018edition
     /basedefs/V1_chap09.html#tag_09_04