proxy70

(2023-04-19) Human-scale programming languages and the problem with them
------------------------------------------------------------------------
I started writing this post while looking at the source code of Equi ([1]),
my probably most ambitious stack-based VM to this day that isn't _fully_ 
esoteric as it allows to write compact but still human-readable machine code 
that ideally would even work on Apple IIe. I remember I promised to write 
more about this VM, but not today. Today, I just want to mention the equi.c 
file now has 716 SLOC of pure ANSI C89 code. Is this a lot? Well, compared 
to most modern programming language implementations, it might not seem a lot 
(even busybox's awk.c is currently about 2900 SLOC), but 700 SLOC already is 
around the upper limit of my comprehension. Of course I understand this code 
because I wrote and tested it, and I hope everyone else will understand it 
because it is well-structured and well-commented, but still, it's just too 
much. And the sad truth is, nothing else can be taken away from there 
without sacrificing either compatibility or usability. I don't want this VM 
to grow in size, but the only realistic way to further shrink it would be 
dropping multitasking support and returning to the old(-ish) memory 
structure which I spent so much time to move away from. And yet again, this 
would reduce the codebase by around 100 SLOC at most and wouldn't 
fundamentally change the overall picture of things.

From this point of view, it's interesting to analyze various programming
languages and their particular dialects or implementations that are usually 
presented as "minimal". For instance, regardless of how small Lua, Red, 
Boron and MicroPython are, I wouldn't consider them "minimal" because their 
codebase still is huge. As I have already mentioned, busybox awk doesn't 
look minimal either. Well, what does? There seem to be just three major 
language families that do not belong to esoteric or narrow-spec (like dc) 
classes that, although not at all small in their canonical implementations, 
_can_ have really minimal flavors: Forth, Lisp and Tcl. I say "families" 
because the implementations themselves may be so different one couldn't 
recognize the original concept in them. For instance, both MINT and my Equi 
are Forth-like although neither of them fully qualifies as a variant of 
Forth. Apart from these three families, there also are some long-forgotten 
specimens like Tiny BASIC (world's first piece of software to popularize the 
word "copyleft", by the way) and VTL/VTL-2, with the canonical 
implementation of the latter being famous for being able to fit into 768 
_bytes_ of Altair's ROM. And, as some advanced versions of this language are 
still being developed for 6502- and Z80-based machines, with the latest 
Apple IIe compatible variant (VTLC02) having 644 SLOC of the **assembly** 
language and fitting into 962 bytes of machine code, this continues to be a 
textbook example of true programming minimalism.

Now, here is the main problem that is true for all minimal programming
languages to larger or lesser extent: the simpler you make the core 
interpreter, the harder you make programming in it. As someone said, 
complexity has to live somewhere. If it's not in your interpreter, then it 
either resides at the lower level (OS, VM or even the Web browser runtime if 
your language targets such an environment), or the upper level (the standard 
library, as it usually is the case with Forths and Lisps), or you put all 
the extra burden on the programmers themselves and every one of them has to 
reinvent the same wheels. To me, the main challenge in picking or even 
designing such a language would not be in moving the complexity around but 
eliminating it altogether. How, may you ask? After all, it's the tools we 
can change, not the tasks we must do with them... Well, here are three 
recommendations I could give about complexity reduction.

1. Adjust your requirements. This is much easier to do if it happens before
even picking the tools. Think on the lines of "Do I, or the tools I choose, 
really need to be able to do X in order for me to do Y?" Don't be afraid to 
cut off unnecessary requirements with the Occam's razor.

2. Decompose your tasks into a set of smaller ones and only pick the tools
necessary to do each part, not anything extra. A good example would be 
typesetting software: you could use all-in-one packages like Kile, Scribus 
or even some proprietary monstrosities from Adobe I even forgot the names 
of, or you could use something modular like troff with eqn, pic, bib, 
dformat etc., but only the parts you actually need. If you only need to use 
formulas in your documents, you don't need anything except eqn with 
troff/groff/nroff. If you also need images, you add in pic, and so on. 
Although they both perform the same tasks, guess which approach is less 
complex? The second one. Same with software design, as well as programming 
language design itself. I always was amazed how Plan 9, that never gained 
any serious traction, was far more Unix-way than the actual Unix-like OSes 
that did.

3. Don't assume growth. This is what I already wrote about in my DevOps
related rant: most of the complexity in the software world arises completely 
prematurely from the blind assumption that everything that starts small will 
grow large. Only focus on what you need to do right here and now. When your 
code needs to grow, refactor it accordingly. When, not before. Accordingly, 
not beyond the scale.

Now, how do these recommendations and the thoughts before them translate into
my vision of truly minimal programming languages? Well, there must be some 
kind of "lowest common denominator" both in terms of implementation 
complexity and in terms of usage complexity, as well as self-sufficiency. 
So, here are my criteria. To me, a particular implementation of a 
programming language is minimal if all of the following conditions apply:

1. Its full source, along with the standard library, must not exceed 500 SLOC
of well-formatted and readable ANSI C89 code. If the implementation is 
provided in another programming language, the SLOC count of a hypothetical 
ANSI C89 translation replicating identical behavior of the language must be 
estimated. If the implementation provides a VM and a compiler is used to 
compile the code for this VM, then both the VM's and the compiler's source 
code is counted.
2. The implementation must provide I/O. If it targets the platforms that
support standard input, it must support standard input too. If it targets 
the platforms that support standard output, it must support standard output 
too. If it only targets the platforms that have neither, it must provide a 
way to return the computation results without having to use any kind of 
debugger, tracer or monitor.
3. The source code in the language itself must be human-readable and only
consist of printable characters except whitespaces or tabs. Also, any 
whitespace characters used in the code must not differ semantically, i.e. a 
single whitespace 0x20, a Tab character 0x9 or any combination of them must 
serve as a single delimiter or bear no semantics at all. An exception could 
be Python-like languages where the amount of leading whitespace characters 
on each line is significant, but that must be clearly stated in the language 
specification.
4. The language in this particular implementation must be Turing-complete.
This might not be so obvious from the first glance, so it's better to 
explicitly specify this requirement.

Now, I understand that languages like Brainfuck will also meet all these
criteria. Well, yes, Brainfuck is cryptic but still minimal. Its full 
implementation in C89 can fit into well below 500 SLOC, it provides standard 
I/O and its source code is human-readable. Whether or not you can understand 
it is another question for another discussion. But, on the scale of 
complexity, I'd put BF far lower than anything like modern Java. At least I 
can imagine how I could even integrate BF programs into my Unix pipelines 
for daily routines. With Java, I'm not so sure.

About 10 years ago (if not 15 already), I had read a quote by some anonymous
that reflects the overall situation described in this post pretty 
accurately: "If everyone out there knew bash, find, vi(m), grep, sed and 
awk, millions of software products would never need to be created". Only 
fairly recently I started understanding how damn right he was.

--- Luxferre ---

[1]: https://git.sr.ht/~luxferre/Equi