proxy70

(2023-04-12) On structured text data formats
--------------------------------------------
From time to time, I come across various articles about why this or that
format rulez or suxx, endless debates like XML vs. JSON vs. YAML vs. TOML 
and so on. I'm astounded about the fact that these debates are being held in 
all possible seriousness. As if the readability of the format is solely 
determined by the syntax itself and not by the way you choose to write your 
own documents in it. In my practice, I had seen a lot, including nearly 
unreadable YAMLs and perfectly readable JSONs and XMLs, although, of course, 
the opposite can be seen a bit more often. For me, any of these discussions 
just don't make sense. When choosing an export/configuration format for your 
project, you must only answer three simple questions:

1) Does the data need to be readable/editable by humans?
2) How critical is the performance of saving/loading the data by the machine?
3) What is the deepest level of hierarchy that _really_ needs to be stored?

Trust me, everything else is not as important as you think. Moreover, once
you answer these questions, the results might surprise you as they most 
probably wouldn't match your first assumptions. But it's not about the 
format, it's about how you arrange your data. And even then, you can make an 
informed choice towards BOTH readability AND ease of parsing. Here, I'm 
going to overview two easily most overlooked and underrated structured text 
data storage formats that would make our lives so much easier if everyone 
was using them for appropriate situations instead of all this zoo.

The first format is something that you, since you're reading this on Gopher,
must already be familiar with: TSV, tab-separated values. Yes, Gophermaps 
are just TSV files and nothing else. The name looks like the format was 
derived from CSV, but in fact it's far more ingenious in its simplicity. 
Unlike CSV where you have to escape commmas (and don't even get me started 
on the fucking M$ that actually allowed _semicolon_ instead of comma as a 
delimiter for some locales, and still calls that abomination CSV) and 
ideally quote all strings with whitespaces and commas and escape all the 
quotes inside and so on, TSV allows you to just write things as is. Because 
no one uses the tab character, CR or LF in their tabular data or 
configuration values anyway. If they really have to be used there, they are 
just replaced with \t, \r and \n respectively, with the backslash itself 
being escaped as \\ in this case. But that's really it. This format is 
extremely easy to parse and write, and probably offers the highest 
machine-friendliness to readability ratio. The only precaution you must take 
if you write TSV files manually in a text editor and not programmatically is 
to make sure tabs are saved as tabs and your editor doesn't automatically 
convert them to spaces. Other than that, it just works. Nothing to complain 
about, really.

Except one thing. TSV format, exactly because of being so simple, doesn't
offer a straightforward way to store hierarchical data of deeper levels than 
just "list of key-value objects" or "key - fixed-size list of values". Sure, 
just like with Gophermaps, you can use the first field to store the value 
(and optionally the type) and all the subsequent fields to store keys and 
subkeys, but variable amount of fields in each row would greatly reduce both 
readability and parsing efficiency, and that's not what we want. So, in case 
these levels of hierarchy are not enough, we must find something even more 
ingenious to describe more complex structures while not introducing another 
JSON, YAML or, gods forbid, XML level of complexity for machine parsing and 
still keep the format entirely humanly readable AND writeable.

Enter Recfiles. This is a format created under the GNU umbrella to describe
relational database-like structures using simple plaintext files. By the 
way, I won't talk about using Recutils here, I don't care much about them. 
For now, I'd like to focus on the format itself. As far as I understood, 
it's about as simple as Gemtext.

1. Comments:
* Any line starting with the # character is a comment
* Comments can only be written on a separate line, # must be the first in it

2. Fields:
* They are name-value pairs separated with colon and space (": " or ":\t")
* Field names are case-sensitive
* Any field name must match this regexp: ^[a-zA-Z%][a-zA-Z0-9_]*$
* Field names starting with % denote metadata, not data
* Any field value must be terminated with LF (except \LF or LF+ cases)
* If a line ends with \LF instead of LF, the next line continues the value
* Newlines in values are encoded as LF+ and a single optional whitespace
* Fully blank lines are allowed and not counted as fields
* In all other cases, the line after LF must begin with a valid field name

3. Records:
* A record is a group of fields written one after another
* Can contain multiple fields with identical names and/or values
* Records are separated by one or more blank lines (like paragraphs in MD)
* Record size is the number of fields it contains

And this is where the syntax itself ends. Everything else documented about
Recfiles, including the notion of record sets and how we describe them using 
record descriptors (which are just records containing metadata fields only, 
something like database table schemas) is completely optional, built upon 
this syntax and constitutes implementation details specific to a particular 
set of tools (GNU Recutils). If you're interested in diving deeper into the 
canonical implementation that GNU Recutils are, I recommend their full 
manual ([1]) for further reading. It really is fascinating. However, with 
Recfiles being a fully open format, its implementations are not limited to 
just one, and some other tools adopt simpler modes of operation. The 
reference recfile parser in Tcl ([2]), for instance, only recognizes the 
%rec metadata field in the descriptor to turn its value (record type) into a 
top-level key in the output dictionary.

I really like this format for the same reason I like Gemtext among others:
because it is fully line-oriented. That is, after splitting your text by LF, 
you can unambiguously determine the type of each line based on the character 
it starts with. In fact, a single record parser can be defined with a very 
simple informal algorithm:

1. Initialize an empty string buffer BUF. Set the literal reading mode to off.
2. Read the next line L.
3. If the literal reading mode is on, append the contents of L to BUF and go
to step 9.
4. If L is empty, go to step 11.
5. If L starts with #, go to step 2.
6. If L starts with +, skip an optional whitespace after it and append an LF
and the further L contents to BUF, emit the flag to update the previously 
emitted field, then go to step 9.
7. Read all characters until the first colon (:) in L as NAME. If NAME
matches the ^[a-zA-Z%][a-zA-Z0-9_]*$ regexp, then save it, otherwise discard 
it.
8. Clear the BUF value. Read all characters after the first colon (:) in L
into BUF. If BUF now starts with a whitespace (0x20 or 0x9), remove this 
whitespace.
9. If BUF ends with a backslash, then remove it from BUF, turn on the literal
reading mode and go to step 2.
10. Turn off the literal reading mode and emit NAME as the current field name
and BUF as the current field value. Go to step 2.
11. Report the end of the record. End of algorithm.

Note that the algorithm doesn't tell us that to do with the duplicate field
names. We determine this ourselves, as well as how to contatenate the + 
lines.

Now, even if we aren't aiming for a full SQLite3 or TextQL replacement, what
can we use this bare format for?

Tabular data? That's the native mode of recfiles, although just using them as
a drop-in replacement for CSV or TSV probably won't showcase their full 
potential. Every record with the same set of unique field names naturally 
corresponds to a table row. For example, any Gophermap line has a 1-to-1 
mapping to a Recfile record.

INI/TOML/.properties-style configuration? Easy! Just use the metadata fields
like %rec to name your sections and unique field names in each record. 
Everything else is the same key-value structure.

JSON/YAML-style configuration of any depth? Also easy:
* all objects and lists of objects are named via the %rec descriptor;
* objects with primitive values are just records with unique key fields;
* lists of primitive values are just records (or parts of records) with
non-unique key fields;
* nesting is done by referencing something like '%rec/[name]' instead of the
primitive value.

Let's take a look at the example YAML from some CloudBees tutorial:

---
doe: "a deer, a female deer"
ray: "a drop of golden sun"
pi: 3.14159
xmas: true
french-hens: 3
calling-birds:
  - huey
  - dewey
  - louie
  - fred
xmas-fifth-day:
  calling-birds: four
  french-hens: 3
  golden-rings: 5
  partridges:
    count: 1
    location: "a pear tree"
  turtle-doves: two

Now, here's how it might look as a Recfile (note that's just one option out
of many):

# top record descriptor - may be omitted
%rec: top

doe: a deer, a female deer
ray: a drop of golden sun
pi: 3.14159
xmas: true
french-hens: 3
calling-birds: huey
calling-birds: dewey
calling-birds: louie
calling-birds: fred
xmas-fifth-day: %rec/xmas-fifth-day

# subrecord descriptor
%rec: xmas-fifth-day

calling-birds: four
french-hens: 3
golden-rings: 5
partridges: %rec/partridges
turtle-doves: two

# another subrecord descriptor
%rec: partridges

count: 1
location: a pear tree


Another example might be this highly nested JSON that contains some arrays of
objects:

{
  "id": "0001",
  "type": "donut",
  "name": "Cake",
  "ppu": 0.55,
  "batters": {
    "batter":
      [
        { "id": "1001", "type": "Regular" },
        { "id": "1002", "type": "Chocolate" },
        { "id": "1003", "type": "Blueberry" },
        { "id": "1004", "type": "Devil's Food" }
      ]
  },
  "topping": [
    { "id": "5001", "type": "None" },
    { "id": "5002", "type": "Glazed" },
    { "id": "5005", "type": "Sugar" },
    { "id": "5007", "type": "Powdered Sugar" },
    { "id": "5006", "type": "Chocolate with Sprinkles" },
    { "id": "5003", "type": "Chocolate" },
    { "id": "5004", "type": "Maple" }
  ]
}

And here is how I'd represent it as a Recfile (omitting the toplevel
descriptor and comments for brevity this time):

id: 0001
type: donut
name: Cake
ppu: 0.55
batters: %rec/batter
topping: %rec/topping

%rec: batter

batter: %rec/batterlist

%rec: batterlist

id: 1001
type: Regular

id: 1002
type: Chocolate

id: 1003
type: Blueberry

id: 1004
type: Devil's Food

%rec: topping

id: 5001
type: None

id: 5002
type: Glazed

id: 5005
type: Sugar

id: 5007
type: Powdered Sugar

id: 5006
type: Chocolate with Sprinkles

id: 5003
type: Chocolate

id: 5004
type: Maple


Although this example and even more complex ones are obviously
machine-generated (i.e. the data came as a result of calling some API) and 
it doesn't resemble anything to be used for configuration purposes, these 
data still are more human-manageable in this format (which is still easy to 
write or read programmatically) instead of trying to guess where the closing 
bracket was missing and which one exactly, curly or square, it was. It also 
doesn't cause eyestrain from the abundance of quotation marks and the fact 
that you need to escape them if they are encountered inside your string 
values. I also could provide an example of how to handle XML-structured 
data, but I hope you already get the idea.

Regarding non-Recutils implementations of Recfiles, it was also interesting
for me to find out that Bash itself, being a GNU project, was supposed to 
include a readrec builtin command that would facilitate reading a whole 
record from a file as opposed to parsing the lines obtained via the read 
builtin. In fact, however, this "builtin" never became a real builtin 
shipped within Bash. For it to work, you still need to install Recutils 
separately (and on my Arch, I had to do this from AUR) and then plug the 
readrec.so library like this:

enable -f /usr/lib/readrec.so readrec

Even without the entire package overhead, this particular library, on x86_64
architecture, weighs about 14K. Not really sure whether all this is really 
necessary to just parse a simple record format and handle special newline 
cases within field values, especially that the command itself doesn't do 
much else. Also, contrary to GNU's own specification, this command doesn't 
enforce whitespace characters after the colon to delimit field values from 
their names (in the input, but does insert them in the output). That's why I 
created a sourcable script for modern Bash versions (4.3 and up) with my own 
version of the command, readreclx, that mimics readrec's behavior (although 
doesn't set the REPLY_REC variable) and weighs under 3K bytes. You can 
consider it a reference implementation of the algorithm mentioned above. And 
it looks like it deals with edge cases just fine, although more thorough 
testing might be required. As usual, I have published this script in my 
downloads section on hoi.st.

Why did I do this? Because such formats really deserve more attention, more
love and more independent implementations.

--- Luxferre ---

[1]: https://www.gnu.org/software/recutils/manual/index.html
[2]: https://wiki.tcl-lang.org/page/recfile