Insidious Optimizations II: Machine Text

Human thought is generally stored as human language which is generally stored as human writing.  The
written word at its most base refers to any human impressions of symbols coming from an alphabet, or
other derivation rules, onto a medium; this isn't amenable to automated analysis, however.  Printing
machines led to the mass production of written works, but with a natural decrease in what was simple
and feasible to express through them, and a morphing of the ideas to suit then-available technology.

Machine text most commonly refers to a stream of octets representing character codes, and this might
seem reasonable; I realized it wasn't.  I considered why character codes were chosen: clearly due to
limited memory and old hardware which physically stored the characters.  There's no reason why those
machines featuring even mere millions of times more storage couldn't manipulate text in better ways.

I've named a better system of my design by the name ``Elision''.

The predominant character encoding is known as ASCII, and can trace a lineage to mechanical devices;
one quarter of it is dedicated to ``control characters'', behaving differently from those characters
which truly represent symbols, and with many behaving differently from the other control characters;
another huge segment is dedicated to storing both the upper case and lower case versions of letters.
The most pleasant aspect of ASCII is its simplicity, due purely to its small size, but extensions to
it have shown this was merely due to technical limitations; the character sets following it are much
more complicated in every way.  Such work against programmers by appearing to be stable foundations.

It's very important that oddities of one medium shouldn't be used as a weapon to discourage adopting
another permitting more.  These characters afford useless freedoms, with those true being forgotten.
Acknowledge what currently passes for machine text lacks many qualities which true text can feature.

These character encodings can clearly approximate text, but fail to adequately represent it.  Actual
text can be stylized, arranged outside of a rigid grid, and can contain handwriting, custom symbols,
or many other things found in books, yet lacking from the premiere of computing.  True text can't be
meaningfully thought of as one-dimensional no differently than a human can't be considered as atoms.

A system being universal is justification for neither considering it sufficient nor reasonable.

Storing text character-by-character is very inefficient.  For English text, words are being encoded,
not random sequences of characters.  It would be simple to store these words and their variants into
a dictionary, internally represented there as character sequences, to represent text as indices into
this dictionary.  I believe a twenty-four bit code, due to the tyranny of the octet, would easily be
sufficient, and would store words more or just as efficiently for all but the smallest of the words.

A machine could be made which knows and can process text as Elision, perhaps using a systolic array.
I believe there to be no better way to refer to such a machine than to refer to it as an ``Elider''.

The particular form of the dictionary should be treated as unimportant, but one strategy is for each
word to have an entry containing its length, characters, and perhaps other data, such as whether the
word is a proper noun, or its category such as article, adverb, adjective, and whatnot; importantly,
these characters shouldn't be stored in any common character set, but instead a more efficient form,
which should prohibit characters no words contain.  Seperating words from their characters enables a
more efficient storage method, in which common characters are shared, tending the cost towards zero.
An example is eliminating explicit storage for words which are subsets of others, as in ``another''.

Certain languages, such as Latin, could be optimized by encoding dipthongs using leftover codespace.
Latin, in particular, could be stored in the dictionary as bases and infinitives, to be declined and
conjugated based on an additional code paired with each word.  The consequences of this are obvious.
Irregular words would need to be noted as such and could be held in the dictionary less efficiently.

An important consideration is giving meaning to the dictionary ordering, such as organizing by case.
This would enable sorting words in constant time, and without needing to know the words represented.

It's deadly important this proposed system not restrict human thought in pursuit of its other goals.
I've thought long over how to best allow for words not in the dictionary to be used, and now know an
auxiliary dictionary is the best approach; a bit is necessary to choose from a pair of dictionaries,
and this approach compresses repetitive usage of such words the same as others, also maintaining the
constant size of words, easily making it better than merely inling them, an approach I'd considered.
This auxiliary dictionary mechanism must be the second layer of a multi-layered Elision application.
Further freeing human thought goes beyond just Elision, but I believe Elision can serve as the base.

Concerning layering further systems, I've noticed combining several can lead to a nicer result.  For
English, a twenty-three bit index; dictionary alternation bit; capitalization bit; and an inter-word
punctuation code resulted in a better system than when I'd considered each layer as being seperated.

I've given thought to customizing primary dictionaries, and now believe using a customized auxiliary
dictionary to be the best approach; there would be plenty of space to be reserved for such purposes.

Loanwords and the like would ideally be handled by another layer, but this would require there to be
an Elision targeting for each, which is unrealistic for the early system.  Secondly, those who use a
loanword aren't necessarily able to recall or recognize the true version, due to foreign scripts, to
diacritics, or whatnot; the most reasonable way to handle these is to simply include them as written
in the borrowing language, and tables associating these different Elision words could be made later.

Greater mixing of languages would require another layer, to identify them.  An important consequence
of requiring text to identify its languages is avoiding the so-called homoglyph issue, regarding the
foolish decision to add different characters with identical appearances.  This could be avoided with
interfaces properly supporting character recognition, but fools believe this problem is intractable.

Other niceties of the system include: spell-checking due to the dictionary, optimized word-searching
due to the constant size of the words, and even obfuscating texts due to the dictionary indirection.
When lacking any system dictionary, this becomes a mere domain-specific compression scheme, in which
a dictionary is tailored to and paired with the text, and thus needn't have an auxiliary dictionary.

I believe creating a programming language which allows for the creation of custom notation will have
a similar effect to that interactive languages have had; both qualities must be a part of the system
from its inception to properly work, and any attempt to do otherwise is always but a poor emulation.
It disgusts me that equality, addition, multiplication, and select others are so privileged so as to
be given slots in the character sets, whereas other notations aren't made so omnipresent and innate.
The decimal digits are included in this disgust; such things shouldn't be embedded into foundations.

This also has a large effect on programming languages.  The mathematicians can truly create notation
that suits its context.  The programmers cannot.  I'm unable to include my crest in a text except as
an image, yet humanity has a proud tradition of including such things in letters and whatnot.  These
machines are unable to offer anything grand enough to justify destroying these and other traditions.

As modern computing is rather based around manipulating text as characters, even embedding such into
boot firmware, it's clear to me the current state of machines is insufficient to progress beyond it.

A problem posed by Elision is the need to occasionally update the dictionary in an incompatible way,
to add new words and maintain any ordering qualities, which requires updating texts indexing into an
older dictionary.  This requires texts to identify the dictionaries used, which isn't hard, and also
prohibits relying on the precise bits of the message, which is both an acceptable cost and a sign of
a higher abstraction forming.  The mapping could be achieved by a basic table approach, or better by
a piecewise function incrementing indices appropriately, based upon the insertions.  Regardless, the
updating can be automated trivially.  Checksumming may be best done on the extracted character data.

I've made my case for believing the current state of what passes for machine text to be an insidious
optimization.  I can trace its origin back to the inception of the teletype and related hardware.  I
recognize it as historical inertia, but am aware of few others who do, making it insidious.  As with
other insidious optimizations, it's no longer necessary, isn't viewed as such, and it only inhibits.
For programming, I seek that same type of freedom of notation Ken Iverson had with APL, but greater.
I'm disgusted to be expected to use the same tools for both human and programming languages.  I thus
view programming languages to be inferior for programming, when comparing them to specialized tools.
Elision is the answer I give for that other half of this.  Now see this situation the same way I do.