This is documentation for the SYMTAB.4TH file. The code is intended to be a supplement to the NOVICE.4TH file. The novice should first read NOVICE.TXT and understand the code in novice.4th. You don't have to understand every detail though, as SYMTAB.4th mostly only uses the FIELD word from novice.4th.

*******************************************************************************
*******************************************************************************
*******************************************************************************

Symtab is a symbol table, such as for an assembler or Forth compiler. It is a binary tree that structures itself such that the more-often accessed nodes are toward the top. The .COUNT field keeps track of how many times the node is accessed. The .COUNT is negated if the node is smudged. There is a non-negative DAMPER value that dampens the restructuring. The higher it is set, the more dampening there is. The default is 5, which should work for most everybody.

Symtab works well when there is a wide variance between how often the words get accessed. This is typical of symbol tables, in which some of the words are more common than others. If some words are accessed 10 to 100 times more often than other words, symtab will move move them toward the root, making their access speed up to 10 times faster (assuming a tree of 1000 words that is 10 levels deep). This makes the overall speed of compilation at least an order of magnitude faster than a balanced binary-tree that is making no effort to move the common words toward the root. 

If all the words are being accessed with the same frequency, symtab will tend to churn. It will promote a node ahead of another, and then demote it again a little bit later --- because none of the nodes really deserve to be promoted in the sense that they are more commonly accessed than the others. Because of this symtab is not a good choice for an associative array, in which nothing is known about how often the words will be accessed. For that, see ASSOCIATION.4TH (which uses an LLRB tree) also included in the novice package.

Symtab is pretty much balanced (assuming that there is no relation between alphabetic ordering and frequency of use). Also, symtab does get rid of "stringers" (nodes with one NIL child and one valid child, and a valid grandchild on the same side as the child). In Forth it is common to have a lot of helper words that only get used once or twice and never again. These all sink to the leaves, where they get flattened out in so much as they don't form stringers.

DAMPER prevents a lot of churning between nodes that are close together in their .COUNT values. The higher that DAMPER is set, the less churning that is done. The damping primarily has an effect in the early stages of dictionary building. This prevents a lot of churning at a time when we don't really have enough information to make decisions. As time passes, the difference between the uncommon and the common nodes will widen, and DAMPER will become increasingly irrelevant.

INIT-SYMTAB is used for initializing nodes with a default COUNT of 1. If the user has some a-priori knowledge of the frequency that words will be accessed, he can preinitialize COUNT using <INIT-SYMTAB>. A compiler-writer could collect some statistics on typical Forth programs and give all the nodes preset COUNT values, and then OPTIMIZE the tree so that it would be reasonably structured out-of-the-box. If the use of the tree differs from "typical" usage, the tree will gradually restructure itself accordingly. Assuming that the use of the tree is fairly typical though, not much restructuring will be necessary. 

FIND-KEY is used for finding a node in the tree that matches a specific key. Note that FIND-KEY returns a new root pointer. This is because the tree may get restructured by FIND-KEY and the root will not be the same node as before. If you are holding the root pointer in a variable, be sure to store the new root pointer into that variable after FIND-KEY returns. 

If INSERT-NODE is inserting a node whose key matches an existing node, then the old node will be smudged. The old node's COUNT value is transferred over to the new node though, because the new node is presumably going to be just as popular as the node it is replacing. INSERT-NODE returns the new root node, which may be different from the old root node if the tree was restructured. INSERT-NODE also returns either the old node or FALSE if there was no old node, so the user can know if a redefinition occurred or not. Also, the user can undo the insertion by unsmudging the old node and smudging the new node, if the redefinition was a mistake.

The word WALK> traverses the tree from left to right alphabetically, <WALK traverses from right to left, and WALK from bottom to top. In all cases, the xt of a word that touches each node must be provided. It is okay for the toucher to access data underneath the node. EQUALIZE is a typical example of a function that traverses the tree, using the toucher <EQUALIZE>. We also have CRAWL>, <CRAWL and CRAWL that provide the toucher with the depth as well as the node itself.

Nodes are smudged with <SMUDGE>, and can be unsmudged with <UNSMUDGE>. The word PRUNE removes all of the smudged nodes. Until PRUNE is called, any smudged node can be unsmudged; smudged nodes never get deleted automatically, but must be explicitely deleted with PRUNE. The word DELETE is also available for deleting arbitrary nodes right away. This isn't recommended because it doesn't work if the tree contains nodes of various sizes. Also, it doesn't always delete the node, but sometimes just smudges it, so you still have to call PRUNE periodically to get rid of memory leaks. All in all, it is best to smudge nodes and then call PRUNE later on. 

In Forth, it is very rare to want to get rid of a single arbitrary node. It is more common to want to wipe out a slew of words in one stroke. A good system I have used is to have both a soft-smudge flag and a smudge flag. You have a LOCALIZE word that soft-smudges that latest definition. This is used throughout a source-file on all of the helper words that aren't general-purpose and aren't intended to be used outside of that source-file. At the end of the source-file, you call the END-MODULE word that traverses the entire dictionary and smudges all of the soft-smudged words that it finds. END-MODULE will then call PRUNE to delete the the whole slew of smudged words. I don't have LOCALIZE and END-MODULE in the symtab package, but these can easily be added when symtab is implemented into a Forth system. I do have the word AXE that removes all of the nodes in the tree. This is typically used for wiping out an entire word-list.

There is a compiler directive INSERT-AT-ROOT? that determines if nodes will be inserted at the root or at the leaves. If the nodes are inserted at the leaves, then the nodes will start out in their optimal place, which is at the bottom. A newly inserted node has a .COUNT value of 1, so it can't be more common than any other node. From there, it will work its way up toward the root as it gets accessed over time, and will eventually end up in its appropriate place. On the other hand, if a node is inserted at the root, then it will sink downward into the tree during the automatic restructuring done by FIND-KEY, and will eventually end up in its appropriate place. The default is to insert at the root. The assumption here is that a word will get used immediately after it is defined. In Forth, there are a lot of helper words that get used within the definition of a word immediately following their own definition, but never get used again. These words will have fast access immediately after they are defined, but wi
ll then sink down to the leaves where they ultimately belong, because they aren't general-purpose words that get used over and over again.

The tree is not always optimized in the sense that every parent node has a higher .COUNT value than its children. One reason is that the nodes are getting inserted at the root, and they have a much lower .COUNT value than pretty much everything below them. Another reason is that nodes get smudged or deleted, and this results in nodes with low .COUNT values being above nodes with higher .COUNT values. Over time, the automatic restructuring done by FIND-KEY will repair all of this damage and optimize the tree.

The word OPTIMIZE optimizes the tree in one fell swoop. The user generally never does this however. After a compiler is built, and before it is released to the public, START-FRESH should be called. START-FRESH calls PRUNE, OPTIMIZE, FLATTEN and EQUALIZE. It calls PRUNE first to get rid of any memory leaks, and also because this speeds up OPTIMIZE. It calls OPTIMIZE next to restructure the tree so that every parent has a higher .COUNT value than its children (within the DAMPER margin). It then calls FLATTEN to flatten out any stringers that OPTIMIZE created. Finally it calls EQUALIZE. This resets the .COUNT values of all the nodes to 1. The purpose of this is to allow an application to be compiled, and the new words to be able to compete equally with the old words for being considered common.


******
****** Hash tables.
******

Symtab has 3 words of overhead per node (LEFT, RITE and COUNT). Assuming that linked-lists are used to resolve collisions, a hash table has N+S words of overhead per node. N is the number of words, which have an overhead of one cell per word which is the node's link pointer. S is the number of one-cell "buckets" in the hash table. S is equal to N/L, with L being the "load factor" of the hash table. The load factor is defined as the total number of nodes divided by the size (S) of the hash table. The Wikipedia article says this about load factor:

"The performance of most collision resolution methods does not depend directly on the number n of stored entries, but depends strongly on the table's load factor, the ratio n/s between n and the size s of its bucket array. With a good hash function, the average lookup cost is nearly constant as the load factor increases from 0 up to 0.7 or so. Beyond that point, the probability of collisions and the cost of handling them increases."

A lot of people rebuild their hash table when the load factor passes 0.5, although the Wikipedia article says that 0.7 is a good threshold. If the load factor is 0.5, then the memory usage of the hash table is N+N/0.5 equals 3 words per node. This is exactly the same memory usage as symtab, which as 3 words per node (LEFT, RITE and COUNT). This is assuming that there are no collisions at all. If there are any collisions, then we have empty buckets that are using a cell each but aren't doing any good, so the memory usage will be worse than 3 cells per node (worse than symtab). At a load factor of 0.5, symtab is at as good (assuming no collisions) or better (given the more realistic assumption of having some collisions) in memory usage as a hash table. At a load factor of 0.7, the memory usage of a hash table is (N+N/0.7) equals 2.43 cells per node. If the rebuilding involves doubling the size of the hash table when L=1/2, then L will equal 1/4 after the rebuilding. At this time, the hash table uses 5 words pe
r node (1+1/0.25), which is much worse than symtab's 3 nodes per word. 

The problem with hash tables is that the user has to predict ahead of time how many entries there will be in his dictionary. If he predicts too high, then he wastes memory. If he predicts too low, then his load factor will hit the threshold and he will have to "stop the world" and rebuild his hash table. Rebuilding a hash table is a very slow process, because each entry has to be rehashed. This rebuilding can occur at unpredictable times, which can cause the system to inexplicably freeze up while the hash table is rebuilt. By comparison, symtab grows smoothly with the time-cost amortized evenly over all of the accesses. The upshot is that symtab is a good choice when memory is in short supply, but hash tables are a good choice when there is enough memory available that the table can be made large enough from the get-go that the 1/2 load-factor threshold is never approached. Modern computers have a lot of memory, so large hash tables are generally the best choice. Symtab is really a throwback to the bad-old d
ays of the 8088 when the entire symbol table had to fit in a single 64K segment. Symtab is not all that useful in modern times unless there is a shortage of memory for some reason. Still, it is a reasonable choice. With symtab the programmer doesn't have to predict ahead of time how big his symbol table will become, but he can just let it fill up the heap during usage. Symtab is more of a maintainence-free solution than the hash table.


******
****** MARKER words and FORGET.
******

ANS-Forth has MARKER words, whereas Forth-83 had FORGET. It is essentially the same, except that MARKER words forget themselves, and it is not possible to forget an arbitrary word as done with FORGET. The advantage of MARKER is that it restores the search-order, which FORGET does not do. 

The problem with both MARKER words and FORGET, is that redefinitions may have occurred. These need to be undone. The basic symtab system doesn't do this. When a redefinition is done, the old word is smudged. Being smudged, it will get deleted the next time that PRUNE is called. When the MARKER word is called, the new definition gets smudged and deleted, but the old definition is still smudged and also gets deleted. This is not a true restoration of the dictionary prior to the MARKER word being defined.

MARKER words (and FORGET) were designed on the assumption that linked lists would be used, and that new nodes would be prepended into the lists with no restructuring of the list being done. In this case, the old node is still in the list and is deeper into the list than the new node. When the new node gets deleted, the old node becomes valid again. This also works in hash tables assuming that they resolve collisions with simple linked-lists as described.

It is possible for symtab to support MARKER words (and FORGET). When a redefinition is done, the old word is not smudged, but is "redef-smudged" instead (a different flag). This means that it can't be found by FIND-KEY, but it doesn't get deleted by PRUNE either. Also the old word is given a pointer to the word that redefined it. When the MARKER word is executed, it searches the dictionary for words that were defined after it was defined, and deletes them. This can be accomplished by checking that the words' code-field-address is higher in memory than the MARKER word's code-field-address. The MARKER word also searches the dictinary for words with their redef-smudge flag set, and which point to a new word that is on the MARKER word's hit list. These words will get their redef-smudge flag unset and will be thus reinstated.

It is possible to support MARKER words as described above, but it is a major hassle. A simpler solution is to not support MARKER at all. Normally, the MARKER word is defined at the very beginning of the application. The purpose of this is so the application can be recompiled repeatedly during development, without redefining itself over and over again, which is a waste of memory. My TRY word in the novice package is designed to facilitate this. Another way to do this however, is to put the entire application in its own word list. My symtab package has an AXE word that deletes every node in a symtab tree. Assuming that the application is in its own word-list, and that each word-list gets its own symtab tree, AXE can be used to delete the entire application. If any words in the application are redefinitions of words in the base system or in previously compiled libraries (each library also gets its own word-list), the old word is still valid in that other word-list and will be reinstated. The reason why FIND-KEY
 wasn't finding the old word was because the search-order had the new word's word-list ahead of the old word's word-list. If we wipe out the new word's word-list and remove it from the search order, the old word will start to get found again. All in all, this is a much more straightforward solution than supporting MARKER words. With MARKER words, everything is in the same big symtab tree and we get involved with a lot of weird redef-smudged words, that are in limbo between being smudged and not being smudged, and which can get reinstated again. All of this complexity can be dispensed with if the libraries and applications are each in their own word-list, and we have a way to delete all of the words in a word-list. ANS-Forth is badly designed in that it lacks a word for deleting all of the words in a word-list --- if we had this, we could get rid of MARKER and all of its weirdness.