Who Killed Gopher?
An Extensible Murder Mystery

From: https://web.archive.org/web/20210405164806/https://www.ics.uci.edu/~rohit/IEEE-L7-http-gopher.html

    By Rohit Khare // December 23, 1998

   As best our forensics team can reconstruct, a serial killer
   first surfaced in the beginning of 1991, born of an
   academic's midnight hack gone awry. NeXTstep users, you see
   -- precisely the kind of object-oriented revolutionaries
   that would think they could get away with inventing a brand
   new protocol for a mature Internet.

   The first victim was the campus phonebook: it dismembered
   the whole corpus into tiny bits of `hypertext', left
   hanging together by search strings alone. Here we see its
   modus operandi: not content to slice information into neat
   lists or trees, it left behind completely unstructured
   graphs littered with redirects hither and yon across the
   Internet to co-conspirators' servers.

   Hegemonic fantasies drove it to assimilate more and more
   multimedia formats at each node. Crude at first, shoveling
   any old file through an 8-bit clean channel, it grew more
   explicit until it could label text, images, audio, even
   compound documents. By now, it can even masquerade behind
   several renderings, negotiating whatever face -- or
   language! -- pleases the victim.

   As a protocol, it was a blunt weapon: aim a socket at the
   target, send a one-line demand, and slurp down the response
   until the connection died. The crazy way it grew from there
   is proof itself that the killer dropped out of Application
   Layer Protocol design school at an early age it shows no
   understanding of the basics of TCP, botching handshakes,
   slow-start, server-sends-first-byte, even tripping over the
   Nagle timer. Instead, in a stroke of streetwise punk
   genius, it stayed 'simple' enough that any two-bit Telnet
   client could take a hit.

   And what an addictive product it was! Workgroupies were
   seduced by the ease of installing new servers and authoring
   new content -- and pushing free software helped the lock-in
   cycle along. The killer preyed on individual users'
   self-esteem by promising instant fame^*, in the form of a
   backlink from the master list of servers.

     ^*Anyone have firm attribution for the famous quote, "On
     the Internet, everyone will be famous for 15 people"?

   Soon, it was assimilating more than bags of bits: it was a
   gateway drug to harder services. Scripts, database lookups,
   even long-established news, file, and email access
   transactions were reduced to mere documents in a menu. The
   address sent in the 'demand' grew into a miniature RPC.

   All the while, with every new user, every new server, every
   nugget of information it wrested from 'legacy' protocols,
   the killer bled off a little more of the Internet's
   bandwidth -- getting away with murder with every handshake!

   Such profligacy is no idle threat: for a while, the killer
   was the largest share of application packet traffic on the
   national NSFNET backbone.

   Or was that the victim?

   Did the killer come knocking on port 70 or port 80? From
   boombox.micro.umn.edu or info.cern.ch? Bearing the
   imprimatur of campus administrators or physicists? Are we
   hunting gophers or spiders?

Dial 'H' For Murder

   In fact, the profile in Table 1 fits both suspects to a
   'T'. In the early '90s, an interesting drama played out
   between two contenders for the first integrated,
   interactive internet information retrieval protocols --
   neither of which, arguably, had any technical reason to
   exist.

   That's not to say the systems weren't inevitable: everyone
   suspected there'd have to be an easier way to use the
   Internet in the '90s than UNIX shell experience. That
   encompasses innovations in user interface, authoring
   formats, and above all, resource identification. It does
   not, however, motivate the original editions of the Gopher
   and HyperText Transfer Protocols. They were painfully
   trivial hacks that did nothing FTP didn't already handle.

   So here begins our mysterious tale: why either protocol
   ever rose to prominence in the first place; and how the
   fratricidial drama eventually played out. Understanding how
   HTTP killed gopher may lead us to the forces which may, in
   turn, topple HTTP.

   Science proves a blind alley, though. Their fates were
   decided not on technical merits, but on economic and
   psychological advantages. The 'Postellian' school of
   protocol design focused on engineering 'right' solutions
   for core applications (batch file transfer, interactive
   terminals, mail and news relays) anchored in unique
   transport layer adaptations (slow-start, Nagle timers, and
   routing as respective examples). Our two specimens are
   'post-Postel', in their details and in their adoption
   dynamics. They are stateless; they don't have (Gopher) or
   dilute (HTTP) the theory of reply codes; they scale poorly,
   imperiling the health of the Internet; and they are
   'luxuries' for publishing discretionary information, not
   Host Requirements which must be compiled into every node.

Burrowing into Gopher

   While the Web was arguably invented a few months before
   Gopher, the latter was the first to garner public notice.
   By the fall of 1994, RFC 1689's census of information
   retrieval tools estimated there were 4,800 Gophers, and
   1,200 anonymous FTP archives, but only 600 Webs. Early and
   "explosive" adoption had a lot to do with Gopher's
   "fiercely simple" design, and hence its wide availabilty on
   a number of students' (DOS, Windows, Mac, NeXT) and campus
   IS departments' (UNIXes, VMS, VM/CMS, MVS) platforms.

   It was designed by the microcomputer support folks at the
   University of Minnesota for a teletext-like Campus-Wide
   Information System (CWIS). The original prototype was a
   multicolumn tree-browser pioneered for the NeXTstep file
   browser. Each scrolling column had a list of titles;
   clicking on an entry would either display that file in the
   pane below, or load the child directory listing in the
   column to the right.

   Naturally, the messages on the wire follow directly: open a
   connection to the target server on port 70; send a one-line
   'selector' string; get back either the file itself, or a
   list of further selector lines, each prefixed by a
   character indicating the type of file it points to. Table 2
   lists the rudimentary type system Gopher shipped with,
   illustrating on one hand its ambitions to assimilate new
   media types and on the other what a thin skein it was
   around protocol- and platform-specific data formats.

   The flip side of 'thin skein' is 'ease of implementation',
   though. In effect, Gopher was a (very!) poor man's version
   of FTP, sans authentication, navigation, authoring, or
   bulk/restartable transfers. It only used TCP as a
   half-duplex fopen() call. It just happened that some of the
   "files" it transferred were menus, in a Gopher-specific,
   tab-delimited, US-ASCII format. So a Gopher server did
   little more than read files out of a specified directory
   tree; and clients could be hacked out of spare Telnet
   client code.

   A little further up the evolutionary tree, servers allowed
   scripts and other programs to be served as well. Rather
   than copying a file back across the socket, it connected it
   to some process's output. Similarly, multiprotocol clients
   could recognize a non-Gopher selector and switch to another
   language (as for phonebook queries). "Gateways exist to
   seamlessly access a variety of non-Gopher services such as
   ftp, WAIS, USENET news, Archie, Z39.50 (1992 rev), X.500
   directories, Sybase and Oracle SQL servers, etc."

   The base protocol didn't leverage TCP well, though. For
   example, TCP's three-way handshake ends with a packet from
   the server to the client -- so even when the client opens
   the connection, the server can send data first for free.
   Gopher did not have a server-hello message, and thus no
   version number to key upgrades off of. To signal the end of
   a particular transmission, Gopher borrowed the SMTP/NNTP
   convention of <CR><LF>.<CR><LF>. This only worked for its
   own menu format, though -- as soon as an actual file was
   transferred, TCP FIN had to be used as the EOF.

   This left lots of socket buffers in the FIN_WAIT state at
   the server side, just like the Web. At the same time, the
   Gopher Team was planning to scale to Internet-wide use.
   They did plan for redundant backup servers: a '+' line was
   an alternate for the selector on the previous line. They
   maintained a master list of publically-accessible servers,
   which was in turn indexed regularly by Veronica (Very Easy
   Rodent-Oriented Net-wide Index to Computerized Archives, by
   analogy to the Archie FTP server index).

   In the long-run, though, no one ever plastered Gopher
   selectors on shirts and lunch trucks and golf tees...

Spinning the Web

     "The key insight I credit Tim Berners-Lee with, is the
     URL: the idea that there's a Uniform Resource Locator
     that says I can point at any bit of information on the
     Internet." -- some obscure Web scholar in Stephen
     Segaller's new book Nerds 2.01: A Brief History of the
     Internet

   If anything, Tim's original Web manifesto was even more
   ambitious than Gopher's. Its hook for assimilating "the
   entire universe of network-accessible information" was the
   URI, which is able to subsume entire namespaces as merely
   one more scheme (telnet:, mail:, and so on). URIs reduced
   operational instructions on how to access a resource to a
   declarative address. For example, contrast the one-line
   ftp: URL with the MIME message/external-body part for FTP
   access -- logins, passwords, alternate sites, alternate
   directories, alternate formats. As RFC 1630 intended "URIs
   may, if necessary, be passed using pen and ink."

   But while URIs were the innovative essence of the Web, the
   other two legs of the triad were less steady. HTML is
   merely a markup language which makes it natural to embed
   URIs as hyperlinks. URIs could just as well have been
   popularized as a pointer format within Rich Text Format,
   PostScript, LaTeX, or any other language. HTTP is merely a
   lookup protocol which makes it natural to resolve http:
   URLs or gateway to other schemes. URIs could just as well
   have been popularized as a naming interface to existing
   clients for FTP, Gopher, News, etc.

   Somehow, though, HTTP/0.9 took root. It is a famously
   'simple' protocol, arguably even less useful than Gopher.
   Here is its entire grammar:

     Simple-Request = "GET" SP Request-URI CRLF
     Simple-Response = [ Entity-Body ]

   That's right: it only supported one method, GET, and
   required a new TCP connection for every transaction. Even
   the 1.0 edition, which eventually added authentication and
   navigation and simultaneous transfers of several files for
   a single "page", had no excuse to be born when FTP was on
   the shelf. Except that HTTP had broken authentication,
   nonstandard conventions for navigating directories and
   welcome/index/home pages, and ran spectacularly afoul of
   slow-start for each of its typically-small file transfers.

   If HTTP/0.9 was really shipped for no better reason than
   that programming separate FTP-Control and FTP-Data channels
   seemed too hard, why did it survive? Many of the earliest
   "web" sites were, in fact, just ftp: URLs. The saving grace
   was 1.0's new metadata headers. It augmented the one-line
   request with additional fields for authentication,
   arguments modifying the method, and information about the
   user-agent's preferred languages and formats. The entity
   bodies it returned were prefixed by MIME Content-Types,
   last-modified dates, base-URI, and more. This information,
   finally, was what took HTTP beyond the realm of file
   transfer to hypertext transfer.

   The headers are how HTTP's adoption cycle diverged from
   Gopher's. Both servers were simple to install and author
   content for -- there was even a hybrid gn server which
   could vend files over both port 70 and 80. There were kudos
   for new sites, in the form of the Web Virtual Library
   project at CERN, and later W3C, and early search engines
   like Brian Pinkerton's NeXTstep DigitalLibrarian-derived
   WebCrawler. The client was the defining wedge. Graphical
   web browsers leveraged rich HTML text with MIME media type
   headers to display graphical images, to launch helper
   applications, and present multilingual content.

Extensibility: the fingerprint of a killer

   The straw that broke the Gopher's back wasn't even a
   feature of HTTP per se. The synergy of HTML and HTTP in the
   graphical browser began with FORMs, through refresh META
   tags, up through scripting and applet OBJECTs. While one
   protocol stayed lean, the other opened up so much headroom
   that it grew into a beast so complex even today's base 1.1
   specification takes 160+ pages.

   An avalanche of new headers modified GET to only retrieve
   the full entity if-modified-since an existing copy, or only
   a specified byte-range; enhanced security with keyed-digest
   passwords; even reverse-engineered state back into HTTP
   using 'cookies.' HTTP reply codes were added for payments,
   for cache validation, for upgrading to new protocols.

   Entirely new functionality could be grafted on HTTP using
   new methods. WebDAV added LOCK and PUT; collections which
   allowed MOVE and COPY on entire swaths of URLspace; and
   even went beyond new headers to encoded additional
   parameters in XML. Conversely, HTTP was borrowed wholesale
   for new protocols for printing (IPP), multiparty call setup
   (SIP), and event notification (RVP).

   Politically, any server or client could inaugurate a new
   method; third-parties could even thread themselves into the
   proxy chain and filter messages in transit. Proxies exist
   to add public annotations (Crit-Link Mediator), anonymize
   access (Lucent Personal Web Assistant), convert file
   formats (ProxiNet for PalmIIIs), language translation
   (AltaVista's BabelFish can be so hacked), and content
   filtering (PICS). Decentralized extensiblity was key to
   HTTP's adoption. Rather than waiting for the IETF standards
   process to revise a canonical table of one-letter content
   codes, for example, MIME headers could deploy new
   content-types on the spot.

   At the same time, uncontrolled divergence reduces the value
   of the Web platform as the probability of compatible
   extensions matching up declines. New versions of HTTP would
   appear to be one solution, but there is little political
   will to empanel a standing working group to review a
   grab-bag of extensions for 1.2, 1.3, ... 39.7 and so on.
   What is neccessary? Some hook for identifying extensions,
   the level of support required, and which parts of URLspace
   are so extended.

   W3C has been promoting such mechanisms for years -- three
   calendar years! -- to better "Lead the [Controlled]
   Evolution of the World Wide Web." At first, the motivating
   scenarios were complex economic and social negotiations
   (payments, content selection, privacy, cryptographic
   algorithms, &c). Early revisions of PEP advertised
   preference lists of alternative extensions, applied in
   pipelined sequence. PEP could also transport metadata about
   which extensions applied to other resources and required
   orders.

   Though PEP remained a W3C experiment, the basic ideas of 1)
   packaging an entire extension -- methods, headers,
   encodings and all -- behind a single name and 2) indicating
   clearly if an extensions was required or optional for 3)
   the next hop or end-to-end were reincarnated as the
   Mandatory proposal. The current draft defines four header
   fields enumerating the URIs of the extensions which apply
   (Man: and Opt: for client and server; C-Man: and C-Opt: for
   the next hop). To interoperate with the installed base, any
   request with mandatory extensions prefixes the method name
   with 'M-', which forces 1.0 and 1.1 servers to fail with
   "501 Not Implemented."

Transport: the Achilles' heel

   By April 1995, Web usage passed Gopher's share as well as
   every other Internet application protocol as a fraction of
   NSFnet backbone usage. Even patching its bandwidth leaks
   with 1.1's persistent connections can't mask HTTP's
   inefficient waste of Internet resources. Its content base
   has grown so immense that caching alone can't reduce
   bandwidth demand by a significant margin for the average
   user or ISP. Former strengths have now become liabilities:
   full 'English' headers, stateless transactions, full
   enumeration of client capabilities once critical for
   debugging now inflate request packets while progressive
   rendering of composite web pages places a premium on
   simultaneous downloads (aggravating the share of packets
   wasted on handshakes and delayed by slow-start).

   Individually, these phenomena have appeared in application
   protocols before -- and were nipped in the bud by active
   involvement by transport protocol folks, or coevolved with
   TCP. When Telnet generated a packet per keystroke, even on
   links where you could type faster than the ACKs returned,
   the Nagle timer throttled TCP down to batching keystrokes
   into a single packet per round-trip. When file transfers
   immediately swamped LANs, slow-start eased up to limit
   congestion.

   So when HTTP presents its demand for many small transfers
   in parallel immediately, there are solutions on the shelf.
   Some vendors are already marketing "accelerators" which
   rely on matched encoders at server and client, like
   Sitara's SpeedServer. Unfortunately, these approaches
   aren't backward compatible with HTTP as-is -- nor as simple
   to code.

   W3C's HTTP-NG (Next Generation) project sees it as a
   three-layer problem: message transport, remote invocation,
   and "The Classic Web Application" (TCWA). At the bottom,
   they propose a multiplexing protocol which combines all the
   traffic destined for a single web server into a single TCP
   connection. This allows the conversations regarding
   individual transfers to be diced up independently: an
   intelligent server might respond with the first few
   critical bytes of all the images in the first packet. A
   fully specified W3Mux layer still needs to address flow
   control (gating the relative priority of the subchannels
   without starving them), simultaneous open (which TCP does
   allow), and lots of interoperability testing.

   The middle layer 'compresses' the data to be transferred by
   marshalling arguments in native data types rather than
   English. So numbers would be sent as integers; dates as
   binary fields; and strings as reusable tokens. The tip of
   their proposal (reaching into the cloudy heights of
   improbability, some critics might say) is reengineering the
   Web as a distributed object system, modeled as operations
   on data structures rather than documents.

Building the Perfect Beast

   Having slain Gopher (and a host of other competitors), the
   Web appears to be expanding in all directions today. Just
   as the SMTP state machine defined a style or protocol
   design for a generation, every new application these days
   seems to ape HTTP's state-less style -- if only to
   piggyback a free ride through firewalls at port 80. The
   Internet Architecture Board is on the verge of issuing
   guidelines to rein in this tendency. First, the HTTP port
   should be reserved for protocols which manipulate the kind
   of information already in web servers. WebDAV is OK, since
   it extends documents; IPP would not, since it uses POST as
   a barn door for its printing procedure calls. Second, HTTP
   should not be referenced by-copy. Lots of other efforts
   want to 'borrow' a few headers for authentication or
   navigation and end up copying text into other specs,
   raising the scepter of divergent versions. Third, HTTP is
   not a valid precedent for MIME users -- it plays fast and
   loose with certain rules which enable absolute
   interoperablity through mail gateways. HTTP allows bare
   <CR> and <LF> as well as inventing a new Content-Encoding:
   process.

   The IAB guidelines are a reminder that HTTP still occupies
   a limited niche in the space of possible Transfer
   Protocols. True, its message format can encode any MIME
   entity, even live data streams -- but only from the server
   to the client. True, its address format can point at
   anything -- but even then it can't encode whether the
   resource is expected to be secured, paid for, or otherwise
   reprocessed. And it is most certainly restricted to
   one-to-one, synchronous, request-response (client-pull)
   interaction. All the various hacks to 'push' data to groups
   of subscribers are, well, hacks: automatic polling, shared
   caches, and "we will mail your response within two working
   days".

   The moral, then, for would-be sucessors to HTTP is to pick
   another niche and follow the network effects.
   Internet-scale event notification, for example, requires
   true push and subscriber management. Current efforts to
   shove presence information through HTTP may succeed, but
   any larger shift to event-based integration could catalyze
   a new, symmetric protocol which allows servers to initiate
   connections back to clients. Simply compressing existing
   HTTP traffic without adding new functionality faces a
   powerful entrenched competitor. Adding programmability,
   whether betting on HTTP-NG or the Mandatory scheme, is an
   attempt to harness network effects between all the
   communities that have been angling to extend HTTP to their
   application. Yet, the rewards for cooperation are spread
   thin: only the server which simultaneously supports HTTP,
   WebDAV, and Internet Printing Protocol would gain from a
   common syntax for those extensions.

   And now we are left back where we started: the mystery of
   protocol adoption. Reuse, extensiblity, simplicity are all
   tricks to reduce the cost of implementing a new protocol --
   but that doesn't mean the market as a whole will make the
   most efficient decision. If you're a Gopher, it may just
   bury you :-)

   The morbid conceit of this article was inspired by the
   first-rate book Aramis by French historian of technology
   Bruno Latour. It told the tale of two decades of R&D on a
   Parisian peoplemover system in the first person -- the
   train itself speaking of its struggle to become, and its
   death.
     ______________________________________________________

  Table 1. Fingerprints of a murderer -- Spider or Gopher?

   Vitals

   Academic NeXTstep hack, about 8 years old; birthmark at
   port 70 or 80

   First Victim

   Notorious primary use was for mere phonebooks; gatewayed
   queries to existing campus database and represented results
   as hypertext nodes

   Disguise

   Began without any meaningful type system at all, but soon
   grew past file extensions to vend multimedia information,
   even negotiating formats

   Modus Operandi

   Open a new connection, send a one-line demand for the
   information at a given address, then receive the response
   until connection dies

   Come-on

   Infiltrate small workgroups with small, easy, and free
   tools, promising acceptance with a simple `backlink' from a
   master list of servers

   Serial Victims

   Reaching beyond the degenerate case of static information,
   its addresses became entry points to scripts, databases,
   and other `stateless' servers

   Motive

   Even its first manifesto laid out hegemonic plans: to
   incorporate, by reference or by proxy, every other Internet
   information service.
     ______________________________________________________

  Table 2. Gopher's rudimentary type system used character codes to
  prefix selector lines. Gopher+ later added content negotiation.

   0

    File

   1

    Directory

   2

    CSO phone-book server

   3

    Error

   4

    BinHexed Mac file

   5

    DOS binary file

   6

    Uuencoded Unix file

   7

    Index-Search server

   8

    Text-based telnet session

   9

    Binary file

   +

    Redundant server

   T

    Text-based tn3270 session

   g

    GIF image file

   I

    Any image file
     ______________________________________________________

  Table 3. RFCs and Internet-Drafts discussed in this issue.

   RFC

   Date

   Title and Comments

   1436

   Mar 1993

   The Internet Gopher Protocol
   Described a `fiercely simple' request-response protocol and
   menu format

   1630

   June 1994

   Universal Resource Identifiers
   Raised the stakes from Gopher's own selectors to any
   namespace

   1689

   Aug 1994

   Status Report on Networked Information Retrieval: Tools and
   Groups
   A wonderful archaeological census of competing protocols,
   many now extinct

   1945

   May 1996

   HyperText Transfer Protocol 1.0
   Also defined the more primitive 0.9 GET. Even 1.0 only
   added HEAD and POST

   2068

   Jan 1997

   HyperText Transfer Protocol 1.1
   A vastly expanded contender with better caching, security,
   and extensibility

   2227
   P. Std.

   Oct 1997

   Simple Hit-Metering and Usage-Limiting for HTTP
   Entire RFC defined a single Meter:

   2396
   D. Std.

   Aug 1998

   Uniform Resource Identifiers: Generic Syntax
   The deceptively small definitive grammar took years to
   hammer out

   TBA
   P. Std

   Dec 1998

   Extensions to HTTP for Distributed Authoring
   Adding locks, collection-resources, and namespace
   operations took major reengineering

   I-Draft

   Nov 1997

   PEP -- an Extension Mechanism for HTTP
   `Protocol Extension Protocol' had negotiation policies for
   pipelining message processors

   I-Draft

   July 1998

   HTTP State Management Mechanism
   Cookies took three years from a quick, risky hack to a
   reasonably secure mechanism

   I-Draft

   Aug 1998

   On the use of HTTP as a Substrate for Other Protocols
   Official -- conservative -- IESG and IAB guidance on port
   assignment, URI schemes, and HTTP security

   I-Draft

   Sep 1998

   HTTP Authentication: Basic and Digest
   A companion specification for nonexistent (UUencoded
   passwords!) and new security.

   I-Draft

   Nov 1998

   HTTP Extension Framework
   Pared down to headers which label `Mandatory' and optional
   extensions of each message

   I-Draft

   Nov 1998

   Upgrading to TLS Within HTTP/1.1
   Using the Upgrade: for Transport Layer Security, in lieu of
   a dedicated port like https: at 443