https://www.etymonline.com/columns/post/who-lusts-for-certainty-lusts-for-lies

Etymology
[                    ]
[]
Log in

Who Lusts for Certainty Lusts for Lies

September 21, 2023 at 1:40 pm

We need to talk about the Google Ngram Viewer n-grams. They are
wrong. [D.R.H.]

Who Lusts for Certainty Lusts for Lies

Here's the Ngram's idea of the frequency of the word said:

[7e45901b5b]

It doesn't look like an indicator of the diachronic change in the
popularity of a very common English verb during the 20th century. It
looks like the temperature graph of the last ice age. Younger people,
rest assured that English authors in the 1970s did not all stop using
"said" and then start again.

Talia Felix and I plow in Google Books every day, researching. It is
a marvel of a resource, but we know by experience how ineptly
assembled that database is. And how many booby-traps lie hidden in
it.

I cannot tell you much about how AI works. But I can tell you how AI
handles something I know how to handle. Here's another example. "The
Great Toast Famine of '77"

[ebe7cc61ca]

Ngram says toast almost vanishes from the English language by 1980,
and then it pops back up.


WHY THAT'S WRONG

There's a long-documented flaw in the Ngram formula, inherited from
Google Books. The error makes a vast number of English words appear
to be diminishing in use through the 20th century only to revive
around 1980.

A rough gist of an explanation for it seems to be that Google Books'
corpus is heavily academic. The printed matter Google sucked up from
universities had a disproportion of modern scientific and academic
journals in it. The articles in those journals and textbooks lean on
the same few words (as academics are wont to do when they write).

That not only bloats the scores for those few words, it falsely
drives down the other words. That creates that mid-20th-century "dip"
in the Ngram of almost every word.

Said likely appears less often in academic writing than in other
writing, such as a novel or a newspaper. But academic papers use
words such as, say, graph, a good deal more often. And here's what
the Ngram for graph looks like in the 20th century:

[2a6e1c82c7]

See? No dip.

That's just one error. Here's another: If you look at an Ngram for
the F-word, you'll see very little use of it until modern times,
which is expected. But the number of hits for it jumps up as you go
back past about 1820, and keeps rising into the late 1700s (if you
could see it). Those are all the word suck, written with the old
"long " -s- -- the printer's -s-. It looks like a lower-case -f-, in
worn-down fonts on cheap paper in old libraries. The use of that
character faded out about 1820. Sometimes only context tells you
whether it is an -f- or an -s-. AI has no clue.

Here's another: Google Books fails to recognize identity in variant
spellings. The Ngram for authorise is different from that for
authorize, and neither counts authorizes. Google doesn't count plural
forms in the noun Ngrams. It can't tell dog from dogs.

Worse, many of Google Books' files are misdated. On a battered
library book, an "1896" on the cover page can look like "1800" to a
digital scanner. A stack of Bible tracts from the 1910s long appeared
in Google Books as published in 1799. That date did appear on all
their covers -- on the logo of the Bible tract society that printed
them, as the date of its founding. I hardly trust Google Books dated
search results to be right five times in a row. We even made a video
about it.


BUT PEOPLE WANT THEM

The text of Etymonline is built entirely from print sources, and is
done entirely by human beings. Ngrams are not. They are unreliable, a
sloppy product of an ignorant technology, one made to sell and
distract, one never taught the difference between "influence" and
"inform."

Why are they on the site at all? Because now, online, pictures win
and words lose. The war is over; they won. Just remember: Ngrams are
unreliable. Even if the world now prefers Ngram reality, where the
word "said" went into eclipse with Jimmy Carter, you're allowed to be
smarter than that.

When you see an Ngrams on etymonline or anywhere else, admire it as
decorative, whimsical, a gourami tank in a restaurant, abstract art
on hotel walls, blueprints for roller-coasters. And where the Ngrams
disagree with etymonline on a first date, presume we're right and
they're wrong.

Share

    
Advertisement[INS::INS]

A Word or Two

  *  

    Who Lusts for Certainty Lusts for Lies

    September 21, 2023 at 1:40 pm
  *  

    A Fig for Dates

    September 07, 2023 at 1:00 pm
  *  

    SEARCH and RESEARCH

    August 04, 2023 at 3:38 pm
  *  

    Homing in on Harlequin

    July 02, 2023 at 10:55 pm
  *  

    An Intimate Encounter with Digital Archival Mania

    May 22, 2023 at 9:29 pm

  * A
  * B
  * C
  * D
  * E
  * F
  * G
  * H
  * I
  * J
  * K
  * L
  * M
  * N
  * O
  * P
  * Q
  * R
  * S
  * T
  * U
  * V
  * W
  * X
  * Y
  * Z

LINKS

ForumFull List of SourcesLinks

PRODUCTS

iOS AppAndroid AppChrome Extension

ABOUT

Who Did ThisIntroduction and ExplanationFollow on Facebook

SUPPORT

Donate with PayPalYe Olde Swag ShoppeSupport on Patreon
(c) 2001-2023 Douglas Harper | Terms of Service | Privacy Policy