Recently I was talking to an academic friend about avoiding
LLM scraping, and they submitted my question to Google's
chatbot GPT2 derivative. There were several interesting
points in the response, the gist of which is that it was
an arms race.

Google suggested "honeypotting" with garbage content.

The most useful kind of garbage content I can think of is using
the dark net- dark net refers to socks prox-ying your traffic
which makes it cryptographically homogeneous and ambiguous.
The dark net mostly refers to the US Navy proxy, tor onion
routing now a charity, or the smaller university-born i2p. 

For example, when I ssh to a tilde, I route through tor. Both
the tilde and myself know who each other are. It's not useful
for obfuscating simple crime in general, which is the widely
promoted myth. Since whatever tilde has a clearnet address,
one outproxy (gateway to clearnet) knows one tor user has
sshed to the tilde. My ISP can only harvest that their user
is connected to onion routing, and nothing else.

This way, lots of information a person is leaking about their
internet usage becomes unavailable to data merchants. I guess
this is only meta-data for constructing a large language model
though.

Aside: In general one would never need or want to touch clear
net, whose connections are scraped. It's just that it's hard
for people to find out how not to use it, because capitalists
don't profit from people being safe from them.

A eusocial advantage of the dark net is that its emphasis on
safety makes self-hosting more safe and easy, since all
connections are started by you connecting outwards to
participate.

My experience was that i2pd, the C++ i2p implementation was
confusing and unreliable, but it seems to be working well as
I just installed and configured it according to openbsd's

pkg_add i2pd

followed by following

/usr/local/share/doc/pkg-readmes/i2pd

so we can replace browsing tracking with opaque garbage as a
first step in fighting capitalist scraping, and we can self
host to avoiding leaking our own, and our visitors data to
proprietary hosting.