http://rachelbythebay.com/w/2021/11/06/sql/

Writing

Software, technology, sysadmin war stories, and more. Feed
Saturday, November 6, 2021

A terrible schema from a clueless programmer

The first time you encounter something, you're probably going to make
some mistakes. There's a post going around tonight about how someone
forgot to put an index on some database thing and wound up doing full
table scans (or something like that). The rub is that instead of just
being slow, it also cost a fair amount of money because this crazy
vendor system charged by the row or somesuch. So, by scanning the
whole table, they touched all of those rows, and oh hey, massive
amounts of money just set ablaze!

The usual venues are discussing it, and I get the impression some
people have the wrong approach to this. I want to describe a truly
bad database schema I encountered, and then tell you a little about
what it did to the system performance.

This system was something that was intended to slow down or stop a
lot of the "open proxy" spam that was going around at the time - late
2002. It was apparent that these spammer scum would grab an open
proxy and then would connect outward to your mail exchanger and would
try to spew a spam at you. Blocking them did little since they forged
everything - the HELO, the MAIL FROM, you name it. They also didn't
use that same open proxy twice, so blocking it also didn't help much.

There were lists of these open proxy hosts, but plenty of "legit"
organizations which were mailing us were sufficiently clueless to
have landed on those lists, and the boss had decreed there would be
no use of those lists. We needed something else to do the job.

The observation was that we could probably store the IP address, HELO
string, FROM address and TO address in a table, and send back a 4xx
"temporary failure" error the first time we saw that particular tuple
(or "quad"). A real mail server which did SMTP properly would retry
at some point, typically 15 minutes to an hour later. If it did retry
and enough time had elapsed, we would allow it through.

Simple enough, right? Well, the database used for this thing was
absolutely terrible. It was bog-standard MySQL but that's not the bad
part. The bad part is the schema. The rows in the table looked like
this:

id | ip | helo | m_from | m_to | time | ...

"Big deal", you might think. Isn't it supposed to be those four
things plus a time? Yes, it is, but the problem was HOW it was
stored. Those things were STRINGS (you know, varchars in
database-speak). So, every damn row in the single "quads" table
looked like this:

1 | ip1 | blahblah | evil@spammer.somewhere | victim1@our.domain | ...
2 | ip2 | foobar   | another@evil.spammer   | victim2@our.domain | ...
3 | ip3 | MAILSERV | legit@person           | user@our.domain    | ...
4 | ip4 | foobar   | another@evil.spammer   | victim1@our.domain | ...

Every row was extremely heavyweight: it had the actual raw string in
it. Now think about what the database had to do to check for a match.
We'd say something like this:

SELECT whatever FROM quads WHERE ip='ip1' AND HELO=
'thing_they_sent_us' AND m_from='whatever' AND m_to='some_user'

The poor database had to go through and start doing string
comparisons all the way down the table! This meant it had to read
every single row in the table, and then it had to start comparing
characters from the query to what it found in that row. It probably
aborted a row once one of the columns didn't match, but still, it was
slow and nasty as hell.

You know what it should have been? It should have been normalized.
*Someone* should have heard about things like "third normal form" and
all of that stuff. Nobody had heard of that, and so it went, bumbling
along, chewing memory and CPU and disk space, and being oh-so-slow.

The person who designed and programmed it really screwed up that
initial implementation, and it kind of went like this for a while -
maybe a couple of months? I'd have to dig through the logs to see how
long the system was groaning under the weight of this thing.

Still, eventually, things changed. The whole system was reworked to
have a table for each set of values. There was now a table of IP
addresses, a table of HELO strings, a table of FROM addresses, and a
table of TO addresses. Each table was simple enough, with a schema
like this:

id | value

Each table was setup such that "id" was the primary key, naturally,
and "value" was unique, so it got an index. This meant you could ask
the database to do something involving finding the id given the
value, and it would efficiently find it for you without reading the
whole thing.

Then the "quads" table was changed to use those numeric values for
each row, where any given entry was one of the ids from the other
tables (ip, helo, from, to). It's a foreign key relation, but the
programmer didn't know that, either.

In any case, the first four rows in the table might've now looked
like this:

1 | 10 | 11 | 15 | 2 | ...
2 | 84 | 99 | 44 | 3 | ...
3 | 91 | 71 | 16 | 4 | ...
4 | 89 | 99 | 44 | 2 | ...

Then, a multi-column unique constraint was set up using the (ip,
helo, from, to) as the inputs, and so that too added an index. Now,
you could ask the database to look up (10, 11, 15, 2) or whatever,
and it would do it relatively efficiently thanks to that index.

We came out the other side with a system that quickly and efficiently
processed this data and which did its intended purpose of slowing
down or stopping most of the open proxy abuses that had been headed
our way.

Now, what do you suppose happened to that clueless programmer who put
in the schemas with raw strings (yes, varchars) galore and didn't
know anything about foreign key relationships?

Well, that's easy. She just wrote this post for you. That's right, I
was that clueless newbie who came up with a completely ridiculous
abuse of a SQL database that was slow, bloated, and obviously wrong
at a glance to anyone who had a clue.

My point is: EVERYONE goes through this, particularly if operating in
a vacuum with no mentorship, guidance, or reference points.
Considering that we as an industry tend to chase off anyone who makes
it to the age of 35, is it any surprise that we have a giant flock of
people roaming around trying anything that'll work?

It's a massive problem, and we're all partly responsible. I'm trying
to take a bite out of it now by writing stuff like this. What about
you?

---------------------------------------------------------------------
November 7, 2021: This post has an update.

More writing | Contact / send feedback