[HN Gopher] A terrible schema from a clueless programmer
___________________________________________________________________
 
A terrible schema from a clueless programmer
 
Author : zdw
Score  : 475 points
Date   : 2021-11-07 14:55 UTC (8 hours ago)
 
web link (rachelbythebay.com)
w3m dump (rachelbythebay.com)
 
| arein3 wrote:
| to me the first implementation seems the best, it just needed a
| multi column index
| 
| the 3NF "optimized" form seems worse and unreadable. I would
| advise to normalize only when it's needed(complex/evolving
| model), in this case it was not needed
 
| biglost wrote:
| Or store the ip address as int64 with index, or better both
| solutions together
 
| stickfigure wrote:
| Assuming this needs to be optimized for massive scale, just hash
| the values to a single indexed field.
| 
| And use something other than an RDBMS. Put the hash in Redis and
| expire the key; your code simply does an existence check for the
| hash. You could probably handle gmail with a big enough cluster.
| 
| That super-normalized schema looks terrible.
 
  | kodt wrote:
  | I too recommend using Redis in 2002.
 
    | stickfigure wrote:
    | Fair enough, but I think the issue here is recognizing that
    | the underlying business operation is a hit test on an
    | expiring hash key. You could have used MySQL as a key/value
    | store or looked for something more specialized.
 
| kazinator wrote:
| > _The whole system was reworked to have a table for each set of
| values. There was now a table of IP addresses, a table of HELO
| strings, a table of FROM addresses, and a table of TO addresses._
| 
| Lisp interning, in database tables. A.k.a. Flygweight Pattern.
 
| ccleve wrote:
| Sorry, no. The original schema was correct, and the new one is a
| mistake.
| 
| The reason is that the new schema adds a great deal of needless
| complexity, requires the overhead of foreign keys, and makes it a
| hassle to change things later.
| 
| It's better to stick the the original design and add a unique
| index _with key prefix compression_ , which all major databases
| do these days. This means that the leading values gets compressed
| out and the resulting index will be no larger and no slower than
| the one with foreign keys.
| 
| If you include all of the keys in the index, then it will be a
| covering index and all queries will hit the index only, and not
| the heap table.
 
  | silisili wrote:
  | I think it depends a lot on the data. If those values like IP,
  | From, To, etc keep repeating, you save a lot of space by
  | normalizing it as she did.
  | 
  | But strictly from a performance aspect, I agree it's a wash if
  | both were done correctly.
 
    | philliphaydon wrote:
    | Space is cheap now tho. Better to duplicate some data and
    | avoid a bunch of joins than to worry about saving a few gb of
    | space.
 
      | wvenable wrote:
      | Using space to avoid joins will not necessarily improve
      | performance in an RDBMS -- it might even make it worse.
 
      | kasey_junk wrote:
      | This system was written in 2002.
 
      | acdha wrote:
      | It's not that easy: you need to consider the total size and
      | cardinality of the fields potentially being denormalized,
      | too. If, say, the JOINed values fit in memory and,
      | especially, if the raw value is much larger than the key it
      | might be the case that you're incurring a table scan to
      | avoid something which stays in memory or allows the query
      | to be satisfied from a modest sized index. I/O isn't as
      | cheap if you're using a SAN or if you have many concurrent
      | queries.
 
  | nerdponx wrote:
  | And even if this is somehow not an ideal use of the database,
  | it's certainly far from "terrible", and to conclude that the
  | programmer who designed it was "clueless" is insulting to that
  | programmer. Even if that programmer was you, years ago (as in
  | the blog post).
  | 
  | Moreover, unless you can prove with experimental data that the
  | 3rd-normal-form version of the database performs significantly
  | better or solves some other business problem, then I would
  | argue that refactoring it is strictly worse.
  | 
  | There are good reasons not to use email addresses as primary or
  | foreign keys, but those reasons are conceptual ("business
  | logic") and not technical.
 
  | late2part wrote:
  | Deleted
 
    | ccleve wrote:
    | If you're suggesting that we should double the size of the
    | hardware rather than add a single index, then I respectfully
    | disagree.
    | 
    | But your point is well-taken. Hardware is cheap.
 
    | emerongi wrote:
    | To a degree, this is true though. An engineer's salary is a
    | huge expense for a startup, it's straight up cheaper to spend
    | more on cloud and have the engineer work on the product
    | itself. Once you're bigger, you can optimize, of course.
 
      | late2part wrote:
      | Deleted
 
        | emerongi wrote:
        | If this is a case from real life that you have seen, you
        | should definitely elaborate more on it. Otherwise, you
        | are creating an extreme example that has no significance
        | in the discussion, as I could create similarly ridiculous
        | examples for the other side.
        | 
        | There are always costs and benefits to decisions. It
        | seems that are you only looking at the costs and none of
        | the benefits?
 
        | late2part wrote:
        | Deleted
 
    | bigbillheck wrote:
    | > compute is cheap
    | 
    | There was a whole thread yesterday about how a dude found out
    | that it isn't: https://briananglin.me/posts/spending-5k-to-
    | learn-how-databa... (also mentioned in the RbtB post)
 
      | rustc wrote:
      | Depends on the provider. I've checked the pricing of 4 top
      | relational database cloud vendors (Google, Amazon, Azure,
      | IBM) and they all charge for the number of CPU cores and
      | RAM you buy, not by the number of rows scanned like
      | PlanetScale did in the post you refer to. They'll be more
      | expensive than buying your own server but not nearly what
      | PlanetScale charges for full table scan queries.
 
      | late2part wrote:
      | When do the rest of the folks figure it out?
 
  | kasey_junk wrote:
  | She doesn't mention the write characteristics of the system but
  | she implies that it was pretty write heavy.
  | 
  | In that case it's not obvious to me that putting a key prefix
  | index on every column is the correct thing to do, because that
  | will get toilsome very quick in high write loads.
  | 
  | Given that she herself wrote the before and after systems 20
  | years ago and that the story was more about everyone having
  | dumb mistakes when they are inexperienced perhaps we should
  | assume the best about her second design?
 
    | ccleve wrote:
    | It's not an index on every column. It's a single index across
    | all columns.
 
      | noisy_boy wrote:
      | Which will be updated every time there is a new
      | combination.
 
        | Animats wrote:
        | Which is OK. It's not like the whole index has to be
        | rebuilt. It's cheaper than O(log N).
 
  | [deleted]
 
  | maddynator wrote:
  | One thing that worth taking into consideration is that this
  | happened in 2002. When the databases were not in cloud, the ops
  | was done by dba's and key prefix compression thats omnipresent
  | today was likely not that common or potentially not even
  | implemented/available.
  | 
  | But i don't think the point of the post is whats right/wrong
  | way of doing it. The point as mentioned by few here is that
  | programmers makes mistakes. They are costly and will be costly
  | if in tech industry, we continue to boot experienced
  | engineers... the tacit knowledge those engineers have gained wi
  | ll not be passed on and this means more people have to figure
  | things out by themselves
 
    | Philip-J-Fry wrote:
    | >One thing that worth taking into consideration is that this
    | happened in 2002. When the databases were not in cloud, the
    | ops was done by dba's
    | 
    | Are you implying that isn't the case today? Thousands of
    | (big) companies are still like that and will continue to be
    | like that.
    | 
    | I write my own SQL, design tables and stuff, submit it for a
    | review by someone 10x more qualified than myself, and at the
    | end of the day I'll get a message back from a DBA saying "do
    | this, it's better".
 
      | mixmastamyk wrote:
      | The question of it being in the cloud or not is highly
      | orthogonal to schema design.
 
      | WJW wrote:
      | If you have one, more power to you. I'm currently making a
      | good living as a freelance DBA/SRE for startups. With a
      | team of 3-5 devs, some of which frontend and/or mobile
      | devs, proper database knowledge is definitely thin on the
      | ground in some places.
 
    | hinkley wrote:
    | In 2002, you started seeing major increases in query run time
    | with as little as 5 joins. Once you hit six you had to start
    | thinking about rearchitecting your data or living with slow
    | response times.
    | 
    | There was a lot of pressure to relax 3NF as being too
    | academic and not practical.
    | 
    | Around then, I had a friend who was using a pattern of
    | varchar primary keys so that queries that just needed the
    | (unique) name and not the metadata could skip the join. We
    | all acted like he was engaging in the Dark Arts.
 
    | adfgaertyqer wrote:
    | You don't really need compression. Rows only need to persist
    | for about an hour. The table can't be more than a few MiB.
    | 
    | We can debate the Correct Implementation all day long. The
    | fact of the matter is that adding _any_ index to the original
    | table, even the wrong index, would lead to a massive speedup.
    | We can debate 2x or 5x speedups from compression or from
    | choosing a different schema or a different index, but we get
    | 10,000x from adding any index at all.
 
      | ipaddr wrote:
      | Just to make this fun.
      | 
      | Adding an index now increases the insert operation
      | cost/time and adds additional storage.
      | 
      | If insert speed/volume is more important than reads keep
      | the indexes away. Replicate and create an index on that
      | copy.
 
  | dudeinjapan wrote:
  | I agree. In the "new" FK-based approach, you'll still need to
  | scan indexes to match the IP and email addresses (now in their
  | own tables) to find the FKs, then do one more scan to match the
  | FK values. I would think this would be significantly slower
  | than a single compound index scan, assuming index scans are
  | O(log(n))
  | 
  | Key thing is to use EXPLAIN and benchmark whatever you do. Then
  | the right path will reveal itself...
 
  | msqlqthrowaway wrote:
  | > all major databases do these days
  | 
  | Did everyone on HN miss that the database in question was
  | whichever version of MySQL existed in 2002?
 
    | mst wrote:
    | Which I was using in production at the time, and, yeah, what
    | she ended up doing was a much better plan for that thing.
 
    | jeremyjh wrote:
    | It actually doesn't matter. Even simple indexes would have
    | made that schema work just fine, and MySQL could certainly
    | index strings in 2002.
 
  | omegalulw wrote:
  | > Sorry, no. The original schema was correct, and the new one
  | is a mistake.
  | 
  | Well, you also save a space by doing this (though presumably
  | you only need to index the emails as IPs are already 128 bits).
  | 
  | But other than that, I'm also not sure why the original schema
  | was bad.
  | 
  | If you were to build individual indexes on all four rows, you
  | would essentially build the four id tables implicitly.
  | 
  | You can calculate the intersection of hits on all four indexs
  | that match your query to get your result. This is linear in the
  | number of hits across all four indexes in the worst case but if
  | you are careful about which index you look at first, you will
  | probably be a lot more efficient, e.g. (from, ip, to, helo).
  | 
  | Even with a multi index on the new schema, how do you search
  | faster than this?
 
  | [deleted]
 
| an9n wrote:
| Wow what a twist. Somebody call M Night Shyamalan!
 
| welder wrote:
| The ending is the most important part.
| 
| > Now, what do you suppose happened to that clueless programmer
| who didn't know anything about foreign key relationships?
| 
| > Well, that's easy. She just wrote this post for you. That's
| right, I was that clueless newbie who came up with a completely
| ridiculous abuse of a SQL database that was slow, bloated, and
| obviously wrong at a glance to anyone who had a clue.
| 
| > My point is: EVERYONE goes through this, particularly if
| operating in a vacuum with no mentorship, guidance, or reference
| points. Considering that we as an industry tend to chase off
| anyone who makes it to the age of 35, is it any surprise that we
| have a giant flock of people roaming around trying anything
| that'll work?
 
  | deathanatos wrote:
  | ... I appreciate her sentiment at the end there that we should
  | try to change the status quo (which I think does suck), but I'm
  | not sure how much power the average employee has over it. Most
  | employers I've worked at seem loathe to retain anyone past
  | about 2 years. (E.g., currently I'm in the 95'th percentile,
  | after just over 2 years.) IME it takes about 6 months to really
  | learn how a company's systems work to where to proficiency
  | (yes, _six months_.) which means we 're spending ~25% of the
  | time "training", where "training" is usually someone trying
  | something crazy and failing, and getting corrected as opposed
  | to some structured form of learning. Oftentimes the systems
  | that these engineers are stumbling I think are the internal
  | ones; they need more features, or refactoring to account for
  | organic growth, but since they're not shiny new customer facing
  | features they'll get exactly 0 priority from PMs. The engineers
  | who build experience working with these systems and actually
  | know what needs to change and how without breaking the existing
  | use cases ... are about to leave the company, since nobody is
  | retained beyond ~2 years.
  | 
  | I also find my own team is usually strapped for resources,
  | normally, people. (Usually politely phrased as "time".) Yes,
  | one has to be wary of mythical man-month'ing it, but like my at
  | my last employ we had essentially 2 of us on a project that
  | could have easily used at least one, if not two more people.
  | Repeat across every project and that employer was understaffed
  | by 50-100%, essentially.
  | 
  | Some company just went for S-1, and they were bleeding cash.
  | But they _weren 't_ bleeding it into new ventures: they were
  | bleeding it into marketing. Sure, that might win you one or two
  | customers, but I think you'd make much stronger gains with new
  | products, or less buggy products that don't drive the existing
  | customers away.
  | 
  | Also there's an obsession with "NIT" -- not invented there --
  | that really ties my hands as an engineer. Like, everything has
  | to be out sourced to some cloud vendor provider whose product
  | only fits some of the needs and barely, and whose "support"
  | department appears to be unaware of what a computer is. I'm a
  | SWE, let me do my thing, once in a while? (Yes, where there's a
  | good fit for an external product, yes, by all means. But these
  | days my job is 100% support tickets, and like 3% actual
  | engineering.)
 
  | 5faulker wrote:
  | Seems like the definition of hacker to me.
 
  | shrimpx wrote:
  | > tend to chase off anyone who makes it to the age of 35
  | 
  | That's definitely not true anymore, if it ever was.
  | Senior/staff level engs with 20+ years of experience are sought
  | after and paid a ton of money.
 
  | jjice wrote:
  | The way she kept referring to the programmer made me suspicious
  | that that would be the ending. I personally really like that as
  | a way of conveying the message of us all having growing pains
  | and learning the hard way from some inefficient code.
  | 
  | I know I've also had these encounters (specifically in DBs),
  | and I'm sure there are plenty more to come.
 
  | carapace wrote:
  | (It seems a lot of folks are getting nerd-sniped by the set-up
  | and missing the moral of the story, eh?)
  | 
  | I don't know the actual numbers, but it's been pointed out that
  | at any given time something like half of all programmers have
  | been doing it less than five years, for decades now.
  | 
  | That, plus the strident ignorance of past art and practice,
  | seem to me to bring on a lot of issues.
 
    | lamontcg wrote:
    | > It seems a lot of folks are getting nerd-sniped
    | 
    | that is an excellent term.
 
      | jacobolus wrote:
      | Origin: https://xkcd.com/356/
 
        | wwweston wrote:
        | Old enough some people under 35 may not have encountered
        | it... ;)
 
        | sseagull wrote:
        | Well, then, they are part of the 10,000
        | 
        | https://xkcd.com/1053/
 
        | misnome wrote:
        | Only if they were born in the US
 
        | lamontcg wrote:
        | I'm 49, I don't know how I never saw that xkcd.
 
        | egeozcan wrote:
        | I feel like that phrase has been there forever, and I'm
        | exactly 35. Will this happen more often as I grow older?
        | Ugh. Feels weird. Maybe also a bit depressing.
 
        | lamontcg wrote:
        | It was weird seeing all the spongebob squarepants memes
        | take over the internet when I was too old to ever grow up
        | with that. I turned 28 in 1999 when that first aired.
        | That was my "holy shit I'm so old" moment when that
        | finally trickled up into my awareness as someone nearly
        | turning 40 or so.
 
        | yesenadam wrote:
        | Same age as you. I have a lot of friends in the
        | Philippines. I swear the Facebook employment info of half
        | the people in the Philippines says they work at the
        | Krusty Krab. (And for most filipinos, the internet =
        | facebook.) For some reason I never asked what that meant,
        | and for many years I thought that was just some odd joke,
        | and was vaguely puzzled about how widespread it is.
        | Eventually I happened on the Spongebob connection!
 
    | beering wrote:
    | > (It seems a lot of folks are getting nerd-sniped by the
    | set-up and missing the moral of the story, eh?)
    | 
    | The narrative should lead to the conclusion. If I told you
    | the story of the tortoise and the hare in which the hare gets
    | shot by a hunter, then said the moral is "perseverance wins",
    | you'd be rightfully confused.
 
    | [deleted]
 
    | twic wrote:
    | I don't think it's purely nerd-sniping. If your story is "at
    | first you are bad, but then you get good", but your example
    | is of a case where you did something fine but then replaced
    | it with something worse, that rather undermines the story.
 
      | mwcampbell wrote:
      | If even someone with Rachel's level of experience still
      | doesn't know all the minutiae of database optimization, I
      | think that just amplifies her point about the importance of
      | mentoring novices.
 
        | duxup wrote:
        | I recall reading a rant on another site about someone so
        | upset they had to deal with clueless noobs at work.
        | 
        | They then went on to list the errors that this brand new
        | entry level employee had made when writing ... an
        | authentication system ...
        | 
        | I was more than a little shocked when I realized they
        | were serious and hadn't realized the issue was sending
        | the new entry level guy to do that job alone.
 
        | HahaReally wrote:
        | OR we could demand they get expensive irrelevant degrees
        | and be geniuses at coding by the time they graduate with
        | no relevant skills! Hey, in fact, let's just do that
        | instead. :)
 
      | adolph wrote:
      | Chronology alone only progresses; progress sometimes
      | regresses.
 
  | kofejnik wrote:
  | IMO she's still pretty clueless (sorry!), but over the years
  | got better at self-promotion
  | 
  | So the real lesson is to be very careful as to who you listen
  | to
 
    | crispyambulance wrote:
    | She has never been "clueless" and has written insightfully
    | for years.
 
      | globular-toast wrote:
      | Well she's clueless about what actually sped up this
      | database query.
 
        | Lammy wrote:
        | imo "clueful"/clueless is more about being able to sense
        | situations where there is something to be known rather
        | than knowing everything upfront all the time (which is
        | obviously impossible)
 
        | yuliyp wrote:
        | Without knowing which database engine (MySQL comes with
        | more than one) she was using, nor the testing that she
        | actually performed, what makes you say that?
 
        | mst wrote:
        | I remember versions of mysql whose string indexing was
        | sufficiently limited on every available backend that this
        | would've been the correct change.
        | 
        | Hell, back in ... 2001? I think? I ended up making
        | exactly the same set of changes to an early database
        | design of mine, for pretty much the same stated reasons
        | and with very similarly pleasant effects on its
        | performance.
 
    | mwcampbell wrote:
    | The point isn't to determine who is or isn't clueless. The
    | point is how we deal with each other's cluelessness.
 
  | crispyambulance wrote:
  | Yep. Folks are getting lost in the weeds discussing indexing of
  | database tables. That's _totally_ beside the point here.
  | 
  | The thing is, the first implementation was a perfectly fine
  | "straight line" approach to solve the problem at hand. One
  | table, a few columns, computers are pretty fast at searching
  | for stuff... why not? In many scenarios, one would never see a
  | problem with that schema.
  | 
  | Unfortunately, "operating in a vacuum with no mentorship,
  | guidance, or reference points", is normal for many folks. She's
  | talking about the trenches of SV, it's even worse outside of
  | that where the only help you might get is smug smackdowns on
  | stackoverflow (or worse, the DBA stackexchange) for daring to
  | ask about such a "basic problem".
 
    | otterley wrote:
    | I'd be a lot more sympathetic if the major RDMSes didn't have
    | outstanding and thorough reference manuals or that there
    | weren't a mountain of books on the subject that cover, among
    | other things, the topic of indexing and its importance.
    | MySQL's manual, for example, has covered this subject from
    | the very beginning:
    | https://dev.mysql.com/doc/refman/5.6/en/mysql-indexes.html (I
    | don't have the 3.x manuals handy but it was there back then
    | too).
    | 
    | It's not clear from the article whether the author spent any
    | time studying the problem before implementing it. If she
    | failed to do so, it is both problematic and way more common
    | than it ought to be.
    | 
    | "Ready, fire, aim"
    | 
    | But you know what they say: good judgment comes from
    | experiences and experience comes from bad judgment.
    | 
    | Compounding the problem here is that the author now has much
    | more experience and in a reflective blog post, _still_ got
    | the wrong answer.
    | 
    | IMO the better lesson to take away here would have been to
    | take the time getting to know the technology before putting
    | it into production instead of jumping head first into it.
    | That would be bar raising advice. The current advice doesn't
    | advise caution; instead it perpetuates the status quo and
    | gives "feel good" advice.
 
      | jraph wrote:
      | You can't really search a problem if you don't suspect its
      | existence. At this point you might feel you have everything
      | you need to start the implementation, how do you guess?
 
      | dasil003 wrote:
      | I couldn't disagree more with this comment. No amount of
      | reading documentation teaches you how to build production
      | systems. You progress much faster by getting your hands
      | dirty, making mistakes, and learning from them.
      | 
      | The challenge of writing good software is not about knowing
      | and focusing on the perfection every gory detail, it's
      | about developing the judgement to focus on the details that
      | actually matter. Junior engineers who follow your advice
      | will be scared to make mistakes and their development will
      | languish compared to those who dive in and learn from their
      | mistakes. As one gains experience it's easy to become
      | arrogant and dismissive of mistakes that junior engineers
      | make, but this can be extremely poisonous to their
      | development.
 
        | isoskeles wrote:
        | I couldn't disagree more with this comment. No amount of
        | getting your hands dirty and making mistakes teaches you
        | how to learn from your mistakes, nor changes the impact
        | of the mistakes. You also progress much faster by
        | learning from other peoples' mistakes, this is why
        | written language is so powerful.
        | 
        | Honestly, I think your comment is okay minus the fact
        | that you're trying to highlight such hard disagreement
        | with a sentiment that people should read the fucking
        | manual. Really, you couldn't disagree MORE?
        | 
        | There is definitely value in RTFM. There are cases where
        | making mistakes in production is not acceptable, and
        | progressing by "making mistakes" is a mistake on its own.
        | I don't think the case in this article sounds like one of
        | those, but they do exist (e.g. financial systems (think
        | about people losing money on crypto exchanges),
        | healthcare, or, say, sending rockets into space). In many
        | cases, making mistakes is fine in test systems, but
        | completely, absolutely, catastrophically unacceptable in
        | production. Although, I refuse to blame the junior
        | engineer for such mistakes, I blame management that "sent
        | a boy to do a man's job" (apologies for a dated idiom
        | here).
        | 
        | (As an aside, overall, I disagree with many of the
        | comments here nitpicking the way the author solved the
        | problem, calling her "clueless", etc. I really don't care
        | about that level of detail, and while I agree the
        | solution does not seem ideal, it worked for them better
        | than the previous solution.)
 
        | crispyambulance wrote:
        | > No amount of getting your hands dirty and making
        | mistakes teaches you how to learn from your mistakes...
        | 
        | Categorically wrong.
        | 
        | Mistakes, failure, and trial and error are very much a
        | part of developing skills. If you're not making mistakes,
        | you're also not taking enough risks and thus missing out
        | on opportunities for growth.
 
        | isoskeles wrote:
        | Yes, mistakes are required to learn from them, but making
        | mistakes does not imply that a person will learn from
        | their mistakes.
 
        | otterley wrote:
        | Respectfully, I believe you are disagreeing with an
        | argument that I am not making. It's not an either-or
        | scenario. Both access to and reliance on reference
        | materials _and_ the ability to safely experiment are part
        | of the professional engineer 's toolbox. I can say this
        | with confidence because I was a junior engineer once -
        | it's not like I'm speaking without experience.
        | 
        |  _Everyone_ - new and experienced developers alike -
        | should have a healthy fear of causing pain to customers.
        | Serving customers - indirectly or otherwise - is what
        | makes businesses run. As a principal engineer today my
        | number one focus is on the customer experience, and I try
        | to instill this same focus in every mentee I have.
        | 
        | Does that mean that junior developers should live in
        | constant fear and decision paralysis? Of course not.
        | 
        | That's where the mentor - the more experienced and
        | seasoned engineer - comes in. The mentor is roughly
        | analogous to the teacher. And the teaching process uses a
        | _combination_ of a mentor, reference materials, and the
        | lab process. The reference materials provide the details;
        | the lab process provides a safe environment in which to
        | learn; and the mentor provides the boundaries to prevent
        | the experiments from causing damage, feedback to the
        | mentee, and wisdom to fill in the gaps between them all.
        | If any of these are absent, the new learner 's
        | development will suffer.
        | 
        | Outstanding educational systems were built over the last
        | 3 centuries in the Western tradition using this
        | technique, and other societies not steeped in this
        | tradition have sent their children to us (especially in
        | the last half-century) to do better than they ever could
        | at home. It is a testament to the quality of the
        | approach.
        | 
        | So no, I'm not advocating that junior developers should
        | do nothing until they have read all the books front to
        | back. They'd never gain any experience at all or make
        | essential mistakes they could learn from if they did
        | that. But junior developers should not work in a vacuum -
        | more senior developers who both provide guidance _and_
        | refer them to reference materials to study and try again,
        | in my experience, lead to stronger, more capable
        | engineers in the long term.  "Learning to learn" and
        | "respect what came before" are skills as important as the
        | engineering process itself.
 
        | crispyambulance wrote:
        | > That's where the mentor [...] comes in.
        | 
        | The thing is, specifically, that the OP did NOT have a
        | mentor. They had a serious problem to solve, pronto, and
        | knew enough to take a crack at it. OK, the initial
        | implementation was suboptimal. So what? It's totally
        | normal, learn and try again. Repeat.
        | 
        | It would be nice if every workplace had a orderly
        | hierarchy of talent where everyone takes care to nurture
        | and support everyone else (especially new folks). And
        | where it's OK to ask questions and receive guidance even
        | across organizational silos, where there are guardrails
        | to mitigate accidents and terrible mistakes. If you work
        | in such an environment, you are lucky.
        | 
        | It is far more common to have sharp-elbow/blamestorm
        | workplaces which pretend to value "accountability" but
        | then don't lift a finger to support anyone who takes
        | initiative. I suspect that at the time Rachelbythebay
        | worked in exactly this kind of environment, where it
        | simply wasn't possible for an OPS person to go see the
        | resident database expert and ask for advice.
 
        | otterley wrote:
        | Right. Hence my concern about the lack of helpful advice
        | other than "accept yourself"[1]: Neophyte engineers
        | lacking guardrails and local mentors should fall back to
        | what they learned in school: experiment, use your books,
        | and ask questions from the community. Mentors can be
        | found outside the immediate workplace, after all.
        | 
        | And when it comes to taking production risks, measure
        | twice and cut once, just like a good carpenter.
        | 
        | [1] Of course you should accept yourself. But that alone
        | won't advance the profession or one's career.
 
        | noduerme wrote:
        | But it's also about constantly asking yourself "how can
        | this be done better? What do I not know that could
        | improve it?"
        | 
        | I learned probably 80% of what I know about DB
        | optimization from reading the Percona blog and S.O. The
        | other 20% came from splashing around and making mistakes,
        | and testing tons of different things with EXPLAIN. That's
        | basically learning in a vacuum.
 
        | [deleted]
 
    | damagednoob wrote:
    | Yes and no and this post highlights a subtle issue with
    | mentorship (which I think is important): Technology does not
    | stand still. What was true in 2002, might not be true today.
    | While adopting the naive approach was detrimental back then,
    | today databases recognise that this happens and provide easy
    | workarounds to get you out of trouble that didn't exist back
    | then.
    | 
    | I've experienced this just by switching languages. C# had
    | many articles dedicated to how much better StringBuilder was
    | compared to String.Concat and yet, other languages would do
    | the right thing by default. I would give advice in a totally
    | different language about a problem that the target language
    | did not have.
    | 
    | As the song goes:
    | 
    | "Be careful whose advice you buy but be patient with those
    | who supply it
    | 
    | Advice is a form of nostalgia, dispensing it is a way of
    | fishing the past
    | 
    | From the disposal, wiping it off, painting over the ugly
    | parts
    | 
    | And recycling it for more than it's worth"
 
      | nightski wrote:
      | That might be true if you just take blanket advice. The key
      | is to find out _why_ say StringBuilder is better than
      | String.Concat. If you understand the implementation details
      | and tradeoffs involved, this makes the knowledge much more
      | applicable in the future. The core concepts in technology
      | do not move nearly as fast as individual projects,
      | libraries, frameworks, languages, etc...
 
  | unclebucknasty wrote:
  | What's funny is there seems to be a lot of debate among
  | grizzled veterans here about whether her old solution is
  | actually better than the new. We're getting into index types,
  | engine implementations, normal forms, etc. And really, this is
  | a relatively simple development problem that I can easily
  | imagine on an interview test.
  | 
  | Now, imagine a new engineer, just getting started trying to
  | make sense of it all, yet being met with harsh criticism and
  | impatience.
  | 
  | Maybe that was exactly her point--to get everyone debating over
  | such a relatively trivial problem to communicate the messages:
  | Good engineering is hard. Everyone is learning. Opinions can
  | vary. Show some humility and grace.
 
  | jraph wrote:
  | I was reading her article like "yeah, well... wow, she's tough
  | / hard with this programmer, it's not that unreasonable
  | anyway!", until this passage.
  | 
  | I would not expect her to be like this from what I've read on
  | her blog so I was surprised. Well written!
 
| IceDane wrote:
| The moral of this post really falls flat coming from this author,
| most of whose posts are snarky, passive-aggressive rants where
| she rants about someone's alleged incompetence or something
| similar.
 
  | enraged_camel wrote:
  | This is a really weird ad hominem that is unrelated to the post
  | itself.
 
    | IceDane wrote:
    | Ad hominem?
    | 
    | It seems pretty relevant to me that the author mostly writes
    | snarky articles which directly contradict the moral of this
    | story.
    | 
    | For the record, I totally agree with the moral of the story.
    | But we can't just blindly ignore context, and in this case
    | the context is that the author regularly writes articles
    | where someone else is ridiculed because the author has deemed
    | them incompetent.
    | 
    | In a similar vein, no one would just ignore it and praise
    | Zuckerberg if he started discussing the importance of
    | personal privacy on his personal blog.
 
| avl999 wrote:
| The responses in this thread are classic pedantic HN with the
| actual point whoooshing over so many people's head.
 
  | jeremyjh wrote:
  | I think it actually makes quite a bit of a difference if 20
  | years later the author still does not understand the actual
  | problem and solution she is using as an example. It actually
  | does undermine her argument that people make mistakes, learn
  | from them and grow etc if the lack of growth is still in
  | evidence 20 years later in the singular example under
  | consideration.
 
  | bizzarojerry wrote:
  | No I'd say the responses in this thread are classic pedantic HN
  | with users like you taking the high moral ground and looking
  | down on people who actually are trying to address the technical
  | issues with a post in a technical forum.
 
    | dang wrote:
    | We've banned this account for breaking the site guidelines.
    | Creating accounts to do that with will eventually get your
    | main account banned as well, so please don't!
    | 
    | https://news.ycombinator.com/newsguidelines.html
 
  | [deleted]
 
| [deleted]
 
| darkwater wrote:
| I'd really love to be snarky here but I'll try to be polite: all
| those comments about the example situation are missing the whole
| point of the post. And it really worries me that there is a good
| chunk of the tech workers that just ignores the real meaning of
| something and just nitpick about stupid implementation details.
| The post is about managing rookie errors, being empathetic and
| also warn the ageism that pervades the sector. TBH about this
| last point IDK the situation nowadays in Silicon Valley, but in
| Europe my limited experience is that ageism is not that bad; kt's
| actually difficult to find seasoned developers.
| 
| Edit: typos
 
  | Lewton wrote:
  | I'm happy the hn comments are nitpicking, I'm not particularly
  | well versed in database schemas but while reading it I was
  | going "???????" and it's good to know I'm not going crazy
 
  | 08-15 wrote:
  | In this blog post, senior engineer Rachel talks nonsense about
  | normalization and promotes a bafflingly complicated solution as
  | superior to a simple one, without identifying the actual fix,
  | which appears to have happened accidentally.
  | 
  | In other words, with enough empathy and patience, a clueless
  | rookie can grow into a clueless senior engineer!
  | 
  | Rachel usually makes more sense than that. That's why people
  | are nitpicking implementation details.
 
    | mixmastamyk wrote:
    | Hah, pretty accurate and kinda funny, but could be nicer. I
    | still make mistakes.
 
      | 08-15 wrote:
      | Meanwhile, Rachel posted an update. Apparently, the whole
      | section about how "the system got reworked" doesn't
      | describe the proper solution, but her own (Rookie Rachel's)
      | attempt to fix it.
      | 
      | And in _that_ context, everything makes sense.
 
  | hn_throwaway_99 wrote:
  | The reason "the details are important" here are _not_ because
  | of the nitty gritty around what mistakes a  "novice" programmer
  | made.
  | 
  | They are important because the _present_ incarnation of the
  | author is making _all the wrong_ diagnoses about the problems
  | with the original implementation, _despite_ doing it with an
  | air of  "Yes, younger me was so naive and inexperienced, and
  | present me is savvy and wise".
 
    | newaccount74 wrote:
    | I think she inadvertently made a different point, which is
    | that even experienced developers sometimes misunderstand the
    | problem and make mistakes.
    | 
    | Or an even better argument: you don't need to actually
    | understand the problem to fix it, often you accidentally fix
    | the problem just by using a different approach.
 
      | mst wrote:
      | I would argue instead that this comment thread is making
      | the point that people forget that things that work now
      | wouldn't've worked then and design decisions have to be
      | made based on the database engine you're running in
      | production.
 
        | jeremyjh wrote:
        | MySQL could index strings in 2002. It would have worked
        | fine.
 
        | mst wrote:
        | "mysql could index strings" and "using a compound index
        | over four varchar columns would've worked out well" are
        | significantly different propositions.
 
        | jeremyjh wrote:
        | Yes, we won't know because she didn't try it. And we know
        | she didn't try it because the problem she described is
        | table scans, not "indexed strings are somehow slow in
        | MySQL circa 2002".
 
        | mst wrote:
        | Or she did try it and it didn't work - or researched the
        | question and figured, quite reasonably based on my
        | experience of mysql in that era, it probably wouldn't
        | work - and kept the post to only the original problem and
        | the final approach to avoid having even _more_ technical
        | details in a post that wasn 't really about those.
        | 
        | I agree that we don't know, but it seems a little unfair
        | to her to treat every unknown as definitely being the
        | most negative of the possibilities.
 
        | hn_throwaway_99 wrote:
        | I really don't understand your general thrust here. MySql
        | certainly had lots of issues in 2003, but being able to
        | support multi-column indexes was not one of them. Her
        | analysis is simply wrong - it is wrong now and was wrong
        | then. Here is the doc from MySql v3.23:
        | 
        | 7.4.2 Multiple-Column Indexes MySQL can create composite
        | indexes (that is, indexes on multiple columns). An index
        | may consist of up to 16 columns. For certain data types,
        | you can index a prefix of the column (see Section 7.4.1,
        | "Column Indexes"). A multiple-column index can be
        | considered a sorted array containing values that are
        | created by concatenating the values of the indexed
        | columns. MySQL uses multiple-column indexes in such a way
        | that queries are fast when you specify a known quantity
        | for the first column of the index in a WHERE clause, even
        | if you do not specify values for the other columns
 
        | mst wrote:
        | As I said already: "mysql could index strings" and "using
        | a compound index over four varchar columns would've
        | worked out well" are significantly different
        | propositions.
        | 
        | To be more verbose about it - there is an important
        | difference between "can be created" and "will perform
        | sufficiently well on whatever (likely scavenged) hardware
        | was assigned to the internal IT system in question."
        | 
        | I wouldn't be surprised if the "server" for this system
        | was something like a repurposed Pentium 233 desktop with
        | a cheap spinning rust IDE drive in it, and depending on
        | just how badly the spammers were kicking the shit out of
        | the mail system in question that's going to be a fun
        | time.
 
        | lern_too_spel wrote:
        | They are not significantly different. It just means
        | sorting using more than one column.
 
    | darkwater wrote:
    | She explained the problem, the first not working solution and
    | the second working solution as they really happened in 2002
    | _as an example_. The real point is the last part of the post.
    | And it was _not_ "how to properly implement an sql based
    | filter for open proxies in your MTA".
 
      | hn_throwaway_99 wrote:
      | I get all that. But 2020 version of this person _still does
      | not understand the problem_ , and she is arguing that she
      | does while oddly self-deprecating the inexperienced version
      | of herself, who arguably had a better solution to begin
      | with.
 
        | mixmastamyk wrote:
        | Yes, I think an alternate design was found that didn't
        | hit as many MySQL limitations as the previous one. This
        | improved performance and was a win. But the post-mortem
        | diagnosis was lacking.
 
  | shawnz wrote:
  | Just because the author was intending to get across a certain
  | point, doesn't mean the implementation details aren't worth
  | discussing too. I don't think many people here would disagree
  | with the central point, so what's there to discuss about it? I
  | think it's uncharitable to assume that everyone talking about
  | the implementation details is "missing the point".
 
  | rmk wrote:
  | I think you are leaping to conclusions here. It is possible
  | that most people want to keep HN focused on technical
  | conversations and use this as an opportunity to learn something
  | (I certainly do), which is why you are seeing many more
  | comments on the technical aspects.
 
  | [deleted]
 
  | Cpoll wrote:
  | I don't buy it. If the only point was managing rookie errors,
  | etc., the blog post shouldn't have been 15~ paragraphs of
  | technical discussion and 2 paragraphs at the bottom of "my
  | point is ..." You can't advocate that people just ignore 80% of
  | the article because they're "not the point of the post."
  | 
  | I'm sure a good chunk of tech workers are worried that when an
  | influential blogger writes something like this a couple hundred
  | developers will Google 3NF and start opening PRs.
 
    | [deleted]
 
    | unethical_ban wrote:
    | It's a narrative. A story. She told it in that way to lure
    | people who get hooked on technical discussion, and then make
    | the point that they probably would have missed or ignored.
    | Without a narrative example, she could have just said "I made
    | mistakes too, live and learn" but she chose to provide an
    | example before presenting the thesis.
    | 
    | And hey, what's so bad about people learning about 3NF? Are
    | you not supposed to know what that is until you're some
    | mythical ninth level DBA?
 
  | globular-toast wrote:
  | The real meaning being that people learn and gain experience
  | over time? Is this really something we need to read two pages
  | of text to find out? I think people are justifiably miffed at
  | having read two pages for such a trivial message. Not only
  | that, but these "stupid implementation details" could seriously
  | mislead someone who really is a clueless newbie.
 
  | [deleted]
 
| hn_throwaway_99 wrote:
| This post is bizarre, precisely because there _is nothing
| particularly wrong_ about the original schema, and the author
| seems to believe that the problem is that the column values were
| stored as strings, or that the schema wasn 't in "third normal
| form".
| 
| Which is nonsense. The problem with the original DB design is
| that the appropriate columns weren't indexed. I don't know enough
| about the problem space to really know if a more normalized DB
| structure would be warranted, but from her description I would
| say it wasn't.
 
  | mewpmewp2 wrote:
  | I think if anything, all of it could've been put into a single
  | indexed column since the query was AND ... AND ... not OR.
  | 
  | So you could've had a indexed column of "fingerprint", like
  | 
  | ip1_blahblah_evil@spammer.somewhere_victim1@our.domain
  | 
  | And indexed this, with only single WHERE in the query.
  | 
  | I don't understand at all how multiple tables thing would help
  | compared to indices, and the whole post seemed kind of crazy to
  | me for that reason. In fact if I had to guess multiple tables
  | would've performed worse.
  | 
  | That is if I'm understanding the problem correctly at all.
 
    | gavinray wrote:
    | This has been the advice given to me by Postgres experts in a
    | similar scenario:                 "If you want to efficiently
    | fuzzy-search through a combination of firstname + lastname +
    | (etc), it's faster to make a generated column which
    | concatenates them and index the generated column and do a
    | text search on that."
    | 
    | (Doesn't have to be fuzzy-searching, but just a search in
    | general, as there's a single column to scan per row rather
    | than multiple)
    | 
    | But also yeah I think just a compound UNIQUE constraint on
    | the original columns would have worked
    | 
    | I'm pretty sure that the degree of normalization given in the
    | end goes also well beyond 3rd-Normal-Form.
    | 
    | I think that is 5th Normal Form/6th Normal Form or so, almost
    | as extreme as you can get:
    | 
    | https://en.wikipedia.org/wiki/Database_normalization#Satisfy.
    | ..
    | 
    | The way I remember 3rd Normal Form is _" Every table can
    | stand alone as it's own coherent entity/has no cross-cutting
    | concerns"_.
    | 
    | So if you have a "product" record, then your table might have
    | "product.name", "product.price", "product.description", etc.
    | 
    | The end schema shown could be described as a "Star Schema"
    | too, I believe (see image on right):
    | 
    | https://en.wikipedia.org/wiki/Star_schema#Example
    | 
    | This Rachel person is also much smarter than I am, and you
    | can make just about anything work, so we're all bikeshedding
    | anyways!
 
      | tkot wrote:
      | Would it not prevent some optimizations based on statistics
      | regarding the data distribution, like using the most
      | selective attribute to narrow down the rows that need to be
      | scanned? I'm assuming there are 2 indexes, 1 for each
      | column that gets combined.
      | 
      | Let's say you know the lastname (Smith) but only the first
      | letter of the firstname (A) - in the proposed scenario only
      | the first letter of the firstname helps you narrow down the
      | search to records starting with the letter (WHERE combined
      | LIKE "A%SMITH"), you will have to check all the rows where
      | firstname starts with "A" even if their lastname is not
      | Smith. If there are two separate indexed columns the WHERE
      | clause will look like this:
      | 
      | WHERE firstname like "A%" AND lastname = "Smith"
      | 
      | so the search can be restricted to smaller number of rows.
      | 
      | Of course having 2 indexes will have its costs like
      | increased storage and slower writes.
      | 
      | Overall the blog post conveys a useful message but the
      | table example is a bit confusing, it doesn't look like
      | there are any functional dependencies in the original table
      | so pointing to normalization as the solution sounds a bit
      | odd.
      | 
      | Given that the query from the post is always about concrete
      | values (as opposed to ranges) it sounds like the right way
      | to index the table is to use hash index (which might not
      | have been available back in 2002 in MySql).
 
        | gavinray wrote:
        | I won't pretend to be an expert in this realm, but see
        | here:
        | 
        | https://www.postgresql.org/docs/current/textsearch-
        | tables.ht...
        | 
        | Specifically, the part starting at the below paragraph,
        | the explanation for which continues to the bottom of the
        | page:                 "Another approach is to create a
        | separate tsvector column to hold the output of
        | to_tsvector. To keep this column automatically up to date
        | with its source data, use a stored generated column. This
        | example is a concatenation of title and body, using
        | coalesce to ensure that one field will still be indexed
        | when the other is NULL:"
        | 
        | I have been told by PG wizards this same generated,
        | concatenated single-column approach with an index on each
        | individual column, plus the concatenated column, is the
        | most effective way for things like ILIKE search as well.
        | 
        | But I couldn't explain to you why
 
    | kelnos wrote:
    | Right, and when you do that, you don't even need a RDBMS. An
    | key-value store would suffice. This essentially just becomes
    | a set! Redis or memcached are battle-tested workhorses that
    | would work even better than a relational DB here.
    | 
    | But this was also back in the early '00s, when "data store"
    | meant "relational DB", and anything that wasn't a RDBMS was
    | probably either a research project or a toy that most people
    | wouldn't be comfortable using in production.
 
      | dehrmann wrote:
      | > this was also back in the early '00s, when "data store"
      | meant "relational DB", and anything that wasn't a RDBMS was
      | probably either a research project or a toy that most
      | people wouldn't be comfortable using in production.
      | 
      | Indeed; her problem looks to have been pre-memcached.
 
    | ryandrake wrote:
    | Also, the "better" solution assumes IPv4 addresses and is not
    | robust to a sudden requirements change to support IPv6. Best
    | to keep IP address as a string unless you really, REALLY need
    | to do something numerical with one or more of the octets.
 
      | johannes1234321 wrote:
      | Keeping IP addresses as string isn't trivial. You can write
      | those in many ways. Even with IPv4. IPv6 just brings new
      | variations. And when talking about migration to IPv6 you
      | have to decide if you keep IPv4 Addresses as such or prefix
      | with ::ffff:.
      | 
      | In the end you have to normalize anyways. Some databases
      | have specific types. If not you have to pick a scheme and
      | then ideally verify.
 
      | pilif wrote:
      | Based on my experience I disagree. If there is a native
      | data type that represents your data, you should really use
      | it.
      | 
      | It will make sure that incorrect data is detected at the
      | time of storage rather than at some indeterminate time in
      | the future and it will likely be more efficient than
      | arbitrary strings.
      | 
      | And in case of IP addresses, when you are using Postgres,
      | it comes with an inet type that covers both ipv4 and ipv6,
      | so you will be save from requirement changes
 
    | dehrmann wrote:
    | Yeah, I'm not clear on how multiple tables fixed it other
    | than allowing you to scan through the main table faster.
    | 
    | Multiple tables could be a big win if long email addresses
    | are causing you a data size problem, but for this use case, I
    | think a hash of the email would suffice.
 
      | jeremyjh wrote:
      | Well she did say she had indexes on the new table. It could
      | have been fixed with an index on the previous table, but a
      | new table with indexes also fixed it.
 
    | riquito wrote:
    | That way you can get false positives unless you concatenate
    | using a non-valid character in the protocol
    | 
    | foo@example.com other@example.net foo@example.co
    | mother@example.net
    | 
    | Btw, storing just the domains, or inverting the email
    | strings, would have speed up the comparison
 
  | erikerikson wrote:
  | Clearly it worked.
  | 
  | However...
  | 
  | Consider the rows `a b c d e` and `f b h i j`. How many times
  | will `b` be stored in the two formulations? 2 and 1, right? The
  | data volume has a cost.
  | 
  | Consider the number of cycles to compare a string versus an
  | integer. A string is, of course, a sequence of numbers. Given
  | the alphabet has a small set of symbols you will necessarily
  | have a character repeated as soon as you store more values than
  | there are characters. Therefore the database would have to
  | perform at least a second comparison. I imagine you're
  | correctly objecting that in this case the string comparison
  | happens either way but consider how this combines with the
  | first point. Is the unique table on which the string
  | comparisons are made smaller? Do you expect that the table with
  | combinations of values will be larger than the table storing
  | comparable values? Through this, does the combinations table
  | using integer comparisons represent a larger proportion of the
  | comparison workload? Clearly she found that to be the case. I
  | would expect it to be the case as well.
 
    | globular-toast wrote:
    | > Clearly it worked.
    | 
    | No it didn't. Someone else put an index on the main table and
    | _that_ worked.
 
      | erikerikson wrote:
      | The article declared that it was running in production and
      | satisfying the requirement. Do you have a different
      | definition of working? I say that to clarify what I was
      | attempting to communicate by "it worked".
      | 
      | I think you're remarking about the relationship between
      | normalized tables and indexes. That relationship does exist
      | in some databases.
      | 
      | Please say more if I'm missing your point.
 
        | lern_too_spel wrote:
        | GP's point is that the final schema has a primary key
        | index on the main table that solved the problem, and then
        | it has a bunch of useless additional tables. The solution
        | was just to add a primary key index on the main table.
        | Adding the additional tables just slows down the query
        | because it now has to do five index lookups instead of
        | one.
 
  | noctune wrote:
  | I especially feel the proposed solution could be problematic
  | given that the data is supposed to be transient. How are keys
  | deleted from the "id | value" tables when rows are deleted from
  | the main table? Does it just keep accumulating rows or is there
  | something like a GC query running occasionally that deletes all
  | unreferenced values?
 
  | bitexploder wrote:
  | Yeah... technically that's (the solution) normalization, but
  | really not a bad design originally. If you never want a
  | particular atom of data repeated, yes you have to normalize
  | each column, and that is efficient in the long run for storage
  | size, but in more normal data storage that still means a lot of
  | joins and you still want indices. On a modern SSD you could use
  | that one table, throw on some indices as you suggest and not
  | notice it again until you hit millions of rows.
  | 
  | Anything you are doing conditional logic or joins on in SQL
  | usually needs some sort of index.
 
  | themgt wrote:
  | Also you could just store IPv4 as an unsigned INT directly,
  | rather than strings in a different table and an unsigned INT as
  | a foreign key.
 
    | tptacek wrote:
    | I have regretted every single time I've gotten cute about
    | storing IP addresses as scalars in SQL.
 
      | ryukafalz wrote:
      | What issues did you run into? Having admittedly never done
      | this before, it feels like whenever you need the dotted-
      | quad IP you could just query:                   SELECT
      | INET_NTOA(ip) FROM ips WHERE...
      | 
      | ...in MySQL at least. (Which might be why MySQL has this
      | function built in?)
      | 
      | I guess if you ever need to match on a specific prefix
      | storing the numeric IP might be an issue?
 
        | soheil wrote:
        | Because as shocking as this might sound like unlike the
        | characters in the movie Matrix I don't have a built-in
        | INET_NTOA function in my retina to see the string form of
        | the IP addresses when glancing at table rows of a large
        | table looking for patterns or a certain IP.
 
        | ryukafalz wrote:
        | How often are you looking at the full contents of a table
        | vs. looking at the results of a query that you could
        | throw an INET_NTOA onto?
 
        | soheil wrote:
        | Many. You also expect me to remember how to spell
        | INET_NTOA and on what fields to use it on. What if I
        | wanted do a quick "SELECT * FROM"? I barely know how to
        | spell English words what makes you think I'm going to
        | remember how to spell INET_NTOA.
 
        | mixmastamyk wrote:
        | dbeaver CE is free.
 
        | Lammy wrote:
        | I realize you're describing a general problem and not
        | just this particular example, but fwiw NTOA and friends
        | got a lot easier for me once I learned enough that I
        | stopped trying to "remember how to spell" them and
        | started thinking about them in terms of the underlying
        | concepts, like "Network (as in big-endianness) to ASCII".
        | Learning the significance of the term "address family"
        | was a big clue for me for self-directed learning in this
        | space :)
 
        | Ginden wrote:
        | > I guess if you ever need to match on a specific prefix
        | storing the numeric IP might be an issue?
        | 
        | If you want to search for range, you can do:
        | SELECT INET_NTOA(ip) FROM ips WHERE ip >
        | INET_NTOA(:startOfRange) AND ip < INET_NTOA(:endOfRange)
 
  | havkd wrote:
  | Is this correct? Would indexing the columns instead of moving
  | the values to another table lead to the same increase in
  | performance?
 
    | mst wrote:
    | Given this is a "from the start of her career" story, I'm
    | guessing she was running similar versions of mysql to the
    | versions I started with, and if my guess is correct then
    | probably not.
    | 
    | On anything you're likely to be deploying today, just
    | throwing a compound index at it is likely quite sufficient
    | though.
 
    | kofejnik wrote:
    | It would be faster, I believe. As we're only looking for
    | existence, a single index traversal is all io we would need
    | to do
 
    | taeric wrote:
    | An index is often easily viewed as another table. So, should.
    | 
    | (Some subtleties on projected values and such, but the point
    | stands.)
 
      | havkd wrote:
      | Then this post is very misleading.
 
        | taeric wrote:
        | I wouldn't get too vehement on it. Knowing how to
        | normalize isn't a bad thing. And, if you intricately know
        | your write and read needs, you may be able to assist the
        | work more appropriately.
 
      | jameshart wrote:
      | Yes, the proposed 'better' structure basically amounts to
      | building your own indexes. Normally, I'd assume it is
      | better to use the RDBMS's own engine to do that, don't roll
      | your own.
      | 
      | There may well be some subtlety to the indexing
      | capabilities of MySQL I'm unaware of though - could easily
      | imagine myself making rookie mistakes like assuming that it
      | has same indexing capabilities. So, to the post's point -
      | if I were working on a MySQL db I would probably benefit
      | from an old hand's advice to warn me away from dangerous
      | assumptions.
      | 
      | On the other hand I also remember an extremely experienced
      | MS SQL Server DBA giving me some terrible advice because
      | what he had learned as a best practice on SQL Server 7
      | turned out to be a great way to not get any of the benefits
      | of a new feature in SQL Server 2005.
      | 
      | Basically, we're all beginners half the time.
 
        | rrrrrrrrrrrryan wrote:
        | As a former SQL Server DBA, most SQL Server DBAs are
        | absolute gurus on 1 or 2 specific versions of SQL server,
        | and massively ignorant about anything newer.
        | 
        | It's a strange role where, in order to do your job well,
        | you kind of have to hyper-specialize in the specific
        | version of the tech that your MegaCorp employer uses,
        | which is usually a bit older, because upgrading databases
        | can be extremely costly and difficult.
 
    | loeg wrote:
    | The same O(N) -> O(log N) improvement for queries, yes. The
    | constant factor on the separate tables _might_ be better.
    | It's also a more complicated design.
 
    | adfgaertyqer wrote:
    | Sure would! I think it would be marginally better, in fact,
    | because you would just need to look at the index rather than
    | five tables. Access would be more local.
 
    | rustc wrote:
    | Moving the data into another table would still require
    | indexes: one on the original table's column (which now stores
    | the new id) and one on the new table's primary key.
    | 
    | In most cases I'd expect just adding an index to the original
    | table to be more efficient, but it depends on the type of the
    | original column and if some data could be de-duplicated by
    | the normalization.
 
  | ignoramous wrote:
  | > _This post is bizarre_ ... _Which is nonsense_ _...but from
  | her description I would say it wasn 't. _
  | 
  | This _is_ what the post is about.
  | 
  | And btw, if they hadn't normalized those tables, it'd instead
  | have been pointed out to them that their "post is bizzare..."
 
  | someonewhocar3s wrote:
  | I was kinda surprised by the last chapter, suddenly this is
  | about 'advancing in programming as a woman', and it's over!
  | Like the last few chapters make sense in that view.
  | 
  | What I want to know is how she went from a strong database
  | format (ID + data, keyed) into a weak database model (not a
  | database... is a key-value store) which is likely to be much
  | slower, but also, this is what happens to varchars under the
  | hood already (the rows only hold pointers, to where the
  | strings/objects are placed, rather than inlining varchars) (at
  | least in litesql!), so optimally it's only two dereferences
  | instead of one..
  | 
  | Like, eh, this programmer went through an anti-learning process
  | somehow, and came up with a senseless optimisation. The post
  | ends midway through the learning!
 
  | [deleted]
 
  | jeroenhd wrote:
  | Based on this piece of old documentation I found [0] for MySQL
  | 3.23 (the most recent version in 2002 as far as I can tell),
  | certain types of indices were only available on certain types
  | of engines. Furthermore, columns were restricted to 255
  | characters, which may be too short for some of the fields saved
  | in the database.
  | 
  | Modern databases abstract away a lot of database complexity for
  | things like indices. It's true that these days you'd just add
  | an index on the text column and go with it. Depending on your
  | index type and data, the end result might be that the database
  | turns the table into third normal form by creating separate
  | lookup tables for strings, but hides it from the user. It could
  | also create a smarter index that's less wasteful, but the end
  | result is not so dissimilar. Manually doing these kinds of
  | optimisations these days is usually a waste of effort or can
  | even cause performance issues (e.g. that post on the front page
  | yesterday about someone forgetting to add an index because
  | mysql added them automatically).
  | 
  | All that doesn't mean it was probably a terrible design back
  | when it was written. We're talking database tech of two decades
  | ago, when XP had just come out and was considered a memory hog
  | because it required 128MB of RAM to work well.
  | 
  | [0]:
  | http://download.nust.na/pub6/mysql/doc/refman/4.1/en/create-...
 
    | mixmastamyk wrote:
    | > All that doesn't mean it was probably a terrible design
    | back when it was written. We're talking database tech of two
    | decades ago, when XP had just come out and was considered a
    | memory hog because it required 128MB of RAM to work well.
    | 
    | A lot of this stuff was invented in the 70s, and was quite
    | sophisticated by 2000. It just wasn't _free_ , rather quite
    | expensive. MySQL was pretty braindead at the time, and my
    | recollection is that even postgres was not that hot either.
    | We've very lucky they've come so far.
 
    | hn_throwaway_99 wrote:
    | The fact remains that whether the text column existed on her
    | original table, or whether it was pulled out to a normalized
    | table, literally all of the same constraints would apply
    | (e.g. max char length, any other underlying limitations of
    | indexing).
    | 
    | The issue is that her analysis of what the issue was with her
    | original table is _completely wrong_ , and it's very weird
    | given that the tone her "present" self is that it's so much
    | more experienced and wise than her "inexperienced, naive"
    | self.
    | 
    | My point is that she should give her inexperience self a
    | break, all that was missing from her original implementation
    | were some indexes.
 
      | mst wrote:
      | Multiple-long-column compound indices sucked in old mysqls
      | if you could even convince it to use them in the first
      | place.
      | 
      | Being able to look up each id via a single-string unique
      | index would've almost certainly worked much better in those
      | days.
 
        | hn_throwaway_99 wrote:
        | I used MySql a lot back then, had many multi-column
        | indexes, and never had an issue.
        | 
        | More importantly, given the degree of uniqueness likely
        | to be present in many of those columns (like the email
        | addresses), she could have gotten away with not indexing
        | on every column.
 
        | mst wrote:
        | Your first sentence was, I'm afraid, very much not a
        | universal experience.
        | 
        | Your second is quite possibly true, but would have
        | required experimentation to be sure of, and at some point
        | "doing the thing that might be overkill but will
        | definitely work" becomes a more effective use of
        | developer time, especially when you have a groaning
        | production system with live users involved.
 
    | e12e wrote:
    | Based on: https://web.mit.edu/databases/mysql/mysql-3.23.6-al
    | pha/Docs/...
    | 
    | I'd say there's very little performance gain in normalizing
    | (it usually goes the other way anyway: normalize for good
    | design, avoiding storing multiple copies of individual
    | columns; _de_ -normalize for performance).
    | 
    | I'm a little surprised by the tone of the article - sure,
    | there were universities that taught computer science without
    | a database course - but it's not like there weren't practical
    | books on dB design in the 90s and onward?
    | 
    | I guess it's meant as a critique of the mentioned, but not
    | linked other article "being discussed in the usual places".
 
    | willvarfar wrote:
    | Knowing only what we can guess about the actual problem, i'd
    | say indexing ip ought be enough to make everything super
    | fast.
 
  | gfodor wrote:
  | The point is she was being hyper critical of her past schema to
  | make a larger point.
 
    | dehrmann wrote:
    | I have mixed feelings about this. On one hand, I appreciate
    | the larger point, and pointing out mistakes _again_ goes
    | against that point. On the other, the post had so many
    | technical details it overshadowed the larger point,
    | especially for something that wasn 't obviously broken.
 
    | mwcampbell wrote:
    | And IIUC, the larger point, made by the twist ending, is that
    | we shouldn't be so critical, but try to actually help newbies
    | learn what we insist that they should know.
 
      | erikerikson wrote:
      | It seemed to me the point was that if the industry remains
      | designed to push out experienced engineers that there won't
      | be anyone to mentor the engineers that are just beginning
      | their careers. Further that we will bend the productivity
      | of the field downward and have less effective systems.
 
        | sokoloff wrote:
        | In what way is the industry designed that way now? Maybe
        | some get sucked into management roles, but I don't see
        | any evidence of a broad design to push them out.
 
        | erikerikson wrote:
        | Role count pyramids, volume and inflexibility of working
        | hours, abusive management techniques, a constant
        | treadmill of the new hotness, more...
        | 
        | Some offices avoid many of the challenging patterns but
        | not the vast majority. A lot of it is natural result of
        | the newness of the industry and the high stakes of
        | investment. Many of the work and organizational model
        | assumptions mean that offering flexibility to employees
        | complicates the already challenging management role.
        | Given the power of those roles, this means workers
        | largely must fit into the mold rather than designing a
        | working relationship that meets all needs. As the needs
        | change predictably over time a mounting pressure
        | develops.
        | 
        | I'm only 17 years in but it has gotten harder and harder
        | to find someone willing to leave me to simply code in
        | peace and quiet and dig my brain into deep and
        | interesting problems.
 
        | wbl wrote:
        | Look at the way we screen resumes, or valorize work
        | conditions incompatible with having family obligations.
        | Ageism in the tech industry is rife.
 
        | bialpio wrote:
        | Or how leetcode heavy hiring favors folks that tend to be
        | more fresh out of college & folks who have time to
        | prepare for them.
 
  | tomc1985 wrote:
  | Idunno the real villain is that bullshit, costly "per-row"
  | pricing
 
  | codazoda wrote:
  | As a 25 year "veteran" programmer, I'm glad I'm not the only
  | one that likes the original schema (with indexes added). A
  | fully normalized table is a lot more difficult to read,
  | troubleshoot, and reason about. You would need a pretty good
  | query to poke around the data.
 
    | mst wrote:
    | When I first started using postgres I was amazed at how much
    | of what I thought was "database knowledge" was actually
    | "working around limitations of the versions of mysql I'd been
    | using" knowledge.
    | 
    | These days, sure, the original schema plus indices would work
    | fine on pretty much anything I can think of. Olde mysqls were
    | a bit special though.
 
  | [deleted]
 
  | zhoujianfu wrote:
  | Phew.. I was reading this post thinking "is basically every
  | table I've ever made wrong?!"
 
    | mst wrote:
    | For modern databases, the extra normalisation is an
    | incremental optimisation that you usually won't need.
    | 
    | But I think she's if anything been in the industry longer
    | than I have and the versions of mysql I first ran in
    | production were a different story.
 
    | topher_t wrote:
    | I definitely got a spike of imposter syndrome thinking, of
    | fuck I've been in software this long and haven't learned this
    | yet?!
 
  | hoppla wrote:
  | I usually store IPv4 as unsigned int, and IPv6 as two 64 bit
  | unsigned ints. But supporting both variants require that I have
  | an a table for each type and a glue table to reference the
  | record in the correct table. Storing simple stuff quickly turns
  | complicated...
 
  | da39a3ee wrote:
  | OK so change the article's technical solution from "normalize"
  | to "add index". That doesn't change the point of the article.
 
| rmk wrote:
| Why not store the IP Address as a 32-bit number (IPv4 addresses)?
| Why store it as a string in the first place? This is something I
| did not quite get. Also, wouldn't it be better to split out the
| domain from the email address for ease of filtering?
| 
| Also, how does performing joins give a performance advantage
| here. I'm assuming there would be queries to get at the IDs of at
| least one, but going up to 4, to get at the IDs of the items in
| the quad. Then there would be a lookup in the mapping table.
| 
| I have worked for some time in this industry, but I have never
| had to deal with relational databases (directly; query tuning and
| such were typically the domain of expert db people). It would be
| interesting to see an explanation of this aspect.
| 
| EDIT: To people who may want to comment, "You missed the point of
| the article!": no, I did not, but I want to focus on the
| technical things I can learn from this. I agree that ageism is a
| serious problem in this industry.
 
  | pg_bot wrote:
  | You could store the IP as a 32 bit unsigned int. There's no
  | issue with that, but I would probably recommend nowadays to use
  | an internal cidr or inet type if your database supports that.
  | It probably wouldn't be better to split the email address from
  | the domain for filtering since email address formats tend to be
  | a bit tricky.
  | 
  | Joins give a performance advantage due to the fact that you
  | aren't duplicating data unnecessarily. The slow query in
  | question becomes five queries (4 for the data once for the
  | final lookup) which can each be done quickly and if any one of
  | them return nil, you can return a timeout.
 
    | rmk wrote:
    | Yes, some of the databases support native ip address types
    | that can be manipulated efficiently. Better to use that (like
    | inbuilt json querying capabilities in postgres) than come up
    | with your own.
    | 
    | It is still not clear how is it better to do up to 5 separate
    | queries or maybe a few joins, than to store the strings and
    | construct indices and such on them? Is the idea that the cost
    | of possibly establishing a new socket connection for some
    | concurrent query execution or sequentially executing 5
    | queries is still < possible scans of the columns (even with
    | indexing in effect)? Also, even if you had integers, don't
    | you need some sort of an index on integer fields to avoid
    | full table scans anyway?
 
| Aqueous wrote:
| There is a time and a place for a denormalized schema. If you are
| trying to search a large data-set a denormalized schema in a
| single table with some good indexing is much preferable to
| joining across 30 tables.
| 
| Whether you choose normalized or denormalized depends very much
| on a) what you need to do now b) what you may need to do in the
| future. Both are considerations.
 
| bizzarojerry wrote:
| What a bizarre post from an influential blogger so confidently
| making assertions that are wildly contested even among HN users.
 
| chrischapman wrote:
| I think the 'terrible schema' thing is a secondary issue. The
| important take away for me was this:
| 
| > Considering that we as an industry tend to chase off anyone who
| makes it to the age of 35, is it any surprise that we have a
| giant flock of people roaming around trying anything that'll
| work?
 
  | alisonkisk wrote:
  | That canard isn't true.
 
  | mixmastamyk wrote:
  | Unfortunately it is the penultimate sentence in a sizable post.
 
| temporallobe wrote:
| This was great. Clever twist at the end.
| 
| I cringed a little because some of those mistakes looks like
| stuff I would do even now. I have a ton on of front/back end
| experience in a huge variety of languages and platforms, but I am
| NOT a database engineer. That's a specific skillset I never
| picked up nor, admittedly, am I passionate about.
| 
| Sorry to go off on a tangent, but it also brings to mind that
| even so-called experts make mistakes. I watch a lot of live
| concert footage from famous bands from the 60s to the 90s, and as
| a musician myself, I spot a lot of mistakes even among the most
| renowned artist...with the sole exception of Rush. As far as I
| can tell, that band was flawless and miraculously sounded better
| live than on record!
 
| mceachen wrote:
|  Actually, the Correct Answer is a bloom
| filter.
| 
| (And, yes, we had math in the early 2000s.)
| 
| Snark aside, I'm frustrated for the author. Her completely-
| reasonable schema wasn't "terrible" (even in archaic MySQL)--it
| just needed an index.
| 
| There's always more than one way to do something. It's a folly of
| the less experienced to think that there's only One Correct Way,
| and it discourages teammates when there's a threat of labeling a
| solution as "terrible."
| 
| Not to say there aren't infinite terrible approaches to any give
| problem: but the way you guide someone to detect why a give
| solution may not be optimal, and how you iterate to something
| better, is how you grow your team.
 
  | Jolter wrote:
  | To be fair, the "correct way" to do databases at that time was
  | to use third normal form. Putting indices on string columns
  | would have been considered a hack, much like using MySQL was.
 
    | mst wrote:
    | Plenty of databases at the time had pretty lame string
    | indexing because it simply wasn't something people relied on
    | if they wanted performance, and memory sayeth mysql's were
    | both not particularly efficient and had some nasty technical
    | limitations.
    | 
    | On the mysql she was using, breaking things out so it only
    | needed to index ints was almost certainly a much better idea.
    | On anything I'd be deploying to today, I'd start by throwing
    | a compound index at the thing with the expectation that'd
    | probably get me to Good Enough.
 
| ed25519FUUU wrote:
| > id | ip | helo | m_from | m_to | time | ...
| 
| I'm not a database person, but this doesn't seem _too bad_ if
| there was at least an index on IP. The performance there would
| probably be good enough for a _long time_ depending on the usage.
| 
| But is 6 tables (indexed on the value) really better than 1 table
| with 6 indexes?
 
| rektide wrote:
| The popular advice in these comments to use indexes sounds really
| good. I haven't thought it out fully, but my first thought was
| "this sounds like a DIY columnar store". A columnar database does
| a lot of this work for you. I'm not 100% it's a good option in
| this case, because things like email subjects are not good
| columnar data. I would like to know more about how problematic
| that is.
 
| ummonk wrote:
| I was preparing to rant about the tone that the post was taking
| towards the "clueless programmer" until I got to the end. Well
| done...
| 
| Still not sure why adding an index didn't work instead of
| breaking out into multiple tables indexed by IDs.
| 
| Besides, this use case screams for "just store the hashes and
| call it a day".
 
| kpommerenke wrote:
| While the article is not really about the database schema, could
| you do a table with just two fields?
| 
| 1) hash of concat(ip & helo & m_from & m_to) 2) time
 
| daly wrote:
| We had a test database that contained 1 record. Nobody paid much
| attention since the focus was on the problem, not the database.
| 
| The database included several newly developed "stored
| procedures".
| 
| Time elapsed... and it was nearing the time to ship the code. So
| we tried to populate the database. But we could not. It turned
| out that the stored procedures would only allow a single record
| in the database.
| 
| Since a portion of the "business logic" depended on the stored
| procedures... well, things got "delayed" for quite a while... and
| we ended up having a major re-design of the back end.
| 
| Fun times.
 
  | hinkley wrote:
  | Why do so many of us "forget" to populate the database with
  | representative quantities of data until after we've baked in
  | all of our poor design decisions?
 
| globular-toast wrote:
| > That's right, I was that clueless newbie
| 
| Questionable use of past tense here.
 
  | mwcampbell wrote:
  | Why? She's clearly not a clueless newbie now.
 
    | framecowbird wrote:
    | I think the OP was sarcastically implying that she still is.
 
      | inopinatus wrote:
      | Such remarks generally backfire onto the author, as indeed
      | in this case.
 
    | kofejnik wrote:
    | Right, not a newbie anymore
 
| [deleted]
 
| zephrx1111 wrote:
| I think this post showed us how an engineer on the Mt. stupid
| looks like. Apparently this design is wrong. We should rely on
| the indexing ability of database itself, instead of inventing our
| own indexing system.
 
| tyingq wrote:
| Feels like you could just concatenate and hash the 4 values with
| MD5 and store the hash and time.
| 
| Edit: I guess concatenate with a delimiter if you're worried
| about false positives with the concat. But it does read like a
| cache of _" I've seen this before"_. Doing it this way would be
| compact and indexed well. MD5 was fast in 2002, and you could
| just use a CRC instead if it weren't. I suppose you lose some
| operational visibility to what's going on.
 
  | ok123456 wrote:
  | For the 2021 version, you'd just generate a bloom filter/cuckoo
  | hash from all the data gathered by your spamhaus database
  | periodically. Make a separate one for each value in your tuple
  | and your score would be the number of the sub-hashes that
  | matched.
 
  | kingcharles wrote:
  | Isn't that exactly what adding an index would do internally?
 
    | tyingq wrote:
    | 2002 MySQL had some limitations on indexing variable length
    | string (varchar) columns. That's the gist of the linked
    | story.
 
  | Moto7451 wrote:
  | Yup, we do this at work for similar purposes and it works a-ok.
  | We also have some use cases that follow the original "problem"
  | schema and they work fine with the correct indices involved.
  | 
  | My guess is that in 2002 there were some issues making those
  | options unappealing to the Engineering team.
  | 
  | When we do this in the realm of huge traffic then we run the
  | data through a log stream and it ends up in either a KV store,
  | Parquet with SQL/Query layer on top of it, or hashed and rolled
  | into a database (and all of the above of there are a lot of
  | disparate consumers. Weee Data Lakes).
  | 
  | This is also the sort of thing I'd imagine Elastic would love
  | you to use their search engine for.
 
| shusson wrote:
| I would add a counter point to foreign key constraints, they make
| deletions and updates a lot more costly [1].
| 
| [1] https://shusson.info/post/postgres-experiments-deleting-
| tabl...
 
| wdr1 wrote:
| Related to database indexes, but not the post: a busted database
| index brought down ticket sales of the 2008 Olympics Games.
| 
| This was the first time regular people could go buy tickets for
| events & they had been lining up overnight at Bank of China
| locations through the country. We were down for over a day before
| we called it off. Apparently this led to minor upheaval at
| several locations in Beijing & riot police were called in.
| 
| We were pretty puzzled as we _had_ an index and had load tested
| extensively. We had Oracle support working directly with us  &
| couldn't figure out why queries had started to become table
| scans.
| 
| The culprit? A point upgrade to DBD::Oracle (something like X.Y.3
| to X.Y.4) introduced subtle but in character sets. So the index
| required using a particular Unicode character set, and we were
| specifying it, but when it was translated into the actually
| query, it wasn't exactly the right one, so the DB assumed it
| couldn't use the index. Then, when all the banks opened & a large
| portion of very populous country tried to buy tickets at the same
| time, things just melted.
| 
| Not a fun day.
 
  | hinkley wrote:
  | The one that always got us is much more mundane. Deleting a row
  | requiring an index in every table with a FK to avoid tables
  | scans.
  | 
  | Near as I can tell, we assume there is some bit of magic built
  | into the foreign key concept that handles this for us, but that
  | is not the case.
 
    | necovek wrote:
    | Foreign keys are naturally referencing primary keys, which
    | have their own (unique) indexes by default. And there's some
    | extra magic with CASCADE.
    | 
    | But it seems like I am missing your point: care to expand on
    | it so I can learn about the gotcha as well? What are you
    | attempting to do, and what's making the performance slow?
 
      | nickkell wrote:
      | If I've understood correctly it's the foreign key column
      | itself that isn't indexed by default. Deleting a record
      | then requires a table scan to ensure that the constraint
      | holds
 
        | dan-robertson wrote:
        | Example: if you delete a record from the customers table,
        | you want an index on the foreign key in the orders table
        | to delete the corresponding entries. This also means the
        | performance of the delete can have the unfortunate
        | property where the small work of deleting a single row
        | cascades into the work of deleting many rows.
 
    | zeroimpl wrote:
    | Yeah I've run into this a few times. In Postgres, you can't
    | add a foreign key referencing columns that aren't indexed. I
    | assume this is for performance reasons so inserts/updates are
    | efficient, but there is no such check that the source columns
    | are also indexed, so updates/deletes on the referenced table
    | can be terribly slow.
 
  | superjan wrote:
  | At my previous job, a particular database was designed using
  | ascii strings fields for a particular field rather than
  | unicode. If you then query with a string in unicode format, the
  | database decided that the comparison should be done in unicode.
  | The only way was to table scan and convert all ascii fields to
  | unicode on the fly. It was found only in production.
  | 
  | Given that you mention China in your story, did GB 18030 have
  | anything to do with your problems?
  | 
  | https://en.wikipedia.org/wiki/GB_18030
 
  | path411 wrote:
  | Can't imagine the pain to even discover this problem. It
  | normally takes me a long time to be like, "maybe I didn't make
  | a mistake, and I have a third party bug?". Third party bugs are
  | always the most annoying to me, usually the longest to diagnose
  | and then ultimately I have to just tell my boss I can't do
  | anything to fix it, but hopefully I can find a way to avoid it.
 
  | mst wrote:
  | I love DBD::Oracle so much I've always arranged to have other
  | people test oracle related DBIx::Class features for me.
  | 
  | Writing the pre-ANSI-JOIN-support SQL dialect code was pretty
  | good fun though.
 
  | taberiand wrote:
  | The moral of this story, to me, is always lock in all
  | dependencies to the exact specific versions tested for
  | production release.
 
| [deleted]
 
| throwawayFanta wrote:
| I think this blog fails to take into account that 2021 is not
| 2002. Computer Science is a much more formal/mainstream field of
| study now, and people don't operate "in a vacuum with no
| mentorship, guidance, or reference points."
| 
| Some comments on the previous blog post raised important
| questions regarding minimum understanding/knowledge of technology
| one utilizes as part of their day job. And I would agree that
| indexing is a fundamental aspect while using relational
| databases.
| 
| But unfortunately this isn't very uncommon these days, way too
| often have I heard people in $bigco say, let's use X, everyone
| uses X without completely understanding the
| implications/drawbacks/benefits of that choice.
 
  | [deleted]
 
| savant_penguin wrote:
| I more or less thrown the towel on that "not being clueless
| anymore"
| 
| Sure, I'll try and study to keep myself on my feet, but there is
| soo much to learn that I'd rather focus on the technologies
| needed right _now_ to get things done
| 
| Way more often than not "thing A has to be done with tech B" and
| the expectation of getting it done without multiple foot guns in
| gone
 
| havocsupreme wrote:
| I'm pretty sure all the people who are criticizing the database
| design haven't read the ending. The article isn't about the
| schema, it's about helping those entry level programmers
 
  | loeg wrote:
  | It would be more effective at making that point if it didn't
  | have the confused and arguably incorrect section in the middle.
  | If as a writer, the audience misses your point, maybe it's a
  | problem with your writing rather than the audience.
 
  | empthought wrote:
  | You don't think it's helpful for people to steer entry level
  | programmers away from bad ideas like this one?
 
| lisper wrote:
| > You know what it should have been? It should have been
| normalized.
| 
| No, it should have had a unique index on the first five columns.
| 
| (Oooh, I just hate it so much when someone gets up on their high
| horse and opens a blog post with a long condescending preamble
| about how some blogger got it wrong, and then they get it even
| wronger.)
 
| nobodyandproud wrote:
| If there are millions of people entries coming in per hour, I can
| see the value.
| 
| Otherwise, how long is their data retention? Wouldn't there be
| more value in cleaning out entries older than, say, two hours?
 
| lifeisstillgood wrote:
| The one thing I would add here is that the devlead (or equivalent
| - there always is someone) should have added at least one test.
| Not a unit test or 99% co drags test but a "does it do what we
| want" test - a golden path test.
| 
| I came across something like this with an apprentice recently. I
| let the lack of a repeatable test pass because speed / time /
| crunch etc and the obvious how can that possibly fail code of
| course failed. The most basic test would have caught it in the
| same way some basic performance testing would have helped rachel
| however many years back.
| 
| It's hard this stuff :-)
 
| dilyevsky wrote:
| The real take away here is DB performance optimization (imo
| performance in general) is freaking hard and even people who'd
| successfully optimized something like OP often totally
| misunderstand why it actually worked. Pay for an expert advice
 
| rvr_ wrote:
| short answer: composite indexes
 
| jeffrallen wrote:
| I did this (bad schema, too much scanning) too.
| 
| I also ran newfs on a running Oracle database. (That did vastly
| less damage than you would have thought, actually.)
 
| ChrisMarshallNY wrote:
| I'd say that pretty much _everyone_ is a better DB programmer
| than I am, so I don 't really go out of my way to design anything
| especially robust. I try to design stuff I'm not good at, in a
| way that makes it easy for someone that knows more than I do, to
| come in and toss out the work I did, to be replaced with good
| work.
| 
| There's _lots_ of stuff that I 'm bad at. I'm also good at a fair
| bit of stuff. I got that way, by being bad at it, making
| mistakes, asking "dumb" questions, and seeing how others did it.
 
| Grimm1 wrote:
| Heh at the end I think I may have been called out for my harsh
| response to the post being referenced at the start here on HN
| since it seemed to raise some hackles. Borrowing from other
| social media terminology here, "subtweeted" but by a blog post,
| not sure how I feel about that.
| 
| Honestly it almost makes me feel compelled to start writing more
| for the industry but to be honest, my opinion is most ideas are
| not great, or not novel, up to and including my own. So writing
| about them seems likely not to be beneficial. I guess you could
| argue I could let the industry be the judge of that though.
 
| fortran77 wrote:
| The "NoSQL" movement -- which I wisely rejected -- is also
| responsible for this.
 
  | jugg1es wrote:
  | There are problems that nosql databases solve easily that
  | require a lot of fancy tricks in a RDBMS.
 
  | harel wrote:
  | All "movements" have some merit, or they wouldn't exist in the
  | first place. Nobody starts a "movement" just because. It exists
  | because it solves something. "NoSQL" was a solution to
  | something. Today you can benefit from best of all worlds, and
  | still NoSQL has a place and a use case, just like anything else
  | that exists.
 
    | tester756 wrote:
    | I will play devil advocate
    | 
    | >and still NoSQL has a place and a use case, just like
    | anything else that exists.
    | 
    | What actually makes you believe that it's the NoSQL that has
    | "some use cases" and relational databases are "default ones"
    | instead of NoSQL/no-relational by default?
 
| junon wrote:
| Err. Wat. The original schema was (mostly) fine. It had no
| indexes. The second schema looks like it tried to work around
| lack of database optimization features. It's in no way "better"
| from a data design standpoint.
| 
| A database with good string index support isn't doing string
| comparisons to find selection candidates - at least, not
| initially.
| 
| What a bizarrely confident article.
 
  | pgeorgi wrote:
  | > lack of database optimization features
  | 
  | That's a perfection description of mysql as of 2002.
 
    | junon wrote:
    | MySQL, at least as far back as 2000, had indexes.
    | 
    | http://web.archive.org/web/20020610031610/http://www.mysql.c.
    | ..
 
      | pgeorgi wrote:
      | I wouldn't bet on these indexes optimizing anything back
      | then. MySQL was legendary that while implementing the
      | necessary standards (to some degree) it was neither
      | reliable nor efficient.
 
        | eklitzke wrote:
        | Indexes in MySQL worked fine in 2002, at least for the
        | use case described in this article.
 
        | junon wrote:
        | That's pretty beside the point though, the concept of
        | indexing had existed for decades even prior to that.
 
        | pgeorgi wrote:
        | The quote was "database optimization features" and the
        | scope was "MySQL as of 2002".
        | 
        | Of course, even my old DBase II handbook talks about
        | indexes - all that's old hat. MySQL had them, too.
        | 
        | MySQL also used to have a long-earned reputation as a toy
        | database though, and MySQL in 2002 was right within the
        | timeframe where it established that reputation. So yeah,
        | you could add indexes, but did they speed things up (as
        | in them being an "optimization feature")? Public opinion
        | was rather torn on that.
 
| oconnor663 wrote:
| I think there's an easy "best of both worlds" take here:
| 
| 1. The majority of the performance problem could've and probably
| should've been summarized as "you need to use an index". (Maybe
| there were MySQL limitations that got in the way of indexing back
| then? But these days there aren't.)
| 
| 2. Everyone makes mistakes! New programmers make mistakes like
| not knowing about indexes. Experienced programmers make mistakes
| like knowing a ton about everything and then teaching things in a
| weird order. All of these things are ok and normal and part of
| growing, and it's important that we treat each other kindly in
| the meantime.
 
  | mst wrote:
  | Based on my own war stories, if it was an early '00s mysql then
  | I expect "just throw an index at the varchar columns" would've
  | helped but not nearly as much as what she actually did.
 
| gadrev wrote:
| What I got from this post is a headache after reviewing the
| wikipedia entries on 3NF (all all of it's dependencies: 1NF, 2NF,
| also 3.5NF). Man the language is dense.
| 
| j/k, always nice to review some of this "basic" stuff (that you
| barely use in everyday work but it's good to have it in the back
| of your head when changing something in the db schema)
 
| bizzarojerry wrote:
| What a bizarre post from an influential blogger so confidently
| making assertions that are wildly contested even among HN users.
| 
| If anything this is the problem with industry, articles like this
| that make vast generalizations from a specific instance,
| unsubstantiated finger-pointing gestures, and passive
| aggressively blame the industry as a whole.
 
| mmaunder wrote:
| Oh well played!! I was setting up to give the author a hard time
| about being judgemental, particularly because storing IPs or
| UUIDs as strings is a mistake I've seen some pretty darn good
| devs make. Some folks just aren't super strong on schema design
| and performance but are great at other things.
| 
| Plus MySQL kind of rocks. Fight me. There are some interesting
| optimizations that Percona has written about that may improve the
| performance of OPs schema. The learning never ends.
| 
| TBH I haven't seen a shortage of mentors. In fact most of our dev
| team is 35+ and are very approachable and collegial and our
| internal mentoring programs have been extremely successful -
| literally creating world experts in their specific field. I don't
| think we're unique in that respect.
| 
| I think it's rather unfortunate that the top comment here is a
| spoiler. Actually reading the post and going on the emotional
| rollercoaster that the author intended to take you on is the
| point. Not the punch line.
 
  | jjice wrote:
  | > Plus MySQL kind of rocks.
  | 
  | Agreed. I prefer Postgres for personal projects, but MySQL is a
  | fine database. Honestly, even a relational DB I wouldn't want
  | to use again (DB2...) is still pretty solid to me. The
  | relational model is pretty damn neat and SQL is a pretty solid
  | query language.
  | 
  | I wonder how many people disagree with that last part in
  | particular...
 
    | mmaunder wrote:
    | It's actually the core of all dev that involves any data
    | storage, today. HN is both aspirational in tech and has a
    | younger demographic so you won't get any points here for
    | mentioning anything SQL. Hence the downvotes I'm getting. But
    | hey, it's all good!!
 
| tomsmeding wrote:
| Why exactly would it be so bad to just put a suitable index on
| the table containing strings? The time complexity of the
| resulting search would be the same, so I assume there will be
| some constant factor slowdowns. Is it that indices over string
| fields are stored inefficiently on disk? (If so, can that not be
| fixed in the db engine directly?) Or is this fine today but
| wasn't fine 15 years ago?
 
  | dan-robertson wrote:
  | I think the article describes the solution that was used and
  | not necessarily the best solution. If they made the "obvious"
  | mistake in the first case they mightn't arrive at the best
  | solution in the second. Adding the single index and using the
  | four tables have the same worst case complexity for reads,
  | O(log n), with different constants and massively beat scanning
  | at O(n).
  | 
  | It may be that the second solution wasn't the best, or that it
  | was better for write performance or memory usage--this was 2002
  | after all. It may also be the case that the cardinal it's of
  | some columns was low in such a way that the four table solution
  | was better, but maybe getting the columns in the right order
  | for a multi column index could do that too.
  | 
  | Either way, I don't think it really matters that much and
  | doesn't affect the point of the article.
 
  | HelloNurse wrote:
  | Normalization is important for deduplication, not only to index
  | and compare a few short numbers instead of a few long string:
  | those host names and email addresses are long and often
  | repeated.
 
    | remram wrote:
    | That's not what normalization is. You're thinking of
    | deduplication or compression. This table is normalized as per
    | usual database terminology.
 
      | ghayes wrote:
      | But if it's guaranteed to be 1:1, why? The two
      | implementations (normalized and denormalized) in that case
      | should be completely isomorphic, quirks of the DB engine
      | aside.
 
  | mst wrote:
  | It was fine on -some- databases 20 years ago. mysql of that
  | era, less often.
 
  | thehappypm wrote:
  | At the worst you could concatenate the strings and have a
  | single primary key.. or hash the string concatenation.
 
  | jerjerjer wrote:
  | Yeah, while this is not optimal (ip should be converted into
  | integer, time should be ts), the table would be small (as old
  | entries could be safely deleted). The only real issue is the
  | lack of indices.
  | 
  | Also is helo field even needed?
 
    | mjevans wrote:
    | Yes. A proper mail implementation won't randomize it, so it
    | should be consistent. Depending on the design of the
    | validation stages it might only need a case-normalized lookup
    | to the string index table and that id as part of the final
    | multi-column unique index, and in that last case only to
    | convey to the database the expectations to optimize around
    | (and require the data adhere to).
 
  | junon wrote:
  | Indexing existed 15 years ago. The article never mentions why
  | indexing didn't solve this problem. Super weird take on the
  | author's part...
 
    | quickthrower2 wrote:
    | Surely more than 15 years! 50 maybe?
 
    | ericbarrett wrote:
    | Early InnoDB* had pretty strict limits on varchar indexes and
    | was not the most efficient. I don't remember the details but
    | it's entirely possible the single-table format Rachel
    | described ran head on into those limitations.
    | 
    | Also remember indexes take space, and if you index all your
    | text columns you'll balloon your DB size; and this was 2002,
    | when that mattered a lot more even for text. Indexes also add
    | write and compute burden for inserts/updates as now the DB
    | engine has to compute and insert new index entries in
    | addition to the row itself.
    | 
    | Finally, normalizing your data while using the file-per-table
    | setting (also not the default back then) can additionally
    | provide better write & mixed throughout due to the writes
    | being spread across multiple file descriptors and not
    | fighting over as many shared locks. (The locking semantics
    | for InnoDB have also improved massively in the last 19
    | years.)
    | 
    | * Assuming the author used InnoDB and not MyISAM. The latter
    | is a garbage engine that doesn't even provide basic
    | crash/reboot durability, and was the default for years; using
    | MyISAM was considered the #1 newbie MySQL mistake back then,
    | and it happened all the time.
 
      | noduerme wrote:
      | To me it seems that spreading it over four tables would
      | lead to a lot more potential read locks while the big
      | combined table is waiting for a join on each of the others,
      | and some process is trying to insert on the others and link
      | them to the main. This is assuming they were using foreign
      | keys and the main table 4-column index was unique.
 
      | cogman10 wrote:
      | > Also remember indexes take space
      | 
      | Indexes take space, except for the clustered index.
      | 
      | An important distinction.
      | 
      | If the point of this table is always selecting based on IP,
      | m_from, m_to, then clustering on those columns in that
      | order would make sense.
      | 
      | Of course, if it's expected that a lot of other query
      | patterns will exist then it might make sense to only
      | cluster on one of those columns and build indexes on the
      | others.
 
| kofejnik wrote:
| The initial design was fine and btw it did not violate 3NF,
| actually (if you add auto increment pk). A single composite index
| would've most likely solved all problems
 
| krisgenre wrote:
| My piano teacher once told me that its common to commit a mistake
| but its disastrous if you keep practicing without realizing that
| you have made a mistake. What I look for in a developer is their
| ability to realize a mistake and find ways to fix it.
 
| marginalia_nu wrote:
| https://news.ycombinator.com/item?id=29137141
 
| cuu508 wrote:
| > The observation was that we could probably store the IP
| address, HELO string, FROM address and TO address in a table, and
| send back a 4xx "temporary failure" error the first time we saw
| that particular tuple (or "quad"). A real mail server which did
| SMTP properly would retry at some point, typically 15 minutes to
| an hour later. If it did retry and enough time had elapsed, we
| would allow it through.
| 
| I've run into this form of greylisting, it's quite annoying. My
| service sends one-time login links and authorization codes that
| expire in 15 minutes. If the email gets delayed, the user can
| just try again, right? Except I'm using AWS SES, so the next
| email may very well come from a different address and will get
| delayed again.
 
  | loeg wrote:
  | You could have your codes expire after an hour or day instead?
 
    | cuu508 wrote:
    | That's _a_ solution, but with tradeoffs:
    | 
    | * security, an attacker has more time to intercept and use
    | the links and codes.
    | 
    | * UX, making the user wait 15+ minutes to do certain actions
    | is quite terrible.
    | 
    | I've had a couple support requests about this. The common
    | theme seems to be the customer is using Mimecast, and the fix
    | is to add my sender address in a whitelist somewhere in their
    | Mimecast configuration.
 
| alksjoflkajsom wrote:
| People in the comments are getting (rightfully) outraged about
| the poor understanding of indexing, but I'm a little surprised
| that everyone here doesn't seem to understand normalization
| either. The original schema is perfectly normalized and is
| already in 3NF: none of the columns shown has a dependence on any
| of the other columns outside of the primary key (in other words,
| if you knew eg the values of the ip, helo, and from columns,
| you'd still have no information about the "to" column).
| 
| Normalization does not mean "the same string can only appear
| once". Mapping the string "address1@foo.bar" to a new value "id1"
| has no effect on the relations in your table. Now instead of
| "address1@foo.bar" in 20 different locations, you have "id1" in
| 20 different locations. There's been no actual "deduplication",
| but that's again not the point. Creating the extra tables has no
| impact on the normalization of the data.
 
  | [deleted]
 
  | slim wrote:
  | Plus depending on the amount of emails they get, that
  | optimisation could be unnecessary. That database schema was
  | perfectly fine in some cases.
 
  | colanderman wrote:
  | Came here to say this. Whatever may have been true of InnoDB 20
  | years ago, don't follow this article's advice for any modern
  | relational database.
  | 
  | To add: Not sure about MySQL, but `varchar`/`text` in
  | PostgreSQL for short strings like those in the article is very
  | efficient. It basically just takes up space equaling the length
  | of the string on disk, plus one byte [1].
  | 
  | [1] https://www.postgresql.org/docs/current/datatype-
  | character.h...
 
  | mixmastamyk wrote:
  | Thank you. Was thinking that I was the clueless one...
  | (probably a little) because that is not my understanding of
  | what normalization is.
  | 
  | I thought perhaps swapping the strings to integers might make
  | it easier to index, or perhaps it did indeed help with
  | dedupilcation in that the implementation didn't "compress"
  | identical strings in a column--saving space and perhaps help
  | performance. But both issues appeared to be implementation
  | issues with an unsophisticated Mysql circa 2000, rather than a
  | fundamentally wrong schema.
  | 
  | I agreed with her comment at the end about not valuing
  | experience, but proper databasing should be taught to every
  | developer at the undergraduate level, and somehow to the self-
  | taught. Looking at comptia... they don't seem to have a db/sql
  | test, only an "IT-Fundamentals" which touches on it.
 
  | nzealand wrote:
  | > Normalization does not mean "the same string can only appear
  | once".
  | 
  | It can, if you want to easily update an email address, or
  | easily remove all references to an email address because it is
  | PII.
 
    | alksjoflkajsom wrote:
    | That's not what normalization normally means, no. See eg
    | wikipedia
    | https://en.wikipedia.org/wiki/Database_normalization
 
  | noduerme wrote:
  | >>> Mapping the string "address1@foo.bar" to a new value "id1"
  | has no effect on the relations in your table.
  | 
  | How do you mean? If id1 is unique on table A, and table B has a
  | foreign key dependency on A.id, then yeah you still have id1 in
  | twenty locations but it's normalized in that altering the
  | referenced table once will alter the joined value in all twenty
  | cases.
  | 
  | This might not be important in the spam-graylisting use case,
  | and very narrowly it might be 3NF as originally written, but it
  | certainly wouldn't be if there were any other data attached to
  | each email address, such as a weighting value.
 
  | twic wrote:
  | This is very slightly not quite true, because the HELO string
  | and the remote address should go hand-in-hand.
 
    | alksjoflkajsom wrote:
    | You might be right (I don't actually know what a HELO string
    | is, I don't know anything about SMTP :). I was just going off
    | how the author presented the data, as a tuple of four
    | completely independent things.
    | 
    | Of course the main point still stands, that the two schemas
    | are exactly as normalized as each other.
    | 
    | Edit: rereading the original post, the author mentions that
    | "they forged...the HELO"--so perhaps there was indeed no
    | relationship between HELO and IP here. But again, I don't
    | know anything about SMTP, so this could be wrong.
 
      | Brian_K_White wrote:
      | I do know about smtp and you were right regardless, because
      | the author was talking about 4 database fields, not smtp.
      | The details of smtp are irrelevant.
 
      | GauntletWizard wrote:
      | It's (somewhat) because the HELO is forged that there's no
      | relationship between HELO and IP. The very first message of
      | SMTP is "HELO ", hostname can either be a unique
      | identifier (server1.company.com, etc.) or a system-level
      | identifier (mta.company.com for all of your company's
      | outbound mail agents, or in the case of bulk mailers they
      | might use bulk.client1.com when sending as client1,
      | bulk.client2.com, etc). But there is/was no authentication
      | on what you send as HELO (Now you can verify it against the
      | TLS certificate, though many implementations don't do that
      | at all or well), so correlating based on the hostname in
      | HELO was questionable at best. Thus, the combination of
      | (HELO, IP) was the a single value as a tuple.
 
        | Izkata wrote:
        | > But there is/was no authentication on what you send as
        | HELO
        | 
        | Yep, and that explains the "foobar" rows - those should
        | have resolved to the same IP, except because there's no
        | authentication that blocks it you could put gibberish
        | here and the SMTP server would accept it.
        | 
        | > so correlating based on the hostname in HELO was
        | questionable at best
        | 
        | Eh, spambots from two different IPs could have both
        | hardcoded "foobar" because of the lack of authentication,
        | so I could see this working to filter
        | legitimate/illegitimate emails from a compromised IP.
 
      | [deleted]
 
| [deleted]
 
| dorianmariefr wrote:
| id | ip_address | from | to | content | updated_at | created_at
| 
| And indexes where you need them. Normalize data like IP
| addresses, from and to. And you are good to go.
 
| fullstackchris wrote:
| I guess I've come away with a totally different takeaway than
| most. This post is rather strong on the blame game, which could
| be fixed by one thing...
| 
| RTFD! (Read The F**in Docs!) - I only skimmed the post, but its
| definitely something that would have been avoided had some SQL
| documentation or introduction been read.
| 
| I'd argue one of the huge things that differentiates "senior"
| developers from all the "other" levels - we're not smarter or
| more more clever than anyone else - we read up on the tools we
| use, see how they work, read how others have used them before...
| I understand this was from 2002, but MySQL came out in 1995 -
| there was certainly at least a handful of books on the topic.
| 
| Perhaps when just starting off as an intern, you may be able to
| argue that you are 'operating in a vacuum' but any number of
| introductory SQL books or documentation could quickly reveal
| solutions to the problem encountered in the post. (Some of which
| are suggested in these comments).
| 
| Of course we all make mistakes in software - I definitely could
| see myself creating such a schema and forgetting to add any sort
| of indexing - but when running into performance issues later, the
| only way you'll be able to know what to do next to fix it is by
| having read literature about details of the tools you are using.
| 
| Operating in a vacuum? Then break out of it and inform yourself.
| Many people have spent many hours creating good documentation and
| tutorial on many many software tools - use them.
 
  | fwip wrote:
  | The problem here is "unknown unknowns," coupled with the real
  | business pressure to get solutions out the door. The MySQL
  | manual is (currently) over 6,000 pages. A developer new to SQL
  | or relational DBs doesn't know how many of those are necessary
  | to read and comprehend to deploy a working solution.
  | 
  | And in this case, she designed and deployed the system, and it
  | worked and met the business needs for several months. When
  | performance became an issue, she optimized. I'm sure she had
  | plenty of other unrelated work to do in the meantime,
  | especially as a lead/solo dev in the early 2000's.
  | 
  | Sounds like a productive developer to me.
 
| GuB-42 wrote:
| The problem, I think is that most people, me included, don't
| really know what databases really do. There is a whole lot about
| optimizing procedural code, with a variety of tools, the dangers
| of premature optimization and the tradeoff with readability.
| Anyone with a computer science degree has heard about algorithmic
| complexity, caches, etc...
| 
| But databases are just magic. You try things out, usually
| involving a CREATE INDEX at some point, and sometime it gets
| faster, so you keep it.
| 
| Rachel, in he blog post is a good example of that thought
| process. She used a "best practice", added an index (there is
| always an index) and it made her queries faster, cool. I don't
| blame her, it works, and it is a good reminder of the 3NF
| principle. But work like that on procedural code and I'm sure we
| will get plenty of reactions like "why no profiler?".
| 
| Many, many programmers write SQL, but very few seem to know about
| query plans and the way the underlying data structures work. It
| almost looks like secret knowledge of the DBA caste or something.
| I know it is all public knowledge of course, but it is rarely
| taught, and the little I know about is is all personal curiosity.
 
  | musingsole wrote:
  | > But databases are just magic.
  | 
  | 100%!!!
  | 
  | I hate how developers talk about a "database" as a monolithic
  | concept. It's an abstract concept with countless
  | implementations built off of competing philosophies of that
  | abstract concept. SQL is only slightly more concrete, but
  | there's as many variants and special quirks of SQL dialects out
  | there as databases.
 
| matsemann wrote:
| > _The first time you encounter something, you 're probably going
| to make some mistakes. There's a post going around tonight about
| how someone forgot to put an index on some database thing and
| wound up doing full table scans (or something like that)._
| 
| Which post is referenced here?
| 
| I also believe this to be a tooling issue. It's often opaque what
| ends up being run after I've done something in some framework
| (java jpa, django queries whatever). How many queries (is it n+1
| issues at bay?), how the queries will behave etc. Locally with
| little data everything is fine, until it blows up in production.
| But you may not even notice it blowing up in production, because
| that relies on someone having instrumented the db and push
| logs+alarms somewhere. So it's easy to remain clueless.
 
  | claytonjy wrote:
  | Pretty sure it's this one
  | https://news.ycombinator.com/item?id=29132572
 
| webstrand wrote:
| While the normalized version is more compact and doesn't store
| redundant data, it _also_ needs an index on the four columns or
| it'll have to check every row in the table. A similar index added
| to the original denormalized table would have given comparable
| query performance.
| 
| The table schema isn't terrible, it's just not great. A good
| first-pass to be optimized when it's discovered to be overly
| large.
 
  | [deleted]
 
  | Closi wrote:
  | > The table schema isn't terrible, it's just not great.
  | 
  | It depends - if this was the full use-case then maybe a single
  | table is actually a pretty good solution.
 
  | [deleted]
 
| ehershey wrote:
| Proud of myself for recognizing the conceit (pun intended) here
| pretty quickly. Good point in the end too!
 
| Donckele wrote:
| Sorry Rachel but the best solution was a simple fix as mentioned
| by most people here - columns having the right types and indexes.
| 
| When I was still wet behind the ears I was doing what you are now
| proposing - but its overkill.
 
| jugg1es wrote:
| One of the most useful lessons I ever learned about designing
| database schemas was the utter uselessness of indexing and
| searching on datetime fields. Since virtually every value is
| going to be different, indexing and searching on that field is
| (almost) no better than having no index on the datetime field at
| all.
| 
| It was a revelation to me when I decided to experiment with
| having a indexed date-ONLY field and using that to search
| instead. It improved performance by almost two orders of
| magnitude. In hindsight, this should have been totally obvious if
| I had stopped to think about how indexing works. But, as is
| stated in this article, we have to keep relearning the same
| lessons. Maybe I should write an article about how useless it is
| to index a datetime field...
| 
| EDIT: This was something I ran into about 10 years ago. It is
| possible there was something else going on at the time that I
| didn't know about that caused the issue I was seeing. This is an
| anecdote from my past self. I have not had to use this technique
| since then, and we were using an on-prem server. It's possible
| that the rest of the table was not designed well and index I was
| trying to use was already inefficient for the resources that we
| had at the time.
 
  | dilyevsky wrote:
  | You're maybe thinking of compression? Primary id is unique too
  | but indexed search is still logn compared to full table scan
  | (n)
 
  | blibble wrote:
  | there's more index types than only hashes (which do indeed only
  | support equality)
  | 
  | e.g. btrees allow greater than/less than comparisons
  | 
  | postgres supports these and many more:
  | https://www.postgresql.org/docs/9.5/indexes-types.html
 
    | jugg1es wrote:
    | I was using a BTREE index and we were doing greater/less than
    | queries. It was not a hash index.
 
  | ComodoHacker wrote:
  | Did you also learn that certain indexes allow for range
  | queries, and modern databases are quite efficient at those?
  | 
  | And yes, please write an article, it would be quite
  | interesting. With data and scripts to reproduce, of course.
 
  | pushrax wrote:
  | This entirely depends on what kind of index is used. A sorted
  | index, such as a B or B+ tree (used in many SQL databases),
  | will allow for fast point/range lookups in a continuous value
  | space. A typical inverted index or hash based index only allows
  | point lookups of specific values in a discrete value space.
 
    | jugg1es wrote:
    | It was a sorted BTREE index in MySQL 5.x. I agree that its
    | supposed to be fast but it just wasn't for some reason.
 
      | pushrax wrote:
      | Are you sure it was actually using the index you expected?
      | There are subtleties in index field order that can prevent
      | the query optimizer from using an index that you might
      | think it should be using.
      | 
      | One common misstep is having a table with columns like (id,
      | user, date), with an index on (user, id) and on (date, id),
      | then issuing a query like "... WHERE user = 1 and date >
      | ...". There is no optimal index for that query, so the
      | optimizer will have to guess which one is better, or try to
      | intersect them. In this example, it might use only the
      | (user, id) index, and scan the date for all records with
      | that user. A better index for this query would be (user,
      | date, id).
 
      | Izkata wrote:
      | One of the more bizarre things we'd found in MySQL
      | 5.something was that accidentally creating two identical
      | indexes significantly slowed down queries that used it.
      | 
      | I wouldn't be surprised if you hit some sort of similar
      | strange bug.
 
    | whoknowswhat11 wrote:
    | I was just going to make this comment.
    | 
    | Postgres as far as I know uses B-tree by default.
    | 
    | You can switch sort order I think for this as well, so "most
    | recent" becomes more efficient.
    | 
    | Multi-column indexes also work, if you are just searching for
    | first column postgres can still use multi-column index.
 
  | dreyfan wrote:
  | B-Tree indexes perform just fine on datetime fields. Were you
  | using hash or inverted indexes maybe?
 
  | Diggsey wrote:
  | > Since virtually every value is going to be different,
  | indexing and searching on that field is (almost) no better than
  | having no index on the datetime field at all.
  | 
  | Do you mean exact equality is rarely what you want because most
  | of the values are different?
  | 
  | Or are you talking about the negative effect on index size of
  | having so many distinct values?
  | 
  | I think the latter point could be quite database-dependent, eg.
  | BTree de-duplication support was only added in Postgres 13.
  | However, you could shave off quite a bit just from the fact
  | that storing a date requires less space in the index than a
  | datetime.
 
  | drchickensalad wrote:
  | I have no idea how you came to this conclusion, but indices on
  | datetime fields are completely required to bring down the seek
  | time from O(n) to O(log(N)), plus the length of the range.
  | Massive parts of the businesses I've worked at would be simply
  | impossible without this.
  | 
  | The cardinality of the index has other ramifications that
  | aren't generally important here.
 
  | hn_throwaway_99 wrote:
  | This a very strange, and incorrect, conclusion you have come
  | to. It doesn't matter that datetime fields are unique, as most
  | of the time you are not searching for a particular date time,
  | but a range. I.e. "show me all of rows created between date X
  | and Y". In that case, the ordering of the index makes it
  | efficient to do that query.
  | 
  | Furthermore, often times your query will have an "ORDER BY
  | dateTimeCol" value on it (or, more commonly, ORDER BY
  | dateTimeCol DESC). If your index is created correctly, it means
  | it can return rows with a quick index scan instead of needing
  | to implement a sort as part of the execution plan.
 
  | mceachen wrote:
  | Your result is surprising: I suspect your wide table with an
  | index wasn't in cache, and your timestamp-only table was.
  | 
  | The whole point is that indexes (can) prevent full table scans,
  | which are expensive. This is true even if the column(s) you're
  | indexing have high cardinality: but it relies on performant
  | table or index joins (which, if your rdbms is configured with
  | sufficient memory, should be the case).
 
| nlitened wrote:
| > The rub is that instead of just being slow, it also cost a fair
| amount of money because this crazy vendor system charged by the
| row or somesuch. So, by scanning the whole table, they touched
| all of those rows, and oh hey, massive amounts of money just set
| ablaze!
| 
| Why _the hell_ is nobody mentioning that using a database that
| charges per row touched is absolute insanity? When has it become
| so normal that nobody mentions it?
 
  | lern_too_spel wrote:
  | If the amount charged isn't proportional to the work done, the
  | cloud provider would quickly go out of business.
 
    | nlitened wrote:
    | My VPS provider gives me unlimited traffic, unlimited disk
    | reads, and unlimited DB rows touched for fixed 30 dollars a
    | month, regardless of how much CPU and RAM I keep loaded.
    | 
    | This makes me think that these things are at least 100x
    | cheaper than AWS might want to make me believe.
 
  | ricardobeat wrote:
  | You'll probably enjoy reading this article that was on the
  | front page yesterday:
  | https://briananglin.me/posts/spending-5k-to-learn-how-databa...
 
| [deleted]
 
| inopinatus wrote:
| The stench of *splaining in these remarks is nigh overwhelming,
| and mostly missing the point to boot.
 
___________________________________________________________________
(page generated 2021-11-07 23:00 UTC)