[HN Gopher] Jaccard Index
___________________________________________________________________
 
Jaccard Index
 
Author : dedalus
Score  : 91 points
Date   : 2023-03-19 19:33 UTC (3 hours ago)
 
web link (en.wikipedia.org)
w3m dump (en.wikipedia.org)
 
| SubiculumCode wrote:
| Jaccard my Dice please.
 
  | SubiculumCode wrote:
  | Jests asside, I've mostly used the closely related Dice
  | coefficient when measuring segmentation reliability
 
| psyklic wrote:
| This is one of my favorite distance metrics* to show people!
| 
| For example, perhaps one person likes Reddit and HN, while
| someone else likes HN and SO.
| 
| Then their Jaccard Index would be 1/3, since they have one thing
| in common out of three.
| 
| * Technically it computes "similarity" (larger number == more
| similar), but `1 - Jaccard Index` is a distance (smaller number
| == more similar).
 
| startup_eng wrote:
| I just used this at work the other day to calculate similarities
| between different data models that had overlapping children
| models. One of our teams was going to go through manually to
| check these overlaps and consolidate, but by using this
| clustering algo based on Jaccard distance we were able to give
| them clusters to consolidate up front. Super cool stuff!
 
  | sonofaragorn wrote:
  | what is a children model? I'm curious but can't really follow
  | what you wrote, can you add a bit more context?
 
| coeneedell wrote:
| I recently used Jaccard similarity as a measurement of distance
| between two sets of online articles. It's amazing how versatile
| it is for all sorts of weird tasks.
 
  | paulgb wrote:
  | I uses to use Jaccard similarity combined with w-shingling at
  | the character level to detect clusters of fraud sites. It was
  | surprisingly effective, because it was able to pick up common
  | patterns in the code even if they used completely different
  | styles and text.
  | 
  | https://en.m.wikipedia.org/wiki/W-shingling
 
    | jethkl wrote:
    | Interesting - I also used Jaccard similarity to classify
    | clusters of malicious ad traffic schemes. The idea worked
    | well. It was unclear if the similarity was due to mimicry or
    | authorship, but that did not matter for our use.
 
| unethical_ban wrote:
| What is the predicted bounding and the ground truth bounding, as
| related by a stop sign? I have no idea what's happening there.
 
  | montroser wrote:
  | The "predicted" box there would be a best guess from a
  | statistical model powered by AI or computer vision, answering,
  | "where is the stop sign in this image?". The "ground truth"
  | would be an annotation by a human answering the same question.
  | The jaccard similarity metric would say that these bounding
  | boxes are highly similar, and so the prediction could be
  | evaluated as high quality.
 
| pncnmnp wrote:
| I recently wrote a fun blog post
| (https://pncnmnp.github.io/blogs/odd-sketches.html) about how to
| estimate Jaccard Similarity using min hashing, what b-bit min
| hashing is, and how to improve upon its limitations using a 2014
| data structure called odd sketches.
| 
| Jaccard Similarity's history is also quite interesting. From my
| blog:
| 
| > In the late 19th century, the United States and several
| European nations were focused on developing strategies for
| weather forecasting, particularly for storm warnings. In 1884,
| Sergeant John Finley of the U.S. Army Signal Corps conducted
| experiments aimed at creating a tornado forecasting program for
| 18 regions in the United States east of the Rockies. To the
| surprise of many, Finley claimed his programs were 95.6% to 98.6%
| accurate, with some areas even achieving a 100% accuracy rate.
| Upon publishing his findings, Finley's methods were criticized by
| contemporaries who pointed out flaws in his verification
| strategies and proposed their solutions. This sparked a renewed
| interest in weather prediction, which is now referred to as the
| "Finley Affair."
| 
| > One of these contemporaries was Grove Karl Gilbert. Just two
| months after Finley's publication, Gilbert pointed out that,
| based on Finley's strategy, a 98.2% accuracy rate could be
| achieved simply by forecasting no tornado warning. Gilbert then
| introduced an alternative strategy, which is now known as Jaccard
| Similarity.
| 
| > So why is it named Jaccard Similarity? As it turns out, nearly
| three decades after Sergeant John Finley's tornado forecasting
| program in the 1880s, Paul Jaccard independently developed the
| same concept while studying the distribution of alpine flora.
 
| stygiansonic wrote:
| The name may be an example of this:
| https://en.m.wikipedia.org/wiki/Stigler%27s_law_of_eponymy
| 
|  _It was developed by Grove Karl Gilbert in 1884 as his ratio of
| verification (v)[1] and now is frequently referred to as the
| Critical Success Index in meteorology.[2] It was later developed
| independently by Paul Jaccard..._
 
  | jszymborski wrote:
  | Another example of this sort of thing (that is vaguely related
  | in that it's commonly used as a metric) is (what I call) the
  | Matthew's Correlation Coefficient
  | https://en.wikipedia.org/wiki/Phi_coefficient
  | 
  | > In machine learning, it is known as the Matthews correlation
  | coefficient (MCC) ... introduced by biochemist Brian W.
  | Matthews in 1975.[1] Introduced by Karl Pearson,[2] and also
  | known as the Yule phi coefficient from its introduction by Udny
  | Yule in 1912
 
  | dalke wrote:
  | In my field, cheminformatics, we refer to it as "Tanimoto
  | similarity" because it was (quoting Wikipedia) "independently
  | formulated again by T. Tanimoto."
  | 
  | It's an odd set of linkages to get there. First, "Dr. David J.
  | Rogers of the New York Botanical Gardens" proposed a problem to
  | Tanimoto, who published the writeup in an internal IBM report
  | in 1958. (I understand there was a lot of mathematical research
  | in taxonomy at the time.) In 1960 Rogers and Tanimoto published
  | an updated version in Science.
  | 
  | In 1973 Adamson and Bush at Sheffield University developed a
  | method for the automatic classification of chemical structures.
  | They tried Dice, phi, and Sneath as their comparison methods
  | but not Tanimoto. In their updated 1975 publication write
  | "Several coefficients have been proposed based on this
  | criterion", with a list of citations, including the Rogers and
  | Tanimoto paper as citation 14.
  | 
  | In 1986, Peter Willett at Sheffield revisits this work and
  | finds that Tanimoto gives overall better results when applied
  | to what are now called cheminformatics "fingerprints". He uses
  | "Tanimoto", with no direct citation for the source of that
  | definition.
  | 
  | This similarity method is easy to implement, and many
  | organizations already have pre-computed fingerprints (they are
  | used as pre-filters for graph queries), so the concept and
  | nomenclature takes off almost immediately, with "Tanimoto" as
  | the preferred named.
  | 
  | It's not until 1991 that can find a paper in my field referring
  | to the earlier work by Jaccard (the paper uses "Tanimoto
  | (Jaccard)").
  | 
  | I have found some papers in related fields (eg, in IR and mass
  | spectra analysis) which reference Tanimoto similarity, but
  | nothing to the extent that my field uses it.
 
| ketralnis wrote:
| Quoting myself from a while ago[0]
| 
| At reddit many moons ago before machine learning was a buzzword
| one early iteration of recommendations was based on Jaccard
| distance using the number of co-voters between subreddits. But
| with one twist: divide by the size of the smaller subreddit.
| relatedness a b =             numerator   = | voters on(a) [?]
| voters on(b) |             denominator = | voters on(a) [?]
| voters on(b) |             weight      = min(|voters on(a)|,
| |voters on(b)|)             numerator / (weight*denominator)
| 
| That gives you a directional relatedness, that is
| programming->python but not necessarily python->programming. Used
| this way you account for the giant subreddit problem[1]
| automatically but now the results are less "amitheasshole is
| related to askreddit" and more like "linguisticshumor is a more
| niche version of linguistics".
| 
| The great thing is that it's actually more actionable as far as
| recommendations go! Everybody has already heard of the bigger
| version of this subreddit, but they probably haven't heard of the
| smaller versions. And it's self-correcting: as a subreddit gets
| bigger we are less likely to recommend it, which is great because
| it needs our help less.
| 
| It's also easy to compute this because it lends itself to one
| giant SQL query that postgres or even sqlite[2] optimises
| reasonable well. It has some discontinuities around very tiny
| subredddits, so there was also a hack to just exclude them with a
| hack heuristic. It does get fairly static so once we've picked 3
| subreddits to recommend if you're on subreddit A, if you don't
| like them we'll just keep showing them anyway. I had a hack in
| mind for that (use the computed values as random weights so we'll
| still occasionally show lower-scoring ones) but by this time
| people much smarter than I took over recommendations with more
| holistic solutions to the problem we were trying to solve in the
| first place. Still, as a first pass it worked great and based on
| my experience I'd recommend simple approaches like this before
| you break out the the linear algebra.
| 
| Side note, I tried co-commenters in addition to co-voters. The
| results tended more accurate in my spot tests but the difference
| fell away in more proper cross-validation testing and I didn't
| look into where the qualitative difference was. But since there
| are more votes than comments on small subreddits the number of
| recommendable subreddits was higher with votes. I reasoned that
| co-submitters (of posts) should be even more accurate but it was
| thrown off by a small number of spammers and I didn't want to
| mess with combining those tasks at the time.
| 
| [0]: https://news.ycombinator.com/item?id=22178517
| 
| [1]: that votes are distributed according to a power law, meaning
| that everybody has voted on the largest subreddits so most
| clustering approaches recommend askreddit to everybody. That's
| okay for product recommendations where "you should buy the most
| popular CPU, it's most popular for a reason" but for subreddits
| you already know that so we want a way to bias to the most
| "surprising" of your votes.
| 
| [2]: I prototyped it on sqlite on my laptop and even with close
| to the production amount of data it ran reasonable well. Not
| fast, but fine. This was on considerably less traffic to today,
| mind.
 
| seydor wrote:
| didnt know there was a name for it
 
___________________________________________________________________
(page generated 2023-03-19 23:00 UTC)