proxy70

	[HN Gopher] Jaccard Index ___________________________________________________________________ Jaccard Index Author : dedalus Score : 91 points Date : 2023-03-19 19:33 UTC (3 hours ago)
	web link (en.wikipedia.org)
	w3m dump (en.wikipedia.org)
	\| SubiculumCode wrote: \| Jaccard my Dice please. \| SubiculumCode wrote: \| Jests asside, I've mostly used the closely related Dice \| coefficient when measuring segmentation reliability \| psyklic wrote: \| This is one of my favorite distance metrics* to show people! \| \| For example, perhaps one person likes Reddit and HN, while \| someone else likes HN and SO. \| \| Then their Jaccard Index would be 1/3, since they have one thing \| in common out of three. \| \| * Technically it computes "similarity" (larger number == more \| similar), but `1 - Jaccard Index` is a distance (smaller number \| == more similar). \| startup_eng wrote: \| I just used this at work the other day to calculate similarities \| between different data models that had overlapping children \| models. One of our teams was going to go through manually to \| check these overlaps and consolidate, but by using this \| clustering algo based on Jaccard distance we were able to give \| them clusters to consolidate up front. Super cool stuff! \| sonofaragorn wrote: \| what is a children model? I'm curious but can't really follow \| what you wrote, can you add a bit more context? \| coeneedell wrote: \| I recently used Jaccard similarity as a measurement of distance \| between two sets of online articles. It's amazing how versatile \| it is for all sorts of weird tasks. \| paulgb wrote: \| I uses to use Jaccard similarity combined with w-shingling at \| the character level to detect clusters of fraud sites. It was \| surprisingly effective, because it was able to pick up common \| patterns in the code even if they used completely different \| styles and text. \| \| https://en.m.wikipedia.org/wiki/W-shingling \| jethkl wrote: \| Interesting - I also used Jaccard similarity to classify \| clusters of malicious ad traffic schemes. The idea worked \| well. It was unclear if the similarity was due to mimicry or \| authorship, but that did not matter for our use. \| unethical_ban wrote: \| What is the predicted bounding and the ground truth bounding, as \| related by a stop sign? I have no idea what's happening there. \| montroser wrote: \| The "predicted" box there would be a best guess from a \| statistical model powered by AI or computer vision, answering, \| "where is the stop sign in this image?". The "ground truth" \| would be an annotation by a human answering the same question. \| The jaccard similarity metric would say that these bounding \| boxes are highly similar, and so the prediction could be \| evaluated as high quality. \| pncnmnp wrote: \| I recently wrote a fun blog post \| (https://pncnmnp.github.io/blogs/odd-sketches.html) about how to \| estimate Jaccard Similarity using min hashing, what b-bit min \| hashing is, and how to improve upon its limitations using a 2014 \| data structure called odd sketches. \| \| Jaccard Similarity's history is also quite interesting. From my \| blog: \| \| > In the late 19th century, the United States and several \| European nations were focused on developing strategies for \| weather forecasting, particularly for storm warnings. In 1884, \| Sergeant John Finley of the U.S. Army Signal Corps conducted \| experiments aimed at creating a tornado forecasting program for \| 18 regions in the United States east of the Rockies. To the \| surprise of many, Finley claimed his programs were 95.6% to 98.6% \| accurate, with some areas even achieving a 100% accuracy rate. \| Upon publishing his findings, Finley's methods were criticized by \| contemporaries who pointed out flaws in his verification \| strategies and proposed their solutions. This sparked a renewed \| interest in weather prediction, which is now referred to as the \| "Finley Affair." \| \| > One of these contemporaries was Grove Karl Gilbert. Just two \| months after Finley's publication, Gilbert pointed out that, \| based on Finley's strategy, a 98.2% accuracy rate could be \| achieved simply by forecasting no tornado warning. Gilbert then \| introduced an alternative strategy, which is now known as Jaccard \| Similarity. \| \| > So why is it named Jaccard Similarity? As it turns out, nearly \| three decades after Sergeant John Finley's tornado forecasting \| program in the 1880s, Paul Jaccard independently developed the \| same concept while studying the distribution of alpine flora. \| stygiansonic wrote: \| The name may be an example of this: \| https://en.m.wikipedia.org/wiki/Stigler%27s_law_of_eponymy \| \| _It was developed by Grove Karl Gilbert in 1884 as his ratio of \| verification (v)[1] and now is frequently referred to as the \| Critical Success Index in meteorology.[2] It was later developed \| independently by Paul Jaccard..._ \| jszymborski wrote: \| Another example of this sort of thing (that is vaguely related \| in that it's commonly used as a metric) is (what I call) the \| Matthew's Correlation Coefficient \| https://en.wikipedia.org/wiki/Phi_coefficient \| \| > In machine learning, it is known as the Matthews correlation \| coefficient (MCC) ... introduced by biochemist Brian W. \| Matthews in 1975.[1] Introduced by Karl Pearson,[2] and also \| known as the Yule phi coefficient from its introduction by Udny \| Yule in 1912 \| dalke wrote: \| In my field, cheminformatics, we refer to it as "Tanimoto \| similarity" because it was (quoting Wikipedia) "independently \| formulated again by T. Tanimoto." \| \| It's an odd set of linkages to get there. First, "Dr. David J. \| Rogers of the New York Botanical Gardens" proposed a problem to \| Tanimoto, who published the writeup in an internal IBM report \| in 1958. (I understand there was a lot of mathematical research \| in taxonomy at the time.) In 1960 Rogers and Tanimoto published \| an updated version in Science. \| \| In 1973 Adamson and Bush at Sheffield University developed a \| method for the automatic classification of chemical structures. \| They tried Dice, phi, and Sneath as their comparison methods \| but not Tanimoto. In their updated 1975 publication write \| "Several coefficients have been proposed based on this \| criterion", with a list of citations, including the Rogers and \| Tanimoto paper as citation 14. \| \| In 1986, Peter Willett at Sheffield revisits this work and \| finds that Tanimoto gives overall better results when applied \| to what are now called cheminformatics "fingerprints". He uses \| "Tanimoto", with no direct citation for the source of that \| definition. \| \| This similarity method is easy to implement, and many \| organizations already have pre-computed fingerprints (they are \| used as pre-filters for graph queries), so the concept and \| nomenclature takes off almost immediately, with "Tanimoto" as \| the preferred named. \| \| It's not until 1991 that can find a paper in my field referring \| to the earlier work by Jaccard (the paper uses "Tanimoto \| (Jaccard)"). \| \| I have found some papers in related fields (eg, in IR and mass \| spectra analysis) which reference Tanimoto similarity, but \| nothing to the extent that my field uses it. \| ketralnis wrote: \| Quoting myself from a while ago[0] \| \| At reddit many moons ago before machine learning was a buzzword \| one early iteration of recommendations was based on Jaccard \| distance using the number of co-voters between subreddits. But \| with one twist: divide by the size of the smaller subreddit. \| relatedness a b = numerator = \| voters on(a) [?] \| voters on(b) \| denominator = \| voters on(a) [?] \| voters on(b) \| weight = min(\|voters on(a)\|, \| \|voters on(b)\|) numerator / (weight*denominator) \| \| That gives you a directional relatedness, that is \| programming->python but not necessarily python->programming. Used \| this way you account for the giant subreddit problem[1] \| automatically but now the results are less "amitheasshole is \| related to askreddit" and more like "linguisticshumor is a more \| niche version of linguistics". \| \| The great thing is that it's actually more actionable as far as \| recommendations go! Everybody has already heard of the bigger \| version of this subreddit, but they probably haven't heard of the \| smaller versions. And it's self-correcting: as a subreddit gets \| bigger we are less likely to recommend it, which is great because \| it needs our help less. \| \| It's also easy to compute this because it lends itself to one \| giant SQL query that postgres or even sqlite[2] optimises \| reasonable well. It has some discontinuities around very tiny \| subredddits, so there was also a hack to just exclude them with a \| hack heuristic. It does get fairly static so once we've picked 3 \| subreddits to recommend if you're on subreddit A, if you don't \| like them we'll just keep showing them anyway. I had a hack in \| mind for that (use the computed values as random weights so we'll \| still occasionally show lower-scoring ones) but by this time \| people much smarter than I took over recommendations with more \| holistic solutions to the problem we were trying to solve in the \| first place. Still, as a first pass it worked great and based on \| my experience I'd recommend simple approaches like this before \| you break out the the linear algebra. \| \| Side note, I tried co-commenters in addition to co-voters. The \| results tended more accurate in my spot tests but the difference \| fell away in more proper cross-validation testing and I didn't \| look into where the qualitative difference was. But since there \| are more votes than comments on small subreddits the number of \| recommendable subreddits was higher with votes. I reasoned that \| co-submitters (of posts) should be even more accurate but it was \| thrown off by a small number of spammers and I didn't want to \| mess with combining those tasks at the time. \| \| [0]: https://news.ycombinator.com/item?id=22178517 \| \| [1]: that votes are distributed according to a power law, meaning \| that everybody has voted on the largest subreddits so most \| clustering approaches recommend askreddit to everybody. That's \| okay for product recommendations where "you should buy the most \| popular CPU, it's most popular for a reason" but for subreddits \| you already know that so we want a way to bias to the most \| "surprising" of your votes. \| \| [2]: I prototyped it on sqlite on my laptop and even with close \| to the production amount of data it ran reasonable well. Not \| fast, but fine. This was on considerably less traffic to today, \| mind. \| seydor wrote: \| didnt know there was a name for it ___________________________________________________________________ (page generated 2023-03-19 23:00 UTC)