|
| SubiculumCode wrote:
| Jaccard my Dice please.
| SubiculumCode wrote:
| Jests asside, I've mostly used the closely related Dice
| coefficient when measuring segmentation reliability
| psyklic wrote:
| This is one of my favorite distance metrics* to show people!
|
| For example, perhaps one person likes Reddit and HN, while
| someone else likes HN and SO.
|
| Then their Jaccard Index would be 1/3, since they have one thing
| in common out of three.
|
| * Technically it computes "similarity" (larger number == more
| similar), but `1 - Jaccard Index` is a distance (smaller number
| == more similar).
| startup_eng wrote:
| I just used this at work the other day to calculate similarities
| between different data models that had overlapping children
| models. One of our teams was going to go through manually to
| check these overlaps and consolidate, but by using this
| clustering algo based on Jaccard distance we were able to give
| them clusters to consolidate up front. Super cool stuff!
| sonofaragorn wrote:
| what is a children model? I'm curious but can't really follow
| what you wrote, can you add a bit more context?
| coeneedell wrote:
| I recently used Jaccard similarity as a measurement of distance
| between two sets of online articles. It's amazing how versatile
| it is for all sorts of weird tasks.
| paulgb wrote:
| I uses to use Jaccard similarity combined with w-shingling at
| the character level to detect clusters of fraud sites. It was
| surprisingly effective, because it was able to pick up common
| patterns in the code even if they used completely different
| styles and text.
|
| https://en.m.wikipedia.org/wiki/W-shingling
| jethkl wrote:
| Interesting - I also used Jaccard similarity to classify
| clusters of malicious ad traffic schemes. The idea worked
| well. It was unclear if the similarity was due to mimicry or
| authorship, but that did not matter for our use.
| unethical_ban wrote:
| What is the predicted bounding and the ground truth bounding, as
| related by a stop sign? I have no idea what's happening there.
| montroser wrote:
| The "predicted" box there would be a best guess from a
| statistical model powered by AI or computer vision, answering,
| "where is the stop sign in this image?". The "ground truth"
| would be an annotation by a human answering the same question.
| The jaccard similarity metric would say that these bounding
| boxes are highly similar, and so the prediction could be
| evaluated as high quality.
| pncnmnp wrote:
| I recently wrote a fun blog post
| (https://pncnmnp.github.io/blogs/odd-sketches.html) about how to
| estimate Jaccard Similarity using min hashing, what b-bit min
| hashing is, and how to improve upon its limitations using a 2014
| data structure called odd sketches.
|
| Jaccard Similarity's history is also quite interesting. From my
| blog:
|
| > In the late 19th century, the United States and several
| European nations were focused on developing strategies for
| weather forecasting, particularly for storm warnings. In 1884,
| Sergeant John Finley of the U.S. Army Signal Corps conducted
| experiments aimed at creating a tornado forecasting program for
| 18 regions in the United States east of the Rockies. To the
| surprise of many, Finley claimed his programs were 95.6% to 98.6%
| accurate, with some areas even achieving a 100% accuracy rate.
| Upon publishing his findings, Finley's methods were criticized by
| contemporaries who pointed out flaws in his verification
| strategies and proposed their solutions. This sparked a renewed
| interest in weather prediction, which is now referred to as the
| "Finley Affair."
|
| > One of these contemporaries was Grove Karl Gilbert. Just two
| months after Finley's publication, Gilbert pointed out that,
| based on Finley's strategy, a 98.2% accuracy rate could be
| achieved simply by forecasting no tornado warning. Gilbert then
| introduced an alternative strategy, which is now known as Jaccard
| Similarity.
|
| > So why is it named Jaccard Similarity? As it turns out, nearly
| three decades after Sergeant John Finley's tornado forecasting
| program in the 1880s, Paul Jaccard independently developed the
| same concept while studying the distribution of alpine flora.
| stygiansonic wrote:
| The name may be an example of this:
| https://en.m.wikipedia.org/wiki/Stigler%27s_law_of_eponymy
|
| _It was developed by Grove Karl Gilbert in 1884 as his ratio of
| verification (v)[1] and now is frequently referred to as the
| Critical Success Index in meteorology.[2] It was later developed
| independently by Paul Jaccard..._
| jszymborski wrote:
| Another example of this sort of thing (that is vaguely related
| in that it's commonly used as a metric) is (what I call) the
| Matthew's Correlation Coefficient
| https://en.wikipedia.org/wiki/Phi_coefficient
|
| > In machine learning, it is known as the Matthews correlation
| coefficient (MCC) ... introduced by biochemist Brian W.
| Matthews in 1975.[1] Introduced by Karl Pearson,[2] and also
| known as the Yule phi coefficient from its introduction by Udny
| Yule in 1912
| dalke wrote:
| In my field, cheminformatics, we refer to it as "Tanimoto
| similarity" because it was (quoting Wikipedia) "independently
| formulated again by T. Tanimoto."
|
| It's an odd set of linkages to get there. First, "Dr. David J.
| Rogers of the New York Botanical Gardens" proposed a problem to
| Tanimoto, who published the writeup in an internal IBM report
| in 1958. (I understand there was a lot of mathematical research
| in taxonomy at the time.) In 1960 Rogers and Tanimoto published
| an updated version in Science.
|
| In 1973 Adamson and Bush at Sheffield University developed a
| method for the automatic classification of chemical structures.
| They tried Dice, phi, and Sneath as their comparison methods
| but not Tanimoto. In their updated 1975 publication write
| "Several coefficients have been proposed based on this
| criterion", with a list of citations, including the Rogers and
| Tanimoto paper as citation 14.
|
| In 1986, Peter Willett at Sheffield revisits this work and
| finds that Tanimoto gives overall better results when applied
| to what are now called cheminformatics "fingerprints". He uses
| "Tanimoto", with no direct citation for the source of that
| definition.
|
| This similarity method is easy to implement, and many
| organizations already have pre-computed fingerprints (they are
| used as pre-filters for graph queries), so the concept and
| nomenclature takes off almost immediately, with "Tanimoto" as
| the preferred named.
|
| It's not until 1991 that can find a paper in my field referring
| to the earlier work by Jaccard (the paper uses "Tanimoto
| (Jaccard)").
|
| I have found some papers in related fields (eg, in IR and mass
| spectra analysis) which reference Tanimoto similarity, but
| nothing to the extent that my field uses it.
| ketralnis wrote:
| Quoting myself from a while ago[0]
|
| At reddit many moons ago before machine learning was a buzzword
| one early iteration of recommendations was based on Jaccard
| distance using the number of co-voters between subreddits. But
| with one twist: divide by the size of the smaller subreddit.
| relatedness a b = numerator = | voters on(a) [?]
| voters on(b) | denominator = | voters on(a) [?]
| voters on(b) | weight = min(|voters on(a)|,
| |voters on(b)|) numerator / (weight*denominator)
|
| That gives you a directional relatedness, that is
| programming->python but not necessarily python->programming. Used
| this way you account for the giant subreddit problem[1]
| automatically but now the results are less "amitheasshole is
| related to askreddit" and more like "linguisticshumor is a more
| niche version of linguistics".
|
| The great thing is that it's actually more actionable as far as
| recommendations go! Everybody has already heard of the bigger
| version of this subreddit, but they probably haven't heard of the
| smaller versions. And it's self-correcting: as a subreddit gets
| bigger we are less likely to recommend it, which is great because
| it needs our help less.
|
| It's also easy to compute this because it lends itself to one
| giant SQL query that postgres or even sqlite[2] optimises
| reasonable well. It has some discontinuities around very tiny
| subredddits, so there was also a hack to just exclude them with a
| hack heuristic. It does get fairly static so once we've picked 3
| subreddits to recommend if you're on subreddit A, if you don't
| like them we'll just keep showing them anyway. I had a hack in
| mind for that (use the computed values as random weights so we'll
| still occasionally show lower-scoring ones) but by this time
| people much smarter than I took over recommendations with more
| holistic solutions to the problem we were trying to solve in the
| first place. Still, as a first pass it worked great and based on
| my experience I'd recommend simple approaches like this before
| you break out the the linear algebra.
|
| Side note, I tried co-commenters in addition to co-voters. The
| results tended more accurate in my spot tests but the difference
| fell away in more proper cross-validation testing and I didn't
| look into where the qualitative difference was. But since there
| are more votes than comments on small subreddits the number of
| recommendable subreddits was higher with votes. I reasoned that
| co-submitters (of posts) should be even more accurate but it was
| thrown off by a small number of spammers and I didn't want to
| mess with combining those tasks at the time.
|
| [0]: https://news.ycombinator.com/item?id=22178517
|
| [1]: that votes are distributed according to a power law, meaning
| that everybody has voted on the largest subreddits so most
| clustering approaches recommend askreddit to everybody. That's
| okay for product recommendations where "you should buy the most
| popular CPU, it's most popular for a reason" but for subreddits
| you already know that so we want a way to bias to the most
| "surprising" of your votes.
|
| [2]: I prototyped it on sqlite on my laptop and even with close
| to the production amount of data it ran reasonable well. Not
| fast, but fine. This was on considerably less traffic to today,
| mind.
| seydor wrote:
| didnt know there was a name for it
___________________________________________________________________
(page generated 2023-03-19 23:00 UTC) |