[HN Gopher] PyWhat: Identify Anything
___________________________________________________________________
 
PyWhat: Identify Anything
 
Author : trueduke
Score  : 183 points
Date   : 2021-06-16 07:54 UTC (15 hours ago)
 
web link (github.com)
w3m dump (github.com)
 
| cosmic_quanta wrote:
| In the same vague theme of "I don't know what I'm dealing with" :
| https://github.com/ajalt/fuckitpy
 
  | kilnr wrote:
  | Another one sort of related is hachoir, and specifically the
  | hachoir-metadata script: https://github.com/vstinner/hachoir
 
  | 0-_-0 wrote:
  | I like the Versioning section:
  | 
  |  _The web devs tell me that fuckit 's versioning scheme is
  | confusing, and that I should use "Semitic Versioning" instead.
  | So starting with fuckit version h.g., package versions will use
  | Hebrew Numerals._
 
  | antongribok wrote:
  | I can't decide what I'm more impressed with:
  | 
  | The 110% code coverage, the downloads per month, or the
  | license.
 
    | bee_rider wrote:
    | I'm not sure if it was intentional or not, but I love that
    | the Hebrew characters that they found look visually similar
    | to Nan.
 
| dec0dedab0de wrote:
| At first I thought this was going to be like google lens. It's
| instead a way to probabilistically Identify things in strings. I
| have wished for this to exist, and made my own dumbed down
| version of it before. This could be very useful for less fragile
| screen scraping.
 
| acidbaseextract wrote:
| Some more great probabilistic python libraries:
| 
| https://github.com/datamade/usaddress - "usaddress is a Python
| library for parsing unstructured address strings into address
| components, using advanced NLP methods."
| 
| https://github.com/datamade/probablepeople - "probablepeople is a
| python library for parsing unstructured romanized name or company
| strings into components, using advanced NLP methods."
 
  | nerdponx wrote:
  | I have used and benefited tremendously from both of these
  | libraries. While the methods are sound, the training data they
  | used is not that comprehensive. He will probably want to apply
  | some heuristic clean up before and after processing. Or if your
  | organization has a lot of time and money, add additional
  | training data.
 
  | cge wrote:
  | Note that for the usaddress library, as I was surprised that it
  | failed spectacularly when I played with it: the 'us' in the
  | name appears to refer to the US, not 'unstructured'. There's no
  | note of this in the readme, though there is a small US flag
  | emoji in the Github about string.
 
  | ssivark wrote:
  | Nice! In the same spirit, here's an interesting talk on using
  | Gen.jl (a probabilistic programming library/framework) for
  | cleaning messy data in tables: https://youtu.be/vUxrtqY84AM
 
  | ok123456 wrote:
  | https://github.com/chardet/chardet - Detects the most likely
  | encoding of a raw byte string.
 
| lapp0 wrote:
| Why would I need this when I already have a full Tome of Identify
| with 50 charges?
 
  | nknealk wrote:
  | Tome of identify only holds 20 charges
 
    | AbraKdabra wrote:
    | I'm pretty sure he's playing the Project Diablo II mod.
 
  | saas_sam wrote:
  | PyWhat only uses one inventory slot vs. 2 for Tome. That's one
  | extra SoJ!
 
| lettergram wrote:
| We built a similar tool, utilizing a CNN. It works on structured
| (and unstructured) data and provides additional info.
| 
| https://github.com/capitalone/DataProfiler
| 
| Cool part, is you can "extend" the intern name-entity recognition
| model by refitting with the new data.
| 
| Out if the box, the DataProfiler does something like 18 entities
| including most of the PII dada.
 
  | [deleted]
 
| gigatexal wrote:
| There really is a Python module for everything.
 
| cecilpl2 wrote:
| Cool, but it seems like 80% of the results in your example demos
| are Youtube video IDs.
 
  | Mogzol wrote:
  | I find it kind of funny that they would choose to show those as
  | demos when it's obvious that most of them really aren't Youtube
  | video IDs. Like "Accept-Lang" is pretty obviously not actually
  | a video ID, even if it matches the [A-Za-z0-9_-]{11} pattern
  | and technically could be a valid ID.
  | 
  | On the other hand, I don't know how you would actually verify
  | whether an 11-character string is or isn't a Youtube ID (short
  | of querying Youtube itself), so I suppose it's nice that
  | potential IDs are shown, just seems they have a very high
  | chance of being false positives.
 
    | meowface wrote:
    | You can reduce false positives by trying to identify
    | base64-seeming strings that are 11 characters long. Above a
    | certain amount of entropy and uppercase/lowercase/digit
    | distribution, etc. You might risk false negatives, but
    | different flags for different levels of sensitivity could
    | help with that.
 
| vitus wrote:
| I'm admittedly not impressed by the pcap processing.
| 
| It identifies a bunch of fragments of HTTP headers as "YouTube
| Video ID".
| 
| Meanwhile, I can get the same info and more by running
| $ strings FollowTheLeader.pcap         *]?>         GET /
| HTTP/1.1         Host: 10.0.2.5         User-Agent: Mozilla/5.0
| (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
| Accept:
| text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
| Accept-Language: en-US,en;q=0.5         Accept-Encoding: gzip,
| deflate         Connection: keep-alive         Upgrade-Insecure-
| Requests: 1         Pragma: no-cache         Cache-Control: no-
| cache         HTTP/1.0 200 OK         Server: SimpleHTTP/0.6
| Python/3.7.3rc1         Date: Sun, 14 Jul 2019 02:42:13 GMT
| Content-type: text/html         Content-Length: 105         Last-
| Modified: Sun, 14 Jul 2019 02:41:10 GMT         

My Flag Web | Page

Hi there! Have a flag!

Here | is your flag: ctfa{terrific_traffic}

___________________________________________________________________ (page generated 2021-06-16 23:00 UTC)