proxy70

	[HN Gopher] PyWhat: Identify Anything ___________________________________________________________________ PyWhat: Identify Anything Author : trueduke Score : 183 points Date : 2021-06-16 07:54 UTC (15 hours ago)
	web link (github.com)
	w3m dump (github.com)
	\| cosmic_quanta wrote: \| In the same vague theme of "I don't know what I'm dealing with" : \| https://github.com/ajalt/fuckitpy \| kilnr wrote: \| Another one sort of related is hachoir, and specifically the \| hachoir-metadata script: https://github.com/vstinner/hachoir \| 0-_-0 wrote: \| I like the Versioning section: \| \| _The web devs tell me that fuckit 's versioning scheme is \| confusing, and that I should use "Semitic Versioning" instead. \| So starting with fuckit version h.g., package versions will use \| Hebrew Numerals._ \| antongribok wrote: \| I can't decide what I'm more impressed with: \| \| The 110% code coverage, the downloads per month, or the \| license. \| bee_rider wrote: \| I'm not sure if it was intentional or not, but I love that \| the Hebrew characters that they found look visually similar \| to Nan. \| dec0dedab0de wrote: \| At first I thought this was going to be like google lens. It's \| instead a way to probabilistically Identify things in strings. I \| have wished for this to exist, and made my own dumbed down \| version of it before. This could be very useful for less fragile \| screen scraping. \| acidbaseextract wrote: \| Some more great probabilistic python libraries: \| \| https://github.com/datamade/usaddress - "usaddress is a Python \| library for parsing unstructured address strings into address \| components, using advanced NLP methods." \| \| https://github.com/datamade/probablepeople - "probablepeople is a \| python library for parsing unstructured romanized name or company \| strings into components, using advanced NLP methods." \| nerdponx wrote: \| I have used and benefited tremendously from both of these \| libraries. While the methods are sound, the training data they \| used is not that comprehensive. He will probably want to apply \| some heuristic clean up before and after processing. Or if your \| organization has a lot of time and money, add additional \| training data. \| cge wrote: \| Note that for the usaddress library, as I was surprised that it \| failed spectacularly when I played with it: the 'us' in the \| name appears to refer to the US, not 'unstructured'. There's no \| note of this in the readme, though there is a small US flag \| emoji in the Github about string. \| ssivark wrote: \| Nice! In the same spirit, here's an interesting talk on using \| Gen.jl (a probabilistic programming library/framework) for \| cleaning messy data in tables: https://youtu.be/vUxrtqY84AM \| ok123456 wrote: \| https://github.com/chardet/chardet - Detects the most likely \| encoding of a raw byte string. \| lapp0 wrote: \| Why would I need this when I already have a full Tome of Identify \| with 50 charges? \| nknealk wrote: \| Tome of identify only holds 20 charges \| AbraKdabra wrote: \| I'm pretty sure he's playing the Project Diablo II mod. \| saas_sam wrote: \| PyWhat only uses one inventory slot vs. 2 for Tome. That's one \| extra SoJ! \| lettergram wrote: \| We built a similar tool, utilizing a CNN. It works on structured \| (and unstructured) data and provides additional info. \| \| https://github.com/capitalone/DataProfiler \| \| Cool part, is you can "extend" the intern name-entity recognition \| model by refitting with the new data. \| \| Out if the box, the DataProfiler does something like 18 entities \| including most of the PII dada. \| [deleted] \| gigatexal wrote: \| There really is a Python module for everything. \| cecilpl2 wrote: \| Cool, but it seems like 80% of the results in your example demos \| are Youtube video IDs. \| Mogzol wrote: \| I find it kind of funny that they would choose to show those as \| demos when it's obvious that most of them really aren't Youtube \| video IDs. Like "Accept-Lang" is pretty obviously not actually \| a video ID, even if it matches the [A-Za-z0-9_-]{11} pattern \| and technically could be a valid ID. \| \| On the other hand, I don't know how you would actually verify \| whether an 11-character string is or isn't a Youtube ID (short \| of querying Youtube itself), so I suppose it's nice that \| potential IDs are shown, just seems they have a very high \| chance of being false positives. \| meowface wrote: \| You can reduce false positives by trying to identify \| base64-seeming strings that are 11 characters long. Above a \| certain amount of entropy and uppercase/lowercase/digit \| distribution, etc. You might risk false negatives, but \| different flags for different levels of sensitivity could \| help with that. \| vitus wrote: \| I'm admittedly not impressed by the pcap processing. \| \| It identifies a bunch of fragments of HTTP headers as "YouTube \| Video ID". \| \| Meanwhile, I can get the same info and more by running \| $ strings FollowTheLeader.pcap ]?> GET / \| HTTP/1.1 Host: 10.0.2.5 User-Agent: Mozilla/5.0 \| (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0 \| Accept: \| text/html,application/xhtml+xml,application/xml;q=0.9,/*;q=0.8 \| Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, \| deflate Connection: keep-alive Upgrade-Insecure- \| Requests: 1 Pragma: no-cache Cache-Control: no- \| cache HTTP/1.0 200 OK Server: SimpleHTTP/0.6 \| Python/3.7.3rc1 Date: Sun, 14 Jul 2019 02:42:13 GMT \| Content-type: text/html Content-Length: 105 Last- \| Modified: Sun, 14 Jul 2019 02:41:10 GMT My Flag Web \| Page Hi there! Have a flag! Here \| is your flag: ctfa{terrific_traffic} ___________________________________________________________________ (page generated 2021-06-16 23:00 UTC)

My Flag Web | Page