|
| dfederschmidt wrote:
| This looks very useful, big fan of all the ^[a-z]+q$ utilities
| out there. But as a user, I would probably want to use XPath[0]
| notation here. Maybe that is just me. A quick search revealed
| xidel[1] which seems to be similar, but supports XPath.
|
| [0]https://en.wikipedia.org/wiki/XPath
| [1]https://github.com/benibela/xidel
| chriswarbo wrote:
| My web scraping tends to start with xidel. If I need a little
| bit more power I'll use xmlstarlet. If neither of those is
| enough, I'll use Python's beautifulsoup package :)
| waynenilsen wrote:
| part of the problem with this is that HTML is mostly not valid
| XML
| akie wrote:
| I'd like to state my support for the author's choice of CSS
| selectors in this particular use case. I think it's a natural
| fit for this domain and already very well known, perhaps even
| known better than XPath.
| mirekrusin wrote:
| Playwright ppl had to solve this for themselves, you can mix
| them as they are distinct, have few small custom
| modifications to help with selectors. Playwright compatible
| selectors would be nice.
| berkes wrote:
| I'd like to add my support here too, but with a note.
|
| When scraping and parsing (or writing integration test DSL),
| I always start out with CSS selectors. But always hit cases
| where they lack or require hoop-jumping and then fall back on
| Xpath. I then have a codebase with both CSS-Sel and Xpath,
| which is arguably worse then having only one method.
|
| I suspect here, one uses this tool untill CSS selector
| limitations are getting in the way, after which one switches
| to another tool(chain)
| alpha_squared wrote:
| Do you mind giving an example? I'm having trouble following
| where CSS is limited for selection.
| berkes wrote:
| Like other commentor says: parent/child. But also
| selecting by content (e.g. "click the button with the
| delete-icon" or "find the link with '@harrypotter') or
| selecting by attributes (e.g. click the pager-item that
| goes to next page) or selecting items outside of body
| (e.g. og-tags, title etc). All are doable in CSS3
| selectors, but everything shouts that they are not meant
| for this; whereas xpath does this far more natural.
| unspecified wrote:
| Searching text content is my main remaining use of XPath.
| benibela wrote:
| XPath does general data processing not just selection
|
| E.g. when you have a list of numbers on the website,
| XPath can calculate the sum or the maximum of the numbers
|
| Or you have a list of names "Last name, First name", then
| you can remove the last name and sort the first names
| alphabetically. Or count how often each name occurs and
| return the most popular name.
|
| Then it goes back to selection, e.g. select all numbers
| that are smaller than the average. Or calculate the most
| popular name, then select all elements containing that
| name
| vlunkr wrote:
| Well, the big one is selecting a parent from the child.
| androceium wrote:
| You could do this with the :has() CSS psuedo-class[0],
| though inverted (select a parent that _has_ the child
| matching a selector).
|
| Looks like that psuedo-class has not been implemented in
| the kuchiki library that htmlq uses though.
|
| [0]: https://developer.mozilla.org/en-
| US/docs/Web/CSS/:has
| Jenk wrote:
| I've not had much friction using either, they are "close
| enough" that the time to (re)write a query from one to the
| other is not very significant.
| lilyball wrote:
| This looks really neat! It supports a bunch of different query
| types, and can even do things like follow links to get info
| about the linked-to pages!
|
| It's also in nixpkgs, though for some reason the nixpkgs
| derivation is marked as linux-only (i.e. not Darwin). (Edit:
| probably because the fpc dependency is also Linux-only, with a
| linux-specific patch and a comment suggesting that supporting
| other platforms would require adding per-platform patches)
| exyi wrote:
| Thanks, this looks more powerfull. Support CSS, XPath and
| XQuery. Maybe I could learn a bit of XQuery when I have a use
| case for it :)
| dmit wrote:
| Well, here's your first lesson then: if you prepend (: to
| your comment it will become a valid XQuery document!
|
| (: XQuery comments are marked by mirrored smilie faces, like
| this. :)
| bdcravens wrote:
| Nice - I've been writing XQuery for years and I had no clue
| firefoxd wrote:
| Super useful. You've created a fantastic tool here. Thank you.
| d--b wrote:
| Just being that guy: is there a reason you didn't call it hq?
| zamadatix wrote:
| Not author but neither is the poster: Jq got away with it
| because it's one of the few 2 letter combinations that wasn't
| absolutely overloaded and "jquery" was already taken. OTOH
| nobody shortens HTML to H and HQ is an extremely common
| acronym, if not one of the most popular 2 letter acronyms you
| could pick.
| OJFord wrote:
| jq didn't get away with it! Have you never tried searching
| for anything to do with it? How I _wish_ it were called
| `jsonq`!
| mgdm wrote:
| I just wanted to be slightly more descriptive and less likely
| to collide with other tools.
| notRobot wrote:
| Hahah, I love how this is your second comment in 10 years on
| HN.
| mgdm wrote:
| Hah. Yeah. I had another account for a little while but
| then HN started to let me reset the password for this one
| quite recently, so here I am.
| Snd_ wrote:
| This is great! Thanks
| who-shot-jr wrote:
| Good work!
| harperlee wrote:
| Nice!
|
| This is the kind of obvious tool that once it exists, you can't
| really grok the fact it did not earlier, and that it took until
| now to exist.
| dmos62 wrote:
| > grok
|
| A good opportunity to introduce `gron` to those unfamiliar!
| > gron "https://api.github.com/repos/tomnomnom/gron/commits?per
| _page=1" | fgrep "commit.author" json[0].commit.author
| = {}; json[0].commit.author.date =
| "2016-07-02T10:51:21Z"; json[0].commit.author.email =
| "mail@tomnomnom.com"; json[0].commit.author.name = "Tom
| Hudson";
|
| https://github.com/tomnomnom/gron
| rsync wrote:
| "A good opportunity to introduce `gron` to those unfamiliar!"
|
| Thank you - appreciated.
|
| I haven't done much work with json but have had reasons
| recently to do so - and I immediately saw how difficult it
| was to pipeline to grep ...
|
| But what I still don't understand is that some json outputs I
| see have multiple values _with the exact same name_ (!) and
| that still seems "un-grep-able" to me ...
|
| What am I missing ?
| dmos62 wrote:
| You might be missing a change in index: `obj[0].prop` vs
| `obj[1].prop`. Or, your JSON might have the same property
| defined multiple times: `{a:1, a:2}` (though I'm not sure
| how gron handles that situation).
| lvncelot wrote:
| > (though I'm not sure how gron handles that situation).
|
| It seems both gron and jq only use the value that has
| been defined last: ~ echo
| '{"a":1,"a":2}' | gron
| json = {}; json.a = 2; ~ echo
| '{"a":1,"a":2}' | jq
| { "a": 2 }
| croon wrote:
| The json output likely contains multiple objects. Can you
| request more specifically the object(s) you need and grep
| on that?
| dotancohen wrote:
| > But what I still don't understand is that some json
| > outputs I see have multiple values with the exact same
| name
|
| This is neither explicitly allowed nor explicitly forbidden
| by the JSON spec. It is implementation dependent upon how
| to handle - does one value override the other? Should they
| be treated as an array?
|
| In practice, this situation is usually carefully avoided by
| services that produce JSON. If you are interfacing with a
| service that does produce duplicate values, I'd be
| interested in seeing it for curiosity's sake. If you are
| writing a service and this is the output, then I implore
| you to reconsider!
| ptwt wrote:
| It did write it a few years ago.
|
| https://github.com/plainas/tq
| matsemann wrote:
| There are already tools for xpath, but using css selectors is
| much more aligned with what I write every day, so that's nice.
| harperlee wrote:
| Yes, and awk and others. I meant something semantically
| closer to the need, with css selectors.
| natrys wrote:
| It's not novel obviously. I have been using pup[1] for years.
| And xidel[2] is probably older.
|
| [1] https://github.com/ericchiang/pup
|
| [2] https://github.com/benibela/xidel
| ducktective wrote:
| Looks nice! Any comparisons with pup?
|
| https://github.com/ericchiang/pup
| notorandit wrote:
| Next is xmlq: https://github.com/dscape/xmlq
| Ronak123 wrote:
| https://techflashes.com/top-upcoming-futuristic-technologies...
| unityByFreedom wrote:
| Why not just jquery?
| purplecats wrote:
| brilliant. does this spin up a heavy DOM implementation in the
| background or do something lighter such as regexp?
| mdzn wrote:
| You can't parse HTML with regexps. It's not a regular language.
| underdeserver wrote:
| https://stackoverflow.com/questions/1732348/regex-match-
| open...
| carnitine wrote:
| What language implements regexps that actually correspond to
| regular languages though?
| Deukhoofd wrote:
| Looks like it uses servos html5ever (through kuchiki), so no
| DOM representation.
| chrismorgan wrote:
| Kuchiki materialises what they call a "DOM-like tree". I'd
| consider it a DOM tree, myself, despite the differences in
| precise API.
|
| But it's not using a full browser to back it, which I suspect
| is what's really being asked.
| mrweasel wrote:
| It looks to be using html5ever to parse the HTML, similar to
| something like BeautifulSoup in Python.
| delusional wrote:
| The source is right there. You can read it. It uses html5ever
| (part of the servo project).
| [deleted]
| gostsamo wrote:
| You can't parse html with regular expressions :)
|
| https://stackoverflow.com/questions/1732348/regex-match-open...
| bmn__ wrote:
| "Oh Yes You Can Use Regexes to Parse HTML!"
|
| https://stackoverflow.com/a/4234491
| anon4242 wrote:
| Yeah, if you allow yourself some Perl to help you with
| those parts that regexes can't handle...
| akie wrote:
| Technically correct, but did you see the regex he uses? It
| spans 82 lines...
| andybak wrote:
| And the obligitory caveat from the comments:
|
| > While arbitrary HTML with only a regex is impossible, it's
| sometimes appropriate to use them for parsing a limited,
| known set of HTML.
| hnbad wrote:
| The emphasis here is on "known". The tool is general
| purpose (i.e. handling _unknown_ HTML) so using regexes
| would be ill-advised.
| dredmorbius wrote:
| See also the html-xml-utils from w3c.
|
| hxextract and hxselect perform similar extract functions.
|
| hxclean and hxnormalize (combined) will pretty-print HTML.
|
| https://www.w3.org/Tools/HTML-XML-utils/
| mozey wrote:
| Funny, couple of years ago I thought someone should create
| something for JSON similar to what
| [XSLT](https://en.wikipedia.org/wiki/XSLT) is for XML. See
| example here https://www.w3schools.com/xml/xsl_intro.asp
|
| Then I found out about jq because awscli was using it in
| example docs.
|
| I guess `htmlq` makes sense if it has the exact same syntax as
| `jq`, and the user is already familiar with the latter?
| desktopninja wrote:
| Very nice tool. I've long spoiled myself with Powershell's:
| Invoke-WebRequest eg. # what is the latest release of
| apache-tomcat? $LINKS=$(Invoke-WebRequest -Uri
| 'https://tomcat.apache.org/download-80.cgi' | Select-Object
| -ExpandProperty Links) $LATEST=$($Links | Where-Object
| -Property href -Match '#8.5.[0-9]+').href.substring(1)
| $FETCH=$($Links | Where-Object -Property href -match "apache-
| tomcat-${LATEST}.zip$").href
| Tepix wrote:
| Should it be $LINKS instead of $Links (2x)?
| desktopninja wrote:
| "$links" works too because PWSH is not case sensitive. But I
| should have used $LINKS like you said for cleaner write-up.
| systemvoltage wrote:
| This is nifty! Python + bs4 takes some googling to remember how
| to parse a webpage. This is just straight forward, thanks so
| much.
| jillesvangurp wrote:
| If you make the html well formed, xpath also works great. Great
| stuff if you ever need to pick html apart. Used this quite a bit
| when microformats were still a thing together with jtidy.
|
| Jq is very loosely inspired by that, I guess. Might come full
| circle here and use some XSL transformations ...
| qw wrote:
| You can usually find a html parser for your language, that you
| can use xpath/xsl on. It will just make the same assumptions
| that the browser does, by adding missing closing tags etc.
|
| I made a tool that extracted parts of web pages 10-15 years
| ago, and it worked well. There are of course cases where the
| html is so unstructured that the results were unpredictable,
| but it worked well in general.
| ludovicianul wrote:
| And a Java version with pre-compiled binaries:
| https://github.com/ludovicianul/hq
| srg0 wrote:
| "htmlq: like jq, but for HTML"
|
| "jq is like sed for JSON data"
|
| sed: "While in some ways similar to an editor which permits
| scripted edits (_such as ed_), sed works by making only one pass
| over the input(s)"
|
| ed: "ed is a line-oriented text editor".
|
| Software definition through a reference to another software is
| somewhat confusing. Potential users come from different
| backgrounds (I had no idea what is jq), and it is not clear what
| are the defining features of each project. Is jq line oriented?
| Is htmlq operating in a single pass?
| digitalsushi wrote:
| "htmlq is like jq but for html" is a very specific 'dog
| whistle' for people who use jq. I agree that people who don't
| know what jq is will get no value and pay no attention. But for
| people who use jq, the claim is, like a dog whistle, clear,
| concise, and means exactly what it says. In two seconds,
| everyone using both jq and html will instantly know what is
| available and log it away.
|
| So for general purposes, it's a terrible marketing pitch. And
| yet I think it's a very, very valuable demonstration of knowing
| some of their 'customers'.
| acomar wrote:
| this isn't what a dogwhistle is. it's just explanation by
| analogy to a model presumed to be shared by the intended
| audience. a dogwhistle offers a surface meaning to the
| uninitiated that's anodyne but communicates a hidden, coded
| message to those who possess some undisclosed, shared
| knowledge with the author. this kind of analogy entirely
| lacks the surface meaning and the message shared via jargon
| also communicates something about how you might learn enough
| to understand the analogy.
| philipswood wrote:
| I can't speak for people who don't know jq, but knowing jq,
| this is a great tagline: it gives me an immediate
| understanding of what it does, how I could expect to use it
| and what value and ease of use I can expect.
|
| I'll be trying it out next time I'm on a PC.
| rendall wrote:
| > _I can 't speak for people who don't know jq,_
|
| I can, and it's not illuminating at all.
| throwaway2016a wrote:
| I agree, however if you do know how to use jq than "like jq,
| but for html" is extremely effective. I use jq all the time and
| that title hooked me, I immediately wanted to try it.
|
| But if you haven't used Jq that I can see how that title is
| less than helpful.
| ducktective wrote:
| The first three are not proper definitions per-se but kind of
| an advertisement, trying to familiarize by self-comparison with
| a _tried & true_ tool that has proven its worth.
|
| _You know Jimmy the famous mechanic? I 'm Timmy, _his brother_
| but an electrician._
|
| IMO, at least `jq` has proven itself as _the_ indispensable
| tool for json-data manipulation.
| corporealshift wrote:
| I mean...if you read the github readme it literally describes
| what it does in the next line: "Uses CSS selectors to extract
| bits content from HTML files".
| kbenson wrote:
| > Software definition through a reference to another software
| is somewhat confusing.
|
| Possibly, depending on background as you note, but not all
| promotion is intended at the same audience. When submitting to
| HN, "like jq, but for X" is short and conveys what it is to
| most the people that would care, I think. jq has been submitted
| and talked about here _many_ times with lively discussion over
| the years.[1] At this point I think most those that are
| interested in what that is and what this is will understand
| fairly quickly from the title. Those that don 't might be
| missed, or they might look it up like you, or they might see it
| through some other submission some other time with a different
| title which isn't based on a chain of references.
|
| 1:
| https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
| zamadatix wrote:
| 1st sentence - Explaining the tool for those the tool was made
| for without beating around the bush.
|
| 2nd sentence - Explaining the tool to folks in the general web
| domain what it can do for them.
|
| 3rd sentence - Explaining where to learn how to use the tool if
| you've stumbled across it but web is not your area of
| expertise.
|
| All that info fits in nearly 25 words then it lists the options
| for the tool and jumps straight into multiple examples (with
| outputs!). If the only explanation had been "htmlq: like jq,
| but for HTML" I'd agree but having the comparison to explain
| what it does isn't a bad thing it's _only_ having the
| comparison that would be bad.
|
| Personally I think this is a model example of a opening for a
| Github readme.
| zerocount wrote:
| I disagree. The 2nd sentence contains, "extract bits
| content." What is that?
|
| If you're going to write a minimal introduction, at least
| make sure it's not confusing.
|
| I get the feeling the author felt compelled to write an
| introduction and did so with as little effort as possible.
| cyberge99 wrote:
| I believe he tailored it to his target audience. If you
| find it confusing, you are likely not it.
| ritchiea wrote:
| As web developer for over a decade "bits content" doesn't
| mean anything to me. But I understand what the tool does
| from the rest of the description. Try running a google
| search for "bits content," [0] it's not a commonly used
| phrase in web development or anything. It's a poor choice
| of words.
|
| 0. https://www.google.com/search?hl=en&q=%22bits%20conten
| t%22
| chownie wrote:
| It's supposed to be "bits of content", it's not jargon.
| The author's just accidentally a word, we all do it.
| ritchiea wrote:
| It's more than fair to say in technical documentation you
| intend others to use having a grammatical error or
| missing word is confusing and a problem. It's the writing
| equivalent of having a bug in your code. And it's
| definitely not "writing to a target audience" as the
| parent comment suggested. We all make mistakes but don't
| try to call a mistake effective documentation.
| RobertKerans wrote:
| Of course it is, but neither parent nor anyone else is
| saying anything close to the mistake being effective
| documentation. There's a single missing word which needs
| to be added in, but the overall text is clearly writing
| to a target audience. You are aware of this, and of how
| small the mistake is, and you understand what the
| sentence should read as, so I'm not sure what your point
| is?
| theIV wrote:
| My hunch is that this is a typo and it should read "extract
| bits OF content."
| mgdm wrote:
| Exactly this! I'll fix it after work.
| rendall wrote:
| Maybe have the line about "jq" be 2nd. Have the first
| line be a brief description of what it actually does.
| ritchiea wrote:
| I agree and having a missing word in your text often
| leads to confusion :)
|
| Honestly you could drop the "bits" which is a bit
| redundant and use the phrase "Uses CSS selectors to
| extract content from HTML files."
| oauea wrote:
| What's this thing called a "computer" that people keep going on
| about, anyway?
| da_chicken wrote:
| It's a person who does mathematic calculations all day. For
| example, creating range tables for artillery, calculating
| averages or totals of a large range of values, or solving
| complex integrals or differential equations, and so on.
| They're commonly used in industry or government, especially
| in astronomy, aerospace and civil engineering for both
| simulation and analysis. Perhaps the most well-known
| computers were the Harvard Computers, which operated in the
| late 19th and early 20th centuries.
|
| As a job, computers were largely automated out of existence
| by solid-state transistor based automated computers and
| integrated circuit transistor automated computers in the 60s,
| 70s and 80s, which replaced the enormously expensive and
| often largely experimental electro-mechanical automated
| computers while radically reducing cost and improving
| performance both by several orders of magnitude.
| samstave wrote:
| Here - this explains it really succinctly:
|
| https://www.youtube.com/watch?v=lE1bS-Mn2Mk
| theandrewbailey wrote:
| It's like a programmable loom, but for logical and
| mathematical operations.
| samhw wrote:
| You may be interested in the symbol grounding problem (http
| s://en.wikipedia.org/wiki/Symbol_grounding_problem#Groun...
| ). It's like the binding problem, but for symbols.
| [deleted]
| sundarurfriend wrote:
| Sort of related: [Expecting Short Inferential
| Distances](https://www.readthesequences.com/Expecting-Short-
| Inferential...)
| nextaccountic wrote:
| jq isn't line-oriented, it's json-oriented. it's operaring on a
| stream of jsons from stdin, so its query is applied to each
| json in sequence.
|
| I would expect that htmlq run the query a single time for a
| single html; just like jquery $('#something') or
| document.querySelector('#something')
| zatkin wrote:
| Why not incorporate this into jq itself, like perhaps adding some
| command line arguments to switch to HTML mode?
| Deukhoofd wrote:
| What would the benefits of fitting a HTML parser into a JSON
| parser tool be?
| lmm wrote:
| JQ is not just a parser but a tool for doing operations, many
| of which are (or should be) generic across any tree-like data
| format. Reusing that part across different input formats
| makes a lot of sense.
| mjburgess wrote:
| Well once there's an HTML parser, then a pdf viewer, and then
| everything needed for PDFs (ie., programming, emailer, video
| support, etc.) we'll finally have that ideal operating system
| we've been waiting for.
| mro_name wrote:
| sounds a lot more like blockchain.
| e12e wrote:
| Would probably be more useful to implement html2json, and pipe
| in html?
|
| Ed: eg: https://github.com/Jxck/html2json
| downWidOutaFite wrote:
| Why? I find xpath's syntax much simpler and regular than jq's.
| [deleted]
| pabs3 wrote:
| I tend to reach for XPath selectors before CSS ones when querying
| HTML.
| necovek wrote:
| Nice, I expected something based on XPath (like xpd), but web
| developers dealing with HTML are infinitely more familiar with
| CSS selectors, so a great choice!
| busterarm wrote:
| I want the option to use both, like Nokogiri gives you.
| necovek wrote:
| Sure, that sounds nice, but having two simple tools each
| doing the job well in its own space is perfectly fine for me
| -- do you imagine needing to combine Xpath and CSS queries in
| a single run?
| busterarm wrote:
| I've had to do it when dealing with some poorly-designed
| XML apis in the past. Nokogiri was a godsend.
| rendall wrote:
| What is jq?
| mro_name wrote:
| it's statically linkable rust, isn't it? Awesome. I'm looking for
| a successor to
|
| $ xmllint --html --xpath ...
|
| that doesn't choke on inline svg.
| gigatexal wrote:
| This is very cool. This will make scraping the web even easier!
| elif wrote:
| When I saw the title I thought this was some machine learning-
| specific rmq/0mq message passing tech called HT. Very excited to
| zero.
| m4r35n357 wrote:
| Should be HQ . . .
| pkrumins wrote:
| Call it "hq".
| jhatemyjob wrote:
| Crazy how a 300-line codebase manages to amass 2000 stars on
| Github and 700 upvotes on HN. Amazing ROI.
| gizdan wrote:
| Once upon a time I was using pup[0] for such thing as well as
| later I changed to cascadia[1] which seemed much more advanced.
|
| Comparing the two repos, it seems pup is dead, but cascadia may
| not be.
|
| These tools, including htmlq, seem to sell themselves as "jq for
| html", which is far from the truth. Jq is closer to the awk where
| you can do just about everything with json. Cascadia, htmlq, and
| pup seem closer to grep for html. They can essentially only
| select data from a html source.
|
| [0] https://github.com/EricChiang/pup [1]
| https://github.com/suntong/cascadia
| heavyset_go wrote:
| I've used pup for a few projects, but was unaware of cascadia.
| Thanks for pointing it out.
| croon wrote:
| Well, jq is grep _as well_ as sed and awk, but yeah, htmlq
| seems to be just grep, for sake of comparison.
|
| But I don't think html has any need for a sed/awk tool, or at
| least not as much. Json output could very well be piped forward
| to the next CLI tool after you've changed it slightly with jq.
| I don't see this scenario as likely with html.
| gizdan wrote:
| > Well, jq is grep as well as sed and awk, but yeah, htmlq
| seems to be just grep, for sake of comparison.
|
| Exactly, and that is what I mean. If you want to compare,
| compare it with grep, not jq.
|
| Someone else posted xidel[0] in this thread, which I've not
| used, but it seems to be the "jq but for html".
|
| [0] https://github.com/benibela/xidel
| bamdadd wrote:
| is there a brew install command ?
| mcovalt wrote:
| I'd like to see a tool using lol-html [0] and their CSS selector
| API as a streaming HTML editor.
|
| [0] https://github.com/cloudflare/lol-html
| hyperpallium2 wrote:
| From examples, this is only like jq in the sense that the q
| stands for the same thing. Even the way it does that is
| different.
|
| An xmlq that was really like jq would be fun, about 20 years ago.
| cerved wrote:
| I would still like xmlq, there are (regrettably) still a lot of
| applications that store data and configuration in xml
| dotancohen wrote:
| There is `xq` today, which parses XML like `jq`. I think that
| it is relatively unknown because it is part of the `yq` package
| for parsing YMAL. So just install `yq` via PIP and you'll get
| `xq` as well.
|
| There is also `xmlstarlet` for parsing XML in a similar
| fashion.
| hyperpallium2 wrote:
| xmlstarlet is really nothing like jq, as a language. But yes,
| I use it because it is the best commandline xml processor I'd
| found. That's the only similarity to jq.
|
| Is this the yq? https://kislyuk.github.io/yq/ It does contain
| an 'xq', as a literal wrapper for jq, piping output into it
| after transcoding XML to JSON using xmltodict
| https://github.com/martinblech/xmltodict (which explodes xml
| into separate JSON data structures).
|
| This is a bash one-liner! But TBF it really is a 'jq for
| xml'. I think it would be horrible for some things, but you
| could also do a lot of useful things painlessly.
| dotancohen wrote:
| Thank you for the comments. I've only recently discovered
| both tools, and literally used them once each. Of the two
| `xq` was easier for my particular work case (parsing a
| Magento config) but I keep both tools in my virtual
| toolbox.
|
| If you have any other suggestions for parsing XML for
| exploratory purposes I'm very happy to hear them.
| hyperpallium2 wrote:
| Thanks! Not actually a reccommendation, but I have used
| xsltproc (command line xslt), but it is horrible to use
| because xslt syntax is horrible (though xslt's concepts
| are pretty cool). One thing is it enables you to use
| XPath in all its glory.
|
| Just installed xq. It's nice just seeing the pretty-
| printed json output, so thanks for the pointer. Probably
| better than xmlstarlet for my usage, which just queries
| and outputs text, not xml. hmmm, that's probably true for
| most commandline uses...
| jle17 wrote:
| Just looked into this and I think it's worth mentioning that
| there are two different projects called `yq`. The first one
| that came up (written in go instead of python) is not the
| right one and doesn't have the `xq` tool.
| abledon wrote:
| is anyone else using the https://github.com/json-path/JsonPath
| over the jq route?
|
| I hope we standardize on some jq query language, like we have
| with a base set of SQL syntax
| andybak wrote:
| > like jg
|
| "jq is a lightweight and flexible command-line JSON processor"
| chefandy wrote:
| If anyone is looking for a good library to do this in Python,
| PyQuery works well:
|
| https://pythonhosted.org/pyquery/
| teitoklien wrote:
| Maybe call it hq ?
| Simplicitas wrote:
| My thoughts EXACTLY... but anyway, great new utility indeed!
| teitoklien wrote:
| Haha, Indeed its a very good utility :D
| oauea wrote:
| https://jsoup.org/ has been around for a long time and seems a
| bit more mature & maintained than this two-code-files 2-year-old
| repo. Highly recommend.
| avereveard wrote:
| what's wrong with using html tidy + xmllint ?
| mro_name wrote:
| nothing wrong. Searching unmodified html though is sometimes
| preferable.
| soheil wrote:
| I'd use something like this script that you can put together
| yourself: #!/usr/bin/env ruby require
| 'nokogiri'; p Nokogiri::HTML(STDIN.read).css(ARGV[0]).text
|
| Just save it to a file in your _/ usr/local/bin/hq_ and do _chmod
| +x !$_
|
| Then you can do: curl -s
| "https://news.ycombinator.com/news"|hq "tr:first-child
| .storylink"
|
| It uses Nokogiri[0], which is much more battle tested and works
| with CSS and XPath selectors.
|
| [0]
| https://nokogiri.org/tutorials/parsing_an_html_xml_document....
| triska wrote:
| This is very nice!
|
| For reasoning about tree-based data such as HTML, I also highly
| recommend the declarative programming language Prolog. HTML
| documents map naturally to Prolog terms and can be readily
| reasoned about with built-in language mechanisms. For instance,
| here is the sample query from the htmlq README, fetching all
| elements with id _get-help_ from https://www.rust-lang.org, using
| Scryer Prolog and its SGML and HTTP libraries in combination with
| the XPath-inspired query language from library(xpath):
| ?- http_open("https://www.rust-lang.org", Stream, []),
| load_html(stream(Stream), DOM, []), xpath(DOM,
| //(*(@id="get-help")), E).
|
| Yielding: E = element(div,[class="flex flex-
| colum ...",id="get-help"],["\n ",element(h4,[],["Get
| help!"]),"\n ",element(ul,[],["\n
| ...",element(li,[],[element(a,[... = ...],[...])]),"\n
| ...",element(li,[],[...]),...|...]),"\n
| ...",element(div,[class="la ..."],["\n
| ...",element(label,[...],[...]),...|...]),"\n ..."]) ;
| false.
|
| The selector //(*(@id="get-help")) is used to obtain all HTML
| elements whose _id_ attribute is get-help. On backtracking, all
| solutions are reported.
|
| The other example from the README, extracting all _links_ from
| the page, can be obtained with Scryer Prolog like this:
| ?- http_open("https://www.rust-lang.org", Stream, []),
| load_html(stream(Stream), DOM, []), xpath(DOM,
| //a(@href), Link), portray_clause(Link),
| false.
|
| This query uses forced backtracking to write all links on
| standard output, yielding: "/".
| "/tools/install". "/learn". "https://play.rust-
| lang.org/". "/tools". "/governance".
| "/community". "https://blog.rust-lang.org/".
| "/learn/get-started". etc.
| chriswarbo wrote:
| Thanks, that's a rare example of something which is (a) simple
| enough to understand for a Prolog-newbie like me, and (b) more
| practical than ubiquitous family-tree example.
|
| I'm always looking for opportunities to dip my toes into
| Prolog; in hindsight it's clearly a good fit for tree-
| structured data structures.
| samhw wrote:
| Interestingly, the only other context in which I've come
| across Prolog is from friends who studied at Cambridge, here
| in the UK. For some reason, the CS 'tripos' (course) there is
| really heavily focussed on Prolog, and everyone I know from
| there ended up a huge fan of the language. I'm not sure why
| that's the case, though, given that almost all other
| universities seem to use more common languages (Java, C++,
| etc).
| zimpenfish wrote:
| cs.man.ac.uk, at least back in 1992, had a compulsory
| Prolog module in the first year. Don't know anyone from
| then who didn't hate that module with a burning passion.
|
| (There was no Java, C++, etc. either. It was SML, Pascal,
| 68000, and Oracle Pascal-Embedded-SQL.)
| ramses0 wrote:
| "Prolog as a library" => Given "functional" constraints =>
| $CONSTRAINTS.prolog( "query..." ) => results
|
| ...many languages (similar to regex / state-machine) can
| benefit greatly from offloading a portion to something
| prolog-ish, but it's unfortunate that prolog knowledge
| isn't as widely distributed.
| WickyNilliams wrote:
| I studied CS at a different university in UK and we used
| Prolog for one module on AI or perhaps machine vision. I
| really enjoyed working with it. This was 15 years ago.
| Looking through their current curriculum I can't see prolog
| being mentioned anymore. Shame!
| pandatigox wrote:
| I tried to run this on my computer now, but as a complete
| Prolog noob, I'm having errors running the script? How do you
| load the http_open module/library in the first place? I tried
| following some Prolog tutorials in the past but I always get
| stuck trying to run something in the REPL. I'm using scryer-
| prolog. Thanks in advance!
| triska wrote:
| The libraries I mentioned can be loaded by invoking the
| use_module/1 predicate on the toplevel, here is the complete
| transcript that loads the SGML, HTTP and XPath libraries in
| Scryer Prolog: ?-
| use_module(library(sgml)). true. ?-
| use_module(library(http/http_open)). true.
| ?- use_module(library(xpath)). true.
|
| The second query also uses portray_clause/1 from
| library(format), which you can load with:
| ?- use_module(library(format)). true.
|
| After all these libraries are loaded, you can post the sample
| queries from above, and it should work.
|
| There are also other ways to load these libraries: A very
| common way to load a library is to use the use_module/1
| _directive_ in Prolog source files. In that case, you would
| put for example the following 4 directives in a Prolog source
| file, say sample.pl: :-
| use_module(library(sgml)). :-
| use_module(library(http/http_open)). :-
| use_module(library(xpath)). :-
| use_module(library(format)).
|
| And then run sample.pl with: $ scryer-
| prolog sample.pl
|
| You can then again post the goals from above on the toplevel,
| and it will work too.
|
| Another way is to put these directives in your ~/.scryerrc
| configuration file, which is automatically consulted when
| Scryer Prolog starts. I recommend to do this for libraries
| you frequently need. Common candidates for this are for
| example library(dcgs), library(lists) and library(reif).
|
| Personally, I start Scryer Prolog from within Emacs, and I
| have set up Emacs so that I can consult a buffer with Prolog
| code, and also post queries and interact with the Prolog
| toplevel from within Emacs.
| pandatigox wrote:
| Wow that works fantastically! Thank you for that. It almost
| seems like magic.
| okasaki wrote:
| It's pretty easy in Python too, eg.: >>> soup
| = BeautifulSoup(requests.get("https://www.rust-lang.org").text)
| >>> [x["href"] for x in soup.find_all("a")] ['/',
| '/tools/install', '/learn', 'https://play.rust-lang.org/',
| '/tools', '/governance', '/community', 'https://blog.rust-
| lang.org/',...
| triska wrote:
| In a certain sense (for example, when measuring brevity), it
| is indeed easy to write this example in Python. However, the
| Python version also illustrates that many different language
| constructs are needed to express the intended functionality.
| In comparison to Prolog, Python is a quite complex language
| with many different language constructs, including loops,
| objects, methods, assignment, dictionaries etc. all of which
| are used in this example.
|
| As I see it, a key attraction of Prolog is its simplicity:
| With a single language construct (Horn clauses), you are able
| to express all known computations, and the example queries I
| posted show that only a single language element, namely again
| Horn clauses to express a query, is needed to run the code.
| The Prolog query, and also every Prolog clause, is itself a
| Prolog term and can be inspected with built-in mechanisms.
|
| As a consequence, an immediate benefit of using Prolog for
| such use cases is that you can easily reason about user-
| specified queries in your applications, and for example
| easily allow only a safe subset of code to be run by users,
| or execute a user-specified query with different execution
| strategies etc. In comparison, Python code is much harder to
| analyze and restrict to a particular subset due to the
| language's comparatively high syntactic complexity.
| the_jeremy wrote:
| The benefit of Python is that developers already know about
| these language constructs, and that more developers know
| Python than Prolog.
| lostcolony wrote:
| I don't think the op's point was "how easy it would be to
| hire developers", or even "taking all the considerations
| a business is under, I feel Prolog makes sense". He was
| just touting how easy Prolog's built in pattern matching
| and declarative style makes implementing and using
| selectors at a language level.
|
| Honestly, if we didn't talk about the benefits of a
| language irrespective of how easy it is to hire for it,
| we'd never have introduced anything beyond FORTRAN, if we
| even made it that far. Bringing "X is easier to hire for"
| into a conversation about the language is, at best, a
| non-sequitur.
| notriddle wrote:
| We might have been better off that way. FORTRAN does have
| its downsides, but language churn itself has downsides
| that almost always outweigh the assumed upsides of a
| better language.
|
| If we had just stuck with FORTRAN forever, how many
| problems would have been completely avoided!? There'd be
| better, and more, IDEs, since even if the language is
| hard to parse, it's still just one parser that needs all
| the effort. So many unfortunate problems in education
| caused by language and ecosystem churn would have been
| avoided (the infamous "by the time you graduate, it's
| always outdated" problem).
|
| The only problem is that FORTRAN is too new. Should've
| stuck with the Hollerith tabulator.
| jfmc wrote:
| AFAIK, this was first proposed and implemented in Ciao Prolog
| back in late 90s (modern versions here: https://ciao-
| lang.org/ciao/build/doc/ciao.html/html.html). It was way before
| Python was popular and JavaScript ever existed.
| parhamn wrote:
| Ive been looking for a library that can find the best set of
| selectors to most consistently find the element youre looking for
| in a page.
|
| Any pointers to something that exists? Interestingly I've also
| found very little for dom extraction in the OS ML space.
___________________________________________________________________
(page generated 2021-09-07 23:01 UTC) |