[HN Gopher] Htmlq: like jq, but for html
___________________________________________________________________
 
Htmlq: like jq, but for html
 
Author : jabo
Score  : 849 points
Date   : 2021-09-07 07:12 UTC (15 hours ago)
 
web link (github.com)
w3m dump (github.com)
 
| dfederschmidt wrote:
| This looks very useful, big fan of all the ^[a-z]+q$ utilities
| out there. But as a user, I would probably want to use XPath[0]
| notation here. Maybe that is just me. A quick search revealed
| xidel[1] which seems to be similar, but supports XPath.
| 
| [0]https://en.wikipedia.org/wiki/XPath
| [1]https://github.com/benibela/xidel
 
  | chriswarbo wrote:
  | My web scraping tends to start with xidel. If I need a little
  | bit more power I'll use xmlstarlet. If neither of those is
  | enough, I'll use Python's beautifulsoup package :)
 
  | waynenilsen wrote:
  | part of the problem with this is that HTML is mostly not valid
  | XML
 
  | akie wrote:
  | I'd like to state my support for the author's choice of CSS
  | selectors in this particular use case. I think it's a natural
  | fit for this domain and already very well known, perhaps even
  | known better than XPath.
 
    | mirekrusin wrote:
    | Playwright ppl had to solve this for themselves, you can mix
    | them as they are distinct, have few small custom
    | modifications to help with selectors. Playwright compatible
    | selectors would be nice.
 
    | berkes wrote:
    | I'd like to add my support here too, but with a note.
    | 
    | When scraping and parsing (or writing integration test DSL),
    | I always start out with CSS selectors. But always hit cases
    | where they lack or require hoop-jumping and then fall back on
    | Xpath. I then have a codebase with both CSS-Sel and Xpath,
    | which is arguably worse then having only one method.
    | 
    | I suspect here, one uses this tool untill CSS selector
    | limitations are getting in the way, after which one switches
    | to another tool(chain)
 
      | alpha_squared wrote:
      | Do you mind giving an example? I'm having trouble following
      | where CSS is limited for selection.
 
        | berkes wrote:
        | Like other commentor says: parent/child. But also
        | selecting by content (e.g. "click the button with the
        | delete-icon" or "find the link with '@harrypotter') or
        | selecting by attributes (e.g. click the pager-item that
        | goes to next page) or selecting items outside of body
        | (e.g. og-tags, title etc). All are doable in CSS3
        | selectors, but everything shouts that they are not meant
        | for this; whereas xpath does this far more natural.
 
        | unspecified wrote:
        | Searching text content is my main remaining use of XPath.
 
        | benibela wrote:
        | XPath does general data processing not just selection
        | 
        | E.g. when you have a list of numbers on the website,
        | XPath can calculate the sum or the maximum of the numbers
        | 
        | Or you have a list of names "Last name, First name", then
        | you can remove the last name and sort the first names
        | alphabetically. Or count how often each name occurs and
        | return the most popular name.
        | 
        | Then it goes back to selection, e.g. select all numbers
        | that are smaller than the average. Or calculate the most
        | popular name, then select all elements containing that
        | name
 
        | vlunkr wrote:
        | Well, the big one is selecting a parent from the child.
 
        | androceium wrote:
        | You could do this with the :has() CSS psuedo-class[0],
        | though inverted (select a parent that _has_ the child
        | matching a selector).
        | 
        | Looks like that psuedo-class has not been implemented in
        | the kuchiki library that htmlq uses though.
        | 
        | [0]: https://developer.mozilla.org/en-
        | US/docs/Web/CSS/:has
 
      | Jenk wrote:
      | I've not had much friction using either, they are "close
      | enough" that the time to (re)write a query from one to the
      | other is not very significant.
 
  | lilyball wrote:
  | This looks really neat! It supports a bunch of different query
  | types, and can even do things like follow links to get info
  | about the linked-to pages!
  | 
  | It's also in nixpkgs, though for some reason the nixpkgs
  | derivation is marked as linux-only (i.e. not Darwin). (Edit:
  | probably because the fpc dependency is also Linux-only, with a
  | linux-specific patch and a comment suggesting that supporting
  | other platforms would require adding per-platform patches)
 
  | exyi wrote:
  | Thanks, this looks more powerfull. Support CSS, XPath and
  | XQuery. Maybe I could learn a bit of XQuery when I have a use
  | case for it :)
 
    | dmit wrote:
    | Well, here's your first lesson then: if you prepend (: to
    | your comment it will become a valid XQuery document!
    | 
    | (: XQuery comments are marked by mirrored smilie faces, like
    | this. :)
 
      | bdcravens wrote:
      | Nice - I've been writing XQuery for years and I had no clue
 
| firefoxd wrote:
| Super useful. You've created a fantastic tool here. Thank you.
 
| d--b wrote:
| Just being that guy: is there a reason you didn't call it hq?
 
  | zamadatix wrote:
  | Not author but neither is the poster: Jq got away with it
  | because it's one of the few 2 letter combinations that wasn't
  | absolutely overloaded and "jquery" was already taken. OTOH
  | nobody shortens HTML to H and HQ is an extremely common
  | acronym, if not one of the most popular 2 letter acronyms you
  | could pick.
 
    | OJFord wrote:
    | jq didn't get away with it! Have you never tried searching
    | for anything to do with it? How I _wish_ it were called
    | `jsonq`!
 
  | mgdm wrote:
  | I just wanted to be slightly more descriptive and less likely
  | to collide with other tools.
 
    | notRobot wrote:
    | Hahah, I love how this is your second comment in 10 years on
    | HN.
 
      | mgdm wrote:
      | Hah. Yeah. I had another account for a little while but
      | then HN started to let me reset the password for this one
      | quite recently, so here I am.
 
| Snd_ wrote:
| This is great! Thanks
 
| who-shot-jr wrote:
| Good work!
 
| harperlee wrote:
| Nice!
| 
| This is the kind of obvious tool that once it exists, you can't
| really grok the fact it did not earlier, and that it took until
| now to exist.
 
  | dmos62 wrote:
  | > grok
  | 
  | A good opportunity to introduce `gron` to those unfamiliar!
  | > gron "https://api.github.com/repos/tomnomnom/gron/commits?per
  | _page=1" | fgrep "commit.author"         json[0].commit.author
  | = {};         json[0].commit.author.date =
  | "2016-07-02T10:51:21Z";         json[0].commit.author.email =
  | "mail@tomnomnom.com";         json[0].commit.author.name = "Tom
  | Hudson";
  | 
  | https://github.com/tomnomnom/gron
 
    | rsync wrote:
    | "A good opportunity to introduce `gron` to those unfamiliar!"
    | 
    | Thank you - appreciated.
    | 
    | I haven't done much work with json but have had reasons
    | recently to do so - and I immediately saw how difficult it
    | was to pipeline to grep ...
    | 
    | But what I still don't understand is that some json outputs I
    | see have multiple values _with the exact same name_ (!) and
    | that still seems  "un-grep-able" to me ...
    | 
    | What am I missing ?
 
      | dmos62 wrote:
      | You might be missing a change in index: `obj[0].prop` vs
      | `obj[1].prop`. Or, your JSON might have the same property
      | defined multiple times: `{a:1, a:2}` (though I'm not sure
      | how gron handles that situation).
 
        | lvncelot wrote:
        | > (though I'm not sure how gron handles that situation).
        | 
        | It seems both gron and jq only use the value that has
        | been defined last:                 ~  echo
        | '{"a":1,"a":2}' | gron
        | json = {};       json.a = 2;       ~  echo
        | '{"a":1,"a":2}' | jq
        | {         "a": 2       }
 
      | croon wrote:
      | The json output likely contains multiple objects. Can you
      | request more specifically the object(s) you need and grep
      | on that?
 
      | dotancohen wrote:
      | > But what I still don't understand is that some json
      | > outputs I see have multiple values with the exact same
      | name
      | 
      | This is neither explicitly allowed nor explicitly forbidden
      | by the JSON spec. It is implementation dependent upon how
      | to handle - does one value override the other? Should they
      | be treated as an array?
      | 
      | In practice, this situation is usually carefully avoided by
      | services that produce JSON. If you are interfacing with a
      | service that does produce duplicate values, I'd be
      | interested in seeing it for curiosity's sake. If you are
      | writing a service and this is the output, then I implore
      | you to reconsider!
 
  | ptwt wrote:
  | It did write it a few years ago.
  | 
  | https://github.com/plainas/tq
 
  | matsemann wrote:
  | There are already tools for xpath, but using css selectors is
  | much more aligned with what I write every day, so that's nice.
 
    | harperlee wrote:
    | Yes, and awk and others. I meant something semantically
    | closer to the need, with css selectors.
 
  | natrys wrote:
  | It's not novel obviously. I have been using pup[1] for years.
  | And xidel[2] is probably older.
  | 
  | [1] https://github.com/ericchiang/pup
  | 
  | [2] https://github.com/benibela/xidel
 
| ducktective wrote:
| Looks nice! Any comparisons with pup?
| 
| https://github.com/ericchiang/pup
 
| notorandit wrote:
| Next is xmlq: https://github.com/dscape/xmlq
 
| Ronak123 wrote:
| https://techflashes.com/top-upcoming-futuristic-technologies...
 
| unityByFreedom wrote:
| Why not just jquery?
 
| purplecats wrote:
| brilliant. does this spin up a heavy DOM implementation in the
| background or do something lighter such as regexp?
 
  | mdzn wrote:
  | You can't parse HTML with regexps. It's not a regular language.
 
    | underdeserver wrote:
    | https://stackoverflow.com/questions/1732348/regex-match-
    | open...
 
    | carnitine wrote:
    | What language implements regexps that actually correspond to
    | regular languages though?
 
  | Deukhoofd wrote:
  | Looks like it uses servos html5ever (through kuchiki), so no
  | DOM representation.
 
    | chrismorgan wrote:
    | Kuchiki materialises what they call a "DOM-like tree". I'd
    | consider it a DOM tree, myself, despite the differences in
    | precise API.
    | 
    | But it's not using a full browser to back it, which I suspect
    | is what's really being asked.
 
  | mrweasel wrote:
  | It looks to be using html5ever to parse the HTML, similar to
  | something like BeautifulSoup in Python.
 
  | delusional wrote:
  | The source is right there. You can read it. It uses html5ever
  | (part of the servo project).
 
  | [deleted]
 
  | gostsamo wrote:
  | You can't parse html with regular expressions :)
  | 
  | https://stackoverflow.com/questions/1732348/regex-match-open...
 
    | bmn__ wrote:
    | "Oh Yes You Can Use Regexes to Parse HTML!"
    | 
    | https://stackoverflow.com/a/4234491
 
      | anon4242 wrote:
      | Yeah, if you allow yourself some Perl to help you with
      | those parts that regexes can't handle...
 
      | akie wrote:
      | Technically correct, but did you see the regex he uses? It
      | spans 82 lines...
 
    | andybak wrote:
    | And the obligitory caveat from the comments:
    | 
    | > While arbitrary HTML with only a regex is impossible, it's
    | sometimes appropriate to use them for parsing a limited,
    | known set of HTML.
 
      | hnbad wrote:
      | The emphasis here is on "known". The tool is general
      | purpose (i.e. handling _unknown_ HTML) so using regexes
      | would be ill-advised.
 
| dredmorbius wrote:
| See also the html-xml-utils from w3c.
| 
| hxextract and hxselect perform similar extract functions.
| 
| hxclean and hxnormalize (combined) will pretty-print HTML.
| 
| https://www.w3.org/Tools/HTML-XML-utils/
 
  | mozey wrote:
  | Funny, couple of years ago I thought someone should create
  | something for JSON similar to what
  | [XSLT](https://en.wikipedia.org/wiki/XSLT) is for XML. See
  | example here https://www.w3schools.com/xml/xsl_intro.asp
  | 
  | Then I found out about jq because awscli was using it in
  | example docs.
  | 
  | I guess `htmlq` makes sense if it has the exact same syntax as
  | `jq`, and the user is already familiar with the latter?
 
| desktopninja wrote:
| Very nice tool. I've long spoiled myself with Powershell's:
| Invoke-WebRequest            eg. # what is the latest release of
| apache-tomcat?       $LINKS=$(Invoke-WebRequest -Uri
| 'https://tomcat.apache.org/download-80.cgi' | Select-Object
| -ExpandProperty Links)       $LATEST=$($Links | Where-Object
| -Property href -Match '#8.5.[0-9]+').href.substring(1)
| $FETCH=$($Links | Where-Object -Property href -match "apache-
| tomcat-${LATEST}.zip$").href
 
  | Tepix wrote:
  | Should it be $LINKS instead of $Links (2x)?
 
    | desktopninja wrote:
    | "$links" works too because PWSH is not case sensitive. But I
    | should have used $LINKS like you said for cleaner write-up.
 
| systemvoltage wrote:
| This is nifty! Python + bs4 takes some googling to remember how
| to parse a webpage. This is just straight forward, thanks so
| much.
 
| jillesvangurp wrote:
| If you make the html well formed, xpath also works great. Great
| stuff if you ever need to pick html apart. Used this quite a bit
| when microformats were still a thing together with jtidy.
| 
| Jq is very loosely inspired by that, I guess. Might come full
| circle here and use some XSL transformations ...
 
  | qw wrote:
  | You can usually find a html parser for your language, that you
  | can use xpath/xsl on. It will just make the same assumptions
  | that the browser does, by adding missing closing tags etc.
  | 
  | I made a tool that extracted parts of web pages 10-15 years
  | ago, and it worked well. There are of course cases where the
  | html is so unstructured that the results were unpredictable,
  | but it worked well in general.
 
| ludovicianul wrote:
| And a Java version with pre-compiled binaries:
| https://github.com/ludovicianul/hq
 
| srg0 wrote:
| "htmlq: like jq, but for HTML"
| 
| "jq is like sed for JSON data"
| 
| sed: "While in some ways similar to an editor which permits
| scripted edits (_such as ed_), sed works by making only one pass
| over the input(s)"
| 
| ed: "ed is a line-oriented text editor".
| 
| Software definition through a reference to another software is
| somewhat confusing. Potential users come from different
| backgrounds (I had no idea what is jq), and it is not clear what
| are the defining features of each project. Is jq line oriented?
| Is htmlq operating in a single pass?
 
  | digitalsushi wrote:
  | "htmlq is like jq but for html" is a very specific 'dog
  | whistle' for people who use jq. I agree that people who don't
  | know what jq is will get no value and pay no attention. But for
  | people who use jq, the claim is, like a dog whistle, clear,
  | concise, and means exactly what it says. In two seconds,
  | everyone using both jq and html will instantly know what is
  | available and log it away.
  | 
  | So for general purposes, it's a terrible marketing pitch. And
  | yet I think it's a very, very valuable demonstration of knowing
  | some of their 'customers'.
 
    | acomar wrote:
    | this isn't what a dogwhistle is. it's just explanation by
    | analogy to a model presumed to be shared by the intended
    | audience. a dogwhistle offers a surface meaning to the
    | uninitiated that's anodyne but communicates a hidden, coded
    | message to those who possess some undisclosed, shared
    | knowledge with the author. this kind of analogy entirely
    | lacks the surface meaning and the message shared via jargon
    | also communicates something about how you might learn enough
    | to understand the analogy.
 
    | philipswood wrote:
    | I can't speak for people who don't know jq, but knowing jq,
    | this is a great tagline: it gives me an immediate
    | understanding of what it does, how I could expect to use it
    | and what value and ease of use I can expect.
    | 
    | I'll be trying it out next time I'm on a PC.
 
      | rendall wrote:
      | > _I can 't speak for people who don't know jq,_
      | 
      | I can, and it's not illuminating at all.
 
  | throwaway2016a wrote:
  | I agree, however if you do know how to use jq than "like jq,
  | but for html" is extremely effective. I use jq all the time and
  | that title hooked me, I immediately wanted to try it.
  | 
  | But if you haven't used Jq that I can see how that title is
  | less than helpful.
 
  | ducktective wrote:
  | The first three are not proper definitions per-se but kind of
  | an advertisement, trying to familiarize by self-comparison with
  | a _tried & true_ tool that has proven its worth.
  | 
  |  _You know Jimmy the famous mechanic? I 'm Timmy, _his brother_
  | but an electrician._
  | 
  | IMO, at least `jq` has proven itself as _the_ indispensable
  | tool for json-data manipulation.
 
  | corporealshift wrote:
  | I mean...if you read the github readme it literally describes
  | what it does in the next line: "Uses CSS selectors to extract
  | bits content from HTML files".
 
  | kbenson wrote:
  | > Software definition through a reference to another software
  | is somewhat confusing.
  | 
  | Possibly, depending on background as you note, but not all
  | promotion is intended at the same audience. When submitting to
  | HN, "like jq, but for X" is short and conveys what it is to
  | most the people that would care, I think. jq has been submitted
  | and talked about here _many_ times with lively discussion over
  | the years.[1] At this point I think most those that are
  | interested in what that is and what this is will understand
  | fairly quickly from the title. Those that don 't might be
  | missed, or they might look it up like you, or they might see it
  | through some other submission some other time with a different
  | title which isn't based on a chain of references.
  | 
  | 1:
  | https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
 
  | zamadatix wrote:
  | 1st sentence - Explaining the tool for those the tool was made
  | for without beating around the bush.
  | 
  | 2nd sentence - Explaining the tool to folks in the general web
  | domain what it can do for them.
  | 
  | 3rd sentence - Explaining where to learn how to use the tool if
  | you've stumbled across it but web is not your area of
  | expertise.
  | 
  | All that info fits in nearly 25 words then it lists the options
  | for the tool and jumps straight into multiple examples (with
  | outputs!). If the only explanation had been "htmlq: like jq,
  | but for HTML" I'd agree but having the comparison to explain
  | what it does isn't a bad thing it's _only_ having the
  | comparison that would be bad.
  | 
  | Personally I think this is a model example of a opening for a
  | Github readme.
 
    | zerocount wrote:
    | I disagree. The 2nd sentence contains, "extract bits
    | content." What is that?
    | 
    | If you're going to write a minimal introduction, at least
    | make sure it's not confusing.
    | 
    | I get the feeling the author felt compelled to write an
    | introduction and did so with as little effort as possible.
 
      | cyberge99 wrote:
      | I believe he tailored it to his target audience. If you
      | find it confusing, you are likely not it.
 
        | ritchiea wrote:
        | As web developer for over a decade "bits content" doesn't
        | mean anything to me. But I understand what the tool does
        | from the rest of the description. Try running a google
        | search for "bits content," [0] it's not a commonly used
        | phrase in web development or anything. It's a poor choice
        | of words.
        | 
        | 0. https://www.google.com/search?hl=en&q=%22bits%20conten
        | t%22
 
        | chownie wrote:
        | It's supposed to be "bits of content", it's not jargon.
        | The author's just accidentally a word, we all do it.
 
        | ritchiea wrote:
        | It's more than fair to say in technical documentation you
        | intend others to use having a grammatical error or
        | missing word is confusing and a problem. It's the writing
        | equivalent of having a bug in your code. And it's
        | definitely not "writing to a target audience" as the
        | parent comment suggested. We all make mistakes but don't
        | try to call a mistake effective documentation.
 
        | RobertKerans wrote:
        | Of course it is, but neither parent nor anyone else is
        | saying anything close to the mistake being effective
        | documentation. There's a single missing word which needs
        | to be added in, but the overall text is clearly writing
        | to a target audience. You are aware of this, and of how
        | small the mistake is, and you understand what the
        | sentence should read as, so I'm not sure what your point
        | is?
 
      | theIV wrote:
      | My hunch is that this is a typo and it should read "extract
      | bits OF content."
 
        | mgdm wrote:
        | Exactly this! I'll fix it after work.
 
        | rendall wrote:
        | Maybe have the line about "jq" be 2nd. Have the first
        | line be a brief description of what it actually does.
 
        | ritchiea wrote:
        | I agree and having a missing word in your text often
        | leads to confusion :)
        | 
        | Honestly you could drop the "bits" which is a bit
        | redundant and use the phrase "Uses CSS selectors to
        | extract content from HTML files."
 
  | oauea wrote:
  | What's this thing called a "computer" that people keep going on
  | about, anyway?
 
    | da_chicken wrote:
    | It's a person who does mathematic calculations all day. For
    | example, creating range tables for artillery, calculating
    | averages or totals of a large range of values, or solving
    | complex integrals or differential equations, and so on.
    | They're commonly used in industry or government, especially
    | in astronomy, aerospace and civil engineering for both
    | simulation and analysis. Perhaps the most well-known
    | computers were the Harvard Computers, which operated in the
    | late 19th and early 20th centuries.
    | 
    | As a job, computers were largely automated out of existence
    | by solid-state transistor based automated computers and
    | integrated circuit transistor automated computers in the 60s,
    | 70s and 80s, which replaced the enormously expensive and
    | often largely experimental electro-mechanical automated
    | computers while radically reducing cost and improving
    | performance both by several orders of magnitude.
 
    | samstave wrote:
    | Here - this explains it really succinctly:
    | 
    | https://www.youtube.com/watch?v=lE1bS-Mn2Mk
 
    | theandrewbailey wrote:
    | It's like a programmable loom, but for logical and
    | mathematical operations.
 
      | samhw wrote:
      | You may be interested in the symbol grounding problem (http
      | s://en.wikipedia.org/wiki/Symbol_grounding_problem#Groun...
      | ). It's like the binding problem, but for symbols.
 
        | [deleted]
 
    | sundarurfriend wrote:
    | Sort of related: [Expecting Short Inferential
    | Distances](https://www.readthesequences.com/Expecting-Short-
    | Inferential...)
 
  | nextaccountic wrote:
  | jq isn't line-oriented, it's json-oriented. it's operaring on a
  | stream of jsons from stdin, so its query is applied to each
  | json in sequence.
  | 
  | I would expect that htmlq run the query a single time for a
  | single html; just like jquery $('#something') or
  | document.querySelector('#something')
 
| zatkin wrote:
| Why not incorporate this into jq itself, like perhaps adding some
| command line arguments to switch to HTML mode?
 
  | Deukhoofd wrote:
  | What would the benefits of fitting a HTML parser into a JSON
  | parser tool be?
 
    | lmm wrote:
    | JQ is not just a parser but a tool for doing operations, many
    | of which are (or should be) generic across any tree-like data
    | format. Reusing that part across different input formats
    | makes a lot of sense.
 
    | mjburgess wrote:
    | Well once there's an HTML parser, then a pdf viewer, and then
    | everything needed for PDFs (ie., programming, emailer, video
    | support, etc.) we'll finally have that ideal operating system
    | we've been waiting for.
 
    | mro_name wrote:
    | sounds a lot more like blockchain.
 
  | e12e wrote:
  | Would probably be more useful to implement html2json, and pipe
  | in html?
  | 
  | Ed: eg: https://github.com/Jxck/html2json
 
| downWidOutaFite wrote:
| Why? I find xpath's syntax much simpler and regular than jq's.
 
| [deleted]
 
| pabs3 wrote:
| I tend to reach for XPath selectors before CSS ones when querying
| HTML.
 
| necovek wrote:
| Nice, I expected something based on XPath (like xpd), but web
| developers dealing with HTML are infinitely more familiar with
| CSS selectors, so a great choice!
 
  | busterarm wrote:
  | I want the option to use both, like Nokogiri gives you.
 
    | necovek wrote:
    | Sure, that sounds nice, but having two simple tools each
    | doing the job well in its own space is perfectly fine for me
    | -- do you imagine needing to combine Xpath and CSS queries in
    | a single run?
 
      | busterarm wrote:
      | I've had to do it when dealing with some poorly-designed
      | XML apis in the past. Nokogiri was a godsend.
 
| rendall wrote:
| What is jq?
 
| mro_name wrote:
| it's statically linkable rust, isn't it? Awesome. I'm looking for
| a successor to
| 
| $ xmllint --html --xpath ...
| 
| that doesn't choke on inline svg.
 
| gigatexal wrote:
| This is very cool. This will make scraping the web even easier!
 
| elif wrote:
| When I saw the title I thought this was some machine learning-
| specific rmq/0mq message passing tech called HT. Very excited to
| zero.
 
| m4r35n357 wrote:
| Should be HQ . . .
 
| pkrumins wrote:
| Call it "hq".
 
| jhatemyjob wrote:
| Crazy how a 300-line codebase manages to amass 2000 stars on
| Github and 700 upvotes on HN. Amazing ROI.
 
| gizdan wrote:
| Once upon a time I was using pup[0] for such thing as well as
| later I changed to cascadia[1] which seemed much more advanced.
| 
| Comparing the two repos, it seems pup is dead, but cascadia may
| not be.
| 
| These tools, including htmlq, seem to sell themselves as "jq for
| html", which is far from the truth. Jq is closer to the awk where
| you can do just about everything with json. Cascadia, htmlq, and
| pup seem closer to grep for html. They can essentially only
| select data from a html source.
| 
| [0] https://github.com/EricChiang/pup [1]
| https://github.com/suntong/cascadia
 
  | heavyset_go wrote:
  | I've used pup for a few projects, but was unaware of cascadia.
  | Thanks for pointing it out.
 
  | croon wrote:
  | Well, jq is grep _as well_ as sed and awk, but yeah, htmlq
  | seems to be just grep, for sake of comparison.
  | 
  | But I don't think html has any need for a sed/awk tool, or at
  | least not as much. Json output could very well be piped forward
  | to the next CLI tool after you've changed it slightly with jq.
  | I don't see this scenario as likely with html.
 
    | gizdan wrote:
    | > Well, jq is grep as well as sed and awk, but yeah, htmlq
    | seems to be just grep, for sake of comparison.
    | 
    | Exactly, and that is what I mean. If you want to compare,
    | compare it with grep, not jq.
    | 
    | Someone else posted xidel[0] in this thread, which I've not
    | used, but it seems to be the "jq but for html".
    | 
    | [0] https://github.com/benibela/xidel
 
| bamdadd wrote:
| is there a brew install command ?
 
| mcovalt wrote:
| I'd like to see a tool using lol-html [0] and their CSS selector
| API as a streaming HTML editor.
| 
| [0] https://github.com/cloudflare/lol-html
 
| hyperpallium2 wrote:
| From examples, this is only like jq in the sense that the q
| stands for the same thing. Even the way it does that is
| different.
| 
| An xmlq that was really like jq would be fun, about 20 years ago.
 
  | cerved wrote:
  | I would still like xmlq, there are (regrettably) still a lot of
  | applications that store data and configuration in xml
 
  | dotancohen wrote:
  | There is `xq` today, which parses XML like `jq`. I think that
  | it is relatively unknown because it is part of the `yq` package
  | for parsing YMAL. So just install `yq` via PIP and you'll get
  | `xq` as well.
  | 
  | There is also `xmlstarlet` for parsing XML in a similar
  | fashion.
 
    | hyperpallium2 wrote:
    | xmlstarlet is really nothing like jq, as a language. But yes,
    | I use it because it is the best commandline xml processor I'd
    | found. That's the only similarity to jq.
    | 
    | Is this the yq? https://kislyuk.github.io/yq/ It does contain
    | an 'xq', as a literal wrapper for jq, piping output into it
    | after transcoding XML to JSON using xmltodict
    | https://github.com/martinblech/xmltodict (which explodes xml
    | into separate JSON data structures).
    | 
    | This is a bash one-liner! But TBF it really is a 'jq for
    | xml'. I think it would be horrible for some things, but you
    | could also do a lot of useful things painlessly.
 
      | dotancohen wrote:
      | Thank you for the comments. I've only recently discovered
      | both tools, and literally used them once each. Of the two
      | `xq` was easier for my particular work case (parsing a
      | Magento config) but I keep both tools in my virtual
      | toolbox.
      | 
      | If you have any other suggestions for parsing XML for
      | exploratory purposes I'm very happy to hear them.
 
        | hyperpallium2 wrote:
        | Thanks! Not actually a reccommendation, but I have used
        | xsltproc (command line xslt), but it is horrible to use
        | because xslt syntax is horrible (though xslt's concepts
        | are pretty cool). One thing is it enables you to use
        | XPath in all its glory.
        | 
        | Just installed xq. It's nice just seeing the pretty-
        | printed json output, so thanks for the pointer. Probably
        | better than xmlstarlet for my usage, which just queries
        | and outputs text, not xml. hmmm, that's probably true for
        | most commandline uses...
 
    | jle17 wrote:
    | Just looked into this and I think it's worth mentioning that
    | there are two different projects called `yq`. The first one
    | that came up (written in go instead of python) is not the
    | right one and doesn't have the `xq` tool.
 
| abledon wrote:
| is anyone else using the https://github.com/json-path/JsonPath
| over the jq route?
| 
| I hope we standardize on some jq query language, like we have
| with a base set of SQL syntax
 
| andybak wrote:
| > like jg
| 
| "jq is a lightweight and flexible command-line JSON processor"
 
| chefandy wrote:
| If anyone is looking for a good library to do this in Python,
| PyQuery works well:
| 
| https://pythonhosted.org/pyquery/
 
| teitoklien wrote:
| Maybe call it hq ?
 
  | Simplicitas wrote:
  | My thoughts EXACTLY... but anyway, great new utility indeed!
 
    | teitoklien wrote:
    | Haha, Indeed its a very good utility :D
 
| oauea wrote:
| https://jsoup.org/ has been around for a long time and seems a
| bit more mature & maintained than this two-code-files 2-year-old
| repo. Highly recommend.
 
| avereveard wrote:
| what's wrong with using html tidy + xmllint ?
 
  | mro_name wrote:
  | nothing wrong. Searching unmodified html though is sometimes
  | preferable.
 
| soheil wrote:
| I'd use something like this script that you can put together
| yourself:                 #!/usr/bin/env ruby       require
| 'nokogiri'; p Nokogiri::HTML(STDIN.read).css(ARGV[0]).text
| 
| Just save it to a file in your _/ usr/local/bin/hq_ and do _chmod
| +x !$_
| 
| Then you can do:                 curl -s
| "https://news.ycombinator.com/news"|hq "tr:first-child
| .storylink"
| 
| It uses Nokogiri[0], which is much more battle tested and works
| with CSS and XPath selectors.
| 
| [0]
| https://nokogiri.org/tutorials/parsing_an_html_xml_document....
 
| triska wrote:
| This is very nice!
| 
| For reasoning about tree-based data such as HTML, I also highly
| recommend the declarative programming language Prolog. HTML
| documents map naturally to Prolog terms and can be readily
| reasoned about with built-in language mechanisms. For instance,
| here is the sample query from the htmlq README, fetching all
| elements with id _get-help_ from https://www.rust-lang.org, using
| Scryer Prolog and its SGML and HTTP libraries in combination with
| the XPath-inspired query language from library(xpath):
| ?- http_open("https://www.rust-lang.org", Stream, []),
| load_html(stream(Stream), DOM, []),            xpath(DOM,
| //(*(@id="get-help")), E).
| 
| Yielding:                      E = element(div,[class="flex flex-
| colum ...",id="get-help"],["\n        ",element(h4,[],["Get
| help!"]),"\n        ",element(ul,[],["\n
| ...",element(li,[],[element(a,[... = ...],[...])]),"\n
| ...",element(li,[],[...]),...|...]),"\n
| ...",element(div,[class="la ..."],["\n
| ...",element(label,[...],[...]),...|...]),"\n    ..."])         ;
| false.
| 
| The selector //(*(@id="get-help")) is used to obtain all HTML
| elements whose _id_ attribute is get-help. On backtracking, all
| solutions are reported.
| 
| The other example from the README, extracting all _links_ from
| the page, can be obtained with Scryer Prolog like this:
| ?- http_open("https://www.rust-lang.org", Stream, []),
| load_html(stream(Stream), DOM, []),            xpath(DOM,
| //a(@href), Link),            portray_clause(Link),
| false.
| 
| This query uses forced backtracking to write all links on
| standard output, yielding:                   "/".
| "/tools/install".         "/learn".         "https://play.rust-
| lang.org/".         "/tools".         "/governance".
| "/community".         "https://blog.rust-lang.org/".
| "/learn/get-started".         etc.
 
  | chriswarbo wrote:
  | Thanks, that's a rare example of something which is (a) simple
  | enough to understand for a Prolog-newbie like me, and (b) more
  | practical than ubiquitous family-tree example.
  | 
  | I'm always looking for opportunities to dip my toes into
  | Prolog; in hindsight it's clearly a good fit for tree-
  | structured data structures.
 
    | samhw wrote:
    | Interestingly, the only other context in which I've come
    | across Prolog is from friends who studied at Cambridge, here
    | in the UK. For some reason, the CS 'tripos' (course) there is
    | really heavily focussed on Prolog, and everyone I know from
    | there ended up a huge fan of the language. I'm not sure why
    | that's the case, though, given that almost all other
    | universities seem to use more common languages (Java, C++,
    | etc).
 
      | zimpenfish wrote:
      | cs.man.ac.uk, at least back in 1992, had a compulsory
      | Prolog module in the first year. Don't know anyone from
      | then who didn't hate that module with a burning passion.
      | 
      | (There was no Java, C++, etc. either. It was SML, Pascal,
      | 68000, and Oracle Pascal-Embedded-SQL.)
 
      | ramses0 wrote:
      | "Prolog as a library" => Given "functional" constraints =>
      | $CONSTRAINTS.prolog( "query..." ) => results
      | 
      | ...many languages (similar to regex / state-machine) can
      | benefit greatly from offloading a portion to something
      | prolog-ish, but it's unfortunate that prolog knowledge
      | isn't as widely distributed.
 
      | WickyNilliams wrote:
      | I studied CS at a different university in UK and we used
      | Prolog for one module on AI or perhaps machine vision. I
      | really enjoyed working with it. This was 15 years ago.
      | Looking through their current curriculum I can't see prolog
      | being mentioned anymore. Shame!
 
  | pandatigox wrote:
  | I tried to run this on my computer now, but as a complete
  | Prolog noob, I'm having errors running the script? How do you
  | load the http_open module/library in the first place? I tried
  | following some Prolog tutorials in the past but I always get
  | stuck trying to run something in the REPL. I'm using scryer-
  | prolog. Thanks in advance!
 
    | triska wrote:
    | The libraries I mentioned can be loaded by invoking the
    | use_module/1 predicate on the toplevel, here is the complete
    | transcript that loads the SGML, HTTP and XPath libraries in
    | Scryer Prolog:                   ?-
    | use_module(library(sgml)).            true.         ?-
    | use_module(library(http/http_open)).            true.
    | ?- use_module(library(xpath)).            true.
    | 
    | The second query also uses portray_clause/1 from
    | library(format), which you can load with:
    | ?- use_module(library(format)).            true.
    | 
    | After all these libraries are loaded, you can post the sample
    | queries from above, and it should work.
    | 
    | There are also other ways to load these libraries: A very
    | common way to load a library is to use the use_module/1
    | _directive_ in Prolog source files. In that case, you would
    | put for example the following 4 directives in a Prolog source
    | file, say sample.pl:                   :-
    | use_module(library(sgml)).         :-
    | use_module(library(http/http_open)).         :-
    | use_module(library(xpath)).         :-
    | use_module(library(format)).
    | 
    | And then run sample.pl with:                   $ scryer-
    | prolog sample.pl
    | 
    | You can then again post the goals from above on the toplevel,
    | and it will work too.
    | 
    | Another way is to put these directives in your ~/.scryerrc
    | configuration file, which is automatically consulted when
    | Scryer Prolog starts. I recommend to do this for libraries
    | you frequently need. Common candidates for this are for
    | example library(dcgs), library(lists) and library(reif).
    | 
    | Personally, I start Scryer Prolog from within Emacs, and I
    | have set up Emacs so that I can consult a buffer with Prolog
    | code, and also post queries and interact with the Prolog
    | toplevel from within Emacs.
 
      | pandatigox wrote:
      | Wow that works fantastically! Thank you for that. It almost
      | seems like magic.
 
  | okasaki wrote:
  | It's pretty easy in Python too, eg.:                   >>> soup
  | = BeautifulSoup(requests.get("https://www.rust-lang.org").text)
  | >>> [x["href"] for x in soup.find_all("a")]              ['/',
  | '/tools/install', '/learn', 'https://play.rust-lang.org/',
  | '/tools', '/governance', '/community', 'https://blog.rust-
  | lang.org/',...
 
    | triska wrote:
    | In a certain sense (for example, when measuring brevity), it
    | is indeed easy to write this example in Python. However, the
    | Python version also illustrates that many different language
    | constructs are needed to express the intended functionality.
    | In comparison to Prolog, Python is a quite complex language
    | with many different language constructs, including loops,
    | objects, methods, assignment, dictionaries etc. all of which
    | are used in this example.
    | 
    | As I see it, a key attraction of Prolog is its simplicity:
    | With a single language construct (Horn clauses), you are able
    | to express all known computations, and the example queries I
    | posted show that only a single language element, namely again
    | Horn clauses to express a query, is needed to run the code.
    | The Prolog query, and also every Prolog clause, is itself a
    | Prolog term and can be inspected with built-in mechanisms.
    | 
    | As a consequence, an immediate benefit of using Prolog for
    | such use cases is that you can easily reason about user-
    | specified queries in your applications, and for example
    | easily allow only a safe subset of code to be run by users,
    | or execute a user-specified query with different execution
    | strategies etc. In comparison, Python code is much harder to
    | analyze and restrict to a particular subset due to the
    | language's comparatively high syntactic complexity.
 
      | the_jeremy wrote:
      | The benefit of Python is that developers already know about
      | these language constructs, and that more developers know
      | Python than Prolog.
 
        | lostcolony wrote:
        | I don't think the op's point was "how easy it would be to
        | hire developers", or even "taking all the considerations
        | a business is under, I feel Prolog makes sense". He was
        | just touting how easy Prolog's built in pattern matching
        | and declarative style makes implementing and using
        | selectors at a language level.
        | 
        | Honestly, if we didn't talk about the benefits of a
        | language irrespective of how easy it is to hire for it,
        | we'd never have introduced anything beyond FORTRAN, if we
        | even made it that far. Bringing "X is easier to hire for"
        | into a conversation about the language is, at best, a
        | non-sequitur.
 
        | notriddle wrote:
        | We might have been better off that way. FORTRAN does have
        | its downsides, but language churn itself has downsides
        | that almost always outweigh the assumed upsides of a
        | better language.
        | 
        | If we had just stuck with FORTRAN forever, how many
        | problems would have been completely avoided!? There'd be
        | better, and more, IDEs, since even if the language is
        | hard to parse, it's still just one parser that needs all
        | the effort. So many unfortunate problems in education
        | caused by language and ecosystem churn would have been
        | avoided (the infamous "by the time you graduate, it's
        | always outdated" problem).
        | 
        | The only problem is that FORTRAN is too new. Should've
        | stuck with the Hollerith tabulator.
 
  | jfmc wrote:
  | AFAIK, this was first proposed and implemented in Ciao Prolog
  | back in late 90s (modern versions here: https://ciao-
  | lang.org/ciao/build/doc/ciao.html/html.html). It was way before
  | Python was popular and JavaScript ever existed.
 
| parhamn wrote:
| Ive been looking for a library that can find the best set of
| selectors to most consistently find the element youre looking for
| in a page.
| 
| Any pointers to something that exists? Interestingly I've also
| found very little for dom extraction in the OS ML space.
 
___________________________________________________________________
(page generated 2021-09-07 23:01 UTC)