proxy70

	[HN Gopher] Htmlq: like jq, but for html ___________________________________________________________________ Htmlq: like jq, but for html Author : jabo Score : 849 points Date : 2021-09-07 07:12 UTC (15 hours ago)
	web link (github.com)
	w3m dump (github.com)
	\| dfederschmidt wrote: \| This looks very useful, big fan of all the ^[a-z]+q$ utilities \| out there. But as a user, I would probably want to use XPath[0] \| notation here. Maybe that is just me. A quick search revealed \| xidel[1] which seems to be similar, but supports XPath. \| \| [0]https://en.wikipedia.org/wiki/XPath \| [1]https://github.com/benibela/xidel \| chriswarbo wrote: \| My web scraping tends to start with xidel. If I need a little \| bit more power I'll use xmlstarlet. If neither of those is \| enough, I'll use Python's beautifulsoup package :) \| waynenilsen wrote: \| part of the problem with this is that HTML is mostly not valid \| XML \| akie wrote: \| I'd like to state my support for the author's choice of CSS \| selectors in this particular use case. I think it's a natural \| fit for this domain and already very well known, perhaps even \| known better than XPath. \| mirekrusin wrote: \| Playwright ppl had to solve this for themselves, you can mix \| them as they are distinct, have few small custom \| modifications to help with selectors. Playwright compatible \| selectors would be nice. \| berkes wrote: \| I'd like to add my support here too, but with a note. \| \| When scraping and parsing (or writing integration test DSL), \| I always start out with CSS selectors. But always hit cases \| where they lack or require hoop-jumping and then fall back on \| Xpath. I then have a codebase with both CSS-Sel and Xpath, \| which is arguably worse then having only one method. \| \| I suspect here, one uses this tool untill CSS selector \| limitations are getting in the way, after which one switches \| to another tool(chain) \| alpha_squared wrote: \| Do you mind giving an example? I'm having trouble following \| where CSS is limited for selection. \| berkes wrote: \| Like other commentor says: parent/child. But also \| selecting by content (e.g. "click the button with the \| delete-icon" or "find the link with '@harrypotter') or \| selecting by attributes (e.g. click the pager-item that \| goes to next page) or selecting items outside of body \| (e.g. og-tags, title etc). All are doable in CSS3 \| selectors, but everything shouts that they are not meant \| for this; whereas xpath does this far more natural. \| unspecified wrote: \| Searching text content is my main remaining use of XPath. \| benibela wrote: \| XPath does general data processing not just selection \| \| E.g. when you have a list of numbers on the website, \| XPath can calculate the sum or the maximum of the numbers \| \| Or you have a list of names "Last name, First name", then \| you can remove the last name and sort the first names \| alphabetically. Or count how often each name occurs and \| return the most popular name. \| \| Then it goes back to selection, e.g. select all numbers \| that are smaller than the average. Or calculate the most \| popular name, then select all elements containing that \| name \| vlunkr wrote: \| Well, the big one is selecting a parent from the child. \| androceium wrote: \| You could do this with the :has() CSS psuedo-class[0], \| though inverted (select a parent that _has_ the child \| matching a selector). \| \| Looks like that psuedo-class has not been implemented in \| the kuchiki library that htmlq uses though. \| \| [0]: https://developer.mozilla.org/en- \| US/docs/Web/CSS/:has \| Jenk wrote: \| I've not had much friction using either, they are "close \| enough" that the time to (re)write a query from one to the \| other is not very significant. \| lilyball wrote: \| This looks really neat! It supports a bunch of different query \| types, and can even do things like follow links to get info \| about the linked-to pages! \| \| It's also in nixpkgs, though for some reason the nixpkgs \| derivation is marked as linux-only (i.e. not Darwin). (Edit: \| probably because the fpc dependency is also Linux-only, with a \| linux-specific patch and a comment suggesting that supporting \| other platforms would require adding per-platform patches) \| exyi wrote: \| Thanks, this looks more powerfull. Support CSS, XPath and \| XQuery. Maybe I could learn a bit of XQuery when I have a use \| case for it :) \| dmit wrote: \| Well, here's your first lesson then: if you prepend (: to \| your comment it will become a valid XQuery document! \| \| (: XQuery comments are marked by mirrored smilie faces, like \| this. :) \| bdcravens wrote: \| Nice - I've been writing XQuery for years and I had no clue \| firefoxd wrote: \| Super useful. You've created a fantastic tool here. Thank you. \| d--b wrote: \| Just being that guy: is there a reason you didn't call it hq? \| zamadatix wrote: \| Not author but neither is the poster: Jq got away with it \| because it's one of the few 2 letter combinations that wasn't \| absolutely overloaded and "jquery" was already taken. OTOH \| nobody shortens HTML to H and HQ is an extremely common \| acronym, if not one of the most popular 2 letter acronyms you \| could pick. \| OJFord wrote: \| jq didn't get away with it! Have you never tried searching \| for anything to do with it? How I _wish_ it were called \| `jsonq`! \| mgdm wrote: \| I just wanted to be slightly more descriptive and less likely \| to collide with other tools. \| notRobot wrote: \| Hahah, I love how this is your second comment in 10 years on \| HN. \| mgdm wrote: \| Hah. Yeah. I had another account for a little while but \| then HN started to let me reset the password for this one \| quite recently, so here I am. \| Snd_ wrote: \| This is great! Thanks \| who-shot-jr wrote: \| Good work! \| harperlee wrote: \| Nice! \| \| This is the kind of obvious tool that once it exists, you can't \| really grok the fact it did not earlier, and that it took until \| now to exist. \| dmos62 wrote: \| > grok \| \| A good opportunity to introduce `gron` to those unfamiliar! \| > gron "https://api.github.com/repos/tomnomnom/gron/commits?per \| _page=1" \| fgrep "commit.author" json[0].commit.author \| = {}; json[0].commit.author.date = \| "2016-07-02T10:51:21Z"; json[0].commit.author.email = \| "mail@tomnomnom.com"; json[0].commit.author.name = "Tom \| Hudson"; \| \| https://github.com/tomnomnom/gron \| rsync wrote: \| "A good opportunity to introduce `gron` to those unfamiliar!" \| \| Thank you - appreciated. \| \| I haven't done much work with json but have had reasons \| recently to do so - and I immediately saw how difficult it \| was to pipeline to grep ... \| \| But what I still don't understand is that some json outputs I \| see have multiple values _with the exact same name_ (!) and \| that still seems "un-grep-able" to me ... \| \| What am I missing ? \| dmos62 wrote: \| You might be missing a change in index: `obj[0].prop` vs \| `obj[1].prop`. Or, your JSON might have the same property \| defined multiple times: `{a:1, a:2}` (though I'm not sure \| how gron handles that situation). \| lvncelot wrote: \| > (though I'm not sure how gron handles that situation). \| \| It seems both gron and jq only use the value that has \| been defined last: ~ echo \| '{"a":1,"a":2}' \| gron \| json = {}; json.a = 2; ~ echo \| '{"a":1,"a":2}' \| jq \| { "a": 2 } \| croon wrote: \| The json output likely contains multiple objects. Can you \| request more specifically the object(s) you need and grep \| on that? \| dotancohen wrote: \| > But what I still don't understand is that some json \| > outputs I see have multiple values with the exact same \| name \| \| This is neither explicitly allowed nor explicitly forbidden \| by the JSON spec. It is implementation dependent upon how \| to handle - does one value override the other? Should they \| be treated as an array? \| \| In practice, this situation is usually carefully avoided by \| services that produce JSON. If you are interfacing with a \| service that does produce duplicate values, I'd be \| interested in seeing it for curiosity's sake. If you are \| writing a service and this is the output, then I implore \| you to reconsider! \| ptwt wrote: \| It did write it a few years ago. \| \| https://github.com/plainas/tq \| matsemann wrote: \| There are already tools for xpath, but using css selectors is \| much more aligned with what I write every day, so that's nice. \| harperlee wrote: \| Yes, and awk and others. I meant something semantically \| closer to the need, with css selectors. \| natrys wrote: \| It's not novel obviously. I have been using pup[1] for years. \| And xidel[2] is probably older. \| \| [1] https://github.com/ericchiang/pup \| \| [2] https://github.com/benibela/xidel \| ducktective wrote: \| Looks nice! Any comparisons with pup? \| \| https://github.com/ericchiang/pup \| notorandit wrote: \| Next is xmlq: https://github.com/dscape/xmlq \| Ronak123 wrote: \| https://techflashes.com/top-upcoming-futuristic-technologies... \| unityByFreedom wrote: \| Why not just jquery? \| purplecats wrote: \| brilliant. does this spin up a heavy DOM implementation in the \| background or do something lighter such as regexp? \| mdzn wrote: \| You can't parse HTML with regexps. It's not a regular language. \| underdeserver wrote: \| https://stackoverflow.com/questions/1732348/regex-match- \| open... \| carnitine wrote: \| What language implements regexps that actually correspond to \| regular languages though? \| Deukhoofd wrote: \| Looks like it uses servos html5ever (through kuchiki), so no \| DOM representation. \| chrismorgan wrote: \| Kuchiki materialises what they call a "DOM-like tree". I'd \| consider it a DOM tree, myself, despite the differences in \| precise API. \| \| But it's not using a full browser to back it, which I suspect \| is what's really being asked. \| mrweasel wrote: \| It looks to be using html5ever to parse the HTML, similar to \| something like BeautifulSoup in Python. \| delusional wrote: \| The source is right there. You can read it. It uses html5ever \| (part of the servo project). \| [deleted] \| gostsamo wrote: \| You can't parse html with regular expressions :) \| \| https://stackoverflow.com/questions/1732348/regex-match-open... \| bmn__ wrote: \| "Oh Yes You Can Use Regexes to Parse HTML!" \| \| https://stackoverflow.com/a/4234491 \| anon4242 wrote: \| Yeah, if you allow yourself some Perl to help you with \| those parts that regexes can't handle... \| akie wrote: \| Technically correct, but did you see the regex he uses? It \| spans 82 lines... \| andybak wrote: \| And the obligitory caveat from the comments: \| \| > While arbitrary HTML with only a regex is impossible, it's \| sometimes appropriate to use them for parsing a limited, \| known set of HTML. \| hnbad wrote: \| The emphasis here is on "known". The tool is general \| purpose (i.e. handling _unknown_ HTML) so using regexes \| would be ill-advised. \| dredmorbius wrote: \| See also the html-xml-utils from w3c. \| \| hxextract and hxselect perform similar extract functions. \| \| hxclean and hxnormalize (combined) will pretty-print HTML. \| \| https://www.w3.org/Tools/HTML-XML-utils/ \| mozey wrote: \| Funny, couple of years ago I thought someone should create \| something for JSON similar to what \| [XSLT](https://en.wikipedia.org/wiki/XSLT) is for XML. See \| example here https://www.w3schools.com/xml/xsl_intro.asp \| \| Then I found out about jq because awscli was using it in \| example docs. \| \| I guess `htmlq` makes sense if it has the exact same syntax as \| `jq`, and the user is already familiar with the latter? \| desktopninja wrote: \| Very nice tool. I've long spoiled myself with Powershell's: \| Invoke-WebRequest eg. # what is the latest release of \| apache-tomcat? $LINKS=$(Invoke-WebRequest -Uri \| 'https://tomcat.apache.org/download-80.cgi' \| Select-Object \| -ExpandProperty Links) $LATEST=$($Links \| Where-Object \| -Property href -Match '#8.5.[0-9]+').href.substring(1) \| $FETCH=$($Links \| Where-Object -Property href -match "apache- \| tomcat-${LATEST}.zip$").href \| Tepix wrote: \| Should it be $LINKS instead of $Links (2x)? \| desktopninja wrote: \| "$links" works too because PWSH is not case sensitive. But I \| should have used $LINKS like you said for cleaner write-up. \| systemvoltage wrote: \| This is nifty! Python + bs4 takes some googling to remember how \| to parse a webpage. This is just straight forward, thanks so \| much. \| jillesvangurp wrote: \| If you make the html well formed, xpath also works great. Great \| stuff if you ever need to pick html apart. Used this quite a bit \| when microformats were still a thing together with jtidy. \| \| Jq is very loosely inspired by that, I guess. Might come full \| circle here and use some XSL transformations ... \| qw wrote: \| You can usually find a html parser for your language, that you \| can use xpath/xsl on. It will just make the same assumptions \| that the browser does, by adding missing closing tags etc. \| \| I made a tool that extracted parts of web pages 10-15 years \| ago, and it worked well. There are of course cases where the \| html is so unstructured that the results were unpredictable, \| but it worked well in general. \| ludovicianul wrote: \| And a Java version with pre-compiled binaries: \| https://github.com/ludovicianul/hq \| srg0 wrote: \| "htmlq: like jq, but for HTML" \| \| "jq is like sed for JSON data" \| \| sed: "While in some ways similar to an editor which permits \| scripted edits (_such as ed_), sed works by making only one pass \| over the input(s)" \| \| ed: "ed is a line-oriented text editor". \| \| Software definition through a reference to another software is \| somewhat confusing. Potential users come from different \| backgrounds (I had no idea what is jq), and it is not clear what \| are the defining features of each project. Is jq line oriented? \| Is htmlq operating in a single pass? \| digitalsushi wrote: \| "htmlq is like jq but for html" is a very specific 'dog \| whistle' for people who use jq. I agree that people who don't \| know what jq is will get no value and pay no attention. But for \| people who use jq, the claim is, like a dog whistle, clear, \| concise, and means exactly what it says. In two seconds, \| everyone using both jq and html will instantly know what is \| available and log it away. \| \| So for general purposes, it's a terrible marketing pitch. And \| yet I think it's a very, very valuable demonstration of knowing \| some of their 'customers'. \| acomar wrote: \| this isn't what a dogwhistle is. it's just explanation by \| analogy to a model presumed to be shared by the intended \| audience. a dogwhistle offers a surface meaning to the \| uninitiated that's anodyne but communicates a hidden, coded \| message to those who possess some undisclosed, shared \| knowledge with the author. this kind of analogy entirely \| lacks the surface meaning and the message shared via jargon \| also communicates something about how you might learn enough \| to understand the analogy. \| philipswood wrote: \| I can't speak for people who don't know jq, but knowing jq, \| this is a great tagline: it gives me an immediate \| understanding of what it does, how I could expect to use it \| and what value and ease of use I can expect. \| \| I'll be trying it out next time I'm on a PC. \| rendall wrote: \| > _I can 't speak for people who don't know jq,_ \| \| I can, and it's not illuminating at all. \| throwaway2016a wrote: \| I agree, however if you do know how to use jq than "like jq, \| but for html" is extremely effective. I use jq all the time and \| that title hooked me, I immediately wanted to try it. \| \| But if you haven't used Jq that I can see how that title is \| less than helpful. \| ducktective wrote: \| The first three are not proper definitions per-se but kind of \| an advertisement, trying to familiarize by self-comparison with \| a _tried & true_ tool that has proven its worth. \| \| _You know Jimmy the famous mechanic? I 'm Timmy, _his brother_ \| but an electrician._ \| \| IMO, at least `jq` has proven itself as _the_ indispensable \| tool for json-data manipulation. \| corporealshift wrote: \| I mean...if you read the github readme it literally describes \| what it does in the next line: "Uses CSS selectors to extract \| bits content from HTML files". \| kbenson wrote: \| > Software definition through a reference to another software \| is somewhat confusing. \| \| Possibly, depending on background as you note, but not all \| promotion is intended at the same audience. When submitting to \| HN, "like jq, but for X" is short and conveys what it is to \| most the people that would care, I think. jq has been submitted \| and talked about here _many_ times with lively discussion over \| the years.[1] At this point I think most those that are \| interested in what that is and what this is will understand \| fairly quickly from the title. Those that don 't might be \| missed, or they might look it up like you, or they might see it \| through some other submission some other time with a different \| title which isn't based on a chain of references. \| \| 1: \| https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu... \| zamadatix wrote: \| 1st sentence - Explaining the tool for those the tool was made \| for without beating around the bush. \| \| 2nd sentence - Explaining the tool to folks in the general web \| domain what it can do for them. \| \| 3rd sentence - Explaining where to learn how to use the tool if \| you've stumbled across it but web is not your area of \| expertise. \| \| All that info fits in nearly 25 words then it lists the options \| for the tool and jumps straight into multiple examples (with \| outputs!). If the only explanation had been "htmlq: like jq, \| but for HTML" I'd agree but having the comparison to explain \| what it does isn't a bad thing it's _only_ having the \| comparison that would be bad. \| \| Personally I think this is a model example of a opening for a \| Github readme. \| zerocount wrote: \| I disagree. The 2nd sentence contains, "extract bits \| content." What is that? \| \| If you're going to write a minimal introduction, at least \| make sure it's not confusing. \| \| I get the feeling the author felt compelled to write an \| introduction and did so with as little effort as possible. \| cyberge99 wrote: \| I believe he tailored it to his target audience. If you \| find it confusing, you are likely not it. \| ritchiea wrote: \| As web developer for over a decade "bits content" doesn't \| mean anything to me. But I understand what the tool does \| from the rest of the description. Try running a google \| search for "bits content," [0] it's not a commonly used \| phrase in web development or anything. It's a poor choice \| of words. \| \| 0. https://www.google.com/search?hl=en&q=%22bits%20conten \| t%22 \| chownie wrote: \| It's supposed to be "bits of content", it's not jargon. \| The author's just accidentally a word, we all do it. \| ritchiea wrote: \| It's more than fair to say in technical documentation you \| intend others to use having a grammatical error or \| missing word is confusing and a problem. It's the writing \| equivalent of having a bug in your code. And it's \| definitely not "writing to a target audience" as the \| parent comment suggested. We all make mistakes but don't \| try to call a mistake effective documentation. \| RobertKerans wrote: \| Of course it is, but neither parent nor anyone else is \| saying anything close to the mistake being effective \| documentation. There's a single missing word which needs \| to be added in, but the overall text is clearly writing \| to a target audience. You are aware of this, and of how \| small the mistake is, and you understand what the \| sentence should read as, so I'm not sure what your point \| is? \| theIV wrote: \| My hunch is that this is a typo and it should read "extract \| bits OF content." \| mgdm wrote: \| Exactly this! I'll fix it after work. \| rendall wrote: \| Maybe have the line about "jq" be 2nd. Have the first \| line be a brief description of what it actually does. \| ritchiea wrote: \| I agree and having a missing word in your text often \| leads to confusion :) \| \| Honestly you could drop the "bits" which is a bit \| redundant and use the phrase "Uses CSS selectors to \| extract content from HTML files." \| oauea wrote: \| What's this thing called a "computer" that people keep going on \| about, anyway? \| da_chicken wrote: \| It's a person who does mathematic calculations all day. For \| example, creating range tables for artillery, calculating \| averages or totals of a large range of values, or solving \| complex integrals or differential equations, and so on. \| They're commonly used in industry or government, especially \| in astronomy, aerospace and civil engineering for both \| simulation and analysis. Perhaps the most well-known \| computers were the Harvard Computers, which operated in the \| late 19th and early 20th centuries. \| \| As a job, computers were largely automated out of existence \| by solid-state transistor based automated computers and \| integrated circuit transistor automated computers in the 60s, \| 70s and 80s, which replaced the enormously expensive and \| often largely experimental electro-mechanical automated \| computers while radically reducing cost and improving \| performance both by several orders of magnitude. \| samstave wrote: \| Here - this explains it really succinctly: \| \| https://www.youtube.com/watch?v=lE1bS-Mn2Mk \| theandrewbailey wrote: \| It's like a programmable loom, but for logical and \| mathematical operations. \| samhw wrote: \| You may be interested in the symbol grounding problem (http \| s://en.wikipedia.org/wiki/Symbol_grounding_problem#Groun... \| ). It's like the binding problem, but for symbols. \| [deleted] \| sundarurfriend wrote: \| Sort of related: [Expecting Short Inferential \| Distances](https://www.readthesequences.com/Expecting-Short- \| Inferential...) \| nextaccountic wrote: \| jq isn't line-oriented, it's json-oriented. it's operaring on a \| stream of jsons from stdin, so its query is applied to each \| json in sequence. \| \| I would expect that htmlq run the query a single time for a \| single html; just like jquery $('#something') or \| document.querySelector('#something') \| zatkin wrote: \| Why not incorporate this into jq itself, like perhaps adding some \| command line arguments to switch to HTML mode? \| Deukhoofd wrote: \| What would the benefits of fitting a HTML parser into a JSON \| parser tool be? \| lmm wrote: \| JQ is not just a parser but a tool for doing operations, many \| of which are (or should be) generic across any tree-like data \| format. Reusing that part across different input formats \| makes a lot of sense. \| mjburgess wrote: \| Well once there's an HTML parser, then a pdf viewer, and then \| everything needed for PDFs (ie., programming, emailer, video \| support, etc.) we'll finally have that ideal operating system \| we've been waiting for. \| mro_name wrote: \| sounds a lot more like blockchain. \| e12e wrote: \| Would probably be more useful to implement html2json, and pipe \| in html? \| \| Ed: eg: https://github.com/Jxck/html2json \| downWidOutaFite wrote: \| Why? I find xpath's syntax much simpler and regular than jq's. \| [deleted] \| pabs3 wrote: \| I tend to reach for XPath selectors before CSS ones when querying \| HTML. \| necovek wrote: \| Nice, I expected something based on XPath (like xpd), but web \| developers dealing with HTML are infinitely more familiar with \| CSS selectors, so a great choice! \| busterarm wrote: \| I want the option to use both, like Nokogiri gives you. \| necovek wrote: \| Sure, that sounds nice, but having two simple tools each \| doing the job well in its own space is perfectly fine for me \| -- do you imagine needing to combine Xpath and CSS queries in \| a single run? \| busterarm wrote: \| I've had to do it when dealing with some poorly-designed \| XML apis in the past. Nokogiri was a godsend. \| rendall wrote: \| What is jq? \| mro_name wrote: \| it's statically linkable rust, isn't it? Awesome. I'm looking for \| a successor to \| \| $ xmllint --html --xpath ... \| \| that doesn't choke on inline svg. \| gigatexal wrote: \| This is very cool. This will make scraping the web even easier! \| elif wrote: \| When I saw the title I thought this was some machine learning- \| specific rmq/0mq message passing tech called HT. Very excited to \| zero. \| m4r35n357 wrote: \| Should be HQ . . . \| pkrumins wrote: \| Call it "hq". \| jhatemyjob wrote: \| Crazy how a 300-line codebase manages to amass 2000 stars on \| Github and 700 upvotes on HN. Amazing ROI. \| gizdan wrote: \| Once upon a time I was using pup[0] for such thing as well as \| later I changed to cascadia[1] which seemed much more advanced. \| \| Comparing the two repos, it seems pup is dead, but cascadia may \| not be. \| \| These tools, including htmlq, seem to sell themselves as "jq for \| html", which is far from the truth. Jq is closer to the awk where \| you can do just about everything with json. Cascadia, htmlq, and \| pup seem closer to grep for html. They can essentially only \| select data from a html source. \| \| [0] https://github.com/EricChiang/pup [1] \| https://github.com/suntong/cascadia \| heavyset_go wrote: \| I've used pup for a few projects, but was unaware of cascadia. \| Thanks for pointing it out. \| croon wrote: \| Well, jq is grep _as well_ as sed and awk, but yeah, htmlq \| seems to be just grep, for sake of comparison. \| \| But I don't think html has any need for a sed/awk tool, or at \| least not as much. Json output could very well be piped forward \| to the next CLI tool after you've changed it slightly with jq. \| I don't see this scenario as likely with html. \| gizdan wrote: \| > Well, jq is grep as well as sed and awk, but yeah, htmlq \| seems to be just grep, for sake of comparison. \| \| Exactly, and that is what I mean. If you want to compare, \| compare it with grep, not jq. \| \| Someone else posted xidel[0] in this thread, which I've not \| used, but it seems to be the "jq but for html". \| \| [0] https://github.com/benibela/xidel \| bamdadd wrote: \| is there a brew install command ? \| mcovalt wrote: \| I'd like to see a tool using lol-html [0] and their CSS selector \| API as a streaming HTML editor. \| \| [0] https://github.com/cloudflare/lol-html \| hyperpallium2 wrote: \| From examples, this is only like jq in the sense that the q \| stands for the same thing. Even the way it does that is \| different. \| \| An xmlq that was really like jq would be fun, about 20 years ago. \| cerved wrote: \| I would still like xmlq, there are (regrettably) still a lot of \| applications that store data and configuration in xml \| dotancohen wrote: \| There is `xq` today, which parses XML like `jq`. I think that \| it is relatively unknown because it is part of the `yq` package \| for parsing YMAL. So just install `yq` via PIP and you'll get \| `xq` as well. \| \| There is also `xmlstarlet` for parsing XML in a similar \| fashion. \| hyperpallium2 wrote: \| xmlstarlet is really nothing like jq, as a language. But yes, \| I use it because it is the best commandline xml processor I'd \| found. That's the only similarity to jq. \| \| Is this the yq? https://kislyuk.github.io/yq/ It does contain \| an 'xq', as a literal wrapper for jq, piping output into it \| after transcoding XML to JSON using xmltodict \| https://github.com/martinblech/xmltodict (which explodes xml \| into separate JSON data structures). \| \| This is a bash one-liner! But TBF it really is a 'jq for \| xml'. I think it would be horrible for some things, but you \| could also do a lot of useful things painlessly. \| dotancohen wrote: \| Thank you for the comments. I've only recently discovered \| both tools, and literally used them once each. Of the two \| `xq` was easier for my particular work case (parsing a \| Magento config) but I keep both tools in my virtual \| toolbox. \| \| If you have any other suggestions for parsing XML for \| exploratory purposes I'm very happy to hear them. \| hyperpallium2 wrote: \| Thanks! Not actually a reccommendation, but I have used \| xsltproc (command line xslt), but it is horrible to use \| because xslt syntax is horrible (though xslt's concepts \| are pretty cool). One thing is it enables you to use \| XPath in all its glory. \| \| Just installed xq. It's nice just seeing the pretty- \| printed json output, so thanks for the pointer. Probably \| better than xmlstarlet for my usage, which just queries \| and outputs text, not xml. hmmm, that's probably true for \| most commandline uses... \| jle17 wrote: \| Just looked into this and I think it's worth mentioning that \| there are two different projects called `yq`. The first one \| that came up (written in go instead of python) is not the \| right one and doesn't have the `xq` tool. \| abledon wrote: \| is anyone else using the https://github.com/json-path/JsonPath \| over the jq route? \| \| I hope we standardize on some jq query language, like we have \| with a base set of SQL syntax \| andybak wrote: \| > like jg \| \| "jq is a lightweight and flexible command-line JSON processor" \| chefandy wrote: \| If anyone is looking for a good library to do this in Python, \| PyQuery works well: \| \| https://pythonhosted.org/pyquery/ \| teitoklien wrote: \| Maybe call it hq ? \| Simplicitas wrote: \| My thoughts EXACTLY... but anyway, great new utility indeed! \| teitoklien wrote: \| Haha, Indeed its a very good utility :D \| oauea wrote: \| https://jsoup.org/ has been around for a long time and seems a \| bit more mature & maintained than this two-code-files 2-year-old \| repo. Highly recommend. \| avereveard wrote: \| what's wrong with using html tidy + xmllint ? \| mro_name wrote: \| nothing wrong. Searching unmodified html though is sometimes \| preferable. \| soheil wrote: \| I'd use something like this script that you can put together \| yourself: #!/usr/bin/env ruby require \| 'nokogiri'; p Nokogiri::HTML(STDIN.read).css(ARGV[0]).text \| \| Just save it to a file in your _/ usr/local/bin/hq_ and do _chmod \| +x !$_ \| \| Then you can do: curl -s \| "https://news.ycombinator.com/news"\|hq "tr:first-child \| .storylink" \| \| It uses Nokogiri[0], which is much more battle tested and works \| with CSS and XPath selectors. \| \| [0] \| https://nokogiri.org/tutorials/parsing_an_html_xml_document.... \| triska wrote: \| This is very nice! \| \| For reasoning about tree-based data such as HTML, I also highly \| recommend the declarative programming language Prolog. HTML \| documents map naturally to Prolog terms and can be readily \| reasoned about with built-in language mechanisms. For instance, \| here is the sample query from the htmlq README, fetching all \| elements with id _get-help_ from https://www.rust-lang.org, using \| Scryer Prolog and its SGML and HTTP libraries in combination with \| the XPath-inspired query language from library(xpath): \| ?- http_open("https://www.rust-lang.org", Stream, []), \| load_html(stream(Stream), DOM, []), xpath(DOM, \| //((@id="get-help")), E). \| \| Yielding: E = element(div,[class="flex flex- \| colum ...",id="get-help"],["\n ",element(h4,[],["Get \| help!"]),"\n ",element(ul,[],["\n \| ...",element(li,[],[element(a,[... = ...],[...])]),"\n \| ...",element(li,[],[...]),...\|...]),"\n \| ...",element(div,[class="la ..."],["\n \| ...",element(label,[...],[...]),...\|...]),"\n ..."]) ; \| false. \| \| The selector //((@id="get-help")) is used to obtain all HTML \| elements whose _id_ attribute is get-help. On backtracking, all \| solutions are reported. \| \| The other example from the README, extracting all _links_ from \| the page, can be obtained with Scryer Prolog like this: \| ?- http_open("https://www.rust-lang.org", Stream, []), \| load_html(stream(Stream), DOM, []), xpath(DOM, \| //a(@href), Link), portray_clause(Link), \| false. \| \| This query uses forced backtracking to write all links on \| standard output, yielding: "/". \| "/tools/install". "/learn". "https://play.rust- \| lang.org/". "/tools". "/governance". \| "/community". "https://blog.rust-lang.org/". \| "/learn/get-started". etc. \| chriswarbo wrote: \| Thanks, that's a rare example of something which is (a) simple \| enough to understand for a Prolog-newbie like me, and (b) more \| practical than ubiquitous family-tree example. \| \| I'm always looking for opportunities to dip my toes into \| Prolog; in hindsight it's clearly a good fit for tree- \| structured data structures. \| samhw wrote: \| Interestingly, the only other context in which I've come \| across Prolog is from friends who studied at Cambridge, here \| in the UK. For some reason, the CS 'tripos' (course) there is \| really heavily focussed on Prolog, and everyone I know from \| there ended up a huge fan of the language. I'm not sure why \| that's the case, though, given that almost all other \| universities seem to use more common languages (Java, C++, \| etc). \| zimpenfish wrote: \| cs.man.ac.uk, at least back in 1992, had a compulsory \| Prolog module in the first year. Don't know anyone from \| then who didn't hate that module with a burning passion. \| \| (There was no Java, C++, etc. either. It was SML, Pascal, \| 68000, and Oracle Pascal-Embedded-SQL.) \| ramses0 wrote: \| "Prolog as a library" => Given "functional" constraints => \| $CONSTRAINTS.prolog( "query..." ) => results \| \| ...many languages (similar to regex / state-machine) can \| benefit greatly from offloading a portion to something \| prolog-ish, but it's unfortunate that prolog knowledge \| isn't as widely distributed. \| WickyNilliams wrote: \| I studied CS at a different university in UK and we used \| Prolog for one module on AI or perhaps machine vision. I \| really enjoyed working with it. This was 15 years ago. \| Looking through their current curriculum I can't see prolog \| being mentioned anymore. Shame! \| pandatigox wrote: \| I tried to run this on my computer now, but as a complete \| Prolog noob, I'm having errors running the script? How do you \| load the http_open module/library in the first place? I tried \| following some Prolog tutorials in the past but I always get \| stuck trying to run something in the REPL. I'm using scryer- \| prolog. Thanks in advance! \| triska wrote: \| The libraries I mentioned can be loaded by invoking the \| use_module/1 predicate on the toplevel, here is the complete \| transcript that loads the SGML, HTTP and XPath libraries in \| Scryer Prolog: ?- \| use_module(library(sgml)). true. ?- \| use_module(library(http/http_open)). true. \| ?- use_module(library(xpath)). true. \| \| The second query also uses portray_clause/1 from \| library(format), which you can load with: \| ?- use_module(library(format)). true. \| \| After all these libraries are loaded, you can post the sample \| queries from above, and it should work. \| \| There are also other ways to load these libraries: A very \| common way to load a library is to use the use_module/1 \| _directive_ in Prolog source files. In that case, you would \| put for example the following 4 directives in a Prolog source \| file, say sample.pl: :- \| use_module(library(sgml)). :- \| use_module(library(http/http_open)). :- \| use_module(library(xpath)). :- \| use_module(library(format)). \| \| And then run sample.pl with: $ scryer- \| prolog sample.pl \| \| You can then again post the goals from above on the toplevel, \| and it will work too. \| \| Another way is to put these directives in your ~/.scryerrc \| configuration file, which is automatically consulted when \| Scryer Prolog starts. I recommend to do this for libraries \| you frequently need. Common candidates for this are for \| example library(dcgs), library(lists) and library(reif). \| \| Personally, I start Scryer Prolog from within Emacs, and I \| have set up Emacs so that I can consult a buffer with Prolog \| code, and also post queries and interact with the Prolog \| toplevel from within Emacs. \| pandatigox wrote: \| Wow that works fantastically! Thank you for that. It almost \| seems like magic. \| okasaki wrote: \| It's pretty easy in Python too, eg.: >>> soup \| = BeautifulSoup(requests.get("https://www.rust-lang.org").text) \| >>> [x["href"] for x in soup.find_all("a")] ['/', \| '/tools/install', '/learn', 'https://play.rust-lang.org/', \| '/tools', '/governance', '/community', 'https://blog.rust- \| lang.org/',... \| triska wrote: \| In a certain sense (for example, when measuring brevity), it \| is indeed easy to write this example in Python. However, the \| Python version also illustrates that many different language \| constructs are needed to express the intended functionality. \| In comparison to Prolog, Python is a quite complex language \| with many different language constructs, including loops, \| objects, methods, assignment, dictionaries etc. all of which \| are used in this example. \| \| As I see it, a key attraction of Prolog is its simplicity: \| With a single language construct (Horn clauses), you are able \| to express all known computations, and the example queries I \| posted show that only a single language element, namely again \| Horn clauses to express a query, is needed to run the code. \| The Prolog query, and also every Prolog clause, is itself a \| Prolog term and can be inspected with built-in mechanisms. \| \| As a consequence, an immediate benefit of using Prolog for \| such use cases is that you can easily reason about user- \| specified queries in your applications, and for example \| easily allow only a safe subset of code to be run by users, \| or execute a user-specified query with different execution \| strategies etc. In comparison, Python code is much harder to \| analyze and restrict to a particular subset due to the \| language's comparatively high syntactic complexity. \| the_jeremy wrote: \| The benefit of Python is that developers already know about \| these language constructs, and that more developers know \| Python than Prolog. \| lostcolony wrote: \| I don't think the op's point was "how easy it would be to \| hire developers", or even "taking all the considerations \| a business is under, I feel Prolog makes sense". He was \| just touting how easy Prolog's built in pattern matching \| and declarative style makes implementing and using \| selectors at a language level. \| \| Honestly, if we didn't talk about the benefits of a \| language irrespective of how easy it is to hire for it, \| we'd never have introduced anything beyond FORTRAN, if we \| even made it that far. Bringing "X is easier to hire for" \| into a conversation about the language is, at best, a \| non-sequitur. \| notriddle wrote: \| We might have been better off that way. FORTRAN does have \| its downsides, but language churn itself has downsides \| that almost always outweigh the assumed upsides of a \| better language. \| \| If we had just stuck with FORTRAN forever, how many \| problems would have been completely avoided!? There'd be \| better, and more, IDEs, since even if the language is \| hard to parse, it's still just one parser that needs all \| the effort. So many unfortunate problems in education \| caused by language and ecosystem churn would have been \| avoided (the infamous "by the time you graduate, it's \| always outdated" problem). \| \| The only problem is that FORTRAN is too new. Should've \| stuck with the Hollerith tabulator. \| jfmc wrote: \| AFAIK, this was first proposed and implemented in Ciao Prolog \| back in late 90s (modern versions here: https://ciao- \| lang.org/ciao/build/doc/ciao.html/html.html). It was way before \| Python was popular and JavaScript ever existed. \| parhamn wrote: \| Ive been looking for a library that can find the best set of \| selectors to most consistently find the element youre looking for \| in a page. \| \| Any pointers to something that exists? Interestingly I've also \| found very little for dom extraction in the OS ML space. ___________________________________________________________________ (page generated 2021-09-07 23:01 UTC)