proxy70

	[HN Gopher] Show HN: Paper to HTML Converter ___________________________________________________________________ Show HN: Paper to HTML Converter Author : codeviking Score : 48 points Date : 2021-09-15 19:01 UTC (3 hours ago)
	web link (papertohtml.org)
	w3m dump (papertohtml.org)
	\| p4bl0 wrote: \| I tried that a few days ago with one of my papers (a PDF \| generated using pdflatex) and it didn't work that well: the text \| was fine but some section titles were off, and all of the math \| and code parts were broken. \| \| But clearly it is a nice idea and I can't wait that such tools \| work better! \| codeviking wrote: \| > all of the math and code parts were broken. \| \| Yup, this is a known issue that we're working towards fixing. \| \| > But clearly it is a nice idea and I can't wait that such \| tools work better! \| \| Glad to hear it! \| jimmySixDOF wrote: \| I am so amazed at the work you guys are doing at AI2 & the \| Semantic Scholar project. You guys are really fixing a broken \| system of research and discovery which suffers from organization \| design principles based on university library index card filing \| cabinets as magnified by the exponential content growth. \| \| Cant wait to see what people do with this . . . . \| codeviking wrote: \| Thanks! \| \| There's a lot of amazing people here, doing really great work. \| It's a really inspiring place to be. I feel really lucky to \| work with such great people on interesting, important problems. \| \| Also, I should mention...we're hiring! \| \| https://allenai.org/careers#current-openings \| codeviking wrote: \| Hi all, \| \| I'm one of the engineers at AI2 that helped make this happen. \| We're excited about this for several reasons, which I'll explain \| below. \| \| Most academic papers are currently inaccessible. This means, for \| instance, that researchers who are vision impaired can't access \| that research. Not only is this unfair, but it probably prevents \| breakthroughs from happening by limiting opportunities for \| collaboration. \| \| We think this is partly due to the fact that the PDF format isn't \| easy to work with, and thereby make accessible. HTML, on the \| other hand, has benefited from years of open contributions. \| There's a lot of accessibility affordances, and they're well \| documented and easy to add. In fact, our hope long-term is to use \| ML to make papers more accessible without (much) effort on the \| author's part. \| \| We're also excited about distributing papers in their HTML form \| as we think it'll allow us to greatly improve the UX of reading \| papers. We think papers should be easy to read regardless of the \| device you're on, and want to provide interactive, ML provided \| enhancements to the reading experience like those provided via \| the Semantic Reader. \| \| We're eager to hear what you think, and happy to answer \| questions. \| isaacimagine wrote: \| Looks great! Have you considered linking this up to something \| like arxiv or other preprint sites? \| codeviking wrote: \| Yup, we're definitely thinking about this. \| \| Our focus right now is on providing a tool folks can run it \| on whatever papers they have access to. For instance, some \| researchers might have access to documents that aren't \| available to the public. We want them to be able to run this \| against those. \| \| That said as we expand the effort I imagine we'll eventually \| pre-convert things that are publicly available, like those on \| ArXiv, etc. \| politelemon wrote: \| I've never actually questioned the why, so maybe you could \| shine some light... why are they usually published as PDFs? \| kartoshechka wrote: \| Unfortunately for my mental health my thesis was exactly \| about converting arxiv papers to modern looking html, and \| there's so much more broken, unjust and ugly things in \| academia then using pdfs... \| \| Regarding your question, I'd say that it is a natural \| continuation of centuries long tradition of writing on the \| actual paper. The invention of TeX actually made it easier to \| produce more papers, then came PDF, and you could produce \| virtual papers. Also science journals pretty much have \| monopoly on scientific knowledge distribution, and they are \| mostly paper too \| DoreenMichele wrote: \| I have no idea at all but as a wild guess, I would assume \| it's because you can't edit PDFs. So you know it says the \| same thing forever and no one went and changed it in response \| to reading criticism of their paper or something. \| codeviking wrote: \| Y'know, that's a good question. I'm not sure I know the \| answer. \| \| My guess is it's largely for historical reasons. At the time \| most venues were organized PDF was probably the best (or \| only) mechanism for sharing documents for print distribution. \| \| But we think it's time to change that :). \| ephbit wrote: \| I always assumed the main reason for using PDFs is, that an \| author/distributor can be pretty sure, that they're rendered \| almost exactly the same (fonts, layout) no matter with which \| viewer they're viewed. \| \| This probably evokes some kind of sense of authenticity. Like \| some physical paper document it has exactly one appearance. \| temp8964 wrote: \| What alternative do you have? Word file? \| \| PDF is the only widely supported format can guarantee \| accurate reprint. \| miohtama wrote: \| Are papers printed anymore? \| \| HTML for text. \| \| SVGs for diagrams. \| \| Equations can be exported as images if needed. \| kahon65 wrote: \| Do you remove the pdf files we send to your servers? \| \| Edit https://allenai.org/terms point 5, you own all the \| uploads! So if by mistake we send a medical PDF for example or \| something else that is under gdpr, we can't ask you to delete \| it???? ? Wtfffff \| nanis wrote: \| This seems pdf2tohtml combined with GROBID[1]. \| \| It seems to me the masheen learningz technikz boil down to a \| generalization of my lightbulb moment here[2]. \| \| [1]: https://grobid.readthedocs.io/en/latest/ \| \| [2]: https://www.nu42.com/2014/09/scraping-pdf-documents- \| without-... \| codeviking wrote: \| Yup, right now we use GROBID, do some post processing and \| combine the output with other extraction techniques. For \| instance, we use a model to extract document figures[1], so \| that we can render them in the resulting HTML document. \| \| Also, we're working hard on a new extraction mechanism that \| should allow us to replace GROBID [2]. \| \| There's a lot of really smart people at AI2 working on this, \| I'm excited to see the resulting improvements and the cool \| things (like this) that we build with the results! \| \| [1]: https://api.semanticscholar.org/CorpusID:4698432 \| \| [2]: https://api.semanticscholar.org/CorpusID:235265639 \| kartoshechka wrote: \| Looks exactly like what type of crunch work ML would do, but have \| you considered using brute force converters like latexml or \| pandoc where appropriate? \| chrisMyzel wrote: \| This is amazing! Will make my (offline-only) Kindle finally \| display scientific papers. Took a random link of arxiv and it \| worked like a charm, including TOC. will this be OS'ed? \| mintplant wrote: \| See also KOReader [0], if jailbreaking is an option for you. \| The built-in column splitter works pretty well for the papers \| I've used it to read. \| \| [0] https://github.com/koreader/koreader \| chrisMyzel wrote: \| (HTML->Mobi is totally possible) \| kartoshechka wrote: \| You may check out https://arxiv-vanity.com as well. OS, \| convertation rates are close to 70% on random arxiv paper if \| I'm not mistaken, but hardly can be called stable \| codeviking wrote: \| Yay, glad to hear it! If you end up viewing one of these on \| your Kindle, let us know how well (or not) things work. \| \| We're not sure if it's something that we can distribute as OSS \| just yet. It relies on a few internal libraries that would also \| need be publicly released, so it's not as simple as adjusting a \| single repository's visibility. \| oolonthegreat wrote: \| cool project, though the name was confusing for me: I believe to \| most people "paper" first means actual paper, so I thought this \| was some kind of OCR system converting printed material to html? \| codeviking wrote: \| Thanks for the feedback. There's two hard problems n' all \| that... :) \| gregsadetsky wrote: \| Great site, congrats! \| \| One comment is that the slowest page to load was the Gallery [0] \| as it loads an ungodly amount of PNG files from what appears to \| be a single IP (a GCP Compute instance?) \| \| I see 421 requests and 150 Mb loaded. As it seems to be mostly \| thumbnails, have you considered using jpegs instead of pngs, \| potentially use lazy loading (i.e. not load images outside of the \| viewport) and potentially use GCP's (or another provider) CDN \| offering? \| \| Once I clicked a thumbnail, loading the article itself (for \| example [1]) was quite breezy. \| \| The gallery is a great showcase of what your site does -- I think \| that it'd be worth making it snappier :-) \| \| Cheers and congrats again \| \| P.S. Also, the paper linked below [1] seems to have a few \| conversion problems -- I see "EQUATION (1): Not extracted; please \| refer to original document", and also some (formula? Greek?) \| characters that seem out of place after the words "and the next \| token is generated by sampling" \| \| [0] https://papertohtml.org/gallery \| \| [1] \| https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba... \| codeviking wrote: \| > One comment is that the slowest page to load was the Gallery \| [0] as it loads an ungodly amount of PNG files from what \| appears to be a single IP (a GCP Compute instance?) \| \| Yup. There's no CDN or anything like that right now. We kept \| things simple to get this out the door. But we definitely \| intend to make improvements like this as we improve the tool. \| \| The more adoption we see, the more it motivates these types of \| fixes! \| \| > P.S. Also, the paper linked below [1] seems to have a few \| conversion problems -- I see "EQUATION (1): Not extracted; \| please refer to original document", and also some (formula? \| Greek?) characters that seem out of place after the words "and \| the next token is generated by sampling" \| \| Thanks for the catch. As you noted there's still a fair number \| of extraction errors for us to correct! \| mintplant wrote: \| Another sample paper that caused some trouble with figure \| extraction: https://www.cs.utexas.edu/~hovav/dist/vera.pdf \| \| Very cool project, looking forward to seeing how it develops! \| codeviking wrote: \| Thanks, I'll pass this example along! ___________________________________________________________________ (page generated 2021-09-15 23:00 UTC)