[HN Gopher] Show HN: Paper to HTML Converter
___________________________________________________________________
 
Show HN: Paper to HTML Converter
 
Author : codeviking
Score  : 48 points
Date   : 2021-09-15 19:01 UTC (3 hours ago)
 
web link (papertohtml.org)
w3m dump (papertohtml.org)
 
| p4bl0 wrote:
| I tried that a few days ago with one of my papers (a PDF
| generated using pdflatex) and it didn't work that well: the text
| was fine but some section titles were off, and all of the math
| and code parts were broken.
| 
| But clearly it is a nice idea and I can't wait that such tools
| work better!
 
  | codeviking wrote:
  | > all of the math and code parts were broken.
  | 
  | Yup, this is a known issue that we're working towards fixing.
  | 
  | > But clearly it is a nice idea and I can't wait that such
  | tools work better!
  | 
  | Glad to hear it!
 
| jimmySixDOF wrote:
| I am so amazed at the work you guys are doing at AI2 & the
| Semantic Scholar project. You guys are really fixing a broken
| system of research and discovery which suffers from organization
| design principles based on university library index card filing
| cabinets as magnified by the exponential content growth.
| 
| Cant wait to see what people do with this . . . .
 
  | codeviking wrote:
  | Thanks!
  | 
  | There's a lot of amazing people here, doing really great work.
  | It's a really inspiring place to be. I feel really lucky to
  | work with such great people on interesting, important problems.
  | 
  | Also, I should mention...we're hiring!
  | 
  | https://allenai.org/careers#current-openings
 
| codeviking wrote:
| Hi all,
| 
| I'm one of the engineers at AI2 that helped make this happen.
| We're excited about this for several reasons, which I'll explain
| below.
| 
| Most academic papers are currently inaccessible. This means, for
| instance, that researchers who are vision impaired can't access
| that research. Not only is this unfair, but it probably prevents
| breakthroughs from happening by limiting opportunities for
| collaboration.
| 
| We think this is partly due to the fact that the PDF format isn't
| easy to work with, and thereby make accessible. HTML, on the
| other hand, has benefited from years of open contributions.
| There's a lot of accessibility affordances, and they're well
| documented and easy to add. In fact, our hope long-term is to use
| ML to make papers more accessible without (much) effort on the
| author's part.
| 
| We're also excited about distributing papers in their HTML form
| as we think it'll allow us to greatly improve the UX of reading
| papers. We think papers should be easy to read regardless of the
| device you're on, and want to provide interactive, ML provided
| enhancements to the reading experience like those provided via
| the Semantic Reader.
| 
| We're eager to hear what you think, and happy to answer
| questions.
 
  | isaacimagine wrote:
  | Looks great! Have you considered linking this up to something
  | like arxiv or other preprint sites?
 
    | codeviking wrote:
    | Yup, we're definitely thinking about this.
    | 
    | Our focus right now is on providing a tool folks can run it
    | on whatever papers they have access to. For instance, some
    | researchers might have access to documents that aren't
    | available to the public. We want them to be able to run this
    | against those.
    | 
    | That said as we expand the effort I imagine we'll eventually
    | pre-convert things that are publicly available, like those on
    | ArXiv, etc.
 
  | politelemon wrote:
  | I've never actually questioned the why, so maybe you could
  | shine some light... why are they usually published as PDFs?
 
    | kartoshechka wrote:
    | Unfortunately for my mental health my thesis was exactly
    | about converting arxiv papers to modern looking html, and
    | there's so much more broken, unjust and ugly things in
    | academia then using pdfs...
    | 
    | Regarding your question, I'd say that it is a natural
    | continuation of centuries long tradition of writing on the
    | actual paper. The invention of TeX actually made it easier to
    | produce more papers, then came PDF, and you could produce
    | virtual papers. Also science journals pretty much have
    | monopoly on scientific knowledge distribution, and they are
    | mostly paper too
 
    | DoreenMichele wrote:
    | I have no idea at all but as a wild guess, I would assume
    | it's because you can't edit PDFs. So you know it says the
    | same thing forever and no one went and changed it in response
    | to reading criticism of their paper or something.
 
    | codeviking wrote:
    | Y'know, that's a good question. I'm not sure I know the
    | answer.
    | 
    | My guess is it's largely for historical reasons. At the time
    | most venues were organized PDF was probably the best (or
    | only) mechanism for sharing documents for print distribution.
    | 
    | But we think it's time to change that :).
 
    | ephbit wrote:
    | I always assumed the main reason for using PDFs is, that an
    | author/distributor can be pretty sure, that they're rendered
    | almost exactly the same (fonts, layout) no matter with which
    | viewer they're viewed.
    | 
    | This probably evokes some kind of sense of authenticity. Like
    | some physical paper document it has exactly one appearance.
 
    | temp8964 wrote:
    | What alternative do you have? Word file?
    | 
    | PDF is the only widely supported format can guarantee
    | accurate reprint.
 
      | miohtama wrote:
      | Are papers printed anymore?
      | 
      | HTML for text.
      | 
      | SVGs for diagrams.
      | 
      | Equations can be exported as images if needed.
 
  | kahon65 wrote:
  | Do you remove the pdf files we send to your servers?
  | 
  | Edit https://allenai.org/terms point 5, you own all the
  | uploads! So if by mistake we send a medical PDF for example or
  | something else that is under gdpr, we can't ask you to delete
  | it???? ? Wtfffff
 
| nanis wrote:
| This seems pdf2tohtml combined with GROBID[1].
| 
| It seems to me the masheen learningz technikz boil down to a
| generalization of my lightbulb moment here[2].
| 
| [1]: https://grobid.readthedocs.io/en/latest/
| 
| [2]: https://www.nu42.com/2014/09/scraping-pdf-documents-
| without-...
 
  | codeviking wrote:
  | Yup, right now we use GROBID, do some post processing and
  | combine the output with other extraction techniques. For
  | instance, we use a model to extract document figures[1], so
  | that we can render them in the resulting HTML document.
  | 
  | Also, we're working hard on a new extraction mechanism that
  | should allow us to replace GROBID [2].
  | 
  | There's a lot of really smart people at AI2 working on this,
  | I'm excited to see the resulting improvements and the cool
  | things (like this) that we build with the results!
  | 
  | [1]: https://api.semanticscholar.org/CorpusID:4698432
  | 
  | [2]: https://api.semanticscholar.org/CorpusID:235265639
 
| kartoshechka wrote:
| Looks exactly like what type of crunch work ML would do, but have
| you considered using brute force converters like latexml or
| pandoc where appropriate?
 
| chrisMyzel wrote:
| This is amazing! Will make my (offline-only) Kindle finally
| display scientific papers. Took a random link of arxiv and it
| worked like a charm, including TOC. will this be OS'ed?
 
  | mintplant wrote:
  | See also KOReader [0], if jailbreaking is an option for you.
  | The built-in column splitter works pretty well for the papers
  | I've used it to read.
  | 
  | [0] https://github.com/koreader/koreader
 
  | chrisMyzel wrote:
  | (HTML->Mobi is totally possible)
 
  | kartoshechka wrote:
  | You may check out https://arxiv-vanity.com as well. OS,
  | convertation rates are close to 70% on random arxiv paper if
  | I'm not mistaken, but hardly can be called stable
 
  | codeviking wrote:
  | Yay, glad to hear it! If you end up viewing one of these on
  | your Kindle, let us know how well (or not) things work.
  | 
  | We're not sure if it's something that we can distribute as OSS
  | just yet. It relies on a few internal libraries that would also
  | need be publicly released, so it's not as simple as adjusting a
  | single repository's visibility.
 
| oolonthegreat wrote:
| cool project, though the name was confusing for me: I believe to
| most people "paper" first means actual paper, so I thought this
| was some kind of OCR system converting printed material to html?
 
  | codeviking wrote:
  | Thanks for the feedback. There's two hard problems n' all
  | that... :)
 
| gregsadetsky wrote:
| Great site, congrats!
| 
| One comment is that the slowest page to load was the Gallery [0]
| as it loads an ungodly amount of PNG files from what appears to
| be a single IP (a GCP Compute instance?)
| 
| I see 421 requests and 150 Mb loaded. As it seems to be mostly
| thumbnails, have you considered using jpegs instead of pngs,
| potentially use lazy loading (i.e. not load images outside of the
| viewport) and potentially use GCP's (or another provider) CDN
| offering?
| 
| Once I clicked a thumbnail, loading the article itself (for
| example [1]) was quite breezy.
| 
| The gallery is a great showcase of what your site does -- I think
| that it'd be worth making it snappier :-)
| 
| Cheers and congrats again
| 
| P.S. Also, the paper linked below [1] seems to have a few
| conversion problems -- I see "EQUATION (1): Not extracted; please
| refer to original document", and also some (formula? Greek?)
| characters that seem out of place after the words "and the next
| token is generated by sampling"
| 
| [0] https://papertohtml.org/gallery
| 
| [1]
| https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba...
 
  | codeviking wrote:
  | > One comment is that the slowest page to load was the Gallery
  | [0] as it loads an ungodly amount of PNG files from what
  | appears to be a single IP (a GCP Compute instance?)
  | 
  | Yup. There's no CDN or anything like that right now. We kept
  | things simple to get this out the door. But we definitely
  | intend to make improvements like this as we improve the tool.
  | 
  | The more adoption we see, the more it motivates these types of
  | fixes!
  | 
  | > P.S. Also, the paper linked below [1] seems to have a few
  | conversion problems -- I see "EQUATION (1): Not extracted;
  | please refer to original document", and also some (formula?
  | Greek?) characters that seem out of place after the words "and
  | the next token is generated by sampling"
  | 
  | Thanks for the catch. As you noted there's still a fair number
  | of extraction errors for us to correct!
 
    | mintplant wrote:
    | Another sample paper that caused some trouble with figure
    | extraction: https://www.cs.utexas.edu/~hovav/dist/vera.pdf
    | 
    | Very cool project, looking forward to seeing how it develops!
 
      | codeviking wrote:
      | Thanks, I'll pass this example along!
 
___________________________________________________________________
(page generated 2021-09-15 23:00 UTC)