|
| p4bl0 wrote:
| I tried that a few days ago with one of my papers (a PDF
| generated using pdflatex) and it didn't work that well: the text
| was fine but some section titles were off, and all of the math
| and code parts were broken.
|
| But clearly it is a nice idea and I can't wait that such tools
| work better!
| codeviking wrote:
| > all of the math and code parts were broken.
|
| Yup, this is a known issue that we're working towards fixing.
|
| > But clearly it is a nice idea and I can't wait that such
| tools work better!
|
| Glad to hear it!
| jimmySixDOF wrote:
| I am so amazed at the work you guys are doing at AI2 & the
| Semantic Scholar project. You guys are really fixing a broken
| system of research and discovery which suffers from organization
| design principles based on university library index card filing
| cabinets as magnified by the exponential content growth.
|
| Cant wait to see what people do with this . . . .
| codeviking wrote:
| Thanks!
|
| There's a lot of amazing people here, doing really great work.
| It's a really inspiring place to be. I feel really lucky to
| work with such great people on interesting, important problems.
|
| Also, I should mention...we're hiring!
|
| https://allenai.org/careers#current-openings
| codeviking wrote:
| Hi all,
|
| I'm one of the engineers at AI2 that helped make this happen.
| We're excited about this for several reasons, which I'll explain
| below.
|
| Most academic papers are currently inaccessible. This means, for
| instance, that researchers who are vision impaired can't access
| that research. Not only is this unfair, but it probably prevents
| breakthroughs from happening by limiting opportunities for
| collaboration.
|
| We think this is partly due to the fact that the PDF format isn't
| easy to work with, and thereby make accessible. HTML, on the
| other hand, has benefited from years of open contributions.
| There's a lot of accessibility affordances, and they're well
| documented and easy to add. In fact, our hope long-term is to use
| ML to make papers more accessible without (much) effort on the
| author's part.
|
| We're also excited about distributing papers in their HTML form
| as we think it'll allow us to greatly improve the UX of reading
| papers. We think papers should be easy to read regardless of the
| device you're on, and want to provide interactive, ML provided
| enhancements to the reading experience like those provided via
| the Semantic Reader.
|
| We're eager to hear what you think, and happy to answer
| questions.
| isaacimagine wrote:
| Looks great! Have you considered linking this up to something
| like arxiv or other preprint sites?
| codeviking wrote:
| Yup, we're definitely thinking about this.
|
| Our focus right now is on providing a tool folks can run it
| on whatever papers they have access to. For instance, some
| researchers might have access to documents that aren't
| available to the public. We want them to be able to run this
| against those.
|
| That said as we expand the effort I imagine we'll eventually
| pre-convert things that are publicly available, like those on
| ArXiv, etc.
| politelemon wrote:
| I've never actually questioned the why, so maybe you could
| shine some light... why are they usually published as PDFs?
| kartoshechka wrote:
| Unfortunately for my mental health my thesis was exactly
| about converting arxiv papers to modern looking html, and
| there's so much more broken, unjust and ugly things in
| academia then using pdfs...
|
| Regarding your question, I'd say that it is a natural
| continuation of centuries long tradition of writing on the
| actual paper. The invention of TeX actually made it easier to
| produce more papers, then came PDF, and you could produce
| virtual papers. Also science journals pretty much have
| monopoly on scientific knowledge distribution, and they are
| mostly paper too
| DoreenMichele wrote:
| I have no idea at all but as a wild guess, I would assume
| it's because you can't edit PDFs. So you know it says the
| same thing forever and no one went and changed it in response
| to reading criticism of their paper or something.
| codeviking wrote:
| Y'know, that's a good question. I'm not sure I know the
| answer.
|
| My guess is it's largely for historical reasons. At the time
| most venues were organized PDF was probably the best (or
| only) mechanism for sharing documents for print distribution.
|
| But we think it's time to change that :).
| ephbit wrote:
| I always assumed the main reason for using PDFs is, that an
| author/distributor can be pretty sure, that they're rendered
| almost exactly the same (fonts, layout) no matter with which
| viewer they're viewed.
|
| This probably evokes some kind of sense of authenticity. Like
| some physical paper document it has exactly one appearance.
| temp8964 wrote:
| What alternative do you have? Word file?
|
| PDF is the only widely supported format can guarantee
| accurate reprint.
| miohtama wrote:
| Are papers printed anymore?
|
| HTML for text.
|
| SVGs for diagrams.
|
| Equations can be exported as images if needed.
| kahon65 wrote:
| Do you remove the pdf files we send to your servers?
|
| Edit https://allenai.org/terms point 5, you own all the
| uploads! So if by mistake we send a medical PDF for example or
| something else that is under gdpr, we can't ask you to delete
| it???? ? Wtfffff
| nanis wrote:
| This seems pdf2tohtml combined with GROBID[1].
|
| It seems to me the masheen learningz technikz boil down to a
| generalization of my lightbulb moment here[2].
|
| [1]: https://grobid.readthedocs.io/en/latest/
|
| [2]: https://www.nu42.com/2014/09/scraping-pdf-documents-
| without-...
| codeviking wrote:
| Yup, right now we use GROBID, do some post processing and
| combine the output with other extraction techniques. For
| instance, we use a model to extract document figures[1], so
| that we can render them in the resulting HTML document.
|
| Also, we're working hard on a new extraction mechanism that
| should allow us to replace GROBID [2].
|
| There's a lot of really smart people at AI2 working on this,
| I'm excited to see the resulting improvements and the cool
| things (like this) that we build with the results!
|
| [1]: https://api.semanticscholar.org/CorpusID:4698432
|
| [2]: https://api.semanticscholar.org/CorpusID:235265639
| kartoshechka wrote:
| Looks exactly like what type of crunch work ML would do, but have
| you considered using brute force converters like latexml or
| pandoc where appropriate?
| chrisMyzel wrote:
| This is amazing! Will make my (offline-only) Kindle finally
| display scientific papers. Took a random link of arxiv and it
| worked like a charm, including TOC. will this be OS'ed?
| mintplant wrote:
| See also KOReader [0], if jailbreaking is an option for you.
| The built-in column splitter works pretty well for the papers
| I've used it to read.
|
| [0] https://github.com/koreader/koreader
| chrisMyzel wrote:
| (HTML->Mobi is totally possible)
| kartoshechka wrote:
| You may check out https://arxiv-vanity.com as well. OS,
| convertation rates are close to 70% on random arxiv paper if
| I'm not mistaken, but hardly can be called stable
| codeviking wrote:
| Yay, glad to hear it! If you end up viewing one of these on
| your Kindle, let us know how well (or not) things work.
|
| We're not sure if it's something that we can distribute as OSS
| just yet. It relies on a few internal libraries that would also
| need be publicly released, so it's not as simple as adjusting a
| single repository's visibility.
| oolonthegreat wrote:
| cool project, though the name was confusing for me: I believe to
| most people "paper" first means actual paper, so I thought this
| was some kind of OCR system converting printed material to html?
| codeviking wrote:
| Thanks for the feedback. There's two hard problems n' all
| that... :)
| gregsadetsky wrote:
| Great site, congrats!
|
| One comment is that the slowest page to load was the Gallery [0]
| as it loads an ungodly amount of PNG files from what appears to
| be a single IP (a GCP Compute instance?)
|
| I see 421 requests and 150 Mb loaded. As it seems to be mostly
| thumbnails, have you considered using jpegs instead of pngs,
| potentially use lazy loading (i.e. not load images outside of the
| viewport) and potentially use GCP's (or another provider) CDN
| offering?
|
| Once I clicked a thumbnail, loading the article itself (for
| example [1]) was quite breezy.
|
| The gallery is a great showcase of what your site does -- I think
| that it'd be worth making it snappier :-)
|
| Cheers and congrats again
|
| P.S. Also, the paper linked below [1] seems to have a few
| conversion problems -- I see "EQUATION (1): Not extracted; please
| refer to original document", and also some (formula? Greek?)
| characters that seem out of place after the words "and the next
| token is generated by sampling"
|
| [0] https://papertohtml.org/gallery
|
| [1]
| https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba...
| codeviking wrote:
| > One comment is that the slowest page to load was the Gallery
| [0] as it loads an ungodly amount of PNG files from what
| appears to be a single IP (a GCP Compute instance?)
|
| Yup. There's no CDN or anything like that right now. We kept
| things simple to get this out the door. But we definitely
| intend to make improvements like this as we improve the tool.
|
| The more adoption we see, the more it motivates these types of
| fixes!
|
| > P.S. Also, the paper linked below [1] seems to have a few
| conversion problems -- I see "EQUATION (1): Not extracted;
| please refer to original document", and also some (formula?
| Greek?) characters that seem out of place after the words "and
| the next token is generated by sampling"
|
| Thanks for the catch. As you noted there's still a fair number
| of extraction errors for us to correct!
| mintplant wrote:
| Another sample paper that caused some trouble with figure
| extraction: https://www.cs.utexas.edu/~hovav/dist/vera.pdf
|
| Very cool project, looking forward to seeing how it develops!
| codeviking wrote:
| Thanks, I'll pass this example along!
___________________________________________________________________
(page generated 2021-09-15 23:00 UTC) |