proxy70

	[HN Gopher] Search PDFs with Transformers and Python Notebook ___________________________________________________________________ Search PDFs with Transformers and Python Notebook Author : alexcg1 Score : 109 points Date : 2022-07-25 14:11 UTC (8 hours ago)
	web link (colab.research.google.com)
	w3m dump (colab.research.google.com)
	\| [deleted] \| CShorten wrote: \| Congratulations Alex, super cool! \| alexcg1 wrote: \| Thanks man! \| alexcg1 wrote: \| Nice to meet another person in the super-obvious-username \| club \| divan wrote: \| Can anyone recommend how to build the following solution? \| \| - Full-text search on modern era PDFs (i.e no need for OCR) \| \| - Exact word search would suffice (fuzzy/contextual search \| actually is less desirable) \| \| - Cross-platform frontend part that highlights and jumps to the \| found text within the document. Frontend should be embeddable \| (i.e. not a SaaS or just standalone UI) \| \| - As lightweight as possible (i.e. no Java, Python or Ruby) \| \| - Long-term oriented stack (i.e. minimum dependencies, ideally \| promise of compatibility) \| \| I'm looking at Mellisearch or Bleve for indexing/backend, and \| Syncfusion Flutter PDF viewer for frontend, but it still needs a \| lot of gluing code and I would love to explore more options. \| \| Google Pinpoint is pretty cool, and I use it a lot, but there is \| only hosted Google version, plus it's too smart (still can't get \| it to do exact word search). \| [deleted] \| snowstormsun wrote: \| pdfgrep with some formatting to add links open the correct \| page? \| alexcg1 wrote: \| Getting the URI of original PDF would be straightforward \| enough - I could whack that into the code tomorrow with a few \| lines. \| \| Opening up the correct page? I don't know of any standardized \| PDF reader that supports that kind of thing. And the format \| has such a history that even if it were supported \| (technically by Adobe - don't even get me started on what PDF \| readers support what formats), there's no guarantee the file \| itself would even have that cooked in. \| capableweb wrote: \| > - As lightweight as possible (i.e. no Java, Python or Ruby) \| \| I don't have suggestions for you, but I do have a question \| regarding this point. Why wouldn't Java be considered \| lightweight? Java literally runs on your SIM card, which is a \| very bare-bones environment to run something on, I'd probably \| consider something like that pretty lightweight. \| divan wrote: \| Ha, I'm from that generation of developers who have the \| mental model of what is actually happening on the hardware \| level when you run the program. Doesn't necesarilly mean I \| overoptimize or think about struct fields offsets or cache \| branching, but I do have this in my mental model and just \| can't unlearn it. \| \| When I think about how many stuff needs to be moved in \| cpu/memory/io bus just to launch simple "Hello, World" in \| Java - I just cannot accept it. I do realize that for large \| programs that overhead is small, but still the JVM concept is \| something I want to avoid as much as possible. Plus the sheer \| scale of Java SDK and amount of legacy and complexity behind \| it exceeds my treshold of "avoiding complexity" by orders of \| magnitude. And the nail to the coffin of "no java" stance is, \| of course, experience with desktop Java applications. \| Consistenly the worst UX experience and performance I've seen \| in 25 years among desktop apps. \| alexcg1 wrote: \| Don't remind me of desktop Java. What was that toolkit, \| swing(?) that was used in all the apps back in the day. \| PDFs have a special place in Hell, but Java desktop UXen \| deserve a whole special circle \| divan wrote: \| PDF history is pretty amazing, actually. The fact that \| PDF survived over so many decades is something worth \| reflecting upon :) \| simonw wrote: \| If you hadn't ruled out Python I'd be suggesting using \| Datasette + SQLite FTS - I've been building a whole bunch of \| different search engines on that (including ones for searching \| within OCRd PDF files) and the cost to host is trivial, since \| you just need to run a Python process somewhere with a binary \| SQLite database file. I usually use Vercel, Cloud Run or Fly \| for that. \| \| One example of a search engine I've built like this is the one \| on the Datasette website: https://datasette.io/-/beta?q=fts - I \| wrote about how that works here: \| https://simonwillison.net/2020/Dec/19/dogsheep-beta/ \| divan wrote: \| Interesting, thanks! I'll take a look (datasette is amazing). \| alexcg1 wrote: \| - Modern PDFs - if you wanna extract text and images, then the \| PDFSegmenter used in my example will work. If tables too, might \| need some additional jiggery-pokery, but definitely doable. I \| know other ppl using the same framework (Jina) who've \| accomplished it. \| \| - Exact word search - pretty simple. I've focused on more \| advanced stuff because color vs colour is same same but \| different. Also just because it's pretty easy since I'm just \| using pre-defined building blocks, not manually integrating \| stuff \| \| - Cross platform frontend - I've seen a lyrics search frontend \| [0] and I've built stuff in Streamlit before. Jina offers \| RESTful/gRPC/WebSockets gateways so it can't be too tough \| \| - Lightweight? I mean how lightweight do you want it? C? Bash? \| Assembly? I've found Python good for text parsing \| \| - Long-term: The notebook I wrote has a few (each of which have \| their own), but compared to others they're relatively \| lightweight. \| \| - Gluing code: I've been using pre-existing building blocks, \| and writing new Executors (i.e. building blocks) is relatively \| straightforward, and then scaling them up with shards, \| replicas, etc is just a parameter away. \| \| I'm more into the search side then the PDF stuff. The PDF side \| I've had experience with through bitter suffering and torment. \| Not a fun format to work with (unless you're into sado- \| masochism) \| \| [0] https://github.com/jina-ai/examples/tree/master/multires- \| lyr... \| divan wrote: \| Thanks for elaborated answer. \| \| Most of my use cases have to deal with 10-100 PDF small \| documents, some - 1000-2000, but I don't want the solution to \| choke on 10GB of huge PDFs (I was just uploading those to \| Google Pinpoint). So Go or Rust for backend should be good \| fit. \| \| By cross-platform frontend I meant web/ios/android/desktop. \| It's probably only Flutter, but I'm looking for other plugins \| than Syncfusion's one to try. I know that sounds like \| overkill for many people (website with search suffice), but I \| already have cross-platform apps that would benefit from this \| functionality, and web is a fallback there, not the main \| option. \| shubham_saboo wrote: \| Wao, this is a really cool way to build full fledged search that \| too in a notebook! \| \| Does it work end-to-end with PDF as a data structure or do we \| have to use OCR and parse the text first to be able to search it, \| really curious? \| alexcg1 wrote: \| The version in the notebook is just for simple text-based PDFs. \| I wrote some posts on our company blog[1] about the sheer \| agonies of dealing with PDF as a data format, so wanted to \| stick with as simple as possible for now. \| \| That said, I'm planning future notebooks where you can perform \| text-to-image or image-to-image search, integrate OCR, scale it \| up, serve it, deploy it, etc. \| \| [1] https://medium.com/jina-ai \| shubham_saboo wrote: \| Awesome, will be on the lookout for that! \| alexcg1 wrote: \| We've got quite a few other notebooks for other kinds of \| search on the blog. Would love to hear your thoughts! \| spaetzleesser wrote: \| "PDF as a data structure" \| \| Don't. PDF is a terrible format for storing machine readable \| data. You lose a ton of Information while you create the PDF \| which you then painstakingly have to get back later (if that's \| even possible) \| alexcg1 wrote: \| I may have misworded it (if I wrote those words - PDF rots \| the brain and my memory likewise). \| \| Agreed on the rest. PDFs don't store machine-readable data. \| Often just pixelated scanned hot garbage dumpster fire text. \| \| I hate PDFs but have to work with the satanforesaken things. \| Hence the notebook. It's my little way of trying to give my \| little PDF-bespoked-hellscape a tiny little glow-up. \| rahimnathwani wrote: \| Under the hood, it uses \| https://github.com/pdfminer/pdfminer.six which expects the text \| to be stored as text. \| alexcg1 wrote: \| You mean the PDFSegmenter Executor in the notebook? \| rahimnathwani wrote: \| Yes \| alexcg1 wrote: \| PDFSegmenter also extracts images, which can then be \| OCR'ed in the next step of the pipeline \| alexcg1 wrote: \| Incidentally Jina Hub [0] has a few OCR Executors [1][2] you \| could integrate into my notebook (though you'd have to do some \| rewiring to take images into account since it's a text-based \| notebook) \| \| [0] https://hub.jina.ai/ \| \| [1] https://hub.jina.ai/executor/w4p7905v \| \| [2] https://hub.jina.ai/executor/78yp7etm \| fzliu wrote: \| I just tried this on all the papers I downloaded over the past \| couple months - cool stuff. \| \| How well would this work in a production setting, e.g. when \| searching over millions of PDFs on arxiv (soon to be tens of \| millions)? Follow-up: have you tried using a vector database such \| as Milvus as the key piece of underlying infrastructure to avoid \| having to implement deletes, failover, scaling, etc? \| https://zilliz.com/learn/what-is-vector-database \| alexcg1 wrote: \| In terms of matching embeddings and performing similarity \| search on text/images - folks are already using the framework \| (Jina) for that and getting decent results. \| \| In terms of processing the PDFs and extracting that data. idk. \| That depends on a lot of factors - e.g. do you need to OCR the \| PDFs or can just extract text directly? Either way, should be \| possible to write a module and then easily scale it up (Jina \| supports shards/replicas). Anyway, lemme know. I'm in talks \| with folks about this kind of shitshow...uh...use case now. \| \| Jina supports multiple vector database backends, like Weaviate, \| Qdrant and others. For others (like Milvus), suggest you ask on \| the Slack [0] - responses tend to be fast. \| \| [0] https://slack.jina.ai \| gapovaj742 wrote: \| okay but what if my PDF is non parseable? Not sure if Python's \| any good for that \| alexcg1 wrote: \| In that case I'd use: \| \| 1. PDFSegmenter (in the notebook) - extract the images of the \| text (yup, it does images too) 2. An OCR Executor [0][1] from \| Jina Hub [2] to extract the text from the images 3. Actually \| splice the text chunks together to be what you'd expect - \| that's the tricky part. Even text splitting over pages can be \| tricky to reassemble properly. PDFs are a pain the butt \| frankly. \| \| [0] https://hub.jina.ai/executor/78yp7etm \| \| [1] https://hub.jina.ai/executor/w4p7905v \| \| [2] https://hub.jina.ai \| nicodjimenez wrote: \| Mathpix PDF search is fully visually powered and does not use \| underlying PDF metadata, even working on handwriting. It's a \| great choice for researchers (especially in STEM) who want to \| build a searchable archive of PDFs. \| simonw wrote: \| Amazon Textract does a phenomenal job of extracting text from \| dodgy scanned PDFs - I've been running it against scanned \| typewritten text and even handwritten journal text from the \| 1880s with great results. \| \| I built a tool for running OCR against every PDF in an S3 \| bucket (which costs about $1.50/thousand pages) here: \| https://simonwillison.net/2022/Jun/30/s3-ocr/ \| alexcg1 wrote: \| Wow, this post really took off! If anyone wants to read some of \| my blog posts on building PDF search engines (and the pain, \| torment and anguish that it causes) read: \| \| - https://medium.com/jina-ai/building-an-ai-powered-pdf-search... \| \| - https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa... \| \| - https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa... \| [deleted] \| Malp wrote: \| Great stuff, I went down the rabbit hole of building something \| similar for synthesizing flash cards + Q/A pairs from textbook \| PDFs about a year ago, and I would also emphasize that PDF \| search is a janky nightmare to get within the ballpark of \| usability :') \| alexcg1 wrote: \| I feel your pain my brother(?) [0] in suffering. That's why I \| started simple in the notebook. Even trying to go a little \| more complex just leads to exponential rabbit holes and \| footguns. \| \| [0] based on typical HN demographics, no assumptions here \| PaulHoule wrote: \| Does it really work better than a simple tfidf? \| \| I worked on a neural search engine just when deep networks were \| taking off and we knew that it worked because we had test data \| that said certain documents were relevant for certain queries so \| we could compute precision and recall curves. My experience was \| that if the AUC metric is substantially improved customers really \| notice the difference. \| \| Very few search vendors do this kind of testing because it is \| expensive and because enterprise customers seem to care more that \| there are connectors to 800+ external systems than if the search \| results are any good. \| \| The main trouble I see with pdf search is that test extracted \| from pdf files is full of junk punctuation including spaces so if \| you are trying a bag of words based search the words are \| corrupted. Seems to me you could build a neural model that works \| around the brokenness of PDF but that isn't 'download a model \| from spacy and pray' but would be a big job that starts with \| getting 10 GB+ of PDF text. \| alexcg1 wrote: \| I'll agree that there's quite a bit of junk punctuation in the \| extracted sentences (and sentence fragments), quite often from \| short footnotes in the Wiki articles. Getting "good" PDFs with \| open usage rights was a bit tricky, especially in a super \| simple PDF format. I ended up PDF-printing from Chrome. \| \| Needless to say, working with PDFs makes me want to pull my \| hair out. \| \| I also ended up writing the SpacySentencizer Executor instead \| of using a "vanilla" sentencizer. That led to consistent \| sentence splitting (so "J.R.R. Tolkien turned to pg. 3" would \| be one sentence, not 5) \| \| For testing, Jina allows you to swap out encoders with just a \| couple of lines of code, so trying different methods out should \| work just fine. \| PaulHoule wrote: \| I dunno, you can download a million or so PDFs from arxiv.org \| and even more from archive.org. They aren't hard to find. \| \| There is something to say for roundtripping PDFs from source \| you control (you can accurately model the corruption produced \| by a particular system) but you will certainly see new and \| different phenomena if you try more. \| \| I'd agree that spacy's sentence segmentation is better than \| many of the alternatives. \| alexcg1 wrote: \| If new and different phenomena means new kinds of \| corruption and downright weird behavior I'll end up having \| no hair left! \| \| Even printing the same page to PDF with Chrome and Firefox \| delivers quite different results. Firefox was often \| combining "f" and "i" into fi ligature [0] which totally \| changed the meaning of "finished" for example. \| \| Downloading a lot of random PDFs from arxiv would be great \| for making something battle-hardened and robust (and I'd \| love to get the chance to do it sometime) but I didn't have \| the time (or the remaining hair) to do it this time round. \| \| [0] https://www.compart.com/en/unicode/U+FB01 \| alexcg1 wrote: \| And +1 to spaCy. I typically use it over Transformers \| because it's SO much faster. I just used Transformers in \| this example for a change. My Stack Overflow search \| notebook [0] uses spaCy. \| \| [0] https://colab.research.google.com/github/jina- \| ai/workshops/b... \| nicodjimenez wrote: \| Mathpix Snip also supports PDF search, including for handwritten \| content, and including math symbols in equations. \| \| Disclaimer: I'm the founder. \| ok_computer wrote: \| Mathpix snip for pdf to Latex is excellent. Thank you for the \| free tier. It is helpful transcribing pdf math homework sets to \| use in the solution document without bugging the instructor for \| their source. \| alexcg1 wrote: \| Oh, nifty! This is more a demo of a PDF search engine that you \| could (in parts 1 thru x of the series) deploy to an intranet \| (for internal knowledge search) or internet (for general \| search), rather than a collaborative tool. \| \| For handwritten/math symbols, I'm sure it wouldn't be too hard \| to integrate something. The Jina Flow [0] concept makes \| integrating new Executors [1] pretty easy. \| \| I LOVE the testimonials on the site btw! \| \| [0] https://docs.jina.ai/fundamentals/flow/ \| \| [1] https://docs.jina.ai/fundamentals/executor/ \| Stampo00 wrote: \| Pardon me while I go add Optimus Prime to my corporate \| letterhead. \| [deleted] ___________________________________________________________________ (page generated 2022-07-25 23:01 UTC)