[HN Gopher] Search PDFs with Transformers and Python Notebook
___________________________________________________________________
 
Search PDFs with Transformers and Python Notebook
 
Author : alexcg1
Score  : 109 points
Date   : 2022-07-25 14:11 UTC (8 hours ago)
 
web link (colab.research.google.com)
w3m dump (colab.research.google.com)
 
| [deleted]
 
| CShorten wrote:
| Congratulations Alex, super cool!
 
  | alexcg1 wrote:
  | Thanks man!
 
    | alexcg1 wrote:
    | Nice to meet another person in the super-obvious-username
    | club
 
| divan wrote:
| Can anyone recommend how to build the following solution?
| 
| - Full-text search on modern era PDFs (i.e no need for OCR)
| 
| - Exact word search would suffice (fuzzy/contextual search
| actually is less desirable)
| 
| - Cross-platform frontend part that highlights and jumps to the
| found text within the document. Frontend should be embeddable
| (i.e. not a SaaS or just standalone UI)
| 
| - As lightweight as possible (i.e. no Java, Python or Ruby)
| 
| - Long-term oriented stack (i.e. minimum dependencies, ideally
| promise of compatibility)
| 
| I'm looking at Mellisearch or Bleve for indexing/backend, and
| Syncfusion Flutter PDF viewer for frontend, but it still needs a
| lot of gluing code and I would love to explore more options.
| 
| Google Pinpoint is pretty cool, and I use it a lot, but there is
| only hosted Google version, plus it's too smart (still can't get
| it to do exact word search).
 
  | [deleted]
 
  | snowstormsun wrote:
  | pdfgrep with some formatting to add links open the correct
  | page?
 
    | alexcg1 wrote:
    | Getting the URI of original PDF would be straightforward
    | enough - I could whack that into the code tomorrow with a few
    | lines.
    | 
    | Opening up the correct page? I don't know of any standardized
    | PDF reader that supports that kind of thing. And the format
    | has such a history that even if it were supported
    | (technically by Adobe - don't even get me started on what PDF
    | readers support what formats), there's no guarantee the file
    | itself would even have that cooked in.
 
  | capableweb wrote:
  | > - As lightweight as possible (i.e. no Java, Python or Ruby)
  | 
  | I don't have suggestions for you, but I do have a question
  | regarding this point. Why wouldn't Java be considered
  | lightweight? Java literally runs on your SIM card, which is a
  | very bare-bones environment to run something on, I'd probably
  | consider something like that pretty lightweight.
 
    | divan wrote:
    | Ha, I'm from that generation of developers who have the
    | mental model of what is actually happening on the hardware
    | level when you run the program. Doesn't necesarilly mean I
    | overoptimize or think about struct fields offsets or cache
    | branching, but I do have this in my mental model and just
    | can't unlearn it.
    | 
    | When I think about how many stuff needs to be moved in
    | cpu/memory/io bus just to launch simple "Hello, World" in
    | Java - I just cannot accept it. I do realize that for large
    | programs that overhead is small, but still the JVM concept is
    | something I want to avoid as much as possible. Plus the sheer
    | scale of Java SDK and amount of legacy and complexity behind
    | it exceeds my treshold of "avoiding complexity" by orders of
    | magnitude. And the nail to the coffin of "no java" stance is,
    | of course, experience with desktop Java applications.
    | Consistenly the worst UX experience and performance I've seen
    | in 25 years among desktop apps.
 
      | alexcg1 wrote:
      | Don't remind me of desktop Java. What was that toolkit,
      | swing(?) that was used in all the apps back in the day.
      | PDFs have a special place in Hell, but Java desktop UXen
      | deserve a whole special circle
 
        | divan wrote:
        | PDF history is pretty amazing, actually. The fact that
        | PDF survived over so many decades is something worth
        | reflecting upon :)
 
  | simonw wrote:
  | If you hadn't ruled out Python I'd be suggesting using
  | Datasette + SQLite FTS - I've been building a whole bunch of
  | different search engines on that (including ones for searching
  | within OCRd PDF files) and the cost to host is trivial, since
  | you just need to run a Python process somewhere with a binary
  | SQLite database file. I usually use Vercel, Cloud Run or Fly
  | for that.
  | 
  | One example of a search engine I've built like this is the one
  | on the Datasette website: https://datasette.io/-/beta?q=fts - I
  | wrote about how that works here:
  | https://simonwillison.net/2020/Dec/19/dogsheep-beta/
 
    | divan wrote:
    | Interesting, thanks! I'll take a look (datasette is amazing).
 
  | alexcg1 wrote:
  | - Modern PDFs - if you wanna extract text and images, then the
  | PDFSegmenter used in my example will work. If tables too, might
  | need some additional jiggery-pokery, but definitely doable. I
  | know other ppl using the same framework (Jina) who've
  | accomplished it.
  | 
  | - Exact word search - pretty simple. I've focused on more
  | advanced stuff because color vs colour is same same but
  | different. Also just because it's pretty easy since I'm just
  | using pre-defined building blocks, not manually integrating
  | stuff
  | 
  | - Cross platform frontend - I've seen a lyrics search frontend
  | [0] and I've built stuff in Streamlit before. Jina offers
  | RESTful/gRPC/WebSockets gateways so it can't be too tough
  | 
  | - Lightweight? I mean how lightweight do you want it? C? Bash?
  | Assembly? I've found Python good for text parsing
  | 
  | - Long-term: The notebook I wrote has a few (each of which have
  | their own), but compared to others they're relatively
  | lightweight.
  | 
  | - Gluing code: I've been using pre-existing building blocks,
  | and writing new Executors (i.e. building blocks) is relatively
  | straightforward, and then scaling them up with shards,
  | replicas, etc is just a parameter away.
  | 
  | I'm more into the search side then the PDF stuff. The PDF side
  | I've had experience with through bitter suffering and torment.
  | Not a fun format to work with (unless you're into sado-
  | masochism)
  | 
  | [0] https://github.com/jina-ai/examples/tree/master/multires-
  | lyr...
 
    | divan wrote:
    | Thanks for elaborated answer.
    | 
    | Most of my use cases have to deal with 10-100 PDF small
    | documents, some - 1000-2000, but I don't want the solution to
    | choke on 10GB of huge PDFs (I was just uploading those to
    | Google Pinpoint). So Go or Rust for backend should be good
    | fit.
    | 
    | By cross-platform frontend I meant web/ios/android/desktop.
    | It's probably only Flutter, but I'm looking for other plugins
    | than Syncfusion's one to try. I know that sounds like
    | overkill for many people (website with search suffice), but I
    | already have cross-platform apps that would benefit from this
    | functionality, and web is a fallback there, not the main
    | option.
 
| shubham_saboo wrote:
| Wao, this is a really cool way to build full fledged search that
| too in a notebook!
| 
| Does it work end-to-end with PDF as a data structure or do we
| have to use OCR and parse the text first to be able to search it,
| really curious?
 
  | alexcg1 wrote:
  | The version in the notebook is just for simple text-based PDFs.
  | I wrote some posts on our company blog[1] about the sheer
  | agonies of dealing with PDF as a data format, so wanted to
  | stick with as simple as possible for now.
  | 
  | That said, I'm planning future notebooks where you can perform
  | text-to-image or image-to-image search, integrate OCR, scale it
  | up, serve it, deploy it, etc.
  | 
  | [1] https://medium.com/jina-ai
 
    | shubham_saboo wrote:
    | Awesome, will be on the lookout for that!
 
      | alexcg1 wrote:
      | We've got quite a few other notebooks for other kinds of
      | search on the blog. Would love to hear your thoughts!
 
  | spaetzleesser wrote:
  | "PDF as a data structure"
  | 
  | Don't. PDF is a terrible format for storing machine readable
  | data. You lose a ton of Information while you create the PDF
  | which you then painstakingly have to get back later (if that's
  | even possible)
 
    | alexcg1 wrote:
    | I may have misworded it (if I wrote those words - PDF rots
    | the brain and my memory likewise).
    | 
    | Agreed on the rest. PDFs don't store machine-readable data.
    | Often just pixelated scanned hot garbage dumpster fire text.
    | 
    | I hate PDFs but have to work with the satanforesaken things.
    | Hence the notebook. It's my little way of trying to give my
    | little PDF-bespoked-hellscape a tiny little glow-up.
 
  | rahimnathwani wrote:
  | Under the hood, it uses
  | https://github.com/pdfminer/pdfminer.six which expects the text
  | to be stored as text.
 
    | alexcg1 wrote:
    | You mean the PDFSegmenter Executor in the notebook?
 
      | rahimnathwani wrote:
      | Yes
 
        | alexcg1 wrote:
        | PDFSegmenter also extracts images, which can then be
        | OCR'ed in the next step of the pipeline
 
  | alexcg1 wrote:
  | Incidentally Jina Hub [0] has a few OCR Executors [1][2] you
  | could integrate into my notebook (though you'd have to do some
  | rewiring to take images into account since it's a text-based
  | notebook)
  | 
  | [0] https://hub.jina.ai/
  | 
  | [1] https://hub.jina.ai/executor/w4p7905v
  | 
  | [2] https://hub.jina.ai/executor/78yp7etm
 
| fzliu wrote:
| I just tried this on all the papers I downloaded over the past
| couple months - cool stuff.
| 
| How well would this work in a production setting, e.g. when
| searching over millions of PDFs on arxiv (soon to be tens of
| millions)? Follow-up: have you tried using a vector database such
| as Milvus as the key piece of underlying infrastructure to avoid
| having to implement deletes, failover, scaling, etc?
| https://zilliz.com/learn/what-is-vector-database
 
  | alexcg1 wrote:
  | In terms of matching embeddings and performing similarity
  | search on text/images - folks are already using the framework
  | (Jina) for that and getting decent results.
  | 
  | In terms of processing the PDFs and extracting that data. idk.
  | That depends on a lot of factors - e.g. do you need to OCR the
  | PDFs or can just extract text directly? Either way, should be
  | possible to write a module and then easily scale it up (Jina
  | supports shards/replicas). Anyway, lemme know. I'm in talks
  | with folks about this kind of shitshow...uh...use case now.
  | 
  | Jina supports multiple vector database backends, like Weaviate,
  | Qdrant and others. For others (like Milvus), suggest you ask on
  | the Slack [0] - responses tend to be fast.
  | 
  | [0] https://slack.jina.ai
 
| gapovaj742 wrote:
| okay but what if my PDF is non parseable? Not sure if Python's
| any good for that
 
  | alexcg1 wrote:
  | In that case I'd use:
  | 
  | 1. PDFSegmenter (in the notebook) - extract the images of the
  | text (yup, it does images too) 2. An OCR Executor [0][1] from
  | Jina Hub [2] to extract the text from the images 3. Actually
  | splice the text chunks together to be what you'd expect -
  | that's the tricky part. Even text splitting over pages can be
  | tricky to reassemble properly. PDFs are a pain the butt
  | frankly.
  | 
  | [0] https://hub.jina.ai/executor/78yp7etm
  | 
  | [1] https://hub.jina.ai/executor/w4p7905v
  | 
  | [2] https://hub.jina.ai
 
  | nicodjimenez wrote:
  | Mathpix PDF search is fully visually powered and does not use
  | underlying PDF metadata, even working on handwriting. It's a
  | great choice for researchers (especially in STEM) who want to
  | build a searchable archive of PDFs.
 
  | simonw wrote:
  | Amazon Textract does a phenomenal job of extracting text from
  | dodgy scanned PDFs - I've been running it against scanned
  | typewritten text and even handwritten journal text from the
  | 1880s with great results.
  | 
  | I built a tool for running OCR against every PDF in an S3
  | bucket (which costs about $1.50/thousand pages) here:
  | https://simonwillison.net/2022/Jun/30/s3-ocr/
 
| alexcg1 wrote:
| Wow, this post really took off! If anyone wants to read some of
| my blog posts on building PDF search engines (and the pain,
| torment and anguish that it causes) read:
| 
| - https://medium.com/jina-ai/building-an-ai-powered-pdf-search...
| 
| - https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa...
| 
| - https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa...
 
  | [deleted]
 
  | Malp wrote:
  | Great stuff, I went down the rabbit hole of building something
  | similar for synthesizing flash cards + Q/A pairs from textbook
  | PDFs about a year ago, and I would also emphasize that PDF
  | search is a janky nightmare to get within the ballpark of
  | usability :')
 
    | alexcg1 wrote:
    | I feel your pain my brother(?) [0] in suffering. That's why I
    | started simple in the notebook. Even trying to go a little
    | more complex just leads to exponential rabbit holes and
    | footguns.
    | 
    | [0] based on typical HN demographics, no assumptions here
 
| PaulHoule wrote:
| Does it really work better than a simple tfidf?
| 
| I worked on a neural search engine just when deep networks were
| taking off and we knew that it worked because we had test data
| that said certain documents were relevant for certain queries so
| we could compute precision and recall curves. My experience was
| that if the AUC metric is substantially improved customers really
| notice the difference.
| 
| Very few search vendors do this kind of testing because it is
| expensive and because enterprise customers seem to care more that
| there are connectors to 800+ external systems than if the search
| results are any good.
| 
| The main trouble I see with pdf search is that test extracted
| from pdf files is full of junk punctuation including spaces so if
| you are trying a bag of words based search the words are
| corrupted. Seems to me you could build a neural model that works
| around the brokenness of PDF but that isn't 'download a model
| from spacy and pray' but would be a big job that starts with
| getting 10 GB+ of PDF text.
 
  | alexcg1 wrote:
  | I'll agree that there's quite a bit of junk punctuation in the
  | extracted sentences (and sentence fragments), quite often from
  | short footnotes in the Wiki articles. Getting "good" PDFs with
  | open usage rights was a bit tricky, especially in a super
  | simple PDF format. I ended up PDF-printing from Chrome.
  | 
  | Needless to say, working with PDFs makes me want to pull my
  | hair out.
  | 
  | I also ended up writing the SpacySentencizer Executor instead
  | of using a "vanilla" sentencizer. That led to consistent
  | sentence splitting (so "J.R.R. Tolkien turned to pg. 3" would
  | be one sentence, not 5)
  | 
  | For testing, Jina allows you to swap out encoders with just a
  | couple of lines of code, so trying different methods out should
  | work just fine.
 
    | PaulHoule wrote:
    | I dunno, you can download a million or so PDFs from arxiv.org
    | and even more from archive.org. They aren't hard to find.
    | 
    | There is something to say for roundtripping PDFs from source
    | you control (you can accurately model the corruption produced
    | by a particular system) but you will certainly see new and
    | different phenomena if you try more.
    | 
    | I'd agree that spacy's sentence segmentation is better than
    | many of the alternatives.
 
      | alexcg1 wrote:
      | If new and different phenomena means new kinds of
      | corruption and downright weird behavior I'll end up having
      | no hair left!
      | 
      | Even printing the same page to PDF with Chrome and Firefox
      | delivers quite different results. Firefox was often
      | combining "f" and "i" into fi ligature [0] which totally
      | changed the meaning of "finished" for example.
      | 
      | Downloading a lot of random PDFs from arxiv would be great
      | for making something battle-hardened and robust (and I'd
      | love to get the chance to do it sometime) but I didn't have
      | the time (or the remaining hair) to do it this time round.
      | 
      | [0] https://www.compart.com/en/unicode/U+FB01
 
      | alexcg1 wrote:
      | And +1 to spaCy. I typically use it over Transformers
      | because it's SO much faster. I just used Transformers in
      | this example for a change. My Stack Overflow search
      | notebook [0] uses spaCy.
      | 
      | [0] https://colab.research.google.com/github/jina-
      | ai/workshops/b...
 
| nicodjimenez wrote:
| Mathpix Snip also supports PDF search, including for handwritten
| content, and including math symbols in equations.
| 
| Disclaimer: I'm the founder.
 
  | ok_computer wrote:
  | Mathpix snip for pdf to Latex is excellent. Thank you for the
  | free tier. It is helpful transcribing pdf math homework sets to
  | use in the solution document without bugging the instructor for
  | their source.
 
  | alexcg1 wrote:
  | Oh, nifty! This is more a demo of a PDF search engine that you
  | could (in parts 1 thru x of the series) deploy to an intranet
  | (for internal knowledge search) or internet (for general
  | search), rather than a collaborative tool.
  | 
  | For handwritten/math symbols, I'm sure it wouldn't be too hard
  | to integrate something. The Jina Flow [0] concept makes
  | integrating new Executors [1] pretty easy.
  | 
  | I LOVE the testimonials on the site btw!
  | 
  | [0] https://docs.jina.ai/fundamentals/flow/
  | 
  | [1] https://docs.jina.ai/fundamentals/executor/
 
| Stampo00 wrote:
| Pardon me while I go add Optimus Prime to my corporate
| letterhead.
 
  | [deleted]
 
___________________________________________________________________
(page generated 2022-07-25 23:01 UTC)