|
| [deleted]
| CShorten wrote:
| Congratulations Alex, super cool!
| alexcg1 wrote:
| Thanks man!
| alexcg1 wrote:
| Nice to meet another person in the super-obvious-username
| club
| divan wrote:
| Can anyone recommend how to build the following solution?
|
| - Full-text search on modern era PDFs (i.e no need for OCR)
|
| - Exact word search would suffice (fuzzy/contextual search
| actually is less desirable)
|
| - Cross-platform frontend part that highlights and jumps to the
| found text within the document. Frontend should be embeddable
| (i.e. not a SaaS or just standalone UI)
|
| - As lightweight as possible (i.e. no Java, Python or Ruby)
|
| - Long-term oriented stack (i.e. minimum dependencies, ideally
| promise of compatibility)
|
| I'm looking at Mellisearch or Bleve for indexing/backend, and
| Syncfusion Flutter PDF viewer for frontend, but it still needs a
| lot of gluing code and I would love to explore more options.
|
| Google Pinpoint is pretty cool, and I use it a lot, but there is
| only hosted Google version, plus it's too smart (still can't get
| it to do exact word search).
| [deleted]
| snowstormsun wrote:
| pdfgrep with some formatting to add links open the correct
| page?
| alexcg1 wrote:
| Getting the URI of original PDF would be straightforward
| enough - I could whack that into the code tomorrow with a few
| lines.
|
| Opening up the correct page? I don't know of any standardized
| PDF reader that supports that kind of thing. And the format
| has such a history that even if it were supported
| (technically by Adobe - don't even get me started on what PDF
| readers support what formats), there's no guarantee the file
| itself would even have that cooked in.
| capableweb wrote:
| > - As lightweight as possible (i.e. no Java, Python or Ruby)
|
| I don't have suggestions for you, but I do have a question
| regarding this point. Why wouldn't Java be considered
| lightweight? Java literally runs on your SIM card, which is a
| very bare-bones environment to run something on, I'd probably
| consider something like that pretty lightweight.
| divan wrote:
| Ha, I'm from that generation of developers who have the
| mental model of what is actually happening on the hardware
| level when you run the program. Doesn't necesarilly mean I
| overoptimize or think about struct fields offsets or cache
| branching, but I do have this in my mental model and just
| can't unlearn it.
|
| When I think about how many stuff needs to be moved in
| cpu/memory/io bus just to launch simple "Hello, World" in
| Java - I just cannot accept it. I do realize that for large
| programs that overhead is small, but still the JVM concept is
| something I want to avoid as much as possible. Plus the sheer
| scale of Java SDK and amount of legacy and complexity behind
| it exceeds my treshold of "avoiding complexity" by orders of
| magnitude. And the nail to the coffin of "no java" stance is,
| of course, experience with desktop Java applications.
| Consistenly the worst UX experience and performance I've seen
| in 25 years among desktop apps.
| alexcg1 wrote:
| Don't remind me of desktop Java. What was that toolkit,
| swing(?) that was used in all the apps back in the day.
| PDFs have a special place in Hell, but Java desktop UXen
| deserve a whole special circle
| divan wrote:
| PDF history is pretty amazing, actually. The fact that
| PDF survived over so many decades is something worth
| reflecting upon :)
| simonw wrote:
| If you hadn't ruled out Python I'd be suggesting using
| Datasette + SQLite FTS - I've been building a whole bunch of
| different search engines on that (including ones for searching
| within OCRd PDF files) and the cost to host is trivial, since
| you just need to run a Python process somewhere with a binary
| SQLite database file. I usually use Vercel, Cloud Run or Fly
| for that.
|
| One example of a search engine I've built like this is the one
| on the Datasette website: https://datasette.io/-/beta?q=fts - I
| wrote about how that works here:
| https://simonwillison.net/2020/Dec/19/dogsheep-beta/
| divan wrote:
| Interesting, thanks! I'll take a look (datasette is amazing).
| alexcg1 wrote:
| - Modern PDFs - if you wanna extract text and images, then the
| PDFSegmenter used in my example will work. If tables too, might
| need some additional jiggery-pokery, but definitely doable. I
| know other ppl using the same framework (Jina) who've
| accomplished it.
|
| - Exact word search - pretty simple. I've focused on more
| advanced stuff because color vs colour is same same but
| different. Also just because it's pretty easy since I'm just
| using pre-defined building blocks, not manually integrating
| stuff
|
| - Cross platform frontend - I've seen a lyrics search frontend
| [0] and I've built stuff in Streamlit before. Jina offers
| RESTful/gRPC/WebSockets gateways so it can't be too tough
|
| - Lightweight? I mean how lightweight do you want it? C? Bash?
| Assembly? I've found Python good for text parsing
|
| - Long-term: The notebook I wrote has a few (each of which have
| their own), but compared to others they're relatively
| lightweight.
|
| - Gluing code: I've been using pre-existing building blocks,
| and writing new Executors (i.e. building blocks) is relatively
| straightforward, and then scaling them up with shards,
| replicas, etc is just a parameter away.
|
| I'm more into the search side then the PDF stuff. The PDF side
| I've had experience with through bitter suffering and torment.
| Not a fun format to work with (unless you're into sado-
| masochism)
|
| [0] https://github.com/jina-ai/examples/tree/master/multires-
| lyr...
| divan wrote:
| Thanks for elaborated answer.
|
| Most of my use cases have to deal with 10-100 PDF small
| documents, some - 1000-2000, but I don't want the solution to
| choke on 10GB of huge PDFs (I was just uploading those to
| Google Pinpoint). So Go or Rust for backend should be good
| fit.
|
| By cross-platform frontend I meant web/ios/android/desktop.
| It's probably only Flutter, but I'm looking for other plugins
| than Syncfusion's one to try. I know that sounds like
| overkill for many people (website with search suffice), but I
| already have cross-platform apps that would benefit from this
| functionality, and web is a fallback there, not the main
| option.
| shubham_saboo wrote:
| Wao, this is a really cool way to build full fledged search that
| too in a notebook!
|
| Does it work end-to-end with PDF as a data structure or do we
| have to use OCR and parse the text first to be able to search it,
| really curious?
| alexcg1 wrote:
| The version in the notebook is just for simple text-based PDFs.
| I wrote some posts on our company blog[1] about the sheer
| agonies of dealing with PDF as a data format, so wanted to
| stick with as simple as possible for now.
|
| That said, I'm planning future notebooks where you can perform
| text-to-image or image-to-image search, integrate OCR, scale it
| up, serve it, deploy it, etc.
|
| [1] https://medium.com/jina-ai
| shubham_saboo wrote:
| Awesome, will be on the lookout for that!
| alexcg1 wrote:
| We've got quite a few other notebooks for other kinds of
| search on the blog. Would love to hear your thoughts!
| spaetzleesser wrote:
| "PDF as a data structure"
|
| Don't. PDF is a terrible format for storing machine readable
| data. You lose a ton of Information while you create the PDF
| which you then painstakingly have to get back later (if that's
| even possible)
| alexcg1 wrote:
| I may have misworded it (if I wrote those words - PDF rots
| the brain and my memory likewise).
|
| Agreed on the rest. PDFs don't store machine-readable data.
| Often just pixelated scanned hot garbage dumpster fire text.
|
| I hate PDFs but have to work with the satanforesaken things.
| Hence the notebook. It's my little way of trying to give my
| little PDF-bespoked-hellscape a tiny little glow-up.
| rahimnathwani wrote:
| Under the hood, it uses
| https://github.com/pdfminer/pdfminer.six which expects the text
| to be stored as text.
| alexcg1 wrote:
| You mean the PDFSegmenter Executor in the notebook?
| rahimnathwani wrote:
| Yes
| alexcg1 wrote:
| PDFSegmenter also extracts images, which can then be
| OCR'ed in the next step of the pipeline
| alexcg1 wrote:
| Incidentally Jina Hub [0] has a few OCR Executors [1][2] you
| could integrate into my notebook (though you'd have to do some
| rewiring to take images into account since it's a text-based
| notebook)
|
| [0] https://hub.jina.ai/
|
| [1] https://hub.jina.ai/executor/w4p7905v
|
| [2] https://hub.jina.ai/executor/78yp7etm
| fzliu wrote:
| I just tried this on all the papers I downloaded over the past
| couple months - cool stuff.
|
| How well would this work in a production setting, e.g. when
| searching over millions of PDFs on arxiv (soon to be tens of
| millions)? Follow-up: have you tried using a vector database such
| as Milvus as the key piece of underlying infrastructure to avoid
| having to implement deletes, failover, scaling, etc?
| https://zilliz.com/learn/what-is-vector-database
| alexcg1 wrote:
| In terms of matching embeddings and performing similarity
| search on text/images - folks are already using the framework
| (Jina) for that and getting decent results.
|
| In terms of processing the PDFs and extracting that data. idk.
| That depends on a lot of factors - e.g. do you need to OCR the
| PDFs or can just extract text directly? Either way, should be
| possible to write a module and then easily scale it up (Jina
| supports shards/replicas). Anyway, lemme know. I'm in talks
| with folks about this kind of shitshow...uh...use case now.
|
| Jina supports multiple vector database backends, like Weaviate,
| Qdrant and others. For others (like Milvus), suggest you ask on
| the Slack [0] - responses tend to be fast.
|
| [0] https://slack.jina.ai
| gapovaj742 wrote:
| okay but what if my PDF is non parseable? Not sure if Python's
| any good for that
| alexcg1 wrote:
| In that case I'd use:
|
| 1. PDFSegmenter (in the notebook) - extract the images of the
| text (yup, it does images too) 2. An OCR Executor [0][1] from
| Jina Hub [2] to extract the text from the images 3. Actually
| splice the text chunks together to be what you'd expect -
| that's the tricky part. Even text splitting over pages can be
| tricky to reassemble properly. PDFs are a pain the butt
| frankly.
|
| [0] https://hub.jina.ai/executor/78yp7etm
|
| [1] https://hub.jina.ai/executor/w4p7905v
|
| [2] https://hub.jina.ai
| nicodjimenez wrote:
| Mathpix PDF search is fully visually powered and does not use
| underlying PDF metadata, even working on handwriting. It's a
| great choice for researchers (especially in STEM) who want to
| build a searchable archive of PDFs.
| simonw wrote:
| Amazon Textract does a phenomenal job of extracting text from
| dodgy scanned PDFs - I've been running it against scanned
| typewritten text and even handwritten journal text from the
| 1880s with great results.
|
| I built a tool for running OCR against every PDF in an S3
| bucket (which costs about $1.50/thousand pages) here:
| https://simonwillison.net/2022/Jun/30/s3-ocr/
| alexcg1 wrote:
| Wow, this post really took off! If anyone wants to read some of
| my blog posts on building PDF search engines (and the pain,
| torment and anguish that it causes) read:
|
| - https://medium.com/jina-ai/building-an-ai-powered-pdf-search...
|
| - https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa...
|
| - https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa...
| [deleted]
| Malp wrote:
| Great stuff, I went down the rabbit hole of building something
| similar for synthesizing flash cards + Q/A pairs from textbook
| PDFs about a year ago, and I would also emphasize that PDF
| search is a janky nightmare to get within the ballpark of
| usability :')
| alexcg1 wrote:
| I feel your pain my brother(?) [0] in suffering. That's why I
| started simple in the notebook. Even trying to go a little
| more complex just leads to exponential rabbit holes and
| footguns.
|
| [0] based on typical HN demographics, no assumptions here
| PaulHoule wrote:
| Does it really work better than a simple tfidf?
|
| I worked on a neural search engine just when deep networks were
| taking off and we knew that it worked because we had test data
| that said certain documents were relevant for certain queries so
| we could compute precision and recall curves. My experience was
| that if the AUC metric is substantially improved customers really
| notice the difference.
|
| Very few search vendors do this kind of testing because it is
| expensive and because enterprise customers seem to care more that
| there are connectors to 800+ external systems than if the search
| results are any good.
|
| The main trouble I see with pdf search is that test extracted
| from pdf files is full of junk punctuation including spaces so if
| you are trying a bag of words based search the words are
| corrupted. Seems to me you could build a neural model that works
| around the brokenness of PDF but that isn't 'download a model
| from spacy and pray' but would be a big job that starts with
| getting 10 GB+ of PDF text.
| alexcg1 wrote:
| I'll agree that there's quite a bit of junk punctuation in the
| extracted sentences (and sentence fragments), quite often from
| short footnotes in the Wiki articles. Getting "good" PDFs with
| open usage rights was a bit tricky, especially in a super
| simple PDF format. I ended up PDF-printing from Chrome.
|
| Needless to say, working with PDFs makes me want to pull my
| hair out.
|
| I also ended up writing the SpacySentencizer Executor instead
| of using a "vanilla" sentencizer. That led to consistent
| sentence splitting (so "J.R.R. Tolkien turned to pg. 3" would
| be one sentence, not 5)
|
| For testing, Jina allows you to swap out encoders with just a
| couple of lines of code, so trying different methods out should
| work just fine.
| PaulHoule wrote:
| I dunno, you can download a million or so PDFs from arxiv.org
| and even more from archive.org. They aren't hard to find.
|
| There is something to say for roundtripping PDFs from source
| you control (you can accurately model the corruption produced
| by a particular system) but you will certainly see new and
| different phenomena if you try more.
|
| I'd agree that spacy's sentence segmentation is better than
| many of the alternatives.
| alexcg1 wrote:
| If new and different phenomena means new kinds of
| corruption and downright weird behavior I'll end up having
| no hair left!
|
| Even printing the same page to PDF with Chrome and Firefox
| delivers quite different results. Firefox was often
| combining "f" and "i" into fi ligature [0] which totally
| changed the meaning of "finished" for example.
|
| Downloading a lot of random PDFs from arxiv would be great
| for making something battle-hardened and robust (and I'd
| love to get the chance to do it sometime) but I didn't have
| the time (or the remaining hair) to do it this time round.
|
| [0] https://www.compart.com/en/unicode/U+FB01
| alexcg1 wrote:
| And +1 to spaCy. I typically use it over Transformers
| because it's SO much faster. I just used Transformers in
| this example for a change. My Stack Overflow search
| notebook [0] uses spaCy.
|
| [0] https://colab.research.google.com/github/jina-
| ai/workshops/b...
| nicodjimenez wrote:
| Mathpix Snip also supports PDF search, including for handwritten
| content, and including math symbols in equations.
|
| Disclaimer: I'm the founder.
| ok_computer wrote:
| Mathpix snip for pdf to Latex is excellent. Thank you for the
| free tier. It is helpful transcribing pdf math homework sets to
| use in the solution document without bugging the instructor for
| their source.
| alexcg1 wrote:
| Oh, nifty! This is more a demo of a PDF search engine that you
| could (in parts 1 thru x of the series) deploy to an intranet
| (for internal knowledge search) or internet (for general
| search), rather than a collaborative tool.
|
| For handwritten/math symbols, I'm sure it wouldn't be too hard
| to integrate something. The Jina Flow [0] concept makes
| integrating new Executors [1] pretty easy.
|
| I LOVE the testimonials on the site btw!
|
| [0] https://docs.jina.ai/fundamentals/flow/
|
| [1] https://docs.jina.ai/fundamentals/executor/
| Stampo00 wrote:
| Pardon me while I go add Optimus Prime to my corporate
| letterhead.
| [deleted]
___________________________________________________________________
(page generated 2022-07-25 23:01 UTC) |