M1 benefits specifically for OCR ?

likegadgets · Mar 22, 2021

I have a project that will require OCRing tens of thousands of pages scanned into PDFs. Typically an i9 mac is relatively fast (using Acrobat Pro as the software). I am wondering if I would benefit by setting a couple of dedicated M1 macs (perhaps minis) to OCR, and I wonder if 16 GB RAM vs 8 GB makes a difference for this application . I would welcome suggestions on faster OCR software for PDF's other than Acrobat Pro - even if PC based machine/software combo will achieve faster results.

Thanks in advance

Slartibart · Mar 23, 2021

Single page or multi page image PDFs? In what languages? You just want a searchable PDF, a text file, or a different format for each document?
Depending on your answers I would run a first trial compiling/install the free Tesseract on the M1 and then script it.
To install Tesseract you can use Homebrew.

darngooddesign · Mar 23, 2021

What we need is someone to volunteer processing a few of Like's files so they have a speed benchmark.

BrianBaughn · Mar 23, 2021

I just found out last week that uploading a PDF to Google Drive makes it searchable. I think it does the scan during the upload.

likegadgets · Mar 23, 2021

Slartibart said:
Single page or multi page image PDFs? In what languages? You just want a searchable PDF, a text file, or a different format for each document?
Depending on your answers I would run a first trial compiling/install the free Tesseract on the M1 and then script it.
To install Tesseract you can use Homebrew.

They are large files, ranging between 1,000 to 5,000 pages each. All in English. On a test a 5,200 pages file took 2 hours and 20 minutes. The PC was specked as:

Intel Core i5-9400 CPU @ 2.9GHz
16GB RAM
64-bit architecture

the upload to Google is an interesting concept, but due to confidentiality is not feasible. We need the result in a searchable PDF. Will look into Tesseract.

thank you ALL for your comments.

dragonfly1 · Mar 23, 2021

Some easy to use OCR apps that convert scanned PDFs into searchable ones. I don't know how fast these are compared to other solutions but they are very easy to use and have some other very useful features.

darngooddesign · Mar 23, 2021

likegadgets said:
They are large files, ranging between 1,000 to 5,000 pages each. All in English. On a test a 5,200 pages file took 2 hours and 20 minutes. The PC was specked as:

Intel Core i5-9400 CPU @ 2.9GHz

16GB RAM

64-bit architecture

the upload to Google is an interesting concept, but due to confidentiality is not feasible. We need the result in a searchable PDF. Will look into Tesseract.

thank you ALL for your comments.

You can always buy an M1 to run benchmarks and then return it.

guzhogi · Mar 23, 2021

likegadgets said:
They are large files, ranging between 1,000 to 5,000 pages each. All in English. On a test a 5,200 pages file took 2 hours and 20 minutes.

Just out of curiosity (sorry for being nosy), what kind of files are they that need to be 1,000 - 5,000 pages each? You had mentioned that you can't use Google Drive due to confidentiality, so I'm guessing something in the medical, legal, or defense fields? I work in education, so this thread might be useful. Much of our documentation is printed out, but I think some our copiers have (or at least CAN have) OCR scanning functionality.

likegadgets · Mar 23, 2021

guzhogi said:
Just out of curiosity (sorry for being nosy), what kind of files are they that need to be 1,000 - 5,000 pages each? You had mentioned that you can't use Google Drive due to confidentiality, so I'm guessing something in the medical, legal, or defense fields? I work in education, so this thread might be useful. Much of our documentation is printed out, but I think some our copiers have (or at least CAN have) OCR scanning functionality.

They are medical and legal files. Unsorted. They scan full file boxes that need to be sorted, information extracted, duplicates eliminated, and other stuff. OCR is just the first step, but an important one. Trying to accelerate this step. So testing software and processing power. I rand the same job on an i3 windows machine with 16 GB ram and on a MBP i9 8 Core with 32 Gb ram. Results were very close. I suspect the limiting factor is the Adobe acrobat software OCR function, rather than the processing power

jdb8167 · Mar 23, 2021

Slartibart said:
Single page or multi page image PDFs? In what languages? You just want a searchable PDF, a text file, or a different format for each document?
Depending on your answers I would run a first trial compiling/install the free Tesseract on the M1 and then script it.
To install Tesseract you can use Homebrew.

Tesseract doesn't seem to directly convert PDF files. I can't find a convenient way to convert a PDF to multiple PNG files. There is pdf2image on brew but it seems not to work. It errors out without finding Ghostscript even though brew installs it as a dependency.

crevalic · Mar 23, 2021

jdb8167 said:
Tesseract doesn't seem to directly convert PDF files. I can't find a convenient way to convert a PDF to multiple PNG files. There is pdf2image on brew but it seems not to work. It errors out without finding Ghostscript even though brew installs it as a dependency.

I'm assuming OP has images and has been using Adobe Acrobat Pro to convert them into PDFs while simultaneously performing OCR, rather than having PDFs already, running OCR on those and making new searchable PDFs? Tesseract is perfect for the former, the latter would need a preprocessing step, as you discovered.

I'm pretty sure that pdf2image on brew is a project that has been abandoned like 10? years ago and was Windows-native anyway, so I'm not surprised it's not working for you. There's a pdf2image python library that is newer and should work, though, if you can handle a tiny bit of coding.

crevalic · Mar 23, 2021

@OP, why are you looking into M1 Macs for this at all? Is it because you've read about the good CPU performance? That's mostly just benchmarks right now, very little actual software is optimized or even exists for M1 at this point and the performance is only really strong in single threaded tasks. Anything that can get parallelized or needs significant RAM bandwidth (like OCR) will never be strong on M1, firstly because nobody will optimize the code and secondly because 4 high performance memory starved M1 threads will never be able to compete with up to 128 threads you can get in a threadripper equipped workstation right now, where each thread is at least as strong as M1. The GPU is also not good on M1, nor is it supported anywhere. M1 is a nice low-power laptop SoC that can compete with hungrier mobile Intel CPUs in many tasks, it's not really a competitive workstation processor or anything like that. You'd also need to use an alpha version of tesseract on M1.

Maybe take a look at this artictle. I'm assuming you aren't super techy, considering your posts so far, so Google Cloud Plarform Vision API definitely seems like a really nice choice to me. Considering this is b2b GCP side of Google, not the consumer drive stuff, the confidentiality should be more than fine and you'll also get access to support.

When you send an image to Vision API, we must store that image for a short period of time in order to perform the analysis and return the results to you. For asynchronous offline batch operations, the stored image is typically deleted right after the processing is done, with a failsafe Time to live (TTL) of a few hours. For online (immediate response) operations, the image data is processed in memory and not persisted to disk.

Pricing also seems surprisingly good, first 1000 images per month are free, with $1.5 for every following 1000 images. That'll run you about $7.5 for one of your 5000 page scans, meaning you can do about 100 5000 page scans for the price of a single M1 mini, but way faster and without needing any setting up or messing with OCR settings.

Leon1das · Mar 24, 2021

M1 makes sense for the OCR task due to its unbeatable single core power - and last time I checked Adobe Acrobat is still not multi-core optimized for heavy tasks.

On the other hand Acrobat still uses Rosetta 2, and I am unaware of other OCR software that has native AS binary.

If Google really does OCR during upload - I would temporarily buy extra GDrive storage for 1-2 months and did a conversion there...

You shouldnt worry about privacy - unless its illegaly obtained copyrighted material.

P.S. Google does get to know customers via various services they provide - but those data are used to offer you products (sometimes intrusive ads I agree), but they do not/cant sell them. In a 15 years of use I had 0 security breaches with Google.
Bad ad company here is Facebook - with similar business model - but also their way weaker protection of users data, and numerous breaches over the years - despite 2FA used for logins... I wouldnt scan even my cat food barcode if Facebook offered it...

BrianBaughn · Mar 24, 2021

I don't think the Google OCR is really OCR of image-based PDFs.

guzhogi · Mar 24, 2021

likegadgets said:
They are medical and legal files. Unsorted. They scan full file boxes that need to be sorted, information extracted, duplicates eliminated, and other stuff. OCR is just the first step, but an important one. Trying to accelerate this step. So testing software and processing power. I rand the same job on an i3 windows machine with 16 GB ram and on a MBP i9 8 Core with 32 Gb ram. Results were very close. I suspect the limiting factor is the Adobe acrobat software OCR function, rather than the processing power

Okay, never had to deal with that much myself, hence why I asked.

At my job, we have Konica Minolta copiers. They offer scanning of documents to various locations (Scan-to-Email, Scan-to-SMB, Scan-to-FTP, Scan-to-Box, Scan-to-USB, Scan-to-WebDAV, Scan-to-DPWS, Network TWAIN scan). I checked the online documentation of it, and while my company didn't get it, you can get an add-on that enables OCR into searchable PDF and DOCX, XLSX. We also have this print server software, Papercut MF, that provides OCR service, too. Just something for you to consider. Should save a bit of work.

Slartibart · Mar 24, 2021

jdb8167 said:
Tesseract doesn't seem to directly convert PDF files. I can't find a convenient way to convert a PDF to multiple PNG files. There is pdf2image on brew but it seems not to work. It errors out without finding Ghostscript even though brew installs it as a dependency.

well...

Step 1:

convert -density 300 in.pdf -depth 1 -strip -background white -alpha off out.tiff

(or if you like to convert specific pages:
convert -density 300 in.pdf[3-6] -depth 1 -strip -background white -alpha off out.tiff
please be aware that the page counter in a pdf starts with “0” for page 1.)

Step 2:

tesseract out.tiff out.pdf

In the case described by the OP this can be easily scripted - I mean I would get rid of the Tiffs generated immediately after running the OCR, create a multi page searchable PDF (or plain text), etc.. I suggest a look into the Tesseract manual.

EDIT: I just checked and thanks to homebrew offering a native M1-version for imagemagick and Tesseract this is... well, surprisingly fast. 😃 I do not have access to PDFs which are composed of thousands of pixel pages though...

mainemini · Jul 14, 2021

We use Intel Mac minis in our medical office to OCR hundreds of pages of PDFs each week. FineReader for Mac uses mulltiple cores on the Intel CPU but wasn't compatible with M1 Macs via Rosetta. They just released an update to fix that issue.

Based on this thread, it seems there are no native M1 OCR solutions yet that don't involve Homebrew.

SuperSven · Jul 14, 2021

Don't know about speed but I used ocrmypdf (brew install ocrmypdf) to deskew multipage PDF scans yesterday.
The result was awesome for me. And it can do multiple things in one run like deskew, optimise and OCR.
No GUI so very nice to script.

OCRmyPDF documentation — ocrmypdf 16.2.1.dev5+g5caf654 documentation

Gnattu · Jul 15, 2021

mainemini said:
it seems there are no native M1 OCR solutions yet that don't involve Homebrew.

"Yet". macOS 12 provides a new feature to the Vision framework which is exactly designed for recognizing text in document images. It runs on the Neural Engine and is super-fast(It can even do document recognition in real-time on a camera/video feed).

vs40 · Jul 15, 2021

crevalic said:
why are you looking into M1 Macs for this at all? Is it because you've read about the good CPU performance? That's mostly just benchmarks right now, very little actual software is optimized or even exists for M1 at this point and the performance is only really strong in single threaded tasks.

Passively cooled MBA M1 for example demolishes every gaming laptop with latest AMD/Intel CPUs in PDF export task.
It is not just benchmark, it is very real task for many office related scenarios.
But I'm not expert in OCR and don't know if those tasks also optimized for M1.

Bildschirmfoto 2021-07-15 um 13.10.56.png

Wolf1701 · Jul 15, 2021

There's a thing that Acrobat Pro does very well: not just a simple ocr but the "Editable text and images" function (former clearscan in Acrobat Pro 11).

This function produces small sized pdf with very high quality text. No other software does the same job (at least I have never found one).

xraydoc · Jul 15, 2021

I know the OP's posts were from some time ago, but with Adobe's recent ARM-optimized release of their flagship apps, hopefully an ARM-optimized version of Acrobat is coming soon.

Jayratch · Dec 8, 2021

likegadgets said:
I have a project that will require OCRing tens of thousands of pages scanned into PDFs. Typically an i9 mac is relatively fast (using Acrobat Pro as the software). I am wondering if I would benefit by setting a couple of dedicated M1 macs (perhaps minis) to OCR, and I wonder if 16 GB RAM vs 8 GB makes a difference for this application . I would welcome suggestions on faster OCR software for PDF's other than Acrobat Pro - even if PC based machine/software combo will achieve faster results.

Thanks in advance

How did this project work out for you in the end?

I am kind of cringing at the whole conversation in light of your software choice being Acrobat Pro. Last I checked, Adobe still hadn't updated that app to take advantage of multiple cores. I found this out when I upgraded from a dual core to a quad core back in 2018, only to find that my OCR performance in Acrobat Pro didn't change at all.

I soon after switched to a command-line tool called OCRMyPDF, which does utilize all cores, and my OCR speed went up by a factor of around three. It's not actually as efficient as Adobe with single-core performance, but the fact that it enabled the rest of the cores meant it ultimately went a good deal faster.

I just upgraded myself from the quad-core i5 to a 10-core M1 Pro, but haven't had the chance to do a head-to-head performance test yet. When I get the chance, I will run the same file through the program on my M1 Pro and on my six-core i5 Mac Mini. I expect the M1 to be a bit over twice as fast, in OCRmyPDF, but in Acrobat, I expect the M1 to be about the same or possibly even slower, if it has to run through Rosetta. But I'm curious what results you found.

likegadgets · Dec 8, 2021

Jayratch said:
How did this project work out for you in the end?

I am kind of cringing at the whole conversation in light of your software choice being Acrobat Pro. Last I checked, Adobe still hadn't updated that app to take advantage of multiple cores. I found this out when I upgraded from a dual core to a quad core back in 2018, only to find that my OCR performance in Acrobat Pro didn't change at all.

I soon after switched to a command-line tool called OCRMyPDF, which does utilize all cores, and my OCR speed went up by a factor of around three. It's not actually as efficient as Adobe with single-core performance, but the fact that it enabled the rest of the cores meant it ultimately went a good deal faster.

I just upgraded myself from the quad-core i5 to a 10-core M1 Pro, but haven't had the chance to do a head-to-head performance test yet. When I get the chance, I will run the same file through the program on my M1 Pro and on my six-core i5 Mac Mini. I expect the M1 to be a bit over twice as fast, in OCRmyPDF, but in Acrobat, I expect the M1 to be about the same or possibly even slower, if it has to run through Rosetta. But I'm curious what results you found.

We ran some tests on M1's and there was no perceptible benefit. We resorted to using multiple Windows/Intel machines. Maybe when there is a recompiled version of Acrobat we will try it out on an M1 Max

Wolf1701 · Dec 9, 2021

likegadgets said:
We ran some tests on M1's and there was no perceptible benefit. We resorted to using multiple Windows/Intel machines. Maybe when there is a recompiled version of Acrobat we will try it out on an M1 Max

M1 benefits specifically for OCR ?

macrumors 6502a

macrumors 68040

macrumors Core

macrumors G4

macrumors 6502a

macrumors newbie

macrumors Core

macrumors 68040

macrumors 6502a

macrumors 601

Suspended

Suspended

macrumors 6502

macrumors G4

macrumors 68040

macrumors 68040

macrumors member

macrumors newbie

macrumors 65816

macrumors member

macrumors 6502

Contributor

macrumors newbie

macrumors 6502a

macrumors 6502

Attachments

Our Staff