Released:
Latest version
![]()
Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript
Project description
Acrobat Pro DC for Mac or Windows provides a complete set of PDF tools. You can edit PDFs, edit scans with OCR, merge PDFs, organize and rotate PDF pages, add files to a PDF, split PDFs, reduce PDF file size, insert watermarks, and convert PDFs to and from image formats and Microsoft Word, Excel, and PowerPoint.
PyPDFOCR - Tesseract-OCR based PDF filing
This program will help manage your scanned PDFs by doing the following:
More links:
Usage:Single conversion:
If you have a language pack installed, then you can specify it with the-l option:
Automatic filing:
To automatically move the OCR’ed pdf to a directory based on a keyword,use the -f option and specify a configuration file (described below):
You can also do this in folder monitoring mode:
Filing based on filename match:
If no keywords match the contents of the filename, you can optionallyallow it to fallback to trying to find keyword matches with the PDFfilename using the -n option. For example, you may have receipts alwaysnamed as receipt_2013_12_2.pdf by your scanner, and you want to movethis to a folder called ‘receipts’. Assuming you have a keywordreceipt matching to folder receipts in your configuration fileas described below, you can run the following and have this filed evenif the content of the pdf does not contain the text ‘receipt’:
Configuration file for automatic PDF filing
The config.yaml file above is a simple folder to keyword matching textfile. It determines where your OCR’ed PDFs (and optionally, the originalscanned PDF) are placed after processing. An example is given below: Sitesucker 2 11 9.
The target_folder is the root of your filing cabinet. Any PDF movingwill happen in sub-directories under this directory.
The folders section defines your filing directories and the keywordsassociated with them. In this example, we have three filing directories(finances, travl, receipts), and some associated keywords for eachfiling directory. For example, if your OCR’ed PDF contains the phrase“american express” (in any upper/lower case), it will be filed intodocs/filed/finances
The default_folder is where the OCR’ed PDF is moved to if there isno keyword match.
The original_move_folder is optional (you can comment it out with# in front of that line), but if specified, the original scanned PDFis moved into this directory after OCR is done. Otherwise, if this fieldis not present or commented out, your original PDF will stay where itwas found.
If there is any naming conflict during filing, the program will add anunderscore followed by a number to each filename, in order to avoidoverwriting files that may already be present.
Evernote upload:Evernote authentication token
To enable Evernote support, you will need to get a developer token foryour Evernoteaccount. Youshould note that this script will never delete or modify existing notesin your account, and limits itself to creating new Notebooks and Notes.Once you get that token, you copy and paste it into your configurationfile as shown below
Evernote filing usage
To automatically upload the OCR’ed pdf to a folder based on a keyword,use the -e option instead of the -f auto filing option.
Similarly, you can also do this in folder monitoring mode:
Evernote filing configuration file
The config file shown above only needs to change slightly. The folderssection is completely unchanged, but note that target_folder is thename of your “Notebook stack” in Evernote, and the default_foldershould just be the default Evernote upload notebook name.
Auto email
You can have PyPDFOCR email you everytime it converts a file and filesit. You need to first specify the following lines in the configurationfile and then use the -m option when invoking pypdfocr:
Advanced optionsFine-tuning Tesseract/Ghostscript/others
You can specify Tesseract and Ghostscript executable locations manually, aswell as the number of concurrent processes allowed during preprocessing andtesseract. Use the following in your configuration file:
Handling disk time-outs
If you need to increase the time interval (default 3 seconds) between newdocument scans when pypdfocr is watching a directory, you can specify the followingoption in the configuration file:
InstallationUsing pip
PyPDFOCR is available in PyPI, so you can just run:
Please note that some of the 3rd-party libraries required by PyPDFOCR wiillrequire some build tools, especially on a default Ubuntu system. If you runinto any issues using pip install, you may want to install thefollowing packages on Ubuntu and try again:
For those on Windows, because it’s such a pain to get all the PILand PDF dependencies installed, I’ve gone ahead and made an executablecalledpypdfocr.exe
You still need to install Tesseract, GhostScript, etc. as detailed below inthe external dependencies list.
Manual install
Clone the source directly from github (you need to have git installed):
Then, install the following third-party python libraries:
These can all be installed via pip:
You will also need to install the external dependencies listed below.
External Dependencies
PyPDFOCR relies on the following (free) programs being installed and inthe path:
Poppler is only required if you want pypdfocr to figure out the original PDF resolutionautomatically; just make sure you have pdfimages in your path. Note that thexpdf provided pdfimages does not work for this,because it does not support the -list option to list the table of images in a PDF file.
On Mac OS X, you can install these using homebrew: File cabinet pro 4 2 6.
On Windows, please use the installers provided on their download pages.
** Important ** Tesseract version 3.02.02 or newer required(apparently 3.02.01-6 and possibly others do not work due to a hocroutput format change that I’m not planning to address). On Ubuntu, youmay need to compile and install it manually by following theseinstructions
Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees)then you need to find your tessdata directory and do the following:
osd stands for Orientation and Script Detection, so you need to copy the .traineddatafor whatever language you want to scan in as osd.traineddata. If you don’t do this step,then any landscape document will produce garbage
Disclaimer
While test coverage is at 84% right now, Sphinx docs generation is at anearly stage. The software is distributed on an “AS IS” BASIS, WITHOUTWARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
Todo list
Release historyRelease notifications | RSS feed
0.9.1
0.9.0
0.8.5
0.8.4
0.8.3
0.8.2
0.8.1
0.8.0
0.7.6
![]()
0.7.5
0.7.4
0.7.3 Bookstore 5 1 mac os x.
0.7.2
0.7.1
0.7.0
0.6.1
0.6.0
0.5.4
0.5.3
0.5.2
0.5.1
0.5
0.4.1
0.4
0.3.1
0.3
0.2.2
0.2.1
0.2
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Hashes for pypdfocr-0.9.1.tar.gz
OR OR Wondershare PDF Converter Pro 4.0.0.52 crack If the first link does not work, then work the second link : OR Wondershare PDFelement Pro 6.8.7.4146 + patch (FULL) If the first link does not work, then work the second link : OR All products xXx Technologies Ltd everything works 100% Nitro Pro Enterprise 12.0.0.112 (32-64) BIT + keygen + patch (FULL) Wondershare UniConverter 11.7.0.3 + crack + portable (FULL), Wondershare UniConverter 11.6.2.26 + crack (FULL),Wondershare UniConverter 11.5.0 MAC + keys (FULL),Wondershare Video Converter Ultimate 10.4.2.194 + Crack (FULL),Wondershare Video Converter Ultimate 8.0.0.10 Final + (Patch) Wondershare Recoverit 8.5.2.4 MAC Cracked (FULL)TNT,Wondershare Recoverit Video Repair 1.0.1.7 + crack (FULL) , Wondershare Recoverit Video Repair 1.0.1.2 MAC cracked (FULL) TNT,Wondershare Recoverit 8.5.1.6 MAC cracked (FULL) TNT,Wondershare Recoverit 8.5.0.38 MAC cracked (FULL) TNT,Wondershare Recoverit Ultimate 8.2.5.6 + crack (FULL),Wondershare Recoverit Ultimate 8.2.4.3 + crack (FULL),Wondershare Recoverit 8.2.0.11 + Wondershare Recoverit Ultimate 8.2.3.5 MAC cracked (FULL),Wondershare Recoverit Ultimate 8.0.4.12 + crack (FULL),Wondershare Recoverit Ultimate 8.0.4.3 + crack (FULL),Wondershare Recoverit 8.0.1.6 MAC cracked (FULL),Wondershare Recoverit 8.0.0.23 MAC cracked (FULL) TNT,Wondershare Recoverit 7.3.2.3 ,Wondershare Recoverit 7.4.5.8 MAC + crack (FULL),Wondershare Recoverit 7.3.1.16 + crack (FULL),Wondershare Recoverit 7.3.0.24 + crack (FULL),Wondershare Recoverit 7.4.2.21 MAC CRACKED (FULL) TNT,Wondershare Recoverit 7.2.4.7 + crack (FULL),Wondershare Data Recovery 6.6.1.0 + crack (FULL),Wondershare Data Recovery 5.0.0.5 FINAL + Crack ,Wondershare Data Recovery 6.0.2.16 Final + Crack + keys (FULL),Wondershare Data Recovery 6.0.4.1 Final + Crack + keys (FULL),Wondershare Data Recovery 6.1.0.4 + Crack,Wondershare Data Recovery 6.2.0.40 Final + Crack + keys (FULL),Wondershare Data Recovery 6.6.0.21 + Crack Wondershare MobileGo 8.0.0.5 Multilingual+ Patch Wondershare PDF Converter Pro includes all the usual features of PDF converter, and also supports OCR, which can convert both normal and scanned PDF into an editable, text format document. Supported formats Microsoft Word, Microsoft Excel, Microsoft PowerPoint, HTML, EPUB and text documents. Features: - Convert scanned PDF to Text Word / Excel / PPT / EPUB / HTML / Text formats - Support 17 recognition languages: English, Turkish, Greek, German, French, Italian, Portuguese, Spanish, Russian, Polish, Czech, Slovak, Ukrainian, Bulgarian, Croatian, Romanian, and Catalan. - Convert PDF files into editable Microsoft Word 2002-2010 format (. DOC,. DOCX) Pdf Converter Ocr 6 2 11 Download- Retain the original text content, images, and format- Convert PDF files into editable Microsoft Excel 2003/2007/2010 (. Xls,. Xlsx) - Convert PDF files into Microsoft PowerPoint 2002-2010 format (. PPT,. PPTX) - Retain the original format, layout images - Convert PDF files into HTML (. HTML) - Retain the original layout, format, and hyperlinks - Supports most popular browsers, like IE, Firefox, Safari and Opera. - Ability to determine the quality of images in the exported HTML files - Convert PDF files into text format (. TXT) - Extract all text content from PDF files - Convert PDF files to EPUB - Accurately preserve the original text, images, graphics, hyperlinks. - Support for setting the font size, background color, and text. You can read eBooks Epub to: - Apple iPad (using Apple iBooks) - IPhone and iPod Touch (using Lexcycle Stanza, Glider) - Sony Reader - Barnes & Noble nook - Hanlin eReader - COOL-ER - ESlick - Bookeen Cybook Gen3, Cybook Opus Pdf Converter Ocr 6- IRex Digital Reader 1000 - IRex Digital Reader 800 - Phones / devices using Android - PocketBook Reader - Ctaindia's eGriver Ebook Reader - Support format PDF 1,0-1,7 - Convert the whole PDF document, or only a specific range of pages (all from - to, individual pages). - Batch conversion. Free Pdf Ocr Converter- AnyBizSoft PDF Converter can also convert encrypted PDF files that are protected from printing, editing and copying.Pdf Converter Ocr 6 2 11 Free- For PDF files with password for opening, you must enter the password in the popup window, and then start the conversion.Pdf Converter Ocr 6 2 11 0- A pleasant and friendly interface.Comments are closed.
|
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |