BlipFiles

Tutorial · ocr

How to extract text from a scanned PDF (OCR)

Got a scanned doc and can't copy a single word out of it? OCR fixes that. Convert it to searchable PDF or plain text in seconds.

4 min readUpdated on April 25, 2026

You receive a scanned contract and try to highlight a clause to paste into an email — nothing happens. The cursor passes right over the text without reacting. That's because the "text" in the PDF is actually a photograph of a piece of paper. To any program, you're just selecting a JPG.

OCR (Optical Character Recognition) is the technology that lets a machine "read" those images and convert them into real text. The result: a document you can search with Cmd/Ctrl-F, copy passages from, edit in Word, index for search, or feed to an LLM.

How to tell if a PDF is scanned

3-second test: open the PDF and try to select a word with your mouse. If it highlights cleanly word by word, the PDF has real text and doesn't need OCR. If you can only draw a rectangle on top of it, it's an image — you need OCR.

Another tell: hit Cmd/Ctrl-F and search for a word you KNOW is in the document. If it doesn't find it, the content is an image.

When OCR is the right call

  • Old digitized contracts — to extract clauses, dates, amounts
  • Scanned receipts — to populate expense spreadsheets
  • Books and academic papers — to cite passages, translate
  • HR documents — IDs, payslips, certificates for record-keeping
  • Medical history — to digitize old patient records
  • Field research — partially handwritten survey forms
Publicidade
Advertisement

Step-by-step: extract text with OCR

1. Upload the scanned file

Works with PDF, JPG, PNG and TIFF. You can upload multi-page documents — OCR processes everything in one shot and preserves the order.

2. Pick the content language

We support English, Portuguese, and Spanish. OCR uses different models per language — picking the wrong one tanks accuracy. If your document is multilingual (a bilingual report, for instance), process each part separately.

3. Choose output format

  • Searchable PDF — keeps the original visual but adds an invisible text layer on top. You can Cmd/Ctrl-F, copy and paste normally.
  • Plain text (.txt) — just the extracted content, no formatting. Great for spreadsheets, importing into systems, feeding to AI.
  • Word (.docx) — converts with basic formatting preserved (paragraphs, alignment). Good for editing.

4. Process and download

OCR is slower than other conversions (each page takes 2-10 seconds depending on resolution). When it finishes, you download the file in your chosen format.

How to improve OCR quality

OCR accuracy depends heavily on the source image quality. Some tips:

  • Scan at 300 DPI minimum — below that, small letters blur
  • Make sure the page is straight — tilted scans and angled shots confuse the recognition
  • Clean stains and folds before scanning — marks become random characters
  • Prefer white background and black ink — high contrast = better reading
  • Avoid screen captures — moiré and pixelization hurt accuracy
Publicidade
Advertisement

Bonus: OCR + other tools

OCR unlocks several follow-ups:

  • OCR + Compress — after OCR the PDF gets MUCH lighter (text weighs far less than image)
  • OCR + Word — export to .docx for editing and review
  • OCR + Excel — if the document is tabular, OCR + Excel converter splits into columns
  • OCR + ChatGPT — drop the extracted text into AI to summarize, translate, or analyze

Frequently asked questions

On printed, clean, high-resolution text: 98-99%. On legible handwriting: 70-90%. On scribbles: 40-60%. Always proofread when accuracy matters (contracts, accounting data).