Last updated
What Is OCR?
OCR (Optical Character Recognition) is the technology that converts images of text — scanned documents, photos of signs, screenshots — into machine-readable text. Modern OCR engines use deep learning models trained on millions of document images to recognize characters with high accuracy across fonts, sizes, and orientations.
The leading open-source OCR engine is Tesseract, originally developed by HP and now maintained by Google. It supports over 100 languages and can be used via command line, Python, or JavaScript (Tesseract.js).
OCR Accuracy Factors
- Image resolution: 300 DPI is the minimum for good OCR results. Lower resolution causes character recognition errors.
- Image quality: Blurry, skewed, or low-contrast images reduce accuracy significantly.
- Font type: Clean, standard fonts (Arial, Times New Roman) are recognized more accurately than decorative or handwritten fonts.
- Background noise: Watermarks, stamps, and complex backgrounds interfere with text detection.
- Language: OCR engines need to be configured for the correct language to use the right character set and dictionary.
Using Tesseract.js in the Browser
import Tesseract from 'tesseract.js';
async function extractText(imageFile) {
const { data: { text } } = await Tesseract.recognize(
imageFile,
'eng', // language
{
logger: m => console.log(m.status, m.progress)
}
);
return text.trim();
}
// Usage with file input
document.getElementById('fileInput').addEventListener('change', async (e) => {
const file = e.target.files[0];
const text = await extractText(file);
document.getElementById('output').textContent = text;
});
Pre-processing for Better Results
Before running OCR, pre-processing the image can dramatically improve accuracy:
- Deskew: Rotate the image to align text horizontally.
- Binarize: Convert to black and white to improve contrast.
- Denoise: Remove speckles and artifacts.
- Scale up: Upscale small images to at least 300 DPI equivalent.
Python's Pillow library and OpenCV are commonly used for image pre-processing
before passing to Tesseract.