Last updated
How PDF Text Extraction Works
PDF files store text as a series of text objects with font references and positioning commands. Text extraction reads these objects and reconstructs the reading order. This works well for digitally-created PDFs (Word exports, web-generated PDFs). Scanned PDFs contain only images — text extraction requires OCR (Optical Character Recognition) to work.
PDF Text Object Structure
BT % Begin text
/F1 12 Tf % Font F1, size 12
100 700 Td % Move to position (100, 700)
(Hello, World!) Tj % Show text string
ET % End text
Using PDF.js for Browser Extraction
import * as pdfjsLib from 'pdfjs-dist';
async function extractText(pdfFile) {
const arrayBuffer = await pdfFile.arrayBuffer();
const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
let fullText = '';
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const content = await page.getTextContent();
const pageText = content.items
.map(item => item.str)
.join(' ');
fullText += `--- Page ${i} ---
${pageText}
`;
}
return fullText;
}
Extraction Limitations
- Scanned PDFs: No embedded text — requires OCR (Tesseract, AWS Textract, Google Vision).
- Column layouts: Multi-column PDFs may have text extracted in the wrong reading order.
- Tables: Table cell boundaries are not preserved — text may run together.
- Ligatures: Some fonts use ligatures (fi, fl) that may not extract as separate characters.
- Encrypted PDFs: Password-protected PDFs cannot be extracted without the password.