Docento.app Logo
Docento.app
All Posts

How to Convert a PDF to Plain Text (.txt)

April 1, 2026·5 min read

Plain text is the most universal output a PDF can produce. No fonts, no layout, no images — just the words. That's what you want when feeding documents to a search index, an LLM, a script, or just a clean editable starting point. The conversion is easy; the gotchas are in what gets dropped along the way.

When you actually want plain text

Text extraction is the right tool for:

  • Searching across many PDFs — index plain text, not the PDFs themselves.
  • Feeding documents to AI tools — most LLM pipelines work better on clean text than on raw PDFs.
  • Editing in any text editor — once it's .txt, you can use grep, sed, or any IDE.
  • Reading on a basic e-reader that doesn't render PDFs well.
  • Counting words or doing language analysis — measure what's in the document, not the noise.

If your goal is "edit the document and save it back as a PDF," you probably want PDF to Word instead — text loses formatting that you'd want to preserve.

Method 1: Browser-based extraction

For a single file, a browser tool is the simplest path. Drop the PDF, get a .txt file, done. Docento.app extracts text in the browser without uploading the file — useful when the document is sensitive.

For multi-column documents, look for an extractor that handles columns correctly. Some tools read across columns, mixing left and right paragraphs together; others read down each column in order. The right behaviour depends on the document.

Method 2: Copy and paste

Underrated. Open the PDF in any reader, Ctrl-A to select all, Ctrl-C to copy, paste into a text file. Works for most digital PDFs, fails for scanned PDFs (no selectable text), and is faster than any conversion tool for one-off use.

Caveats:

  • Page breaks may or may not be preserved.
  • Hyphenation at line endings comes through as literal hyphens.
  • Tables become tab-separated text, often misaligned.

Method 3: Command line

For batch extraction, the command line is unbeatable:

  • pdftotext (poppler-utils): pdftotext input.pdf output.txt. Adds -layout to preserve column layout, -raw for reading order, -table for table detection.
  • pdfplumber (Python): more programmable, handles tables better than pdftotext.
  • mutool: mutool draw -F txt input.pdf — simple text extraction.
  • Apache Tika: works for PDFs and many other formats; useful in pipelines that handle multiple file types.

For batch jobs, see our batch processing guide.

Method 4: OCR for scanned PDFs

If the PDF is a scanned image, none of the above work directly — there's no text layer to extract. You need OCR first:

  • Tesseract: free, command-line, supports 100+ languages.
  • Browser OCR tools: WebAssembly Tesseract runs in-browser; slower than native but private.
  • Google Drive: open a scanned PDF "with Google Docs" and the document opens with extracted text.

See PDF OCR explained and making a PDF searchable.

Cleanup steps

Raw extracted text usually needs cleanup before it's useful:

  • Re-join hyphenated words at line breaks ("inter-\nesting" → "interesting").
  • Collapse repeated blank lines that came from page breaks or generous spacing.
  • Re-flow paragraphs that were broken into single lines per print row.
  • Strip page numbers, headers, and footers that repeat on every page. A regex like ^Page \d+ of \d+$ handles most.
  • Fix encoding issues — characters like en-dashes and curly quotes sometimes come through as Unicode escapes or replacement marks.

A simple Python or shell pipeline handles all of these in 30 seconds. For one-off cleanup, do it in your text editor.

Tables and layout

Tables are the hardest case. Plain text can't represent a 2D table well. Options:

  • Use pdftotext -layout to preserve column positions with spaces. Good for human reading, bad for parsing.
  • Use pdftotext -table for tab-separated output. Good for parsing, but column detection sometimes fails.
  • Extract tables separately with a tool like Camelot or pdfplumber, output as CSV. Best for numerical data.
  • Convert to a structured format — for example, PDF to Excel — and extract from there.

Multi-column documents

Academic papers, magazines, and many reports use multi-column layouts. Naive extraction reads left-to-right across the page, mixing the columns. To fix:

  • Use a tool that detects columns: pdftotext -layout works for many cases.
  • For complex layouts, pdfplumber lets you specify column boundaries explicitly.
  • As a last resort, crop each column into a separate PDF and extract individually. See cropping a PDF.

Metadata, footnotes, and references

Plain text extraction usually drops:

  • Footnotes and endnotes — sometimes appended to the page text, sometimes lost.
  • Headers and footers — included by default, often need filtering.
  • Embedded metadata (title, author, keywords) — not part of the body text.
  • Hyperlinks — link text comes through, the URL doesn't unless you ask for it specifically.

If hyperlinks matter for your downstream use, consider PDF to HTML instead.

Encoding gotchas

Modern PDFs are mostly UTF-8 friendly, but:

  • PDFs from older typesetting systems sometimes use custom font encodings — the extracted text looks like gibberish.
  • Embedded fonts with custom glyph mappings (common in academic papers) can extract correctly in one tool and as mojibake in another.
  • Non-Latin scripts (Arabic, Chinese, Japanese, Korean, Hindi) need a tool with proper Unicode support.

If the extraction looks wrong, try a different tool before giving up — some handle these cases far better than others.

Conclusion

Plain text extraction is fast and free with the right tool. Use a browser tool for one-off, command line for batches, OCR for scans. After extraction, plan for a small cleanup pass — re-flow paragraphs, fix hyphenation, remove repeated headers — and the result is ready for grep, search, or AI pipelines. Docento.app handles browser-based extraction without uploads.

Related Posts