Docento.app Logo
Docento.app
Lines of source code on a screen
All Posts

How to Convert a PDF to Markdown for Wikis, Static Sites, and LLMs

April 26, 2026·7 min read

Markdown is the format every developer, technical writer, and increasingly every knowledge worker reaches for first. It is plain text, version-controllable, easy to read, and clean to render. When the source material is a PDF, a research paper, a vendor manual, an old report someone scanned, converting to Markdown puts it into a format you can actually work with. This guide walks through the practical options for clean PDF-to-Markdown conversion.

Why convert PDF to Markdown

A few common motivations:

  • Static site generation. Tools like Hugo, Jekyll, Astro, and 11ty consume Markdown. Converting old PDF resources to Markdown lets you publish them as web pages.
  • Wiki ingestion. Confluence, MediaWiki, Obsidian, and Notion all accept Markdown. Converting PDF documentation to Markdown brings it into your team's knowledge base.
  • LLM context. Modern AI tools work best with clean text. Markdown is the format of choice for feeding documents to LLMs, see chatting with PDFs explained.
  • Version control. Tracking changes to a PDF over time is painful. Tracking changes to a Markdown file in Git is trivial.
  • Editing comfort. Once content is in Markdown, anyone can edit it in any text editor.

What you can preserve

A clean Markdown conversion preserves:

  • Headings (H1-H6 mapped to #-######)
  • Paragraphs
  • Bulleted and numbered lists
  • Bold and italic emphasis
  • Inline links and footnotes
  • Tables (basic Markdown tables, no merged cells)
  • Code blocks (fenced)
  • Block quotes

What does not survive:

  • Page numbers and headers/footers (you usually do not want them in Markdown anyway)
  • Multi-column layout (Markdown is single-column)
  • Complex tables with merged cells or nested content
  • Floating sidebars, callouts, and pull quotes (unless converted to block quotes or admonitions)
  • Inline equations rendered as image-only, converts to placeholder text unless your converter handles MathJax or LaTeX

Tools that produce decent Markdown

Pandoc. The most general document converter. pandoc input.pdf -o output.md. Pandoc's PDF parsing leans on external tools (pdftotext or similar) and is best for native-text PDFs with clear structure. Output is plain Markdown; pass -t gfm for GitHub-flavored.

marker. Open-source ML-based PDF-to-Markdown converter that emphasizes layout reconstruction. Strong on academic papers, complex layouts, equations. marker input.pdf output.md. Slower than Pandoc but cleaner output on hard PDFs.

pdftotext + post-processing. pdftotext -layout input.pdf - | python normalize.py > output.md. The simplest baseline. Layout-preserving text dump, then a script converts headings, lists, and tables. See poppler-utils introduction.

unstructured, Python library that parses PDFs and other document formats into structured elements (Title, NarrativeText, ListItem, Table) which you can render as Markdown. Designed for LLM pipelines.

Adobe Acrobat Pro. Export As → HTML, then convert HTML to Markdown with Pandoc (pandoc input.html -o output.md). The two-step gives Pandoc cleaner input than its native PDF reader produces.

Online converters. Many free services. Convenient for one-off conversions; not appropriate for sensitive material, see are online PDF editors safe.

A practical pipeline for academic papers

Academic PDFs (two-column, references, equations, figures) are notoriously hard. A pipeline that works:

  1. Source-tag-aware extraction. Run the PDF through nougat or marker, both ML-based, which understand academic layout.
  2. Pandoc finishing. Pass the Markdown through pandoc -f markdown -t markdown --markdown-headings=atx to normalize headings, fix whitespace, and standardize syntax.
  3. Manual cleanup. Open in any text editor and fix the inevitable problems: misnumbered footnotes, garbled equations, figure references with wrong numbers.
  4. Spell-check. Run a spell-checker over the final Markdown to catch OCR-style errors.

A practical pipeline for vendor manuals

Product manuals and corporate documentation are easier, the layouts are simpler:

  1. pandoc manual.pdf -o manual.md
  2. Open in a Markdown editor (Obsidian, VS Code with Markdown preview, Typora)
  3. Fix any headings that did not detect correctly
  4. Add front matter (title, date, source) if your static site needs it
  5. Drop into the wiki or repo

For technical manuals with code blocks, watch for monospace text that lost its formatting. Re-wrap code in triple-backtick fences.

Tables: the hardest part

Markdown's table syntax is intentionally minimal. Anything with merged cells, multi-line content, or nested structures does not fit cleanly. Strategies:

  • For simple tables, Pandoc handles them well. Output is GitHub-flavored Markdown table syntax.
  • For complex tables, convert to HTML tables inside Markdown. Most renderers accept inline HTML.
  • For data tables, extract to CSV instead, then reference the CSV from the Markdown. See how to convert a PDF to CSV.
  • For tables of tables, accept that Markdown is the wrong target. Stay with HTML or PDF.

Images and figures

Most converters extract images to a sidecar directory and reference them in the Markdown with ![alt](images/figure-1.png). Common cleanups after conversion:

  • Move images to a more organized folder structure
  • Rename generic Image001.png to descriptive filenames
  • Add alt text, Markdown's ![alt]() syntax is your accessibility hook
  • Compress large PNGs to save repository size
  • Strip metadata from images using mat2 or exiftool

Front matter and metadata

Many Markdown consumers (Hugo, Jekyll, Astro) expect YAML front matter at the top of each file:

---
title: 'Quarterly Report Q1 2026'
date: '2026-04-01'
source: 'q1-2026-report.pdf'
---

Some converters can emit front matter from PDF metadata. Otherwise add it as a post-processing step.

LLM-oriented conversion

If your goal is to feed PDFs to an LLM, Markdown is the right intermediate format because it preserves structure (headings, lists, tables) in a token-efficient way.

A few tips:

  • Strip page numbers and headers/footers. They add noise without semantic value.
  • Preserve heading levels carefully. LLMs use heading hierarchy to chunk and understand documents.
  • Include the source filename and page number as comments when chunking, so the LLM can cite the page back to you.
  • Test extraction quality on a few sample PDFs before processing thousands. A bad conversion produces garbage answers from the LLM downstream.

For more on this workflow, see AI data extraction from PDFs.

Common gotchas

Page breaks turning into paragraph breaks. PDFs often break paragraphs across pages. Naive conversion splits the paragraph; you need a post-processing step to re-join.

Hyphenated line breaks. Convert "develop- ment" back to "development". Pandoc handles this with --strip-comments; other tools need custom regex.

Multi-column PDFs. Without column detection, columns interleave. Use a converter that handles columns (marker, nougat, ABBYY) or pre-process to a single-column PDF.

Footnotes. Many converters drop footnotes entirely. Pandoc can convert them to Markdown footnote syntax ([^1]). Verify they end up where expected.

Headers and footers repeated. "Page 47" or "© 2025 Acme Corp" on every page becomes 47 lines of Markdown noise. Strip with regex or a converter setting.

Encoding. Curly quotes, em dashes, and special characters can become Unicode escapes or garbage. Use --utf8 (Pandoc) and normalize to NFC.

Tables of contents. A PDF's TOC often references specific page numbers that no longer exist in Markdown. Either remove the TOC or regenerate it from headings using your static-site generator.

Tools that combine OCR and Markdown

For scanned PDFs, the conversion pipeline is OCR → Markdown:

  • nougat (Meta), designed for academic papers, produces clean Markdown with equations.
  • marker + Tesseract, accepts scanned PDFs, OCRs, and produces Markdown.
  • ABBYY FineReader, export to "Editable copy", paste into a Markdown editor, fix structure.
  • Google Document AI, produces structured JSON which you can render as Markdown.

For more on OCR concepts, see PDF OCR explained.

Quick reference

For most PDFs, in 2026:

  • Simple native-text PDFpandoc input.pdf -o output.md
  • Academic papermarker input.pdf output.md or nougat input.pdf
  • Scanned documentocrmypdf input.pdf ocr.pdf && pandoc ocr.pdf -o output.md
  • Quick LLM ingestunstructured-py partition input.pdf | render-md > output.md

Takeaway

Converting PDF to Markdown is one of the most useful transformations in modern document workflows. Pandoc handles the common case; marker and nougat handle the hard cases; OCR plus Pandoc handles scans. Plan for post-processing, every conversion needs a cleanup pass. The result is content you can edit, version-control, search, render to HTML, and feed to AI tools. For the upstream step of trimming a PDF before conversion, say, extracting just the relevant chapter, Docento.app lets you isolate pages in the browser without uploading anywhere.

Related Posts