PDF to HTML is the conversion that most disappoints first-time users. They expect a clean, semantic web page; they get a tangle of absolutely-positioned divs that looks identical in the browser but breaks when you try to edit it. That's because PDF and HTML have fundamentally different shapes. Knowing the difference is the difference between a useful conversion and a wasted afternoon.
Why this conversion is hard
HTML is a flow document — paragraphs, headings, lists, semantic structure. PDF is a fixed-page document — every glyph has X/Y coordinates and a font reference. There is no clean transformation between the two.
Converters fall into two camps:
- Visual converters produce HTML that looks identical to the PDF. They achieve this by absolute-positioning every text run and image. The HTML "works" but isn't really web-suitable — no responsive design, no semantic structure, no accessibility.
- Semantic converters try to recover headings, paragraphs, and lists from the PDF's visual structure. They produce HTML that looks roughly like the original but is editable, responsive, and accessible. They occasionally guess wrong about what's a heading vs body text.
Pick based on what you need.
When you want a visual converter
Use a visual converter when:
- You want to embed the PDF in a web page without using
<iframe>or a PDF viewer. - The exact layout matters (a designed brochure, an annual report, a magazine page).
- You're building an offline reader that handles PDFs.
Tools:
- pdf2htmlEX: produces near-pixel-perfect HTML, single-file output that includes embedded fonts and images.
- Browser-based visual converters: many SaaS tools do this, but they upload the file. Choose carefully for sensitive documents.
The output works in any browser but isn't really meant to be edited by hand.
When you want a semantic converter
Use a semantic converter when:
- You're migrating PDF content to a website or CMS and want clean HTML you can edit.
- The text matters more than the visual design.
- You need responsive output that works on phone screens.
- You need accessible output for screen readers.
Tools:
- Pandoc:
pandoc input.pdf -o output.html. Surprisingly capable for digital PDFs with clean structure. - Apache Tika: extracts text and applies basic structure.
- AI-based converters: increasingly common, can recognise headings and lists more reliably than rule-based tools, but check the output — they can also hallucinate.
For documents with consistent structure (a textbook with chapter headings, a report with section titles), semantic converters work well. For ad-hoc designs, results are mixed.
Method 1: Browser-based conversion
For one-off conversions, a browser tool gives you both visual and semantic options. Docento.app runs the conversion in your browser without uploading the source — useful when the document contains anything you'd rather not send through a third party.
Method 2: Convert via Word as a stepping stone
A common trick: PDF → Word → HTML.
- Convert the PDF to Word. See how to convert a PDF to Word.
- Open the Word file, File → Save As → Web Page (Filtered).
- The result is HTML with paragraphs, headings, and inline images.
This produces clean-ish output, especially when the PDF was made from a Word file in the first place. Don't expect modern semantic HTML — Word's HTML export is dated — but it's a usable starting point.
Method 3: Convert via Markdown
For text-heavy content where you want maximum portability:
- PDF → text (see how to convert PDF to text).
- Hand-clean the text into Markdown — add headings, lists, links.
- Pandoc the Markdown to HTML.
This is the most labour but produces the cleanest HTML. Right tool when migrating a small set of important documents.
Handling images, tables, and links
Each requires attention:
- Images are usually extracted into separate files and referenced from the HTML. Modern converters embed them as base64 if you need a single-file result. Watch image quality — some converters downsample.
- Tables are the most fragile. Visual converters keep them looking right via positioning. Semantic converters often output
<table>markup that's structurally correct but visually different. Inspect every table. - Hyperlinks survive most conversions if they were real links in the PDF (not text that looks like a URL). Check a few
Ctrl-clicks after conversion. - Form fields rarely survive. Rebuild as HTML form elements if you need them.
CSS and responsive design
Visual converters embed massive amounts of CSS to position every text run. The result isn't responsive — the page is a fixed width, just like the PDF. To make it responsive, you'll need to:
- Strip the absolute positioning.
- Replace fixed widths with percentages.
- Re-add semantic structure (
<h1>,<p>,<ul>).
This is essentially a manual rewrite. For one document it's fine; for many, you want a semantic converter from the start.
Accessibility
HTML from a visual converter is usually invisible to screen readers — every text run is just an absolutely-positioned div. To make the HTML accessible:
- Use a semantic converter, even if the output looks less polished.
- Verify heading levels are correct (
<h1>once,<h2>for sections, etc.). - Add
alttext for images; the conversion won't supply meaningful descriptions. - Confirm reading order is sensible by tabbing through.
For more, see accessibility tags in PDF and PDF accessibility guide.
When to skip the conversion
If your goal is "let users view a PDF on a web page," consider just embedding the PDF directly with <iframe> or using a JavaScript PDF viewer like PDF.js. The result is interactive (zoom, search, scroll) without the lossy conversion.
If your goal is "publish content as a web page," go back to the source and republish from there. The PDF is a derived format — you almost always have a more web-friendly source somewhere.
Conclusion
Pick visual conversion for fidelity, semantic conversion for editability. Pandoc and pdf2htmlEX cover the common cases. For one-off browser conversion without uploads, Docento.app handles both modes. After conversion, plan for cleanup — especially of tables, images, and headings. For comparison content, see PDF vs HTML.