How to Convert a PDF to XML for Data Pipelines and Document Processing

Most people will never need to convert a PDF to XML. The people who do, engineers building data pipelines, archivists, anyone dealing with structured documents in legal or healthcare workflows, find it surprisingly involved. XML is structured, hierarchical, and machine-readable. PDFs, even at their best, are layered onto a page-rendering model that does not naturally map to a clean tree. This guide walks through what "PDF to XML" actually means in practice and how to do it well.

Why convert PDF to XML at all

A few real reasons:

Pipeline ingestion. Downstream systems consume XML. Invoices, lab reports, government filings all flow through XML-based exchange formats.
Search and indexing. A structured XML representation indexes far better than a flat blob of text.
Re-rendering. Converting to XML lets you re-template the content, same data, different layout, different output format.
Long-term preservation. XML is plain text, easy to validate, and durable. Combined with PDF/A, it makes a strong archival pair.
Accessibility. Some accessibility workflows extract content from PDF to XML, transform it, and re-render to a more accessible format.

What "convert to XML" can mean

The phrase is ambiguous. There are at least three different conversions:

PDF → flat XML of text + coordinates. Each text run becomes an XML element with x/y/font/size attributes. No structural meaning. Useful for downstream parsing, fragile to layout changes.
PDF → tagged XML following the document structure. Headings, paragraphs, lists, tables, figures. Possible only if the PDF is properly tagged for accessibility. Maps the structure tree to XML elements.
PDF → domain-specific XML. The PDF is an invoice, a lab report, an SEC filing. The XML conforms to a domain schema (UBL for invoices, HL7 CDA for healthcare, EDGAR XBRL for SEC). This involves understanding the content, not just the layout.

Each requires a different tool.

Tools for flat XML extraction

If you just need the text with positions:

pdftohtml -xml file.pdf (poppler-utils) produces an XML file with one element per text run, including font, size, and coordinates. See poppler-utils introduction.
mutool draw -F stext.xml file.pdf (MuPDF) produces structured text XML. See MuPDF introduction.
Apache PDFBox ExtractText with --text-output -file plus your own wrapper, flexible Java-based extraction.
pdfminer.six in Python, pdf2txt.py -o out.xml file.pdf produces XML output.

These give you a faithful but unstructured representation: every text run, every image position, every page boundary, all as XML. From there it is up to you to extract meaning.

Tools that preserve tag structure

When the PDF is tagged for accessibility (and you can verify this, see tagged PDF vs untagged PDF), you can extract the structure tree as XML:

Adobe Acrobat Pro, Export As → XML → "XML 1.0". Preserves the tag tree.
callas pdfToolbox, has a "structure to XML" action useful for compliance workflows.
CommonLook Validator, produces XML reports of the structure for accessibility audits.
Custom scripts using pikepdf can walk the /StructTreeRoot and emit XML matching your schema.

The output is something like:

<Document>
  <H1>Quarterly Report</H1>
  <P>This document summarizes…</P>
  <Table>
    <TR><TH>Quarter</TH><TH>Revenue</TH></TR>
    <TR><TD>Q1</TD><TD>$1.2M</TD></TR>
  </Table>
</Document>

This is dramatically more useful than flat coordinates.

Domain-specific XML extraction

If you need PDF → UBL invoice XML, PDF → HL7 CDA, PDF → JATS for scientific articles, you are no longer doing PDF conversion, you are doing structured document understanding. Approaches:

Template-driven extraction. If you know the layout (e.g., all invoices from vendor X have the same layout), you can write a template that picks values at specific coordinates and emits domain XML. Tools: Klippa, Rossum, your own scripts.
Machine-learning extraction. Modern OCR plus document AI services (Amazon Textract, Google Document AI, Azure Form Recognizer, ABBYY) recognize entities like "invoice number", "due date", "line item" from arbitrary layouts and emit JSON or XML. See AI data extraction from PDFs for an overview.
Hybrid pipelines. OCR + post-processing rules + domain validators (e.g., an UBL validator that confirms the XML conforms to schema).

For specific PDF flavors that map to specific XML, e.g., PDF/A-3 with an embedded ZUGFeRD invoice XML, the XML is already inside the PDF. You "convert" by extracting the embedded attachment, not by re-parsing the visible content.

A practical recipe for unstructured PDFs

For most "I have a folder of PDFs, I want XML" jobs, a workable starting pipeline is:

OCR if needed. Run OCRmyPDF or similar to ensure every PDF has a text layer. See PDF OCR explained.
Tag if needed. If you control the PDF source, tag at authoring time. If not, use Acrobat Pro's autotag plus manual cleanup.
Extract. Use pdfminer.six or mutool to emit XML.
Post-process. Apply XSLT or Python rules to map the extracted XML to your target schema.
Validate. Run the output through an XML schema validator (xmllint, an XSD validator) to catch shape errors early.
Spot-check. Compare a sample of XML files visually against the source PDFs.

For batch jobs, wrap this in a script and run over a directory.

Common gotchas

Reading order. Coordinate-based extractors return text in the order it appears in the PDF's content stream, which is not always reading order. A multi-column document needs reading-order reconstruction or it produces interleaved nonsense.

Whitespace. Spaces in PDFs are sometimes encoded as zero-width characters or positioning hints, not literal spaces. Extractors handle this differently. Test on real samples.

Unicode normalization. PDFs may use ligatures (ﬁ, ﬂ) or non-standard encodings. Normalize to NFC and replace ligatures before consuming in XML.

Hyphenation. Words broken across line breaks with hyphens need to be re-joined before XML output, or every "developer" in a hyphenated source becomes "devel-oper".

Page boundaries. A paragraph that spans pages is two <P> elements unless you reconstruct continuity. Many extractors do not.

Embedded fonts using custom encodings. Sometimes a PDF's text is correctly drawn on the page but extracts as garbage characters because the font has no ToUnicode map. OCR is the fallback.

When to push the conversion upstream

If you control the PDF source, the cheapest "PDF to XML" conversion is to never need it. Have the system that produces the PDF also produce the XML in the same step. This is exactly what some tax filings and electronic invoices do:

A ZUGFeRD invoice is a PDF/A-3 with the structured invoice data embedded as XML.
A Factur-X file is the French/European equivalent.
Many SEC filings combine human-readable PDF with machine-readable XBRL XML.

If you find yourself frequently extracting XML from PDFs that you generate, fix the generator instead of the extractor.

Takeaway

Converting a PDF to XML is straightforward when you only need a positional dump and the PDF has selectable text. It becomes useful when the PDF is tagged and the structure can be extracted. It becomes hard when you need domain-specific XML out of arbitrarily-laid-out PDFs. Pick the right tier of tool for the job, validate the output against a schema, and where possible push the conversion upstream so the XML is born alongside the PDF rather than reverse-engineered later. For the PDF-side maintenance, splitting, merging, page reorganization, Docento.app handles browser-based edits without disturbing the underlying structure.