Extracting Tables From PDFs With AI

Tables are the single most painful structure in PDF extraction. To a human eye they are obvious: rows, columns, headers, totals. To a parser they are a loose collection of text fragments scattered across coordinates. Getting clean tables out of PDFs reliably is now possible in 2026, but it requires the right tool for the right table. This guide walks through the options.

Why PDF tables are hard

PDFs do not store tables as tables. A PDF stores text positioned at coordinates and lines or rectangles drawn separately. The "table" is an emergent property of those positions to a human reader. A parser has to infer cell boundaries from spacing, vertical alignment, or lines, none of which are guaranteed.

Common difficulties:

No visible lines. Many tables use whitespace alignment alone.
Merged cells. A single text run spans multiple columns or rows.
Multi-line cells. One logical cell has wrapped onto three rows of text.
Nested tables. A cell contains a sub-table.
Multi-page tables. The header is on page 7, data continues to page 9.
Headers detection. Is the top row a header or just bold data?
Scanned tables. No text layer; pure OCR.

A method that works on financial statements may fail completely on scientific tables, and vice versa.

Traditional tools (no AI)

Before the AI era, two tools handled most PDF tables:

Tabula (Java, GUI plus CLI). User selects the table region; Tabula extracts it. Good for occasional manual extraction.
Camelot (Python). Two modes: "lattice" for tables with visible lines, "stream" for whitespace-aligned. Strong with clean tables; struggles with messy ones.
pdfplumber (Python). Lower-level; gives you cell positions and you decide how to group them.

Plus pdf-table-extractor, PDF-Plumber, and a handful of others. These remain the right tool for clean, well-structured tables and very high volume where AI cost is a factor.

For converting tables to spreadsheets directly, see how to convert PDF to Excel and how to convert PDF to CSV.

AI table extraction in 2026

The new options use machine learning or large multimodal models. The categories:

Specialized table-extraction models. Microsoft's Table Transformer (TATR), TableNet, and the Hugging Face microsoft/table-transformer-detection family detect table regions and structure. Often combined with text OCR for the actual content.

Document AI pipelines. AWS Textract, Google Document AI, and Azure Document Intelligence include dedicated table-extraction modes. They handle detection, structure recognition, and text extraction together. Pay-per-page, but high quality on common tables.

Vision-language models. GPT-4o, Claude Sonnet 4 with vision, Gemini 2.5 Pro. Hand them a page image and ask for the table as Markdown or JSON. Costs more per page than dedicated models, but handles edge cases (handwritten headers, weird layouts, multi-language) that break specialized tools.

Hybrid pipelines. Use a specialized model for detection plus an LLM for difficult cells. Best quality at moderate cost.

Which tool for which table

A practical matching:

Clean financial tables, high volume: Camelot in lattice mode, or pdfplumber. Costs zero per page.
Receipts, invoices: Document AI services. Pre-built models exist for these common types.
Scientific tables with complex structure: Table Transformer or a vision-language model.
Scanned tables: OCR first (Tesseract 5 or hosted OCR), then any of the above.
One-off or weird layouts: Vision-language model. Just ask.

A working LLM prompt

For ad-hoc extraction with a vision-language model:

"Below is a page image. There is one table on the page. Extract the table as Markdown. Preserve exact text. Use empty strings for blank cells. If a cell is merged, repeat its value across the merged span. Do not invent or summarize. If a header is multi-row, flatten by concatenating with a slash."

Then send the page rendered as an image. For multi-page tables, send all relevant pages and ask the model to concatenate.

Output as JSON when downstream code consumes the data, Markdown when humans review it.

Verification

Even great AI table extraction makes mistakes. A short verification pass catches most:

Row count. If you know the table should have N rows, count.
Column totals. If a "Total" cell exists, verify the column sums to it.
Unit consistency. Currency, percentages, dates all formatted as expected.
Special values. Negative numbers in parentheses, footnoted cells, dashes for missing data, all handled.

For high-stakes tables (financial filings, regulatory reports), spot-check 5 to 10 percent against the source manually.

Multi-page tables

When a table spans pages:

Detect the header row on the first page.
Detect the table region on subsequent pages by alignment with the header columns.
Skip page-level repeated headers on subsequent pages.
Stop when a non-table region appears below the last data row.

Most modern Document AI services handle this automatically. For custom pipelines, you implement the join. Be careful with running totals or subtotals that appear at page breaks.

Handling scanned tables

For a scanned table, the sequence is:

OCR the page with a quality OCR (Tesseract 5, AWS Textract, Google Document AI).
Detect the table region using a layout model.
Recognize the table structure with a table-extraction model or vision-language model.
Reconcile text and structure into cells.

Quality degrades quickly with scan quality. A 300 DPI scan extracts well; a 150 DPI photo of a printed page is much harder. See how to edit scanned PDF for cleanup tips before extraction.

Working with vendor PDFs

If you regularly extract tables from a small set of vendor formats (utility bills, recurring invoices, monthly statements), build a template-based extractor. Templates use the layout's stability to extract reliably with zero ML cost per document. Update the template when the vendor changes their format (typically once a year).

For high-volume vendor extraction, hybrid template plus AI catches the changes when the layout drifts.

Common gotchas

Merged cells. Many extractors flatten merges silently. Verify the row count matches the source.

Footnotes attached to numbers. A "12.3*" with a footnote may be parsed as "12.3" with the asterisk dropped, or "12.3star". Verify.

Negative numbers in parentheses. Accounting style "(1,234)" means -1234. Many extractors keep the parens; downstream consumers expect a number.

Date formats. "01/02/2026" is January 2 in the US, February 1 in the EU. Disambiguate at extraction time.

Header alignment. Headers and data sometimes shift by one column when the extractor misreads a merged header. Spot-check.

Special characters. Currency symbols, en-dashes for ranges, microsigns, and unit suffixes. OCR and LLM mistakes happen.

Number-only columns parsed as text. Strip commas, currency symbols, and percent signs at extraction or downstream.

Privacy and compliance

Tables in financial, medical, and legal PDFs often contain PII. The same warnings as other AI-on-PDF workflows apply, see risks of using AI on confidential PDFs. For high-sensitivity content, prefer on-premise extraction.

Tooling for non-developers

For users who do not want to write code:

Adobe Acrobat "Export to Excel" works on many table types; quality varies.
Smallpdf / iLovePDF offer "PDF to Excel" web tools.
NotebookLM and Claude with image attachments will produce table Markdown on request.
Tabula has a free GUI for selecting regions and exporting CSV.

For converting in the browser, see how to convert PDF to Excel and the broader conversion guides.

Practical recipe

To extract a single difficult table today:

Identify the table page in the PDF.
Try the cheap path first: Tabula or Camelot. If clean, done in seconds.
If messy, try a vision-language model: render the page as image, ask for Markdown. Costs a few cents.
Verify row count and totals.
Export to CSV or paste into a spreadsheet.

For ongoing extraction:

Profile your tables by source and type.
Choose a tool per type: traditional, Document AI, or VLM.
Build verification into the pipeline.
Monitor accuracy with sampled human review.

Takeaway

AI table extraction in 2026 is finally good enough for production for most table types. The trick is picking the right tool for the table at hand and building verification around it. For routine extraction from clean tables, traditional libraries still win on cost. For everything else, the multimodal models close the gap. For complementary in-browser PDF operations like splitting source documents before extraction, Docento.app keeps the file local to your machine. See also AI data extraction from PDFs, how to convert PDF to Excel, and building a RAG system with PDFs.