Docento.app Logo
Docento.app
Server room with blue lighting
All Posts

Hidden Data in PDFs: What Lurks Inside and Why It Matters

May 11, 2026·8 min read

A PDF on screen looks like a clean rendering of pages. The actual file is a complex container holding text, images, fonts, structure trees, metadata, annotations, scripts, signatures, attachments, old revisions, and various other objects. Most of these are invisible to a reader but visible to anyone with the right tool. This guide walks through the categories of hidden data that PDFs carry, why they sometimes leak, and how to handle them responsibly.

Why "hidden" data matters

Hidden data has been the cause of many embarrassing and damaging leaks:

  • A corporate PDF revealed internal author names that contradicted the company's official position
  • A legal filing exposed redacted information through copy-paste of underlying text
  • A government document leaked metadata showing it had been edited shortly before release
  • A research paper revealed undisclosed conflicts of interest through previous version names
  • A press release exposed the internal path of the source document, leaking organizational structure

These were not technical attacks. They were data that the originator forgot existed.

Categories of hidden data

A typical PDF can contain:

Document metadata. Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate in the /Info dictionary. Also XMP metadata in an embedded XML packet. See how to edit PDF metadata.

File attachments. PDFs can attach other files (.docx, .xlsx, .zip, etc.) as embedded payloads. The attachments are not visible on the page but ride inside the file.

Form data. Filled form fields carry their values. If you receive a filled form, you receive the values.

Annotations. Comments, sticky notes, highlights, drawing markup, audio notes, video notes. Each is a separate object in the file, often invisible until the reader's comment panel is opened.

Layers (Optional Content Groups). A PDF can have content in layers that are toggled on or off. Hidden layers contain content that does not appear in the default view but exists in the file.

Old revisions (incremental updates). When a PDF is edited and saved as "Save" (not "Save As"), the new version is appended to the file; old content remains. A reader displays the latest revision; tools can extract previous revisions.

JavaScript. Scripts that run on open, on field change, on submit. Can perform actions, validation, or, in rare cases, leak data.

Embedded fonts. Custom corporate fonts may identify the originator.

Color profiles. ICC profiles sometimes carry identifying strings (creator, date, configuration).

Bookmarks (outlines). Internal navigation tree. Bookmark titles can leak section names or identifiers.

Page labels. Logical page numbering separate from physical pages. Can leak structure not shown in the visible layout.

Digital signatures. Cryptographic signatures carry signer identity, signing time, certificate chain.

XMP rights and version history. Modern XMP packets can include version-history entries showing who edited what and when.

Hidden text. Text that is the same color as the background, behind images, or positioned outside the visible page area. Sometimes used for OCR overlays, sometimes for legitimate accessibility, sometimes by accident.

Document outline / structure tree. Reading order and tagging. Can reveal sections that were thought removed.

Print settings and presentation settings. Sometimes contain identifying defaults.

How to inspect what is in a PDF

To audit a PDF for hidden data:

  • Adobe Acrobat Pro. File → Properties → Description tab for the basics. Tools → Edit PDF → "Inspect Document" runs a comprehensive check.
  • exiftool file.pdf lists all metadata. See how to edit PDF metadata.
  • pdftk file.pdf dump_data shows the file structure including bookmarks and metadata.
  • qpdf --json file.pdf dumps the entire object structure as JSON. See qpdf introduction.
  • pdfinfo file.pdf (poppler-utils) shows summary info. See poppler-utils introduction.
  • mutool show file.pdf trailer shows the file trailer and catalog. See MuPDF introduction.
  • Hex dump. xxd file.pdf | less lets you search the raw bytes for known strings.

For high-stakes audit (before publication), use multiple tools, they catch different things.

Removing hidden data

Strategies for each category:

Metadata. exiftool -all= file.pdf removes both DocInfo and XMP. Or Acrobat's Sanitize Document.

Attachments. Acrobat's Sanitize Document removes. Or mutool clean -d input.pdf output.pdf.

Form data. pdftk input.pdf output flattened.pdf flatten flattens form values into page content (and removes the field objects). See how to flatten a PDF.

Annotations. pdftk input.pdf output noannot.pdf cat followed by tools that strip annotations, or Acrobat's "Remove All Comments".

Hidden layers. Toggle off the layers in Acrobat, then flatten or Sanitize.

Old revisions. Save As (not Save) to flatten. qpdf --linearize or --object-streams=generate rewrites the file without incremental update history.

JavaScript. pdftk strip_javascript, mutool clean -d, or Acrobat's Sanitize.

Embedded fonts. Cannot easily remove (the file would not render). Can substitute system fonts before generating the PDF.

Bookmarks. Acrobat: View → Navigation Panels → Bookmarks → delete each, or use qpdf to rewrite the catalog.

Signatures. Removing invalidates the signature; if anonymizing, this is desired. Use Acrobat's "Clear Signature" or rewrite the catalog programmatically.

Comprehensive sanitization. Acrobat Pro's "Sanitize Document" runs most of these in one pass. For programmatic equivalents, combine pdftk flatten, qpdf --linearize, exiftool -all=, and mutool clean -d in a pipeline.

For an end-to-end workflow, see how to anonymize PDF documents and how to strip metadata from PDF.

Authoring-time prevention

The best removal is never having the data in the first place:

  • Use a generic author name in your authoring tool's settings ("Acme Corporation" instead of personal names)
  • Avoid embedding attachments unless intentional
  • Disable revision tracking during final export
  • Use "Save As" for final versions to flatten incremental updates
  • Set a "publication" template in your authoring tool with anonymized defaults
  • Disable JavaScript in PDF export options unless needed
  • Use standard fonts to avoid font-based fingerprinting
  • Strip annotations before final export

Building these into your authoring workflow prevents the leak rather than fixing it later.

When hidden data is intentional

Not all hidden data is bad. Some is essential:

Sanitization should preserve the intentional hidden data and remove the unintentional or sensitive parts.

Specific case studies

The signed-and-edited contract. A PDF is signed and then someone edits it (perhaps just adding annotations). Some readers preserve the original revision; tools can extract the pre-edit version. For contracts, this means an edited version may still expose the pre-edit original. Flatten and re-sign after every change, or use a signing service that prevents tampering.

The redacted research paper. Redactions implemented as visual covers leave the underlying text in the content stream. Anyone with a PDF tool can extract the redacted content. See PDF redaction failures.

The "anonymous" survey result. A survey PDF compiled from individual responses may have annotations or metadata showing who submitted what. Strip carefully.

The "external" press release. Internal author name leaks through metadata. A 30-second exiftool pass would have caught it.

The recovered manuscript. A PDF with multiple incremental updates can be walked back through revisions, revealing draft content the author thought was removed.

Common gotchas

"Save" vs "Save As". Save appends an incremental update; old data remains. Save As flattens to a clean file. Use Save As for any anonymization or finalization.

Reader does not show but tool does. A piece of content might not display in a reader (e.g., text under an image, content in a hidden layer) but be visible to extraction tools. Sanitize, do not just hide.

Sanitize that does not sanitize. Some "sanitize" features only clean what the tool's authors thought of. For high-stakes content, use multiple tools and verify.

Page-level inspection misses file-level data. Looking at each page does not show metadata, attachments, or hidden objects. Use file-level tools too.

Re-saving after sanitization. If you re-save a sanitized PDF through a tool that adds metadata back (e.g., Acrobat with a different author setting), you reintroduce hidden data. Sanitize last.

Backup copies. A sanitized PDF distributed externally is one copy. The original with all its hidden data still exists in your backups, email archives, and document management system. Anonymization rarely solves the broader retention problem.

Practical recipe

For a PDF about to be published externally:

  1. Make a working copy
  2. Run Acrobat's Sanitize Document
  3. exiftool -all= working.pdf for extra metadata cleanup
  4. mutool clean -d working.pdf cleaned.pdf for thorough object cleanup
  5. Verify: pdftotext cleaned.pdf - | head -100 and exiftool cleaned.pdf
  6. Hex dump search for known identifiers
  7. Open in a different reader for visual verification
  8. Distribute cleaned.pdf; keep original.pdf in internal archive

Takeaway

Hidden data in PDFs is a real and frequently-leaked problem. Metadata, attachments, annotations, old revisions, hidden layers, JavaScript, and embedded objects all carry information beyond what shows on the page. Comprehensive sanitization requires multiple tools and verification. The strongest defense is authoring with sanitization defaults, generic author names, no embedded files, no incremental updates for finals. For browser-based metadata stripping and annotation removal, Docento.app handles the operations without uploading to third-party services. For specific topics, see how to strip metadata from PDF, PDF redaction failures, and how to anonymize PDF documents.

Related Posts