What Metadata Does a PDF Contain?
← All posts
PDFPDF metadataXMPdocument privacymetadata

What Metadata Does a PDF Contain?

A PDF hides metadata in two places—the Info dictionary and an XMP stream—exposing your name, software, and edit history. Here is what is inside.

Photo by Kampus Production on Pexels

TL;DR: A PDF stores metadata in two parallel systems. The older one is the Document Information Dictionary—an object holding Title, Author, Subject, Keywords, Creator, Producer, CreationDate, and ModDate. The newer one is an embedded XMP packet, an XML metadata stream defined by ISO 16684-1 that mirrors those fields and adds a document identity and edit history through xmpMM:DocumentID, xmpMM:InstanceID, and xmpMM:History. Beyond both, a PDF can carry hidden data its author never sees: earlier revisions left behind by incremental saves, text under "redaction" boxes, embedded files, form data, annotations stamped with names and timestamps, and full EXIF inside embedded photos. Author and Producer fields routinely expose a real name, an organization, and the exact software and operating system that made the file. The only reliable fix is to sanitize the document, not just delete the visible properties.

Open almost any PDF you have made—an invoice, a résumé, a contract, a scanned form—and it is quietly carrying a record of who made it, with what software, and when. None of that shows on the page. It lives in structured fields inside the file, written automatically by whatever program exported the PDF. Most people never look, which is exactly why these fields leak so much. Below we walk through what a PDF actually stores, where it hides, and how to clear it before you send a document out into the world.

What counts as "metadata" in a PDF?

A PDF is not one flat document. It is a collection of objects—pages, fonts, images, and dictionaries—linked by a cross-reference table. Two of those objects are dedicated to metadata, and a PDF can contain either or both.

The first is the Document Information Dictionary, usually just called the Info dictionary. It is the original metadata mechanism from the PDF specification (now ISO 32000), a simple set of key/value pairs attached to the document. The second is an XMP metadata stream, a block of XML embedded as its own object inside the file. The two were never fully merged, so a single PDF often holds the same author name in both places, plus extra fields in the XMP that the Info dictionary cannot express. Understanding a PDF's privacy footprint means reading both. If you have worked through how photo formats layer their data, this two-system design will feel familiar—it is the same tension we covered in IPTC vs EXIF vs XMP, where multiple standards describe overlapping facts about the same file.

What is in the Document Information Dictionary?

The Info dictionary holds a short, well-defined list of fields, and each one tells a story. Title, Subject, and Keywords are author-supplied and often left as whatever the source application guessed—frequently the first line of the document or an internal project name. Author is the field that causes the most trouble: most software fills it from your operating system account name or your license profile, so a document you think of as anonymous can ship with your full legal name attached.

Then come the two software fingerprints. Creator records the application that produced the original document—"Microsoft Word for Microsoft 365," "Pages," "LaTeX with hyperref." Producer records the engine that actually wrote the PDF—"Adobe PDF Library," "macOS Quartz PDFContext," "Skia/PDF," "Microsoft: Print To PDF." Together they reveal your toolchain and, by extension, your operating system and often its version.

Finally there are the timestamps. CreationDate and ModDate use a specific PDF date string in the form D:YYYYMMDDHHmmSS followed by a UTC offset like -05'00'. That offset is not harmless: it pins the document to your time zone. A file claiming to come from London that carries a -08'00' offset has quietly contradicted itself. These fields are the document equivalent of the camera and timestamp data we describe in What Is EXIF Data?—automatic, invisible, and far more specific than people expect.

A person reviewing a document on a laptop—the same File, Properties panel that displays a PDF's title and author also exposes the software and timestamps written into it.

Photo by Thirdman on Pexels

Why does a PDF carry a second, XML-based metadata system?

The Info dictionary is simple but limited—it cannot represent structured, namespaced, or extensible data. So Adobe introduced the Extensible Metadata Platform (XMP), now standardized as ISO 16684-1, and PDFs adopted it as an embedded metadata stream marked /Type /Metadata /Subtype /XML. Inside that stream is RDF/XML drawing on several namespaces at once: Dublin Core (dc:title, dc:creator, dc:description) for descriptive fields, the pdf: schema for Producer and Keywords, the xmp: schema for CreateDate and ModifyDate, and—most revealing—the XMP Media Management schema, xmpMM:.

That last namespace is where document identity lives. xmpMM:DocumentID and xmpMM:InstanceID assign persistent identifiers to a document and to each saved instance of it, and xmpMM:History can log a sequence of edit events. The practical consequence is that two PDFs you believe are unrelated can be tied together by a shared DocumentID, and a "final" file can carry a breadcrumb trail of every version that came before it. PDF/A, the archival profile, actually requires XMP and insists its values match the Info dictionary—which is why archival documents are often the most thoroughly fingerprinted. This embedded-stream approach is conceptually close to the external sidecar files photographers use, except the data rides inside the document instead of beside it. The XMP standard, ISO 16684-1, documents these fields if you want the schema-level detail.

The hidden data most people never think about

Document properties are only the part with a label. PDFs accumulate other material that is easy to forget and hard to see.

Earlier revisions. PDFs support incremental updates: when you save a change, many tools append the new objects to the end of the file rather than rewriting it, leaving the previous version's objects still present and recoverable. A redacted or "corrected" document can contain its own earlier draft.

Text under redaction boxes. This is the classic, costly mistake. Drawing a black rectangle over a name in most viewers adds a graphic on top of the page; the characters underneath stay in the content stream and can be selected, copied, or pulled out with a script. Real redaction removes the underlying content, and the PDF Association has documented how often organizations get this wrong in published filings.

Embedded files and form data. A PDF can carry attachments, AcroForm field values, JavaScript, and annotations. Comments and review markups are stamped with author names and timestamps—sometimes more candid than the document itself.

Embedded image EXIF. A photo dropped into a PDF often keeps its own EXIF block, including the camera model and, on phone images, GPS coordinates. The document inherits the privacy exposure of every picture inside it, which is why we always recommend cleaning images first, exactly as in How to Strip EXIF Data From a Photo.

A wooden stamp resting on a document—PDF software stamps its own marks invisibly, recording the author, the producer application, and a full edit history inside the file.

Photo by Markus Spiske on Pexels

Does a PDF reveal your location?

Usually not in the direct way a phone photo does. A PDF has no native GPS field equivalent to a JPEG's GPS Info IFD, so a document exported from a word processor will not announce where you stood. There are two important exceptions. First, any embedded photo can carry its own geotag, so a PDF full of phone snapshots can leak coordinates through its images—the same mechanism we trace in How GPS Coordinates Get Embedded in Photos. Second, the time-zone offset in the date strings narrows you to a region even when no coordinates exist. Specialized "geospatial PDF" maps can also embed coordinate reference systems, but that is a niche format most people never produce.

Why PDF metadata is a privacy issue

The risk is rarely a single field; it is the accumulation. An Author name plus a Producer string plus a creation timestamp can confirm authorship of a document someone meant to circulate anonymously—a problem for whistleblowers, journalists, and anyone submitting a sensitive form. Software fingerprints reveal the systems an organization runs, which is reconnaissance value for an attacker. Revision history and uncleared redactions have exposed settlement figures, informant names, and unredacted personal data in court and government filings that were technically "redacted." The Electronic Frontier Foundation has long flagged document metadata as an under-appreciated leak, and the pattern is consistent: the data that hurts you is the data you did not know was there. We covered the broader stakes in The Privacy Risks Hiding in Your Photo Metadata, and documents raise the same questions with higher-stakes content.

How to see what is inside your PDF

Before you remove anything, look. In Adobe Acrobat, File → Properties (Ctrl/Cmd+D) shows the Info dictionary fields, and the "Additional Metadata" button exposes the XMP. On macOS, Preview's Tools → Show Inspector reveals the core properties under the information tab. On the command line, the standard PDF utilities will dump both the Info dictionary and the raw XMP packet so you can read exactly what an automated extractor would see. Looking first tells you which fields are populated and whether earlier revisions or embedded files are present—information you need before deciding how aggressively to sanitize.

How to remove metadata from a PDF

Deleting the visible properties is not enough on its own, because the XMP stream and any leftover revisions can survive a field-by-field wipe. The reliable path is to sanitize the document. In Acrobat, "Remove Hidden Information" and "Sanitize Document" clear metadata, embedded content, scripts, and prior revisions in one pass; equivalent "export to a fresh PDF" or "save as optimized" flows in other editors flatten the file and drop most accumulated data. For sensitive text, use a true Redact tool rather than a drawn box. And because a clean document can still be undermined by a dirty photo, strip your images before you place them.

That last step is where we can help directly. Metadata Cleaner removes EXIF, XMP, and other metadata from JPEG, PNG, and WebP images entirely in your browser—nothing uploads to a server—so the pictures you embed in a PDF arrive without camera data or GPS coordinates attached. We are honest about the boundary: our tool cleans the images, video, and audio you share, not the PDF container itself, so pair it with your editor's sanitize feature for the document and you have covered both layers. After sanitizing, always reopen the file and re-check the properties to confirm the fields are genuinely empty.

A PDF is more candid than it looks. It records who made it, with what, and when, keeps copies of what you thought you deleted, and inherits the metadata of everything inside it. Knowing the two systems—the Info dictionary and the XMP stream—is what lets you clear them with confidence instead of trusting that an empty-looking page is actually empty.

Clean the images before they go in the document. Try Metadata Cleaner free and strip EXIF from your photos in seconds, right in the browser.