TL;DR: An ElevenLabs export embeds across three separate layers, and only one of them is what people mean by "metadata." The container layer is thin: an ID3v2.4 tag area on an MP3, a RIFF LIST/INFO chunk on a WAV, usually holding little beyond encoder strings. The second layer is a C2PA provenance manifest — a signed JUMBF box recording origin and AI authorship, either embedded in the file or soft-bound to it by a content hash stored on a server. The third is a signal-domain watermark written into the audio samples, which the AI Speech Classifier reads at roughly 99% precision and 80% recall on unmodified clips. A metadata strip clears the container and an embedded manifest. It does not reach the signal watermark or a soft-bound record.
When we wrote our overview of ElevenLabs exports, the short version was: the file looks empty in Properties, but it is not anonymous. This post is the long version — a layer-by-layer accounting of what actually gets embedded, where each piece lives, and which layers survive a metadata strip and which do not.
We will work outward from the audio: the container tags first, then the provenance manifest, then the watermark baked into the waveform. The order matters, because the further in you go, the harder the mark is to remove and the less a metadata cleaner can do about it.
What does an ElevenLabs export actually embed?
ElevenLabs does not hand you an exotic file. Whether you pull audio from the web app or the API, you get a standard container — an MP3, a WAV, an Opus stream, or raw PCM. The API names the exact shape through an output_format parameter written as codec_sample_rate_bitrate, and it defaults to mp3_44100_128: an MP3 at a 44.1 kHz sample rate and 128 kbps.
The available formats are gated by plan. MP3 comes at 22.05 kHz/32 kbps, and at 44.1 kHz in 32, 64, 96, and 128 kbps on the free and Starter tiers, with 192 kbps unlocking on Creator and above. PCM is offered at 8, 16, 22.05, and 24 kHz on lower tiers, with 44.1 kHz — and the WAV most editors expect — requiring Pro or higher, per the ElevenLabs audio-format documentation. There are also µ-law and A-law options aimed at telephony pipelines.
That ordinariness sets the trap. Because the container is standard, the file-level metadata is standard too — and standard, for programmatically generated audio, means sparse. So the file reads as clean in a Properties panel while still carrying two layers a Properties panel never shows. The container is the layer everyone inspects and the least interesting one to a detector.
What's in the container metadata layer, byte for byte?
Start with the MP3. An ID3v2.4 tag sits at the front of the file, before the first MPEG audio frame. It opens with a 10-byte tag header — the ASCII identifier ID3, a version pair (04 00 for 2.4), a flags byte, and a synchsafe size — followed by a run of frames. Each frame has its own 10-byte header: a four-character frame ID, a size, and flags, then the data.
The frame IDs are the part worth knowing, because they are what a reader actually sees. TIT2 holds the title, TPE1 the lead performer, TALB an album, TCON a genre, and COMM a free-text comment. The ones that matter for a generated file are the technical frames: TSSE and TENC, which name the encoding software and settings. A generated export rarely fills the descriptive frames, but an encoder string in TSSE can still identify the exact tool and build that produced the file — a correlation handle that survives renaming and survives clearing the visible title. Many MP3s also carry a second tag: a fixed 128-byte ID3v1 block at the very end of the file, starting with TAG, which is why a field you "cleared" in a player can reappear from a tag you never saw.
Photo by Pixabay on Pexels.
A WAV is a RIFF container, and its metadata lives in an optional LIST chunk carrying an INFO block: INAM for a name, IART for artist, ISFT for the software that wrote the file. Broadcast-WAV files add a bext chunk. As with the MP3, generated audio tends to fill almost none of this — but ISFT is the same kind of quiet software fingerprint as TSSE. None of it is hard to remove; all of it is editable, and that is exactly why the container is the layer a metadata cleaner can fully own. We covered the MP3 case end to end in our walkthrough on removing MP3 metadata.
What does the C2PA provenance manifest embed?
The second layer is provenance, and it is a different kind of object. C2PA — the Coalition for Content Provenance and Authenticity — defines a Content Credential: a structured manifest of assertions about how a file was made. The assertions can record the originating tool, the time, the edits applied afterward, and a statement that the content was AI-generated. The whole manifest is then cryptographically signed, so tampering is detectable rather than silent. The current C2PA specification covers audio containers, including MP3 and WAV.
Mechanically, the manifest is stored as a JUMBF box — the ISO "JPEG Universal Metadata Box Format," reused across media types. There are two ways it attaches to your audio, and the distinction is the entire reason this layer confuses people. A hard binding embeds the manifest inside the file alongside the samples. A soft binding stores the manifest elsewhere and links it back to the asset by a hash of the content, so a file with nothing embedded can still be matched to a server-side record by recomputing its hash. The Content Authenticity Initiative maintains the open tooling that reads and writes these manifests, so inspecting one is routine rather than specialist work.
Photo by Negative Space on Pexels.
Here is the practical consequence. An embedded manifest can be removed, because it is data attached to the file; a cleaner that walks the JUMBF box drops it the same way it drops an ID3 tag. A soft-bound record cannot be removed from your copy, because it was never in your copy — it lives on a server keyed to the file's hash, and the only way to break that link is to change the bytes the hash was computed over. Industry coverage frames the embedded-manifest approach as the transitional, easier-to-strip signal, with the durable signal moving into the waveform itself. Our C2PA primer goes deeper on the manifest format, and the audio-specific case is here.
What does the watermark embed, and why can't a metadata cleaner remove it?
The third layer is not a tag at all. ElevenLabs embeds an inaudible watermark into the audio signal — the actual sample values of the waveform — and runs a free AI Speech Classifier that listens for it and reports whether a clip came from the platform. On unmodified audio straight from an export, that classifier is reported at about 99% precision and 80% recall, and it analyzes roughly the first minute of a clip. One real limitation worth stating: it does not reliably classify audio generated with the ElevenV3 model, so a positive result is more meaningful than a negative one. ElevenLabs describes the broader approach on its safety page, and the company is among those moving toward Google's SynthID-style signal watermarking as the durable mark — the part designed to outlast tag removal.
This is the single most important line in the post: the metadata lives in the header; the watermark lives in the signal. A metadata cleaner parses the ID3 frames and RIFF chunks and rewrites them. It does not — and a tool that preserves your audio cannot — repaint the samples. So a metadata strip leaves the watermark exactly where it was, and the classifier still recognizes the file. What degrades a signal-domain watermark is signal-domain change: re-encoding at a low bitrate, heavy compression, pitch-shifting, time-stretching, or layering noise. Those edits work because they alter the samples — and that is also why they change how the clip sounds. This is the same split we drew in metadata versus audio watermarks.
Photo by Egor Komarov on Pexels.
If the goal is "make this file unrecognizable to a detector," metadata removal is the wrong instrument, and we would rather say so than imply otherwise. Suno users run into the same wall: the detectable signal is in the waveform, not the tag.
So what does stripping metadata actually reach?
Here is the clean accounting, because this is exactly where overpromising erodes trust.
A metadata strip removes the ID3v2 frames and any trailing ID3v1 block on an MP3, the RIFF INFO fields on a WAV, and an embedded C2PA manifest if the tool walks the JUMBF box. After a full clean, exiftool or a C2PA verifier run against the file finds the tag area emptied and no embedded manifest.
It does not reach four things. The signal-domain watermark stays, because it is in the samples. A soft-bound C2PA record stays, because it lives on a server keyed to your file's hash. An acoustic fingerprint — the statistical signature a classifier derives from the audio itself — stays, because it needs no metadata to begin with. And any server-side record ElevenLabs or a distributor holds about the generation stays, because it was never in the file; distributor pipelines also carry separate DDEX flags that travel outside the audio entirely. So a stripped ElevenLabs file is genuinely cleaner at the container level and genuinely still detectable at the signal level. Both are true at once, and we explained the layer model that this rests on in what AI-generated files contain.
How do you clean the container layer?
For the layer you legitimately control, the workflow is short. Open Metadata Cleaner in any browser, drag in the MP3 or WAV, and clean it. JavaScript in the tab parses the container, drops the ID3v2 and ID3v1 tags or the RIFF INFO chunk, removes an embedded C2PA JUMBF box if one is present, and writes a fresh file with the audio samples untouched, so the clip sounds identical and the bitrate is unchanged. The file never leaves your device — nothing is uploaded and nothing is logged.
To check the result, run exiftool yourfile.mp3 and confirm the tag block is gone, or drop the file into a C2PA verifier and confirm no embedded manifest remains. Then, if detection matters to your use case, hold the honest limit in mind: the signal watermark is still there, and only an actual audio edit would change that.
FAQ
What does an ElevenLabs file embed that a metadata cleaner can remove?
The container tags — ID3v2 and ID3v1 frames on an MP3, the RIFF INFO chunk on a WAV — and an embedded C2PA manifest if one is present. These are data attached beside the audio, so a cleaner rewrites the file without them and leaves the samples intact.
What does it embed that a cleaner cannot remove?
The inaudible signal-domain watermark, a soft-bound C2PA record stored server-side by content hash, and any acoustic fingerprint or generation record held off-file. None of these live in the metadata, so clearing tags does not touch them.
Does removing metadata stop the AI Speech Classifier from detecting the file?
No. The classifier reads the watermark and acoustic characteristics in the audio signal, not the tags. It is reported at about 99% precision and 80% recall on unmodified audio and does not reliably classify ElevenV3 output. Only editing the audio degrades that signal, and that changes how the clip sounds.
Does an ElevenLabs export contain my account or voice ID in a tag?
From what is observable in exported files, the container metadata is sparse — typically encoder and software strings rather than personal identifiers. Whatever is present is editable. The identifying layer is the signal watermark, not a text frame.
What is the difference between the C2PA manifest and the watermark?
The manifest is a signed JUMBF record of the file's origin; an embedded one can be removed, but a soft-bound one is matched server-side by the file's hash. The watermark is encoded into the waveform itself and is meant to survive tag removal and copying.
Can I clean an ElevenLabs file on my phone?
Yes. The tool is browser-only and runs on mobile Safari, Chrome on Android, and Firefox mobile. Drag-and-drop becomes tap-to-pick, and the cleaned file lands in Files or Downloads.
If you want a clean container before delivering or publishing an ElevenLabs clip, that part is straightforward. Try Metadata Cleaner free — drop the file, clean it, done. Just go in knowing which of the three layers you are clearing, and which two only an audio edit would touch.