AI & Agents

How to Extract File Metadata with Python Libraries

Python has more than ten mature libraries for reading metadata from files, each specialized for different formats. This guide compares the leading options, from Pillow for image EXIF data to pypdf for PDF properties, Mutagen for audio tags, and pymediainfo for video streams. You will learn how to install each library, write extraction scripts, and build a pipeline that handles mixed file types at scale.

Fast.io Editorial Team 12 min read
Visualization of structured data extraction from files

What File Metadata Actually Contains

Every file carries two layers of information: the content you see (text, pixels, waveforms) and the properties you don't. That second layer is metadata. It includes creation timestamps, author names, GPS coordinates, encoding details, color profiles, bit rates, page counts, and dozens of other attributes that vary by format.

Python metadata extraction is the process of programmatically reading these embedded properties so you can catalog, audit, filter, or transform files without opening each one manually. The definition is straightforward: use a library to parse the file's internal structure and return its properties as Python objects.

Why does this matter in practice? A few common scenarios:

  • Digital asset management: Cataloging thousands of images by camera model, resolution, and capture date
  • Data pipeline preprocessing: Filtering training datasets by file format, dimensions, and encoding before ingestion
  • Compliance and forensics: Auditing documents for authorship, revision history, and hidden properties
  • Media workflows: Sorting video files by codec, frame rate, and duration for transcoding queues

The challenge is that no single library handles every format. JPEG EXIF data uses a different internal structure than PDF document properties or MP3 ID3 tags. That is why Python's ecosystem has specialized libraries for each file family, plus a few universal wrappers that delegate to external tools.

Choosing the Right Library for Each File Type

Before diving into code, here is a comparison of the most widely used Python metadata extraction libraries. Each one targets a specific set of file formats, and the right choice depends on what you are processing.

Pillow (PIL Fork)

  • File types: JPEG, PNG, TIFF, WebP, BMP, GIF
  • Metadata scope: EXIF tags, ICC color profiles, image dimensions, format details
  • Install: pip install Pillow
  • Best for: Image-heavy workflows where you need camera settings, GPS data, or orientation tags

pypdf (successor to PyPDF2)

  • File types: PDF
  • Metadata scope: Author, title, subject, creator application, creation/modification dates, page count, encryption status
  • Install: pip install pypdf
  • Best for: Document processing pipelines that need to sort, filter, or audit PDF collections. Note that PyPDF2 is deprecated as of 2023. All active development has moved to pypdf.

Mutagen

  • File types: MP3, FLAC, OGG, AAC, WMA, AIFF, MP4 audio, WavPack, Opus, and more
  • Metadata scope: ID3v1/v2 tags, Vorbis comments, APEv2 tags, MP4 atoms (title, artist, album, track number, bitrate, duration)
  • Install: pip install mutagen
  • Best for: Music libraries, podcast archives, and audio processing pipelines. Zero external dependencies.

pymediainfo

  • File types: MP4, MKV, AVI, MOV, WebM, and most video/audio containers
  • Metadata scope: Video codec, resolution, frame rate, bit rate, audio channels, duration, container format
  • Install: pip install pymediainfo (requires MediaInfo library on the system)
  • Best for: Video transcoding pipelines, quality assurance checks, and media asset management

python-docx

  • File types: .docx (Word documents)
  • Metadata scope: Author, title, subject, keywords, creation/modification dates, revision count, custom properties
  • Install: pip install python-docx
  • Best for: Document governance, legal discovery, and office file auditing

PyExifTool

  • File types: 400+ formats (everything ExifTool supports)
  • Metadata scope: All tags ExifTool can read, including maker notes, XMP, IPTC, and ICC profiles
  • Install: pip install pyexiftool (requires ExifTool CLI on the system)
  • Best for: Universal extraction when you need a single tool for mixed file types. Runs ExifTool in batch mode for efficiency.
Dashboard showing structured metadata extracted from multiple file types

Extracting Image Metadata with Pillow

Pillow is the most common starting point because image metadata extraction is the most frequent use case. The library reads EXIF data embedded by cameras and editing software, then exposes it through a clean Python API.

Here is a working example that reads EXIF tags from a JPEG file:

from PIL import Image
from PIL.ExifTags import TAGS

def extract_image_metadata(filepath):
    img = Image.open(filepath)
    info = {
        "format": img.format,
        "mode": img.mode,
        "size": img.size,
    }

exif_data = img.getexif()
    if exif_data:
        for tag_id, value in exif_data.items():
            tag_name = TAGS.get(tag_id, tag_id)
            info[tag_name] = value

return info

metadata = extract_image_metadata("photo.jpg")
print(metadata.get("Make"))       # Camera manufacturer
print(metadata.get("Model"))      # Camera model
print(metadata.get("DateTime"))   # Capture timestamp

The getexif() method returns a dictionary-like object where keys are integer tag IDs. The TAGS mapping converts those integers to human-readable names like "Make", "Model", "DateTime", "ExposureTime", and "FNumber".

For GPS data, you need an extra step. GPS info is stored in a nested IFD (Image File Directory) tag:

from PIL.ExifTags import GPSTAGS

def extract_gps(filepath):
    img = Image.open(filepath)
    exif = img.getexif()
    gps_ifd = exif.get_ifd(0x8825)

gps_data = {}
    for key, val in gps_ifd.items():
        tag_name = GPSTAGS.get(key, key)
        gps_data[tag_name] = val

return gps_data

A few things to watch for when working with Pillow EXIF extraction:

  • Not all images contain EXIF data. Screenshots, web-optimized PNGs, and programmatically generated images typically have none.
  • Some tags contain binary data or custom manufacturer extensions that need format-specific parsing.
  • Pillow reads EXIF but does not write it. If you need to modify tags, use PyExifTool or the piexif library instead.
Fastio features

Extract and query file metadata without writing parsers

Fast.io Metadata Views turn documents into a searchable, filterable database. Describe the fields you need in plain English, and structured data appears automatically. 50 GB free, no credit card required.

Reading PDF, Audio, and Video Metadata

Once you move beyond images, each file family requires its own library. Here is how to handle the three most common non-image formats.

PDF Metadata with pypdf

The pypdf library (the maintained successor to PyPDF2) reads document properties from the PDF's internal metadata dictionary:

from pypdf import PdfReader

def extract_pdf_metadata(filepath):
    reader = PdfReader(filepath)
    meta = reader.metadata

return {
        "author": meta.author,
        "creator": meta.creator,
        "producer": meta.producer,
        "subject": meta.subject,
        "title": meta.title,
        "pages": len(reader.pages),
        "created": str(meta.creation_date),
        "modified": str(meta.modification_date),
    }

One common pitfall: some PDFs store metadata in XMP format rather than the standard info dictionary. For those cases, you may need the pikepdf library, which provides full XMP access.

Audio Metadata with Mutagen

Mutagen handles the widest range of audio formats with zero external dependencies. It auto-detects the tag format and provides a consistent interface:

from mutagen import File as MutagenFile

def extract_audio_metadata(filepath):
    audio = MutagenFile(filepath, easy=True)
    if audio is None:
        return {"error": "unsupported format"}

info = {
        "length_seconds": round(audio.info.length, 2),
        "bitrate": getattr(audio.info, "bitrate", None),
        "sample_rate": getattr(audio.info, "sample_rate", None),
        "channels": getattr(audio.info, "channels", None),
    }

if audio.tags:
        for key, value in audio.tags.items():
            info[key] = value[0] if isinstance(value, list) else value

return info

The easy=True parameter maps format-specific tag names to common keys like "title", "artist", "album", and "date". Without it, you get raw tag identifiers like "TIT2" for ID3 or "TALB" for album.

Video Metadata with pymediainfo

pymediainfo wraps the MediaInfo library, which parses container formats and codec details from video files:

from pymediainfo import MediaInfo

def extract_video_metadata(filepath):
    media = MediaInfo.parse(filepath)
    result = {}

for track in media.tracks:
        if track.track_type == "General":
            result["format"] = track.format
            result["duration_ms"] = track.duration
            result["file_size"] = track.file_size
        elif track.track_type == "Video":
            result["video_codec"] = track.format
            result["width"] = track.width
            result["height"] = track.height
            result["frame_rate"] = track.frame_rate
            result["bit_rate"] = track.bit_rate
        elif track.track_type == "Audio":
            result["audio_codec"] = track.format
            result["audio_channels"] = track.channel_s
            result["audio_sample_rate"] = track.sampling_rate

return result

pymediainfo requires the MediaInfo shared library to be installed on the system. On macOS, use brew install mediainfo. On Ubuntu, use apt install libmediainfo0v5. On Windows, the pymediainfo wheel bundles MediaInfo automatically.

Audit log showing metadata properties extracted from multiple files

Building a Multi-Format Extraction Pipeline

Real-world projects rarely deal with a single file type. A data pipeline might receive a folder containing PDFs, JPEGs, MP4s, and Word documents all mixed together. Here is a pattern for routing files to the correct extractor based on their extension:

import os
from pathlib import Path

EXTRACTORS = {
    ".jpg": extract_image_metadata,
    ".jpeg": extract_image_metadata,
    ".png": extract_image_metadata,
    ".tiff": extract_image_metadata,
    ".pdf": extract_pdf_metadata,
    ".mp3": extract_audio_metadata,
    ".flac": extract_audio_metadata,
    ".ogg": extract_audio_metadata,
    ".mp4": extract_video_metadata,
    ".mkv": extract_video_metadata,
    ".mov": extract_video_metadata,
    ".docx": extract_docx_metadata,
}

def extract_metadata(filepath):
    ext = Path(filepath).suffix.lower()
    extractor = EXTRACTORS.get(ext)

if extractor is None:
        return {"file": filepath, "error": "no extractor for this format"}

try:
        result = extractor(filepath)
        result["file"] = filepath
        result["extension"] = ext
        return result
    except Exception as e:
        return {"file": filepath, "error": str(e)}

def process_directory(directory):
    results = []
    for root, dirs, files in os.walk(directory):
        for filename in files:
            filepath = os.path.join(root, filename)
            results.append(extract_metadata(filepath))
    return results

This dispatcher pattern has a few advantages. Adding support for a new format means writing one extractor function and adding one entry to the dictionary. Error handling is centralized, so a corrupt file does not crash the entire pipeline. And the output is a flat list of dictionaries that you can serialize to JSON, load into a dataframe, or push to a database.

For larger volumes, consider these optimizations:

  • Parallel processing: Use concurrent.futures.ProcessPoolExecutor to extract metadata from multiple files simultaneously. Metadata extraction is CPU-bound for media files, so multiple processes outperform threads.
  • Caching: If you process the same files repeatedly, hash the file content and cache results. Metadata does not change unless the file changes.
  • Streaming output: For directories with tens of thousands of files, write results to a JSONL file incrementally instead of building a list in memory.

When PyExifTool Makes More Sense

If your pipeline handles a wide variety of formats and you want a single extraction path, PyExifTool simplifies the routing problem. It delegates to ExifTool, which supports over 400 file formats:

import exiftool

def extract_with_exiftool(filepaths):
    with exiftool.ExifToolHelper() as et:
        return et.get_metadata(filepaths)

The tradeoff is a system dependency. ExifTool must be installed and available on the PATH. For containerized deployments, that means adding it to your Dockerfile. For serverless functions, the dependency is harder to manage, which is where the individual Python-native libraries have an advantage.

Scaling Metadata Extraction with Workspace Tools

Writing extraction scripts works well for one-off tasks and small pipelines. But when you are managing thousands of files across a team, or when agents need to extract and query metadata programmatically, a workspace-level tool removes the need to maintain custom code.

Local Python scripts have a few limitations at scale. You need to handle file storage, access control, result persistence, and schema changes yourself. If a colleague needs to query the extracted metadata, they either run the script again or you build a shared database. When file formats change or new columns are needed, someone has to update the code and reprocess everything.

Fast.io's Metadata Views take a different approach to structured extraction. Instead of writing extraction logic per format, you describe the fields you want in plain English. The system designs a typed schema (text, integer, decimal, boolean, URL, JSON, date), matches files in the workspace, and populates a sortable, filterable spreadsheet. Adding a new column does not require reprocessing existing files.

This is particularly useful for teams that combine programmatic extraction with human review. You might use Python scripts to pre-process and upload files to a workspace, then use Metadata Views to extract business-specific fields that require AI interpretation, like contract counterparties from legal PDFs or coverage limits from insurance documents.

For agent-driven workflows, Metadata Views are accessible through the Fast.io MCP server, so an AI agent can create views, trigger extraction, and query results without custom extraction code. The combination of Python libraries for technical metadata (codecs, dimensions, bitrates) and workspace tools for semantic metadata (dates, names, dollar amounts) covers both sides of the extraction problem.

Other cloud-based alternatives exist. AWS Textract handles document extraction, Google Document AI processes forms and invoices, and Azure AI Document Intelligence covers similar territory. Fast.io differentiates by combining extraction with workspace features like file versioning, granular permissions, and Intelligence Mode for RAG-powered search, all accessible on a free plan with 50 GB storage and 5,000 monthly credits.

Workspace interface showing AI-powered metadata extraction from uploaded files

Frequently Asked Questions

What Python library extracts metadata from files?

It depends on the file type. Pillow handles image EXIF data, pypdf reads PDF document properties, Mutagen extracts audio tags from MP3/FLAC/OGG files, pymediainfo parses video codec and stream details, and python-docx reads Word document properties. For a universal solution, PyExifTool wraps the ExifTool CLI to support over 400 formats from a single library.

How do you extract EXIF data with Python?

Use the Pillow library. Open the image with Image.open(), call getexif() on the image object, then iterate over the returned dictionary. Map integer tag IDs to readable names using PIL.ExifTags.TAGS. For GPS coordinates specifically, access the GPS IFD with exif.get_ifd(0x8825) and map those keys with PIL.ExifTags.GPSTAGS.

How do you read PDF metadata in Python?

Use the pypdf library (the maintained successor to PyPDF2). Create a PdfReader object, then access reader.metadata to get properties like author, title, creation date, and page count. For PDFs with XMP metadata, the pikepdf library provides more complete access.

What is the best Python library for video metadata?

pymediainfo is the most comprehensive option. It wraps the MediaInfo library and extracts codec details, resolution, frame rate, bit rate, audio channels, and container format from most video formats including MP4, MKV, AVI, and MOV. It requires the MediaInfo system library to be installed alongside the Python package.

Is PyPDF2 still maintained?

No. PyPDF2 was deprecated in 2023 and merged back into pypdf (lowercase, no version number). All active development happens in the pypdf project. Migrating is straightforward since the API is nearly identical. Replace "from PyPDF2 import" with "from pypdf import" in most cases.

Can Python extract metadata from multiple file types at once?

Yes, by building a dispatcher that routes files to the correct library based on extension. Map extensions like .jpg to Pillow, .pdf to pypdf, .mp3 to Mutagen, and .mp4 to pymediainfo. Alternatively, PyExifTool handles most formats through a single interface, though it requires ExifTool as a system dependency.

Related Resources

Fastio features

Extract and query file metadata without writing parsers

Fast.io Metadata Views turn documents into a searchable, filterable database. Describe the fields you need in plain English, and structured data appears automatically. 50 GB free, no credit card required.