How do I extract metadata from DICOM files?

Use Python with the pydicom library. Call pydicom.dcmread() to load a file, then access tags by keyword (ds.PatientName, ds.Modality) or by hexadecimal tag number (ds[0x0010, 0x0010]). For batch processing, use the stop_before_pixels=True flag to skip pixel data and speed up extraction across thousands of files.

What metadata is stored in DICOM images?

DICOM files contain metadata at four hierarchical levels. Patient-level tags include name, ID, and date of birth. Study-level tags cover the examination date, description, and referring physician. Series-level tags describe the imaging modality, body part, and acquisition parameters like repetition time and slice thickness. Instance-level tags record image dimensions, pixel spacing, and position within the series.

How do I anonymize DICOM metadata?

Start with the DICOM de-identification profile in PS3.15 Annex E, which specifies actions for over 300 PHI-containing tags. For scripted pipelines, use the deid library with YAML recipe files that define remove, replace, hash, or date-shift actions for each tag. Always remove private (vendor-specific) tags, which may contain PHI in non-standard locations. Check for burned-in pixel text in ultrasound and secondary capture images using OCR-based tools.

What tools read DICOM file headers?

pydicom (Python library) is the most common programmatic option. dcmdump from the DCMTK suite works at the command line. Desktop viewers like 3D Slicer, Horos, and RadiAnt include metadata inspection panels. Cloud options include Google Cloud Healthcare API and Microsoft Azure DICOM Service, both of which expose DICOM metadata as JSON through DICOMweb REST APIs.

What is the difference between DICOM metadata and EXIF data?

EXIF data is embedded in consumer image formats (JPEG, TIFF) and records camera settings like exposure, focal length, and GPS coordinates. DICOM metadata is specific to medical imaging and records clinical information including patient demographics, imaging modality, acquisition parameters, and study context. DICOM tags are far more extensive, with over 4,000 standard-defined fields compared to a few hundred EXIF tags.

How many tags does the DICOM standard define?

The DICOM standard defines over 4,000 public data elements in its data dictionary (PS3.6). These are organized by group number, covering patient information (group 0010), study and series identification (groups 0008 and 0020), image pixel parameters (group 0028), and acquisition settings (group 0018). Equipment vendors can also define private tags for proprietary data, which adds thousands more in practice.

Can I extract DICOM metadata without loading pixel data?

Yes. In pydicom, pass stop_before_pixels=True to dcmread(). This reads only the metadata header, which is typically a few kilobytes, and skips the pixel data, which can be hundreds of kilobytes to megabytes per file. This makes batch metadata extraction across large archives practical on standard hardware.

DICOM Metadata Extraction for Medical Imaging in 2026

What DICOM Metadata Is and Why Researchers Need It

DICOM (Digital Imaging and Communications in Medicine) is the universal standard for storing and transmitting medical images. Every CT scan, MRI, X-ray, ultrasound, and PET image produced by modern imaging equipment gets saved as a DICOM file. What makes DICOM different from a regular image format like JPEG or PNG is the metadata header: a structured block of tagged fields that describes everything about the image, the patient, the equipment, and the clinical context.

The DICOM standard defines over 4,000 metadata tags organized into groups. These tags cover patient demographics (name, ID, date of birth), study information (referring physician, study date, accession number), series parameters (modality, body part, imaging protocol), and image-level details (pixel spacing, slice thickness, window center and width). Every tag has a unique identifier, a pair of hexadecimal numbers like (0010,0010) for Patient Name or (0028,0010) for image Rows.

For clinical IT staff, DICOM metadata powers PACS (Picture Archiving and Communication System) routing and worklist management. But a growing audience needs DICOM metadata for a different reason: research. Data science teams building medical imaging AI models need to extract acquisition parameters to normalize datasets, pull patient demographics for cohort analysis, verify imaging protocols for quality control, and strip protected health information before sharing data across institutions.

Most DICOM guides target radiologists or PACS administrators. This guide is written for researchers, data engineers, and ML teams who need to programmatically extract, filter, and anonymize DICOM metadata at scale.

The DICOM Tag Hierarchy

DICOM organizes metadata in a four-level hierarchy that mirrors how medical imaging works in practice: a patient visits a facility, undergoes a study (examination), which contains one or more series (image sequences), each made up of individual instances (images).

Patient Level

Tags in group (0010,xxxx) describe the patient:

PatientName (0010,0010): The patient's full name
PatientID (0010,0020): Facility-assigned identifier
PatientBirthDate (0010,0030): Date of birth
PatientSex (0010,0040): Biological sex
PatientAge (0010,1010): Age at time of study

These fields carry protected health information (PHI) and require careful handling in research workflows.

Study Level

Tags in groups (0008,xxxx) and (0020,xxxx) describe the examination:

StudyInstanceUID (0020,000D): Globally unique study identifier
StudyDate (0008,0020): When the examination occurred
StudyDescription (0008,1030): Clinical description of the exam
ReferringPhysicianName (0008,0090): Ordering physician
AccessionNumber (0008,0050): Tracking number from the ordering system

The StudyInstanceUID is the primary key for linking all images from a single examination.

Series Level

Series-level tags describe a specific imaging sequence within a study:

SeriesInstanceUID (0020,000E): Unique series identifier
Modality (0008,0060): Imaging type (CT, MR, US, CR, DX, PT)
BodyPartExamined (0018,0015): Anatomical region
SeriesDescription (0008,103E): Protocol name or sequence description
ProtocolName (0018,1030): Acquisition protocol identifier

For MRI studies, series-level tags also include pulse sequence parameters like RepetitionTime (0018,0080), EchoTime (0018,0081), and FlipAngle (0018,1314). CT studies include KVP (0018,0060), ExposureTime (0018,1150), and ConvolutionKernel (0018,1210).

Instance Level

Instance tags describe individual images:

SOPInstanceUID (0008,0018): Unique image identifier
InstanceNumber (0020,0013): Position in the series
Rows (0028,0010) and Columns (0028,0011): Image dimensions
PixelSpacing (0028,0030): Physical distance between pixel centers
SliceThickness (0018,0050): Depth of each image slice
BitsAllocated (0028,0100): Storage bits per pixel

Understanding this hierarchy matters for extraction pipelines. When you process a folder of DICOM files, you are not working with a flat list of images. You are reconstructing a tree of patients, studies, series, and instances. The UID fields at each level are the keys that let you group files correctly.

Hierarchical data structure showing organized levels of file metadata

Extracting DICOM Metadata with Python and pydicom

pydicom is the standard Python library for reading and writing DICOM files. It parses the binary DICOM format into Python objects with attribute-style access to every tag. Install it with pip:

pip install pydicom

Reading a Single File

The basic workflow is straightforward. Load a file with dcmread, then access tags by keyword or by their hexadecimal group/element pair:

from pydicom import dcmread

ds = dcmread("scan_001.dcm")

### Access by keyword
print(ds.PatientName)
print(ds.StudyDate)
print(ds.Modality)

### Access by tag number
print(ds[0x0010, 0x0010].value)  # Patient Name
print(ds[0x0028, 0x0010].value)  # Rows
print(ds[0x0028, 0x0011].value)  # Columns

You can also iterate through every data element in a file:

for element in ds:
    if element.VR != "SQ":  # Skip sequences for readability
        print(f"{element.tag} {element.keyword}: {element.value}")

Batch Extraction Across a Directory

Research datasets typically contain thousands of DICOM files organized in nested folders. Here is a pattern for extracting metadata from an entire directory into a structured format:

import json
from pathlib import Path
from pydicom import dcmread
from pydicom.errors import InvalidDicomError

def extract_dicom_metadata(dicom_dir: Path) -> list[dict]:
    records = []
    for dcm_path in dicom_dir.rglob("*.dcm"):
        try:
            ds = dcmread(dcm_path, stop_before_pixels=True)
            record = {
                "file_path": str(dcm_path),
                "patient_id": str(getattr(ds, "PatientID", "")),
                "study_uid": str(getattr(ds, "StudyInstanceUID", "")),
                "series_uid": str(getattr(ds, "SeriesInstanceUID", "")),
                "sop_uid": str(getattr(ds, "SOPInstanceUID", "")),
                "modality": str(getattr(ds, "Modality", "")),
                "study_date": str(getattr(ds, "StudyDate", "")),
                "series_desc": str(getattr(ds, "SeriesDescription", "")),
                "rows": int(getattr(ds, "Rows", 0)),
                "columns": int(getattr(ds, "Columns", 0)),
                "slice_thickness": float(getattr(ds, "SliceThickness", 0)),
                "pixel_spacing": str(getattr(ds, "PixelSpacing", "")),
            }
            records.append(record)
        except (InvalidDicomError, AttributeError) as e:
            records.append({
                "file_path": str(dcm_path),
                "error": str(e),
            })
    return records

The stop_before_pixels=True flag is critical for performance. It tells pydicom to read only the metadata header without loading pixel data into memory. For a 512x512 CT slice, the metadata header is a few kilobytes while pixel data is half a megabyte. When you are processing tens of thousands of files, skipping pixel data cuts memory usage and processing time dramatically.

Exporting to Tabular Formats

Once you have extracted records, convert them to a DataFrame for analysis:

import pandas as pd

records = extract_dicom_metadata(Path("/data/ct_scans"))
df = pd.DataFrame(records)

### Filter to only successful extractions
df_valid = df[df["error"].isna()]

### Group by study to see how many series each contains
study_summary = df_valid.groupby("study_uid").agg(
    series_count=("series_uid", "nunique"),
    image_count=("sop_uid", "count"),
    modalities=("modality", lambda x: list(x.unique())),
)

### Export for downstream use
df_valid.to_parquet("dicom_metadata.parquet", index=False)

Parquet is a good export format for DICOM metadata because it handles mixed types efficiently, supports columnar queries through tools like DuckDB, and compresses well for datasets with millions of rows.

Organize and extract metadata from medical imaging datasets

Fast.io workspaces let you upload DICOM and other medical imaging files, define custom metadata extraction schemas, and share curated datasets with granular permissions. 50 GB free storage, no credit card required.

Anonymizing DICOM Metadata for Research

Medical imaging research almost always requires de-identification. Patient names, IDs, dates of birth, and other PHI fields must be removed or replaced before data leaves the originating institution. DICOM anonymization is not as simple as deleting a few tags. PHI can appear in dozens of standard fields, in private vendor tags, and even burned into pixel data as overlays or annotations.

The DICOM De-identification Standard

The DICOM standard itself defines a de-identification profile in PS3.15 Annex E. It specifies actions for each tag: remove, replace with a dummy value, hash, or keep. The profile covers over 300 tags that may contain PHI. NEMA, the organization that maintains the standard, publishes a confidentiality profile table that maps every relevant tag to its recommended action.

Programmatic Anonymization with pydicom

For scripted workflows, pydicom provides direct tag manipulation:

from pydicom import dcmread

ds = dcmread("original.dcm")

### Remove direct identifiers
ds.PatientName = "ANONYMOUS"
ds.PatientID = "ANON_001"

if "PatientBirthDate" in ds:
    del ds.PatientBirthDate

if "ReferringPhysicianName" in ds:
    del ds.ReferringPhysicianName

### Remove all private tags (vendor-specific, may contain PHI)
ds.remove_private_tags()

### Save anonymized copy
ds.save_as("anonymized.dcm")

This approach works for small datasets but gets fragile at scale. You need to handle every PHI-containing tag explicitly, and missing one creates a compliance risk.

Using the deid Library for Reproducible Pipelines

The deid library, built on top of pydicom, uses YAML recipe files to define anonymization rules. This makes the process auditable and reproducible:

pip install deid

A recipe file specifies actions per tag: REMOVE, BLANK, REPLACE, JITTER (for date shifting), or HASH. Teams define a recipe once, version-control it, and apply it across every dataset. The recipe serves double duty as documentation for your IRB or ethics board, showing exactly which fields were modified and how.

Burned-in Pixel Data

Some DICOM images contain PHI burned directly into the pixel data, typically patient names or scan dates rendered as text overlays in ultrasound images or secondary captures. Metadata-level anonymization misses these entirely. Detecting burned-in text requires OCR or machine learning classifiers that scan the image corners and borders where overlays typically appear. Tools like the RSNA DICOM Anonymizer and CTP (Clinical Trials Processor) include pixel anonymization modules for this purpose.

Date Shifting

Research protocols often need to preserve temporal relationships between studies without revealing actual dates. Date shifting replaces real dates with consistently offset versions. If a patient had a baseline MRI on January 15, 2024 and a follow-up on April 15, 2024, shifting both dates by the same random offset (say, minus 47 days) preserves the 90-day interval while removing the real calendar dates. The deid library supports this through its JITTER action.

Secure data vault interface representing protected medical data management

Tools for DICOM Metadata Viewing and Extraction

Beyond pydicom, several tools handle DICOM metadata extraction for different use cases and skill levels.

Command-Line Tools

dcmdump from the DCMTK (DICOM Toolkit) suite is the standard command-line DICOM viewer. It prints every tag in a human-readable format and supports filtering by tag group. DCMTK is written in C++ and handles large files efficiently. It is available through most Linux package managers and Homebrew on macOS.

GDCM (Grassroots DICOM) provides both a command-line interface and Python bindings. Its gdcmdump command works similarly to dcmdump but includes additional features for handling compressed transfer syntaxes and converting between DICOM encoding formats.

Desktop Applications

3D Slicer is an open-source platform for medical image visualization and analysis. Its DICOM browser reads metadata from files, organizes them by the patient/study/series hierarchy, and lets you inspect individual tags through a graphical interface. It is particularly useful for verifying extraction results visually.

Horos (macOS) and RadiAnt (Windows) are DICOM viewers with built-in metadata inspection panels. They are designed for radiologists but work well for researchers who need to spot-check files without writing code.

Cloud and API-Based Solutions

Google Cloud Healthcare API and Microsoft Azure DICOM Service provide cloud-hosted DICOM storage with RESTful APIs for metadata queries. These services implement DICOMweb, a set of REST APIs defined by the DICOM standard for storing, retrieving, and searching medical images over HTTP. They handle the complexity of parsing DICOM binary formats and expose metadata as JSON, making it accessible to teams that work primarily with web APIs rather than specialized medical imaging tools.

For teams that need structured extraction beyond raw tag reading, platforms like Fast.io offer a different approach. Fast.io's Metadata Views let you define custom extraction schemas in natural language. Describe the fields you need extracted, such as modality, body part, acquisition date, and pixel dimensions, and the AI builds a typed schema that populates a sortable, filterable spreadsheet across all files in a workspace. This works well for teams that need queryable metadata without building custom pydicom scripts, especially when medical imaging files are part of a larger multi-format dataset that includes reports, annotations, and clinical notes alongside the DICOM images themselves.

Building a Research Pipeline from DICOM Metadata

Extracting metadata is only the first step. The real value comes from using that metadata to build curated, reproducible research datasets. Here is a practical pipeline pattern that data science teams use to go from raw DICOM archives to model-ready datasets.

Step 1: Inventory and Catalog

Start by scanning your entire DICOM archive and building a metadata catalog. Use the batch extraction approach from the pydicom section, with stop_before_pixels=True for speed. Store the catalog in Parquet or a SQLite database. This gives you a queryable index of every file before you move any pixel data.

Step 2: Cohort Selection Use metadata queries to identify the subset of studies that match your research criteria:

import duckdb

conn = duckdb.connect()
cohort = conn.execute("""
    SELECT DISTINCT study_uid, patient_id, modality, study_date
    FROM 'dicom_metadata.parquet'
    WHERE modality = 'MR'
      AND series_desc LIKE '%T1%'
      AND slice_thickness <= 1.5
      AND rows >= 256
      AND columns >= 256
""").fetchdf()

This query finds all MRI studies with T1-weighted sequences, thin slices, and adequate resolution, common criteria for neuroimaging research. Running it against a metadata catalog takes seconds, even for archives with hundreds of thousands of files.

Step 3: Quality Validation Before extracting pixel data, validate that the selected files meet your requirements. Check for missing slices by verifying that instance numbers are contiguous within each series. Flag studies where acquisition parameters changed mid-series (inconsistent PixelSpacing or SliceThickness values). Identify series with unexpectedly few or many images, which may indicate incomplete transfers or scout images mixed in with diagnostic series.

Step 4: De-identify and Export

Apply your anonymization recipe to the selected cohort, then export the anonymized files to your research storage. Include the metadata catalog alongside the images so downstream consumers can filter and group without re-reading DICOM headers.

Step 5: Version and Share Research datasets evolve. New studies get added, quality criteria change, and collaborators at other institutions need access. Version your metadata catalog alongside the image data so every dataset release has a complete record of what it contains and how it was filtered.

For multi-site collaborations, a cloud workspace simplifies sharing. Fast.io workspaces support granular permissions at the folder level, audit trails for every file operation, and branded shares that let you package datasets with download controls and expiration dates. The free tier includes 50 GB of storage, enough for a pilot imaging dataset, and agents can access workspaces programmatically through Fast.io's MCP server for automated pipeline integration.

AI-powered audit interface tracking file processing operations and metadata

How to Extract Metadata from DICOM Medical Imaging Files

What DICOM Metadata Is and Why Researchers Need It

The DICOM Tag Hierarchy

Patient Level

Study Level

Series Level

Instance Level

Extracting DICOM Metadata with Python and pydicom

Reading a Single File

Batch Extraction Across a Directory

Exporting to Tabular Formats

Organize and extract metadata from medical imaging datasets

Anonymizing DICOM Metadata for Research

The DICOM De-identification Standard

Programmatic Anonymization with pydicom

Using the deid Library for Reproducible Pipelines

Burned-in Pixel Data

Date Shifting

Tools for DICOM Metadata Viewing and Extraction

Command-Line Tools

Desktop Applications

Cloud and API-Based Solutions

Building a Research Pipeline from DICOM Metadata

Step 1: Inventory and Catalog

Step 2: Cohort Selection Use metadata queries to identify the subset of studies that match your research criteria:

Step 4: De-identify and Export

Step 5: Version and Share Research datasets evolve. New studies get added, quality criteria change, and collaborators at other institutions need access. Version your metadata catalog alongside the image data so every dataset release has a complete record of what it contains and how it was filtered.

Frequently Asked Questions

Related Resources

Organize and extract metadata from medical imaging datasets