How to Extract Metadata from DICOM Medical Imaging Files
DICOM metadata extraction is the process of reading standardized tags from medical imaging files to retrieve patient demographics, imaging parameters, equipment settings, and study context. This guide covers the DICOM tag hierarchy, practical extraction with Python and pydicom, anonymization workflows for research datasets, and tools for managing medical imaging files at scale.
What DICOM Metadata Is and Why Researchers Need It
DICOM (Digital Imaging and Communications in Medicine) is the universal standard for storing and transmitting medical images. Every CT scan, MRI, X-ray, ultrasound, and PET image produced by modern imaging equipment gets saved as a DICOM file. What makes DICOM different from a regular image format like JPEG or PNG is the metadata header: a structured block of tagged fields that describes everything about the image, the patient, the equipment, and the clinical context.
The DICOM standard defines over 4,000 metadata tags organized into groups. These tags cover patient demographics (name, ID, date of birth), study information (referring physician, study date, accession number), series parameters (modality, body part, imaging protocol), and image-level details (pixel spacing, slice thickness, window center and width). Every tag has a unique identifier, a pair of hexadecimal numbers like (0010,0010) for Patient Name or (0028,0010) for image Rows.
For clinical IT staff, DICOM metadata powers PACS (Picture Archiving and Communication System) routing and worklist management. But a growing audience needs DICOM metadata for a different reason: research. Data science teams building medical imaging AI models need to extract acquisition parameters to normalize datasets, pull patient demographics for cohort analysis, verify imaging protocols for quality control, and strip protected health information before sharing data across institutions.
Most DICOM guides target radiologists or PACS administrators. This guide is written for researchers, data engineers, and ML teams who need to programmatically extract, filter, and anonymize DICOM metadata at scale.
The DICOM Tag Hierarchy
DICOM organizes metadata in a four-level hierarchy that mirrors how medical imaging works in practice: a patient visits a facility, undergoes a study (examination), which contains one or more series (image sequences), each made up of individual instances (images).
Patient Level
Tags in group (0010,xxxx) describe the patient:
- PatientName (0010,0010): The patient's full name
- PatientID (0010,0020): Facility-assigned identifier
- PatientBirthDate (0010,0030): Date of birth
- PatientSex (0010,0040): Biological sex
- PatientAge (0010,1010): Age at time of study
These fields carry protected health information (PHI) and require careful handling in research workflows.
Study Level
Tags in groups (0008,xxxx) and (0020,xxxx) describe the examination:
- StudyInstanceUID (0020,000D): Globally unique study identifier
- StudyDate (0008,0020): When the examination occurred
- StudyDescription (0008,1030): Clinical description of the exam
- ReferringPhysicianName (0008,0090): Ordering physician
- AccessionNumber (0008,0050): Tracking number from the ordering system
The StudyInstanceUID is the primary key for linking all images from a single examination.
Series Level
Series-level tags describe a specific imaging sequence within a study:
- SeriesInstanceUID (0020,000E): Unique series identifier
- Modality (0008,0060): Imaging type (CT, MR, US, CR, DX, PT)
- BodyPartExamined (0018,0015): Anatomical region
- SeriesDescription (0008,103E): Protocol name or sequence description
- ProtocolName (0018,1030): Acquisition protocol identifier
For MRI studies, series-level tags also include pulse sequence parameters like RepetitionTime (0018,0080), EchoTime (0018,0081), and FlipAngle (0018,1314). CT studies include KVP (0018,0060), ExposureTime (0018,1150), and ConvolutionKernel (0018,1210).
Instance Level
Instance tags describe individual images:
- SOPInstanceUID (0008,0018): Unique image identifier
- InstanceNumber (0020,0013): Position in the series
- Rows (0028,0010) and Columns (0028,0011): Image dimensions
- PixelSpacing (0028,0030): Physical distance between pixel centers
- SliceThickness (0018,0050): Depth of each image slice
- BitsAllocated (0028,0100): Storage bits per pixel
Understanding this hierarchy matters for extraction pipelines. When you process a folder of DICOM files, you are not working with a flat list of images. You are reconstructing a tree of patients, studies, series, and instances. The UID fields at each level are the keys that let you group files correctly.
Extracting DICOM Metadata with Python and pydicom
pydicom is the standard Python library for reading and writing DICOM files. It parses the binary DICOM format into Python objects with attribute-style access to every tag. Install it with pip:
pip install pydicom
Reading a Single File
The basic workflow is straightforward. Load a file with dcmread, then access tags by keyword or by their hexadecimal group/element pair:
from pydicom import dcmread
ds = dcmread("scan_001.dcm")
### Access by keyword
print(ds.PatientName)
print(ds.StudyDate)
print(ds.Modality)
### Access by tag number
print(ds[0x0010, 0x0010].value) # Patient Name
print(ds[0x0028, 0x0010].value) # Rows
print(ds[0x0028, 0x0011].value) # Columns
You can also iterate through every data element in a file:
for element in ds:
if element.VR != "SQ": # Skip sequences for readability
print(f"{element.tag} {element.keyword}: {element.value}")
Batch Extraction Across a Directory
Research datasets typically contain thousands of DICOM files organized in nested folders. Here is a pattern for extracting metadata from an entire directory into a structured format:
import json
from pathlib import Path
from pydicom import dcmread
from pydicom.errors import InvalidDicomError
def extract_dicom_metadata(dicom_dir: Path) -> list[dict]:
records = []
for dcm_path in dicom_dir.rglob("*.dcm"):
try:
ds = dcmread(dcm_path, stop_before_pixels=True)
record = {
"file_path": str(dcm_path),
"patient_id": str(getattr(ds, "PatientID", "")),
"study_uid": str(getattr(ds, "StudyInstanceUID", "")),
"series_uid": str(getattr(ds, "SeriesInstanceUID", "")),
"sop_uid": str(getattr(ds, "SOPInstanceUID", "")),
"modality": str(getattr(ds, "Modality", "")),
"study_date": str(getattr(ds, "StudyDate", "")),
"series_desc": str(getattr(ds, "SeriesDescription", "")),
"rows": int(getattr(ds, "Rows", 0)),
"columns": int(getattr(ds, "Columns", 0)),
"slice_thickness": float(getattr(ds, "SliceThickness", 0)),
"pixel_spacing": str(getattr(ds, "PixelSpacing", "")),
}
records.append(record)
except (InvalidDicomError, AttributeError) as e:
records.append({
"file_path": str(dcm_path),
"error": str(e),
})
return records
The stop_before_pixels=True flag is critical for performance. It tells pydicom to read only the metadata header without loading pixel data into memory. For a 512x512 CT slice, the metadata header is a few kilobytes while pixel data is half a megabyte. When you are processing tens of thousands of files, skipping pixel data cuts memory usage and processing time dramatically.
Exporting to Tabular Formats
Once you have extracted records, convert them to a DataFrame for analysis:
import pandas as pd
records = extract_dicom_metadata(Path("/data/ct_scans"))
df = pd.DataFrame(records)
### Filter to only successful extractions
df_valid = df[df["error"].isna()]
### Group by study to see how many series each contains
study_summary = df_valid.groupby("study_uid").agg(
series_count=("series_uid", "nunique"),
image_count=("sop_uid", "count"),
modalities=("modality", lambda x: list(x.unique())),
)
### Export for downstream use
df_valid.to_parquet("dicom_metadata.parquet", index=False)
Parquet is a good export format for DICOM metadata because it handles mixed types efficiently, supports columnar queries through tools like DuckDB, and compresses well for datasets with millions of rows.
Organize and extract metadata from medical imaging datasets
Fast.io workspaces let you upload DICOM and other medical imaging files, define custom metadata extraction schemas, and share curated datasets with granular permissions. 50 GB free storage, no credit card required.
Anonymizing DICOM Metadata for Research
Medical imaging research almost always requires de-identification. Patient names, IDs, dates of birth, and other PHI fields must be removed or replaced before data leaves the originating institution. DICOM anonymization is not as simple as deleting a few tags. PHI can appear in dozens of standard fields, in private vendor tags, and even burned into pixel data as overlays or annotations.
The DICOM De-identification Standard
The DICOM standard itself defines a de-identification profile in PS3.15 Annex E. It specifies actions for each tag: remove, replace with a dummy value, hash, or keep. The profile covers over 300 tags that may contain PHI. NEMA, the organization that maintains the standard, publishes a confidentiality profile table that maps every relevant tag to its recommended action.
Programmatic Anonymization with pydicom
For scripted workflows, pydicom provides direct tag manipulation:
from pydicom import dcmread
ds = dcmread("original.dcm")
### Remove direct identifiers
ds.PatientName = "ANONYMOUS"
ds.PatientID = "ANON_001"
if "PatientBirthDate" in ds:
del ds.PatientBirthDate
if "ReferringPhysicianName" in ds:
del ds.ReferringPhysicianName
### Remove all private tags (vendor-specific, may contain PHI)
ds.remove_private_tags()
### Save anonymized copy
ds.save_as("anonymized.dcm")
This approach works for small datasets but gets fragile at scale. You need to handle every PHI-containing tag explicitly, and missing one creates a compliance risk.
Using the deid Library for Reproducible Pipelines
The deid library, built on top of pydicom, uses YAML recipe files to define anonymization rules. This makes the process auditable and reproducible:
pip install deid
A recipe file specifies actions per tag: REMOVE, BLANK, REPLACE, JITTER (for date shifting), or HASH. Teams define a recipe once, version-control it, and apply it across every dataset. The recipe serves double duty as documentation for your IRB or ethics board, showing exactly which fields were modified and how.
Burned-in Pixel Data
Some DICOM images contain PHI burned directly into the pixel data, typically patient names or scan dates rendered as text overlays in ultrasound images or secondary captures. Metadata-level anonymization misses these entirely. Detecting burned-in text requires OCR or machine learning classifiers that scan the image corners and borders where overlays typically appear. Tools like the RSNA DICOM Anonymizer and CTP (Clinical Trials Processor) include pixel anonymization modules for this purpose.
Date Shifting
Research protocols often need to preserve temporal relationships between studies without revealing actual dates. Date shifting replaces real dates with consistently offset versions. If a patient had a baseline MRI on January 15, 2024 and a follow-up on April 15, 2024, shifting both dates by the same random offset (say, minus 47 days) preserves the 90-day interval while removing the real calendar dates. The deid library supports this through its JITTER action.
Tools for DICOM Metadata Viewing and Extraction
Beyond pydicom, several tools handle DICOM metadata extraction for different use cases and skill levels.
Command-Line Tools
dcmdump from the DCMTK (DICOM Toolkit) suite is the standard command-line DICOM viewer. It prints every tag in a human-readable format and supports filtering by tag group. DCMTK is written in C++ and handles large files efficiently. It is available through most Linux package managers and Homebrew on macOS.
GDCM (Grassroots DICOM) provides both a command-line interface and Python bindings. Its gdcmdump command works similarly to dcmdump but includes additional features for handling compressed transfer syntaxes and converting between DICOM encoding formats.
Desktop Applications
3D Slicer is an open-source platform for medical image visualization and analysis. Its DICOM browser reads metadata from files, organizes them by the patient/study/series hierarchy, and lets you inspect individual tags through a graphical interface. It is particularly useful for verifying extraction results visually.
Horos (macOS) and RadiAnt (Windows) are DICOM viewers with built-in metadata inspection panels. They are designed for radiologists but work well for researchers who need to spot-check files without writing code.
Cloud and API-Based Solutions
Google Cloud Healthcare API and Microsoft Azure DICOM Service provide cloud-hosted DICOM storage with RESTful APIs for metadata queries. These services implement DICOMweb, a set of REST APIs defined by the DICOM standard for storing, retrieving, and searching medical images over HTTP. They handle the complexity of parsing DICOM binary formats and expose metadata as JSON, making it accessible to teams that work primarily with web APIs rather than specialized medical imaging tools.
For teams that need structured extraction beyond raw tag reading, platforms like Fast.io offer a different approach. Fast.io's Metadata Views let you define custom extraction schemas in natural language. Describe the fields you need extracted, such as modality, body part, acquisition date, and pixel dimensions, and the AI builds a typed schema that populates a sortable, filterable spreadsheet across all files in a workspace. This works well for teams that need queryable metadata without building custom pydicom scripts, especially when medical imaging files are part of a larger multi-format dataset that includes reports, annotations, and clinical notes alongside the DICOM images themselves.
Building a Research Pipeline from DICOM Metadata
Extracting metadata is only the first step. The real value comes from using that metadata to build curated, reproducible research datasets. Here is a practical pipeline pattern that data science teams use to go from raw DICOM archives to model-ready datasets.
Step 1: Inventory and Catalog
Start by scanning your entire DICOM archive and building a metadata catalog. Use the batch extraction approach from the pydicom section, with stop_before_pixels=True for speed. Store the catalog in Parquet or a SQLite database. This gives you a queryable index of every file before you move any pixel data.
Step 2: Cohort Selection Use metadata queries to identify the subset of studies that match your research criteria:
import duckdb
conn = duckdb.connect()
cohort = conn.execute("""
SELECT DISTINCT study_uid, patient_id, modality, study_date
FROM 'dicom_metadata.parquet'
WHERE modality = 'MR'
AND series_desc LIKE '%T1%'
AND slice_thickness <= 1.5
AND rows >= 256
AND columns >= 256
""").fetchdf()
This query finds all MRI studies with T1-weighted sequences, thin slices, and adequate resolution, common criteria for neuroimaging research. Running it against a metadata catalog takes seconds, even for archives with hundreds of thousands of files.
Step 3: Quality Validation Before extracting pixel data, validate that the selected files meet your requirements. Check for missing slices by verifying that instance numbers are contiguous within each series. Flag studies where acquisition parameters changed mid-series (inconsistent PixelSpacing or SliceThickness values). Identify series with unexpectedly few or many images, which may indicate incomplete transfers or scout images mixed in with diagnostic series.
Step 4: De-identify and Export
Apply your anonymization recipe to the selected cohort, then export the anonymized files to your research storage. Include the metadata catalog alongside the images so downstream consumers can filter and group without re-reading DICOM headers.
Step 5: Version and Share Research datasets evolve. New studies get added, quality criteria change, and collaborators at other institutions need access. Version your metadata catalog alongside the image data so every dataset release has a complete record of what it contains and how it was filtered.
For multi-site collaborations, a cloud workspace simplifies sharing. Fast.io workspaces support granular permissions at the folder level, audit trails for every file operation, and branded shares that let you package datasets with download controls and expiration dates. The free tier includes 50 GB of storage, enough for a pilot imaging dataset, and agents can access workspaces programmatically through Fast.io's MCP server for automated pipeline integration.
Frequently Asked Questions
How do I extract metadata from DICOM files?
Use Python with the pydicom library. Call pydicom.dcmread() to load a file, then access tags by keyword (ds.PatientName, ds.Modality) or by hexadecimal tag number (ds[0x0010, 0x0010]). For batch processing, use the stop_before_pixels=True flag to skip pixel data and speed up extraction across thousands of files.
What metadata is stored in DICOM images?
DICOM files contain metadata at four hierarchical levels. Patient-level tags include name, ID, and date of birth. Study-level tags cover the examination date, description, and referring physician. Series-level tags describe the imaging modality, body part, and acquisition parameters like repetition time and slice thickness. Instance-level tags record image dimensions, pixel spacing, and position within the series.
How do I anonymize DICOM metadata?
Start with the DICOM de-identification profile in PS3.15 Annex E, which specifies actions for over 300 PHI-containing tags. For scripted pipelines, use the deid library with YAML recipe files that define remove, replace, hash, or date-shift actions for each tag. Always remove private (vendor-specific) tags, which may contain PHI in non-standard locations. Check for burned-in pixel text in ultrasound and secondary capture images using OCR-based tools.
What tools read DICOM file headers?
pydicom (Python library) is the most common programmatic option. dcmdump from the DCMTK suite works at the command line. Desktop viewers like 3D Slicer, Horos, and RadiAnt include metadata inspection panels. Cloud options include Google Cloud Healthcare API and Microsoft Azure DICOM Service, both of which expose DICOM metadata as JSON through DICOMweb REST APIs.
What is the difference between DICOM metadata and EXIF data?
EXIF data is embedded in consumer image formats (JPEG, TIFF) and records camera settings like exposure, focal length, and GPS coordinates. DICOM metadata is specific to medical imaging and records clinical information including patient demographics, imaging modality, acquisition parameters, and study context. DICOM tags are far more extensive, with over 4,000 standard-defined fields compared to a few hundred EXIF tags.
How many tags does the DICOM standard define?
The DICOM standard defines over 4,000 public data elements in its data dictionary (PS3.6). These are organized by group number, covering patient information (group 0010), study and series identification (groups 0008 and 0020), image pixel parameters (group 0028), and acquisition settings (group 0018). Equipment vendors can also define private tags for proprietary data, which adds thousands more in practice.
Can I extract DICOM metadata without loading pixel data?
Yes. In pydicom, pass stop_before_pixels=True to dcmread(). This reads only the metadata header, which is typically a few kilobytes, and skips the pixel data, which can be hundreds of kilobytes to megabytes per file. This makes batch metadata extraction across large archives practical on standard hardware.
Related Resources
Organize and extract metadata from medical imaging datasets
Fast.io workspaces let you upload DICOM and other medical imaging files, define custom metadata extraction schemas, and share curated datasets with granular permissions. 50 GB free storage, no credit card required.