How to Extract File Metadata with PHP Libraries
PHP ships with built-in EXIF and IPTC functions that most other languages lack, and its ecosystem includes mature libraries like getID3 and smalot/pdfparser for audio, video, and document metadata. This guide walks through each option with working code, covers format-specific gotchas, and shows how to build a metadata extraction pipeline that handles mixed file types in production.
What PHP Metadata Extraction Covers
Every file carries hidden properties alongside its visible content. Images embed camera settings, GPS coordinates, and color profiles. Audio files store artist names, album art, and bit rates. PDFs record authorship, creation dates, and page counts. Metadata extraction is the process of reading these embedded properties programmatically so you can catalog, filter, audit, or transform files without inspecting each one by hand.
PHP is unusually well-equipped for this work. It powers roughly 76% of websites with a known server-side language according to W3Techs, and it includes native EXIF and IPTC parsing functions that ship with the standard distribution. Most other server-side languages require third-party packages for even basic image metadata. PHP gives you exif_read_data() and iptcparse() out of the box.
That said, built-in functions only cover image formats. For audio, video, and document metadata, you need libraries. The PHP ecosystem has several strong options:
- exif_read_data() and iptcparse(): Built-in functions for JPEG and TIFF image metadata
- getID3: The go-to library for audio and video files (MP3, FLAC, AVI, MP4, and dozens more)
- smalot/pdfparser: Standalone PDF metadata extraction without external dependencies
- php-image-metadata-parser: Combined EXIF, IPTC, and XMP parsing in one package
- PEL (PHP Exif Library): Read and write EXIF headers in JPEG and TIFF files
This guide covers each option with working code, starting with PHP's built-in functions and progressing to third-party libraries for more complex formats. For teams that want extracted metadata to live alongside the files in a shared, queryable system, Fast.io combines workspaces, collaboration, and AI-powered extraction in one place.
Reading Image Metadata with Built-in PHP Functions
PHP's exif extension provides exif_read_data(), which reads EXIF headers from JPEG and TIFF files. The extension is bundled with PHP but may need to be enabled in your php.ini configuration. Check with phpinfo() or run php -m | grep exif from the command line.
Extracting EXIF Data
Here is a working example that pulls camera settings, timestamps, and dimensions from a JPEG:
$metadata = exif_read_data('photo.jpg', 'ANY_TAG', true);
if ($metadata !== false) {
echo $metadata['IFD0']['Make']; // Camera manufacturer
echo $metadata['IFD0']['Model']; // Camera model
echo $metadata['EXIF']['DateTimeOriginal']; // Capture timestamp
echo $metadata['COMPUTED']['Width']; // Image width
echo $metadata['COMPUTED']['Height']; // Image height
}
The second parameter controls which sections to read. Pass 'ANY_TAG' to get everything, or specify sections like 'IFD0', 'EXIF', 'GPS', or 'THUMBNAIL' to limit the output. The third parameter, when set to true, organizes results into section-based arrays instead of a flat list.
For GPS coordinates, look in the GPS section:
$exif = exif_read_data('geotagged-photo.jpg', 'GPS', true);
if (isset($exif['GPS'])) {
$lat = $exif['GPS']['GPSLatitude'];
$lon = $exif['GPS']['GPSLongitude'];
$latRef = $exif['GPS']['GPSLatitudeRef']; // N or S
$lonRef = $exif['GPS']['GPSLongitudeRef']; // E or W
}
GPS values come back as arrays of rational numbers (fractions), so you will need a conversion function to get decimal degrees. Each element represents degrees, minutes, and seconds as a fraction string like "40/1".
Extracting IPTC Data
IPTC metadata stores editorial information: captions, keywords, credits, and copyright notices. PHP reads it through a two-step process using getimagesize() and iptcparse():
$size = getimagesize('photo.jpg', $info);
if (isset($info['APP13'])) {
$iptc = iptcparse($info['APP13']);
$caption = $iptc['2#120'][0] ?? ''; // Caption/description
$keywords = $iptc['2#025'] ?? []; // Keywords array
$credit = $iptc['2#110'][0] ?? ''; // Credit line
$headline = $iptc['2#105'][0] ?? ''; // Headline
}
IPTC tags use a numeric code system. The most useful ones are 2#120 (caption), 2#025 (keywords), 2#080 (byline/author), 2#105 (headline), 2#116 (copyright), and 2#110 (credit). Keywords return as an array since images can have multiple keyword tags.
Limitations of Built-in Functions
PHP's built-in metadata functions have real constraints to plan around:
- JPEG and TIFF only:
exif_read_data()does not handle PNG, WebP, HEIC, or RAW formats - Read-only: You cannot write or modify EXIF data with built-in functions
- No XMP support: The increasingly common XMP metadata standard requires a third-party library
- Binary data handling: Some manufacturer-specific tags return raw binary that needs custom parsing
Audio and Video Metadata with getID3
getID3 is the dominant PHP library for multimedia metadata. It reads properties from over 30 audio and video formats without requiring any external system dependencies. Everything runs in pure PHP.
Install it with Composer:
composer require james-heinrich/getid3
Reading MP3 Metadata
Here is how to extract ID3 tags and audio properties from an MP3 file:
require_once 'vendor/autoload.php';
$getID3 = new getID3();
$fileInfo = $getID3->analyze('/path/to/song.mp3');
// Audio properties
echo $fileInfo['playtime_string']; // "3:45"
echo $fileInfo['audio']['bitrate']; // 320000
echo $fileInfo['audio']['channels']; // 2
echo $fileInfo['audio']['sample_rate']; // 44100
// ID3 tags
$tags = $fileInfo['tags']['id3v2'] ?? $fileInfo['tags']['id3v1'] ?? [];
echo $tags['title'][0] ?? 'Unknown';
echo $tags['artist'][0] ?? 'Unknown';
echo $tags['album'][0] ?? 'Unknown';
echo $tags['year'][0] ?? '';
The analyze() method returns a deeply nested array. Audio properties sit at the top level under audio, while ID3 tag data is organized under tags with separate entries for id3v1 and id3v2 tag versions. Most modern files use ID3v2, but it is good practice to fall back to v1.
Video File
Analysis getID3 handles video containers just as well. It reads codec details, resolution, frame rates, and embedded metadata from MP4, AVI, MKV, MOV, and WebM files:
$getID3 = new getID3();
$videoInfo = $getID3->analyze('/path/to/video.mp4');
echo $videoInfo['video']['resolution_x']; // 1920
echo $videoInfo['video']['resolution_y']; // 1080
echo $videoInfo['video']['frame_rate']; // 29.97
echo $videoInfo['video']['codec']; // h264
echo $videoInfo['playtime_string']; // "1:23:45"
echo $videoInfo['filesize']; // bytes
Supported Formats
getID3 covers a wide range of formats out of the box:
- Audio: MP3, FLAC, OGG Vorbis, WAV, AAC, WMA, AIFF, Monkey's Audio, Musepack, Opus
- Video: MP4, AVI, MKV, MOV, WebM, ASF/WMV, MPEG, QuickTime
- Images: JPEG, PNG, GIF, BMP, TIFF (basic properties, not full EXIF)
- Archives: ZIP (file listing and metadata)
For Laravel projects, the plutuss/getid3-laravel package wraps getID3 with a service provider and facade, making it easier to works alongside dependency injection. Install it alongside the core library to use syntax like GetId3::fromFile($path)->extractInfo().
Extract and organize file metadata without custom code
Fast.io Metadata Views turns uploaded documents into searchable, structured data. Describe the fields you need and let AI handle the extraction across PDFs, images, and office files. Start free with 50GB storage and 5,000 credits.
PDF and Document Metadata Extraction
PDF files store metadata in an internal info dictionary and sometimes in XMP sidecar data. The smalot/pdfparser library handles both cases in pure PHP with no external dependencies.
Install via Composer:
composer require smalot/pdfparser
Reading PDF Properties
use Smalot\PdfParser\Parser;
$parser = new Parser();
$pdf = $parser->parseFile('/path/to/document.pdf');
$details = $pdf->getDetails();
echo $details['Author'] ?? 'Unknown';
echo $details['Creator'] ?? '';
echo $details['CreationDate'] ?? '';
echo $details['ModDate'] ?? '';
echo $details['Title'] ?? '';
echo $details['Pages'] ?? '';
The getDetails() method returns an associative array with standard PDF properties. Some PDFs also include XMP metadata with namespace prefixes, which pdfparser surfaces as additional array keys. These values can be nested arrays when the XMP data contains structured fields.
Handling Edge Cases
A few things to watch for when parsing PDF metadata:
- Encrypted PDFs: pdfparser cannot read metadata from password-protected documents. You will need to decrypt them first with a tool like
qpdf. - Missing metadata: Many PDFs, especially scanned documents, have empty or missing metadata fields. Always use null coalescing (
??) orisset()checks. - Date formats: PDF dates use a non-standard format like
D:20260425120000+00'00'. Parse them with a helper function or theDateTimeclass. - Large files: For PDFs over 100MB, consider using
parseFile()with memory limits configured inphp.inito avoid exhaustion errors.
Office Documents
For Word documents (.docx), Excel spreadsheets (.xlsx), and PowerPoint files (.pptx), the PHPOffice family of libraries reads document properties from the Open XML format:
use PhpOffice\PhpSpreadsheet\IOFactory;
$spreadsheet = IOFactory::load('report.xlsx');
$props = $spreadsheet->getProperties();
echo $props->getCreator();
echo $props->getTitle();
echo $props->getDescription();
echo $props->getLastModifiedBy();
echo $props->getCreated()->format('Y-m-d');
PhpOffice packages include phpoffice/phpspreadsheet for Excel, phpoffice/phpword for Word, and phpoffice/phppresentation for PowerPoint. All three expose a similar getProperties() interface for reading document metadata.
Building a Multi-Format Extraction Pipeline
Real projects rarely deal with a single file type. A media library ingests JPEGs alongside MP4s. A document management system handles PDFs, Word files, and scanned images. You need a pipeline that detects the file type and routes it to the right extraction method.
Here is a practical approach using a class that combines all the libraries covered above:
class MetadataExtractor
{
private getID3 $getID3;
private Parser $pdfParser;
public function __construct()
{
$this->getID3 = new getID3();
$this->pdfParser = new Parser();
}
public function extract(string $filepath): array
{
$mime = mime_content_type($filepath);
$base = [
'filename' => basename($filepath),
'filesize' => filesize($filepath),
'mime_type' => $mime,
'modified' => date('c', filemtime($filepath)),
];
return match (true) {
str_starts_with($mime, 'image/') => array_merge(
$base, $this->extractImage($filepath)
),
str_starts_with($mime, 'audio/') => array_merge(
$base, $this->extractAudio($filepath)
),
str_starts_with($mime, 'video/') => array_merge(
$base, $this->extractVideo($filepath)
),
$mime === 'application/pdf' => array_merge(
$base, $this->extractPdf($filepath)
),
default => $base,
};
}
private function extractImage(string $path): array
{
$data = [];
$exif = @exif_read_data($path, 'ANY_TAG', true);
if ($exif !== false) {
$data['camera'] = $exif['IFD0']['Model'] ?? null;
$data['taken'] = $exif['EXIF']['DateTimeOriginal'] ?? null;
$data['width'] = $exif['COMPUTED']['Width'] ?? null;
$data['height'] = $exif['COMPUTED']['Height'] ?? null;
}
$size = getimagesize($path, $info);
if (isset($info['APP13'])) {
$iptc = iptcparse($info['APP13']);
$data['caption'] = $iptc['2#120'][0] ?? null;
$data['keywords'] = $iptc['2#025'] ?? [];
}
return $data;
}
private function extractAudio(string $path): array
{
$info = $this->getID3->analyze($path);
$tags = $info['tags']['id3v2']
?? $info['tags']['id3v1'] ?? [];
return [
'duration' => $info['playtime_string'] ?? null,
'bitrate' => $info['audio']['bitrate'] ?? null,
'title' => $tags['title'][0] ?? null,
'artist' => $tags['artist'][0] ?? null,
'album' => $tags['album'][0] ?? null,
];
}
private function extractVideo(string $path): array
{
$info = $this->getID3->analyze($path);
return [
'duration' => $info['playtime_string'] ?? null,
'width' => $info['video']['resolution_x'] ?? null,
'height' => $info['video']['resolution_y'] ?? null,
'codec' => $info['video']['codec'] ?? null,
'frame_rate' => $info['video']['frame_rate'] ?? null,
];
}
private function extractPdf(string $path): array
{
$pdf = $this->pdfParser->parseFile($path);
$details = $pdf->getDetails();
return [
'author' => $details['Author'] ?? null,
'title' => $details['Title'] ?? null,
'pages' => $details['Pages'] ?? null,
'created' => $details['CreationDate'] ?? null,
];
}
}
Use it like this:
$extractor = new MetadataExtractor();
$metadata = $extractor->extract('/uploads/photo.jpg');
$metadata = $extractor->extract('/uploads/podcast.mp3');
$metadata = $extractor->extract('/uploads/contract.pdf');
Batch Processing
For directories with hundreds or thousands of files, wrap the extractor in a generator to keep memory usage flat:
function extractDirectory(string $dir): Generator
{
$extractor = new MetadataExtractor();
$iterator = new RecursiveIteratorIterator(
new RecursiveDirectoryIterator($dir)
);
foreach ($iterator as $file) {
if ($file->isFile()) {
yield $file->getPathname() => $extractor->extract(
$file->getPathname()
);
}
}
}
foreach (extractDirectory('/uploads') as $path => $meta) {
// Store in database, send to API, etc.
echo "{$path}: {$meta['mime_type']}
";
}
This pattern works well for building search indexes, generating reports, or feeding metadata into a downstream system.
Storing and Querying Extracted Metadata at Scale
Extracting metadata is only half the problem. Once you have structured properties from thousands of files, you need somewhere to store them and a way to search across them.
Database Storage
For most PHP applications, storing metadata in a relational database makes sense. Here is a simple schema that works with MySQL or PostgreSQL:
CREATE TABLE file_metadata (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
filepath VARCHAR(1024) NOT NULL,
mime_type VARCHAR(100),
filesize BIGINT,
metadata JSON,
extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_mime (mime_type),
INDEX idx_filepath (filepath(255))
);
Using a JSON column for the metadata itself keeps the schema flexible. Different file types produce different fields, and a JSON column handles that without schema migrations every time you add a new extractor.
Searching Metadata
With JSON columns, you can query specific metadata fields directly:
-- Find all photos taken with a specific camera
SELECT filepath FROM file_metadata
WHERE JSON_EXTRACT(metadata, '$.camera') = 'Canon EOS R5';
-- Find videos longer than 10 minutes
SELECT filepath FROM file_metadata
WHERE mime_type LIKE 'video/%'
AND JSON_EXTRACT(metadata, '$.duration') > '10:00';
Using Fast.io Metadata Views
For teams that need metadata extraction without building custom infrastructure, Fast.io's Metadata Views takes a different approach. Instead of writing code for each file format, you describe the fields you want in plain language and the platform's AI designs a typed schema and populates a searchable spreadsheet from your uploaded files.
Metadata Views works across PDFs, images, Word documents, spreadsheets, presentations, and scanned pages. You can add new columns at any time without reprocessing existing files, and agents can create Views, trigger extraction, and query results through Fast.io's MCP server.
This is useful when you need structured extraction from document-heavy workflows (legal contracts, insurance policies, invoices) without maintaining format-specific parsing code. Your PHP pipeline handles the programmatic cases where you need fine-grained control over specific fields, while Metadata Views covers the cases where document formats are unpredictable or change frequently.
For file-heavy PHP applications, a hybrid approach works well: use PHP libraries for real-time extraction in your upload pipeline, and use a platform like Fast.io for the document analysis and search layer where AI can handle format variations you did not anticipate in code.
Frequently Asked Questions
How do I read EXIF data in PHP?
Use PHP's built-in exif_read_data() function. Call it with the image path as the first argument, 'ANY_TAG' as the second to read all sections, and true as the third to organize results by section. The function works with JPEG and TIFF files and requires the exif extension to be enabled in php.ini. Check that the extension is loaded with phpinfo() or php -m before using it.
What PHP library extracts audio metadata?
getID3 is the standard PHP library for audio metadata. Install it with composer require james-heinrich/getid3, then call the analyze() method on a file path. It reads ID3v1 and ID3v2 tags from MP3 files, Vorbis comments from FLAC and OGG, and metadata from WAV, AAC, WMA, AIFF, and Opus files. It returns duration, bitrate, sample rate, channels, and all tag fields in a single nested array.
How do I extract metadata from a PDF in PHP?
The smalot/pdfparser library reads PDF metadata in pure PHP. Install it with composer require smalot/pdfparser, parse the file with the Parser class, and call getDetails() on the result. It returns author, title, creator application, creation date, modification date, and page count. For encrypted PDFs, you will need to decrypt them first with a tool like qpdf.
Does PHP have built-in EXIF support?
Yes. PHP includes the exif extension as a bundled module. It provides exif_read_data() for reading EXIF headers, exif_thumbnail() for extracting embedded thumbnails, and exif_imagetype() for detecting image format by header bytes. The extension ships with PHP but may be disabled by default on some installations, so check your php.ini to confirm it is enabled.
How do I read IPTC metadata from images in PHP?
Use getimagesize() with a second parameter to capture the raw info array, then check for the APP13 key and pass it to iptcparse(). IPTC tags use numeric codes like 2#120 for caption, 2#025 for keywords, and 2#080 for author. Keywords return as an array since images can carry multiple keyword entries. This approach works without the GD or exif extensions.
Can getID3 read video file metadata?
Yes. getID3 reads metadata from MP4, AVI, MKV, MOV, WebM, ASF/WMV, and MPEG files. It returns video resolution, frame rate, codec, duration, and file size. The library runs in pure PHP with no external dependencies, so it works on shared hosting environments where you cannot install system-level tools like FFmpeg or MediaInfo.
Related Resources
Extract and organize file metadata without custom code
Fast.io Metadata Views turns uploaded documents into searchable, structured data. Describe the fields you need and let AI handle the extraction across PDFs, images, and office files. Start free with 50GB storage and 5,000 credits.