AI & Agents

Metadata Extraction with Ruby Libraries: A Developer Guide

Ruby has a solid collection of gems for reading metadata from images, PDFs, audio files, and more. This guide compares the most useful options, from mini_exiftool's ExifTool wrapper to exifr's pure-Ruby EXIF parsing, pdf-reader for document properties, and taglib-ruby for audio ID3 tags. You will find installation steps, working code examples, and advice on building a multi-format extraction pipeline.

Fast.io Editorial Team 10 min read
Structured data extracted from uploaded files displayed in a dashboard

What Ruby Metadata Extraction Looks Like in Practice

Ruby metadata extraction libraries provide native or wrapper interfaces for reading and writing file metadata. Gems like mini_exiftool wrap the ExifTool command-line application, while exifr provides pure-Ruby EXIF and TIFF parsing with zero external dependencies.

The practical use cases are straightforward:

  • Asset cataloging: Reading camera model, resolution, and capture date from thousands of uploaded photos
  • Document processing: Pulling author, creation date, page count, and security flags from PDFs
  • Audio library management: Extracting ID3 tags (artist, album, track number, genre) from MP3 and FLAC collections
  • Privacy compliance: Stripping GPS coordinates and author names before publishing files
  • Server-side processing: Handling metadata in Rails controllers or background jobs when users upload files

The challenge is the same in every language: no single library handles every file format. JPEG EXIF data lives in a different structure than PDF document info dictionaries or MP3 ID3 frames. Ruby's gem ecosystem addresses this with specialized libraries for each file family, plus wrapper gems that delegate to ExifTool for universal coverage.

This guide covers the five most practical gems, with installation instructions and working code for each. By the end, you will have a reusable pattern for routing any file type to the right extractor.

Comparing the Top Ruby Metadata Gems

Before writing code, it helps to understand which gem fits which job. Here is a breakdown of the primary options.

mini_exiftool (ExifTool wrapper)

  • File types: JPEG, PNG, TIFF, RAW, PDF, MP3, MP4, MOV, and hundreds more (anything ExifTool supports)
  • Metadata scope: EXIF, IPTC, XMP, ICC profiles, maker notes, video properties, document info
  • Dependency: Requires the ExifTool command-line binary installed on the system
  • Install: gem install mini_exiftool
  • Downloads: Over 5.4 million on RubyGems
  • Best for: Projects that need to read (and write) metadata across many file formats with a single API

exifr (pure Ruby)

  • File types: JPEG, TIFF
  • Metadata scope: EXIF tags, GPS data, thumbnail extraction
  • Dependency: None (pure Ruby)
  • Install: gem install exifr
  • Downloads: Over 14.6 million on RubyGems
  • Best for: Image-only workflows where you want zero system dependencies and fast parsing

pdf-reader

  • File types: PDF
  • Metadata scope: Document info (title, author, creator, producer, creation date), PDF version, page count, encryption status
  • Dependency: None (pure Ruby)
  • Install: gem install pdf-reader
  • Downloads: Over 103 million on RubyGems
  • Best for: Extracting document properties from PDF files in processing pipelines

taglib-ruby (TagLib wrapper)

  • File types: MP3, FLAC, OGG, WMA, AAC, WAV, AIFF
  • Metadata scope: ID3v1, ID3v2 (including v2.4), cover art, audio properties (bitrate, sample rate, duration)
  • Dependency: Requires the TagLib C++ library installed on the system
  • Install: gem install taglib-ruby
  • Downloads: Over 460,000 on RubyGems
  • Best for: Music and podcast metadata extraction with full read/write support

ruby-vips (libvips wrapper)

  • File types: JPEG, PNG, TIFF, WebP, HEIF, SVG, PDF (rasterized)
  • Metadata scope: EXIF data, ICC profiles, image dimensions, resolution, orientation
  • Dependency: Requires the libvips library installed on the system
  • Install: gem install ruby-vips
  • Best for: High-performance image processing pipelines where you also need metadata access

If you only process images, exifr gives you the fast path with no dependencies. If you deal with mixed file types or need write support, mini_exiftool is the most versatile. For PDF-heavy workflows, pdf-reader is the standard. For audio, taglib-ruby is the clear choice.

Reading Image Metadata with exifr and mini_exiftool

Image metadata extraction is the most common use case. Here are working examples with both pure-Ruby and ExifTool-based approaches.

Pure Ruby with exifr

exifr parses JPEG and TIFF EXIF data without any external dependencies:

require 'exifr/jpeg'

jpeg = EXIFR::JPEG.new('photo.jpg')
puts "Dimensions: #{jpeg.width}x#{jpeg.height}"
puts "Camera: #{jpeg.model}"
puts "Date taken: #{jpeg.date_time_original}"
puts "Exposure: #{jpeg.exposure_time}"
puts "ISO: #{jpeg.iso_speed_ratings}"

if jpeg.gps
  puts "Location: #{jpeg.gps.latitude}, #{jpeg.gps.longitude}"
end

exifr returns typed Ruby objects. Timestamps come back as Time instances, GPS coordinates as floats, and dimensions as integers. That means you can sort and filter without manual type conversion.

For TIFF files, swap the class:

require 'exifr/tiff'

tiff = EXIFR::TIFF.new('scan.tiff')
puts tiff.artist
puts tiff.copyright

The tradeoff is format coverage. exifr only reads JPEG and TIFF. It cannot parse PNG, WebP, HEIF, or RAW camera files. If you upload a .cr2 or .arw file, you need a different tool.

ExifTool wrapper with mini_exiftool

mini_exiftool delegates to ExifTool, which supports over 400 file formats. Install ExifTool first (brew install exiftool on macOS, apt install libimage-exiftool-perl on Ubuntu), then use the gem:

require 'mini_exiftool'

photo = MiniExiftool.new('photo.jpg')
puts photo.date_time_original
puts photo.model
puts photo.gps_latitude
puts photo.gps_longitude
puts photo.image_width
puts photo.image_height

Tag names in mini_exiftool are case-insensitive. photo.DateTimeOriginal, photo.datetimeoriginal, and photo.date_time_original all return the same value. The library also handles arrays, integers, floats, strings, and Time objects automatically.

One important design choice: mini_exiftool processes one file at a time and prioritizes safety over speed. Write operations work on a copy of the original file so that either all changes succeed or none do:

photo = MiniExiftool.new('photo.jpg')
photo.artist = 'Jane Doe'
photo.copyright = '2026 Jane Doe'

if photo.save
  puts 'Metadata updated'
else
  puts "Errors: #{photo.errors}"
end

If you need batch processing across many files, the exiftool gem (a separate project from mini_exiftool) supports multiget operations that call ExifTool once for an entire directory.

File properties indexed and displayed in a structured view
Fastio features

Extract Structured Data from Any File

Fast.io Metadata Views turn documents, images, and scanned files into queryable data with AI-powered extraction. 50GB free storage, no credit card required.

Extracting PDF and Audio Metadata

Beyond images, Ruby handles PDF document properties and audio tags with dedicated gems.

PDF metadata with pdf-reader

The pdf-reader gem parses PDF files conforming to the Adobe specification. It exposes document info dictionaries, page counts, and encryption details:

require 'pdf-reader'

reader = PDF::Reader.new('report.pdf')
info = reader.info

puts "Title: #{info[:Title]}"
puts "Author: #{info[:Author]}"
puts "Creator: #{info[:Creator]}"
puts "Producer: #{info[:Producer]}"
puts "Created: #{info[:CreationDate]}"
puts "Modified: #{info[:ModDate]}"
puts "Pages: #{reader.page_count}"
puts "PDF version: #{reader.pdf_version}"

The info hash keys are symbols matching the PDF specification field names. Not every PDF includes all fields. Some generators set the Title, others leave it blank. Always check for nil values before using them in your pipeline.

For more advanced PDF manipulation including metadata modification, HexaPDF is a pure-Ruby alternative that supports both reading and writing:

require 'hexapdf'

doc = HexaPDF::Document.open('report.pdf')
info = doc.trailer.info

puts "Title: #{info[:Title]}"
puts "Author: #{info[:Author]}"

info[:Title] = 'Updated Report Title'
doc.write('report_updated.pdf')

Audio metadata with taglib-ruby

taglib-ruby wraps the TagLib C++ library for reading and writing audio metadata. Install TagLib first (brew install taglib on macOS, apt install libtag1-dev on Ubuntu):

require 'taglib'

TagLib::FileRef.open('track.mp3') do |file|
  tag = file.tag
  puts "Title: #{tag.title}"
  puts "Artist: #{tag.artist}"
  puts "Album: #{tag.album}"
  puts "Year: #{tag.year}"
  puts "Track: #{tag.track}"
  puts "Genre: #{tag.genre}"

props = file.audio_properties
  puts "Duration: #{props.length_in_seconds}s"
  puts "Bitrate: #{props.bitrate} kbps"
  puts "Sample rate: #{props.sample_rate} Hz"
  puts "Channels: #{props.channels}"
end

The FileRef class auto-detects the file format and uses the right parser. It works with MP3, FLAC, OGG Vorbis, WMA, AAC, WAV, and AIFF. The block syntax ensures the file handle is closed properly after extraction.

For ID3v2-specific features like cover art or custom frames, use the format-specific API:

TagLib::MPEG::File.open('track.mp3') do |file|
  tag = file.id3v2_tag
  tag.frame_list('APIC').each do |frame|
    puts "Cover art: #{frame.mime_type}, #{frame.picture.size} bytes"
  end
end

Building a Multi-Format Extraction Pipeline

Real projects rarely deal with a single file type. Here is a pattern for dispatching files to the right extractor based on extension, wrapping everything in consistent error handling:

require 'mini_exiftool'
require 'pdf-reader'
require 'taglib'

module MetadataExtractor
  AUDIO_EXTENSIONS = %w[.mp3 .flac .ogg .wma .aac .wav .aiff].freeze
  IMAGE_EXTENSIONS = %w[.jpg .jpeg .png .tiff .tif .heic .webp].freeze
  PDF_EXTENSIONS   = %w[.pdf].freeze

def self.extract(filepath)
    ext = File.extname(filepath).downcase
    result = { file: filepath, extension: ext }

begin
      if IMAGE_EXTENSIONS.include?(ext)
        result.merge!(extract_image(filepath))
      elsif PDF_EXTENSIONS.include?(ext)
        result.merge!(extract_pdf(filepath))
      elsif AUDIO_EXTENSIONS.include?(ext)
        result.merge!(extract_audio(filepath))
      else
        result[:error] = "no extractor for #{ext}"
      end
    rescue => e
      result[:error] = e.message
    end

result
  end

def self.extract_image(filepath)
    photo = MiniExiftool.new(filepath)
    {
      width: photo.image_width,
      height: photo.image_height,
      camera: photo.model,
      date_taken: photo.date_time_original,
      gps_lat: photo.gps_latitude,
      gps_lon: photo.gps_longitude
    }
  end

def self.extract_pdf(filepath)
    reader = PDF::Reader.new(filepath)
    info = reader.info
    {
      title: info[:Title],
      author: info[:Author],
      pages: reader.page_count,
      pdf_version: reader.pdf_version
    }
  end

def self.extract_audio(filepath)
    result = {}
    TagLib::FileRef.open(filepath) do |file|
      tag = file.tag
      props = file.audio_properties
      result = {
        title: tag.title,
        artist: tag.artist,
        album: tag.album,
        duration_seconds: props.length_in_seconds,
        bitrate: props.bitrate
      }
    end
    result
  end
end

Use it to process an entire directory:

require 'json'

results = Dir.glob('/uploads/**/*')
  .select { |f| File.file?(f) }
  .map { |f| MetadataExtractor.extract(f) }

File.write('metadata.json', JSON.pretty_generate(results))

This dispatcher scales well. Adding a new format means writing one method and adding the extension to the right constant array. Errors on individual files do not crash the entire run.

Performance considerations

mini_exiftool spawns ExifTool as a subprocess for each file, which adds process startup overhead. For batch jobs processing thousands of images, consider these alternatives:

  • Use the exiftool gem instead, which supports passing multiple files in a single ExifTool invocation
  • For image-only jobs, switch to exifr to avoid the subprocess entirely
  • Run extraction in threads using Ruby's Thread or a job framework like Sidekiq, since ExifTool subprocesses release the GIL

For Rails applications, metadata extraction fits naturally into Active Job. Queue the extraction after file upload, store results in your database, and keep the upload response fast:

class MetadataExtractionJob < ApplicationJob
  queue_as :default

def perform(attachment_id)
    attachment = Attachment.find(attachment_id)
    metadata = MetadataExtractor.extract(attachment.file.path)
    attachment.update!(metadata: metadata)
  end
end

When Programmatic Extraction Is Not Enough

Ruby gems work well for structured file types where the metadata format is standardized. EXIF in JPEG, info dictionaries in PDF, ID3 in MP3. But some extraction jobs go beyond what any parser can handle.

Consider these scenarios:

  • A scanned contract where the "metadata" you need (counterparty name, signing date, dollar amount) exists only as text in the document body, not in any header field
  • A photo gallery where you want to tag images with subjects ("group photo," "product shot," "landscape") based on visual content rather than EXIF fields
  • A folder of mixed-format invoices where the file structure varies between vendors, and you need to extract line items into a consistent schema

These are structured extraction problems, not metadata parsing problems. They require AI-based understanding of file contents rather than format-level header parsing.

Fast.io's Metadata Views handle this layer. You describe the fields you want extracted in natural language, and the system designs a typed schema (text, integer, decimal, boolean, URL, date/time) and populates a sortable, filterable spreadsheet from your files. It works with PDFs, images, Word docs, spreadsheets, presentations, scanned pages, and handwritten notes. No templates, no OCR rules, no custom code.

The difference from Intelligence Mode (which handles search and summarization) is that Metadata Views are the structured extraction layer. You get back typed columns you can sort, filter, and export, not free-text summaries.

For a typical workflow, you might use Ruby gems for header-level metadata (camera settings, creation dates, page counts) and Metadata Views for content-level extraction (contract terms, invoice totals, photo subjects). The two approaches complement each other: one parses file format headers, the other understands file contents.

Agents can create Views, trigger extraction, and query results through the Fast.io MCP server, so you can build pipelines that combine both programmatic and AI-based extraction in the same workflow.

Frequently Asked Questions

How do I extract EXIF data in Ruby?

The fast approach is the exifr gem for pure-Ruby parsing. Install with `gem install exifr`, then use `EXIFR::JPEG.new('photo.jpg')` to access properties like width, height, model, date_time_original, and GPS coordinates. For broader format support including RAW and HEIF files, use mini_exiftool, which wraps the ExifTool command-line application.

What Ruby gem is best for reading file metadata?

It depends on the file type. For images, exifr handles JPEG and TIFF with zero dependencies. For multi-format coverage (images, video, audio, documents), mini_exiftool wraps ExifTool and supports over 400 formats. For PDFs specifically, pdf-reader is the standard with over 103 million downloads. For audio files, taglib-ruby provides full ID3 tag support.

How do I use mini_exiftool in Rails?

Install ExifTool on your server (`apt install libimage-exiftool-perl`), add `mini_exiftool` to your Gemfile, and use it in a background job. Call `MiniExiftool.new(file_path)` to read metadata, then store the extracted properties in your database. Run extraction in Active Job or Sidekiq to keep upload responses fast, since mini_exiftool spawns a subprocess for each file.

Can I extract PDF metadata with Ruby?

Yes. The pdf-reader gem reads PDF document info dictionaries including title, author, creator, creation date, page count, and PDF version. For read-write access, HexaPDF is a pure-Ruby library that can both extract and modify PDF metadata. Install either gem with no system dependencies required.

Does exifr work with RAW camera files?

No. exifr only supports JPEG and TIFF formats. For RAW files like CR2, ARW, NEF, or DNG, use mini_exiftool, which delegates to ExifTool and supports hundreds of RAW formats from all major camera manufacturers.

How do I strip metadata from files in Ruby?

mini_exiftool supports both reading and writing. To strip GPS data, set `photo.gps_latitude = nil` and call `photo.save`. To remove all metadata, use ExifTool's `-all=` flag through the gem's command options. For image-specific stripping during processing, ruby-vips can selectively remove EXIF fields while resaving.

Related Resources

Fastio features

Extract Structured Data from Any File

Fast.io Metadata Views turn documents, images, and scanned files into queryable data with AI-powered extraction. 50GB free storage, no credit card required.