How do I extract metadata using AWS Lambda?

Create a Lambda function triggered by S3 upload events. The function receives the bucket name and object key, downloads the file to /tmp, runs an extraction tool (ExifTool for images, FFprobe for video, or a language-specific library for documents), and writes the structured metadata to a datastore like DynamoDB. Package binary tools as Lambda layers so they are available in the execution environment.

Can I run ExifTool in a Lambda function?

Yes. ExifTool requires Perl, so you need to package both the Perl runtime and ExifTool as a Lambda layer. Build the layer on Amazon Linux 2 (the Lambda runtime OS) using Docker to ensure binary compatibility. Once the layer is attached to your function, call ExifTool via subprocess with the -json flag to get structured output.

How do I process file metadata with cloud functions?

Configure an event trigger on your storage bucket (S3 Event Notifications on AWS, Eventarc on Google Cloud) to invoke your function when files are created. The function downloads the file, runs your chosen metadata extraction tool, and stores the results. Both AWS Lambda and Google Cloud Functions support binary dependencies through layers or custom container images.

What is the best serverless approach for metadata extraction?

For most use cases, an event-driven approach works best. Configure your storage bucket to trigger a function on file upload, extract metadata in the function, and store results in a database. Use Lambda layers for binary tools like ExifTool and FFprobe. For high-throughput scenarios, add an SQS queue between S3 and Lambda to buffer events and control concurrency.

What are the size limits for AWS Lambda layers?

Each Lambda layer can be up to 50 MB compressed (ZIP file). The total unzipped size of all layers attached to a function cannot exceed 250 MB. For metadata extraction, an ExifTool layer typically uses 15 to 25 MB and an FFprobe layer uses about 30 MB, fitting well within these limits.

How do I handle large files in Lambda metadata extraction?

For video and audio files, use pre-signed S3 URLs with FFprobe instead of downloading the entire file. FFprobe reads only header metadata, so it processes even multi-gigabyte files in under a second. For other large files, Lambda supports up to 10 GB of ephemeral storage in /tmp, configurable in your function settings.

Can serverless functions process thousands of files concurrently?

Yes. AWS Lambda scales automatically up to your account's concurrency limit (default 1,000, can be increased to tens of thousands). Each file upload triggers an independent function instance. For batch processing, you can use S3 Batch Operations or Step Functions to orchestrate extraction across large file sets with built-in retry logic.

Run Metadata Extraction with AWS Lambda and Cloud Functions

Why Serverless Fits Metadata Extraction: metadata extraction serverless cloud functions aws lambda

Metadata extraction is bursty by nature. You might process zero files for hours, then receive a batch upload of 5,000 images from a photographer or a dump of 10,000 PDFs from a migration project. Traditional server-based approaches force you to choose between paying for idle capacity or accepting slow processing during spikes.

Serverless functions solve this by running extraction code only when files arrive. AWS Lambda and Google Cloud Functions both spin up instances automatically, process the file, and shut down. You pay per invocation and per millisecond of compute, nothing else.

The economics make sense for metadata work specifically because most extraction jobs are short-lived. FFprobe reads only the header bytes of a media file to pull codec, resolution, and duration data. In benchmarks from the AWS Compute Blog, analyzing a 793 MB video file took between 186 and 816 milliseconds and consumed under 150 MB of memory. ExifTool parses EXIF, IPTC, and XMP tags from images in a similar timeframe. These are exactly the kinds of workloads serverless was designed for: brief, stateless, and parallelizable.

AWS Lambda currently supports up to 10,240 MB (10 GB) of memory and a maximum execution timeout of 15 minutes per invocation. For metadata extraction, you will rarely need more than 512 MB of memory or 30 seconds of runtime. The 15-minute ceiling exists as a safety net for unusually large or complex files, but typical extraction jobs finish in under a second.

The tradeoff is cold starts. The first invocation after a period of inactivity takes longer because Lambda needs to initialize your function and load any layers. For metadata extraction, cold starts typically add 1 to 3 seconds. If that latency matters for your workflow, you can use provisioned concurrency to keep instances warm, though this adds cost.

Architecture of an Event-Driven Metadata Pipeline

The standard serverless metadata extraction pipeline has four components: a storage trigger, the extraction function, a results store, and an optional downstream consumer.

Storage trigger. When a file lands in an S3 bucket (or a Google Cloud Storage bucket), the storage service emits an event notification. On AWS, you configure S3 Event Notifications to fire on s3:ObjectCreated:* events, which sends a payload to your Lambda function containing the bucket name, object key, file size, and content type. On Google Cloud, Eventarc routes google.cloud.storage.object.v1.finalized events to your Cloud Function.

Extraction function. Your Lambda (or Cloud Function) receives the event, downloads the file to temporary storage (/tmp on Lambda, limited to 10 GB with ephemeral storage), runs the extraction tool, and outputs structured metadata as JSON. The function itself is stateless. Each invocation handles one file and exits.

Results store. Extracted metadata needs to land somewhere queryable. Common choices include DynamoDB for key-value lookups, PostgreSQL for relational queries, Elasticsearch for full-text search, or writing a JSON sidecar file back to S3 alongside the original.

Downstream consumers. Once metadata is stored, other systems can react to it. An SNS topic can notify a search indexer. A Step Functions workflow can route files based on their metadata (for example, sending all images with GPS coordinates to a geolocation service). Or a webhook can push results to an external platform.

Here is a minimal architecture using S3 and Lambda:

S3 Bucket (uploads/)
    │
    ├── s3:ObjectCreated:* event
    │
    ▼
Lambda Function (extract-metadata)
    │
    ├── Downloads file from S3
    ├── Runs ExifTool / FFprobe / custom parser
    ├── Structures output as JSON
    │
    ▼
DynamoDB Table (file-metadata)
    │
    ├── SNS notification (optional)
    │
    ▼
Downstream services

This pattern scales horizontally by default. If 1,000 files arrive simultaneously, Lambda spins up 1,000 concurrent instances (subject to your account's concurrency limit, which defaults to 1,000 and can be increased). Each instance processes one file independently, with no shared state or coordination required.

Workflow pipeline showing sequential task processing stages

Packaging ExifTool as an AWS Lambda Layer

ExifTool is a Perl-based command-line tool that reads and writes metadata for virtually every image, audio, and video format. Running it in Lambda requires packaging both the Perl runtime and ExifTool as a Lambda layer.

Lambda layers are ZIP archives that get extracted to /opt in the execution environment. Binaries placed in a bin/ directory within the ZIP are automatically added to the PATH, and libraries in lib/ are added to LD_LIBRARY_PATH. This means your function code can call exiftool as a subprocess without knowing its absolute path.

Building the ExifTool Layer

You need to build the layer on Amazon Linux 2 (the Lambda runtime OS) to ensure binary compatibility. The simplest approach uses Docker:

FROM public.ecr.aws/lambda/provided:al2

RUN yum install -y perl perl-libs
RUN curl -L -o exiftool.tar.gz \
    https://exiftool.org/Image-ExifTool-12.87.tar.gz
RUN tar xzf exiftool.tar.gz
RUN mkdir -p /opt/bin /opt/lib
RUN cp -r Image-ExifTool-12.87/exiftool /opt/bin/
RUN cp -r Image-ExifTool-12.87/lib/* /opt/lib/
RUN cp $(which perl) /opt/bin/

After building, zip the /opt/bin and /opt/lib directories:

zip -r exiftool-layer.zip bin/ lib/
aws lambda publish-layer-version \
  --layer-name exiftool \
  --zip-file fileb://exiftool-layer.zip \
  --compatible-runtimes python3.12 nodejs20.x

Calling ExifTool from Your Function

With the layer attached, call ExifTool from Python using subprocess:

import subprocess
import json

def extract_metadata(file_path):
    result = subprocess.run(
        ["exiftool", "-json", "-n", file_path],
        capture_output=True,
        text=True
    )
    return json.loads(result.stdout)[0]

The -json flag returns structured JSON output. The -n flag returns raw numeric values instead of formatted strings, which is better for programmatic consumption. ExifTool can extract hundreds of metadata fields depending on the file type: EXIF data from photos (camera model, exposure, GPS coordinates), IPTC data (captions, keywords, copyright), XMP sidecars, and document properties from PDFs and Office files.

Layer Size Considerations

The ExifTool layer with Perl typically comes in around 15 to 25 MB compressed. Lambda layers have a 50 MB compressed limit per layer and a 250 MB total unzipped limit across all layers attached to a function. ExifTool fits comfortably within these constraints, leaving room for additional layers if needed.

Extract Metadata from Files Without Building Infrastructure

Fast.io Metadata Views turns documents into structured, queryable data. Describe the fields you need, upload your files, and get a sortable spreadsheet back. 50 GB free, no credit card required.

Packaging FFprobe for Video and Audio Metadata

FFprobe is part of the FFmpeg project and extracts detailed technical metadata from media files: codecs, bitrates, frame rates, resolution, duration, and stream information. Unlike ExifTool, FFprobe understands container formats and can probe individual streams within a file.

Building the FFprobe Layer

FFprobe is available as a static binary, which simplifies layer creation. You do not need to compile it from source.

mkdir -p ffprobe-layer/bin
cd ffprobe-layer

curl -L -o ffprobe-release.tar.xz \
  https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xf ffprobe-release.tar.xz
cp ffmpeg-*-amd64-static/ffprobe bin/

zip -r ffprobe-layer.zip bin/
aws lambda publish-layer-version \
  --layer-name ffprobe \
  --zip-file fileb://ffprobe-layer.zip \
  --compatible-runtimes python3.12 nodejs20.x

The static build avoids missing library issues that plague dynamically linked binaries in the Lambda environment.

Using FFprobe with Pre-Signed URLs

A useful optimization for video files: instead of downloading the entire file to /tmp, generate a pre-signed S3 URL and pass it directly to FFprobe. FFprobe only reads the file headers and metadata atoms, so it downloads a tiny fraction of the actual file.

import subprocess
import json
import boto3

s3 = boto3.client("s3")

def extract_video_metadata(bucket, key):
    url = s3.generate_presigned_url(
        "get_object",
        Params={"Bucket": bucket, "Key": key},
        ExpiresIn=300
    )
    result = subprocess.run(
        ["/opt/bin/ffprobe", "-loglevel", "error",
         "-show_streams", "-show_format",
         "-print_format", "json", url],
        capture_output=True,
        text=True
    )
    return json.loads(result.stdout)

In testing documented by AWS, this approach processed a 793 MB video file in under a second using only 146 MB of memory. The function never downloads the full file, making it both faster and cheaper than a download-then-analyze approach.

Combining Both Layers

You can attach both the ExifTool and FFprobe layers to a single Lambda function and route extraction based on the file's content type. Images go to ExifTool, video and audio files go to FFprobe, and documents can use a Python library like python-pptx or PyPDF2 bundled in the deployment package.

Google Cloud Functions as an Alternative

If you are already on Google Cloud, Cloud Functions (2nd gen) offers a comparable setup. The event model differs slightly: instead of S3 Event Notifications, you use Eventarc to route Cloud Storage events to your function.

import functions_framework
from google.cloud import storage

@functions_framework.cloud_event
def process_upload(cloud_event):
    data = cloud_event.data
    bucket_name = data["bucket"]
    file_name = data["name"]

client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(file_name)

blob.download_to_filename(f"/tmp/{file_name}")
    metadata = extract_metadata(f"/tmp/{file_name}")
    ### Store metadata in Firestore, BigQuery, etc.

Cloud Functions 2nd gen runs on Cloud Run under the hood, which gives you up to 32 GB of memory, 60-minute timeouts, and concurrency within a single instance (multiple requests handled by one instance). For metadata extraction, the extra memory and timeout headroom can help with unusually large files, though most jobs will not need it.

One important difference: Google Cloud Storage event delivery is at-least-once, meaning your function may receive duplicate events. Build your metadata pipeline to be idempotent. Check whether metadata already exists for a file before processing, or use a deduplication key based on the object's generation number.

For binary dependencies like ExifTool and FFprobe, you can either build a custom container image (Cloud Functions 2nd gen supports this) or include static binaries in your deployment package. The container approach is cleaner for complex dependency chains.

When a Managed Platform Makes More Sense

Building your own serverless extraction pipeline gives you full control over which tools run, how metadata is structured, and where results are stored. But it also means you own the layer packaging, the function code, the error handling, the retry logic, and the ongoing maintenance of binary dependencies.

For teams that need structured metadata from documents without building infrastructure, managed platforms handle the extraction layer entirely. You upload files and get structured data back.

Fast.io's Metadata Views takes this approach. Instead of writing code to parse specific file formats, you describe the fields you want extracted in natural language. The platform designs a typed schema (text, integer, decimal, boolean, URL, JSON, date and time), matches files in your workspace, and populates a sortable, filterable spreadsheet. It works across PDFs, images, Word docs, spreadsheets, presentations, scanned pages, and handwritten notes.

The difference from a Lambda pipeline is the extraction logic. With Lambda, you choose your tools (ExifTool for technical EXIF data, FFprobe for media streams, custom parsers for domain-specific formats) and write the code to run them. With Metadata Views, the AI handles format detection and field extraction based on your schema description. Adding a new column does not require reprocessing existing files.

For agent-driven workflows, Fast.io exposes these capabilities through its MCP server. An agent can create a workspace, upload files, define a metadata schema, trigger extraction, and query the results programmatically. The free plan includes 50 GB of storage and 5,000 credits per month with no credit card required.

Both approaches have their place. If you need raw technical metadata (EXIF tags, codec parameters, GPS coordinates), a Lambda pipeline with ExifTool and FFprobe gives you precise control. If you need business-level metadata (contract dates, invoice totals, policy numbers) from diverse document types, a managed extraction platform saves significant development time.

How to Run Metadata Extraction with Serverless Cloud Functions

Why Serverless Fits Metadata Extraction: metadata extraction serverless cloud functions aws lambda

Architecture of an Event-Driven Metadata Pipeline

Packaging ExifTool as an AWS Lambda Layer

Building the ExifTool Layer

Calling ExifTool from Your Function

Layer Size Considerations

Extract Metadata from Files Without Building Infrastructure

Packaging FFprobe for Video and Audio Metadata

Building the FFprobe Layer

Using FFprobe with Pre-Signed URLs

Combining Both Layers

Google Cloud Functions as an Alternative

When a Managed Platform Makes More Sense

Frequently Asked Questions

Related Resources

Extract Metadata from Files Without Building Infrastructure