AI & Agents

How to Extract Metadata from ZIP and Archive Files

Archive metadata includes container-level properties like compression method, encryption flags, and internal timestamps, plus the embedded metadata of each file inside. This guide covers how to extract that metadata from ZIP, RAR, 7z, and tar archives using ExifTool, Python, and 7-Zip on the command line.

Fast.io Editorial Team 13 min read
Document metadata extraction and audit interface

What Metadata Do Archive Files Actually Store?

Most people think of archives as compressed folders. They are, but they also carry structured metadata at two levels: the archive container itself, and each individual file entry inside it.

At the container level, an archive records the format version, compression algorithm, total entry count, and any archive-wide comments. Some formats store checksums (CRC-32, SHA-256) for integrity verification.

At the entry level, each file inside the archive gets its own metadata record. A ZIP file stores a separate timestamp, compressed size, uncompressed size, compression method, and CRC-32 for every entry. The ZIP format uses DOS-style timestamps with only 2-second resolution by default, though extended fields can add NTFS or Unix timestamps with much higher precision. ZIP has no concept of time zone, so timestamps are only meaningful if you know where the archive was created.

7z archives go further. Each entry can include the filename, modification time, creation time, last access time, file attributes (read-only, hidden, system), CRC-32, compression method, and whether the entry is encrypted. The 7z format supports over 15 distinct metadata fields per entry when extended attributes are present.

RAR v5 archives store similar per-entry metadata plus the operating system that created the archive, an archive comment field, and recovery record data for error correction. TAR archives preserve Unix-specific metadata that other formats ignore: user ID, group ID, file permissions (the full chmod octal), and symbolic link targets.

Here is what each format stores:

Metadata Field ZIP 7z RAR v5 tar
Filename Yes Yes Yes Yes
Modification timestamp Yes (2s resolution) Yes (high precision) Yes Yes
Creation timestamp Extended field only Yes Yes No
File permissions Extended field only Yes Yes Yes (full POSIX)
Owner/group No No No Yes (UID/GID)
Compression method Yes (per entry) Yes (per entry) Yes N/A (tar is uncompressed)
CRC/checksum CRC-32 CRC-32 CRC-32 No (added by gzip/xz wrapper)
Encryption flag Yes Yes (AES-256) Yes (AES-256) No
OS of creation Yes (creator system) No Yes No
Archive comment Yes No Yes No
Symlink targets No No No Yes

TAR itself is an uncompressed container. Compression comes from wrapping it with gzip (.tar.gz), bzip2 (.tar.bz2), or xz (.tar.xz). The compression wrapper adds its own metadata layer, like gzip's original filename and modification time fields.

What to check before scaling extract metadata from zip and archive files

ExifTool, created by Phil Harvey, reads metadata from over 400 file formats, including ZIP, RAR, 7z, and GZIP archives. It is the fast way to get a structured metadata dump from any archive without extracting the contents.

Installing ExifTool

macOS (Homebrew):

brew install exiftool

Ubuntu/Debian:

sudo apt install libimage-exiftool-perl

Windows: Download the standalone executable from exiftool.org and add it to your PATH.

Reading ZIP Metadata Run ExifTool against a ZIP file to see container-level metadata:

exiftool archive.zip

This outputs the file type, MIME type, ZIP version needed to extract, ZIP bit flag, compression method, modification date, CRC-32, compressed size, uncompressed size, and the filename of each entry. For a multi-file archive, ExifTool lists metadata for every entry in sequence.

To get machine-readable output for scripting:

exiftool -json archive.zip > metadata.json

Reading RAR and 7z Metadata ExifTool handles RAR v5 and 7z files the same way:

exiftool backup.rar
exiftool project.7z

For RAR files, ExifTool extracts the file version, compressed size, uncompressed size, modification date, operating system, and archived filename for each entry. For 7z files, you get similar fields plus the compression method used.

Filtering Specific Fields

Extract only the fields you care about:

exiftool -FileName -ModifyDate -CompressedSize -UncompressedSize archive.zip

To process an entire directory of archives:

exiftool -r -json /path/to/archives/ > all_metadata.json

The -r flag recurses into subdirectories. This is useful for auditing large collections of archived deliverables or backups.

Metadata audit log showing file properties and timestamps
Fastio features

Index your extracted archive contents with built-in AI search

Upload files from your archive extraction pipeline to Fast.io workspaces. Intelligence Mode auto-indexes everything for semantic search and RAG, no separate vector database needed. 50 GB free, no credit card required. Built for extract metadata from zip and archive files workflows.

Using Python for Programmatic Extraction

Python's standard library includes zipfile and tarfile modules that extract metadata without any third-party dependencies. For RAR and 7z, you need additional packages.

ZIP Metadata with zipfile

The zipfile module exposes a ZipInfo object for each entry with these attributes: filename, date_time (a 6-tuple of year, month, day, hour, minute, second), compress_type, file_size, compress_size, CRC, external_attr, create_system, create_version, extract_version, flag_bits, comment, and extra (raw bytes for extended fields).

import zipfile
import json

def extract_zip_metadata(path):
    metadata = []
    with zipfile.ZipFile(path, 'r') as zf:
        for info in zf.infolist():
            metadata.append({
                'filename': info.filename,
                'date_time': info.date_time,
                'compress_type': info.compress_type,
                'file_size': info.file_size,
                'compress_size': info.compress_size,
                'crc': info.CRC,
                'create_system': info.create_system,
                'flag_bits': info.flag_bits,
                'is_encrypted': bool(info.flag_bits & 0x1),
                'external_attr': info.external_attr,
            })
    return metadata

results = extract_zip_metadata('project.zip')
print(json.dumps(results, indent=2, default=str))

The create_system field tells you which operating system created the archive: 0 means MS-DOS/Windows, 3 means Unix. The external_attr field contains OS-specific file attributes. On Unix-created archives, shifting external_attr >> 16 gives you the Unix file permission bits.

TAR Metadata with tarfile

The tarfile module preserves Unix metadata that ZIP typically drops:

import tarfile

def extract_tar_metadata(path):
    metadata = []
    with tarfile.open(path, 'r:*') as tf:
        for member in tf.getmembers():
            metadata.append({
                'name': member.name,
                'size': member.size,
                'mtime': member.mtime,
                'mode': oct(member.mode),
                'uid': member.uid,
                'gid': member.gid,
                'uname': member.uname,
                'gname': member.gname,
                'type': member.type,
                'linkname': member.linkname,
                'is_dir': member.isdir(),
                'is_symlink': member.issym(),
            })
    return metadata

The r:* mode auto-detects the compression wrapper (gzip, bzip2, xz), so the same code works for .tar.gz, .tar.bz2, and .tar.xz files.

7z and RAR with py7zr and rarfile

For 7z archives, install py7zr:

pip install py7zr
import py7zr

with py7zr.SevenZipFile('backup.7z', 'r') as z:
    for name, info in z.archiveinfo().items():
        print(name, info)

For RAR archives, install rarfile (requires unrar binary on the system):

pip install rarfile
import rarfile

with rarfile.RarFile('backup.rar') as rf:
    for info in rf.infolist():
        print(info.filename, info.file_size, info.date_time,
              info.compress_type, info.host_os)

The host_os field in RAR metadata identifies the operating system that created the archive, which is useful for forensic analysis and understanding why certain file attributes might be missing.

Code-driven metadata extraction and indexing pipeline

Command-Line Extraction with 7-Zip and unrar

If you prefer staying in the terminal without Python, 7-Zip and unrar provide detailed metadata listings.

7-Zip (7z)

7-Zip's l (list) command with the -slt flag outputs technical metadata for every entry:

7z l -slt archive.zip

This prints the path, size, compressed size, modification time, creation time, access time, attributes, CRC, method, encrypted flag, and block number for each file. The same command works on .7z, .rar, .tar.gz, .tar.xz, and dozens of other formats.

For scripting, pipe the output through grep or awk:

7z l -slt archive.7z | grep -E "^(Path|Size|Modified|Method|Encrypted)"

unrar

For RAR-specific metadata, unrar provides verbose listings:

unrar lt archive.rar

The lt command shows technical details: filename, size, packed size, ratio, modification date, attributes, CRC-32, host OS, compression method, and version. The host OS field is particularly valuable for cross-platform forensics.

tar

The tar command shows Unix metadata directly:

tar tvf archive.tar.gz

This displays permissions, owner/group, file size, modification date, and filename for each entry. The permission string (like drwxr-xr-x) shows the exact POSIX permissions stored in the archive, which no other archive format preserves as faithfully.

For more detail on a specific entry:

tar --list --verbose --verbose -f archive.tar.gz

The double --verbose flag adds the file format, link count, device numbers, and the full timestamp including seconds.

Practical Workflows for Archive Metadata

Raw metadata extraction becomes useful when you build it into a repeatable workflow. Here are the patterns that come up most often.

Forensic Timeline Reconstruction

Archives often contain files from different time periods. Extracting timestamps from every entry lets you build a timeline of when files were originally created or modified, independent of when the archive itself was made. ZIP's per-entry timestamps are especially useful here because they record the original file modification time, not the archiving time.

A Python script that builds a sorted timeline from a ZIP:

import zipfile
from datetime import datetime

with zipfile.ZipFile('evidence.zip', 'r') as zf:
    entries = []
    for info in zf.infolist():
        dt = datetime(*info.date_time)
        entries.append((dt, info.filename, info.file_size))

for dt, name, size in sorted(entries):
        print(f"{dt.isoformat()}  {size:>10}  {name}")

Detecting Encryption and Compression Methods

Security audits often need to verify whether archived data is encrypted. The encryption flag in ZIP and 7z metadata tells you immediately, without trying to extract anything:

import zipfile

with zipfile.ZipFile('delivery.zip', 'r') as zf:
    for info in zf.infolist():
        encrypted = bool(info.flag_bits & 0x1)
        print(f"{info.filename}: encrypted={encrypted}, "
              f"method={info.compress_type}")

This is faster than attempting extraction and catching password errors, and it works even when you do not have the decryption password.

Bulk Archive Auditing

When you receive archives from multiple sources, you often need to catalog what is inside without extracting everything. Combine ExifTool's recursive mode with JSON output to build a searchable index:

exiftool -r -json -FileName -FileType -ModifyDate \
  -CompressedSize -UncompressedSize /archives/ > index.json

This creates a single JSON file cataloging every entry across all archives in the directory tree.

Feeding Archive Metadata into Search Pipelines

Once you have structured metadata from your archives, the next step is making it searchable. Platforms with built-in document indexing can ingest the extracted files and their metadata together. Fast.io's Intelligence Mode, for example, auto-indexes uploaded files for semantic search and RAG queries. Upload the extracted archive contents to a workspace, enable Intelligence, and the files become searchable by meaning, not just filename. This is particularly useful for large archive collections where manual browsing is impractical.

For automated pipelines, you can script the extraction-to-upload flow: extract metadata and files from archives, filter by criteria (date range, file type, encryption status), then upload the qualifying files to a shared workspace using the Fast.io API or MCP server. Once the extracted documents land in a workspace, Metadata Views can pull structured data from the files themselves (contract dates, invoice amounts, image properties, author names) into a queryable grid, no additional scripts required. Webhooks can notify downstream systems when new files land.

Handling Edge Cases and Limitations

Archive metadata extraction is not always straightforward. Several edge cases trip up automated pipelines.

Password-Protected Archives

You can read metadata from encrypted ZIP files without the password. The filename, timestamps, compressed size, and CRC are stored in the unencrypted central directory. However, 7z and RAR archives can optionally encrypt the file listing itself (header encryption), which blocks all metadata access without the password.

To check whether a 7z archive has encrypted headers:

7z l archive.7z

If the output shows "Encrypted = +" but no filenames, the headers are encrypted.

Nested Archives

Archives inside archives (a .tar.gz inside a .zip) require recursive extraction. The outer archive's metadata tells you an entry exists, but you need to extract it to read the inner archive's metadata.

A simple recursive approach in Python:

import zipfile
import tarfile
import io

def inspect_nested(zf, entry):
    data = zf.read(entry.filename)
    if entry.filename.endswith('.tar.gz'):
        with tarfile.open(fileobj=io.BytesIO(data), mode='r:gz') as tf:
            for member in tf.getmembers():
                print(f"  nested: {member.name} "
                      f"({member.size} bytes, "
                      f"mode={oct(member.mode)})")

Timestamp Precision Differences

ZIP's default 2-second timestamp resolution means two files modified 1 second apart will show the same timestamp. If your workflow depends on sub-second ordering, check for extended timestamp fields in the extra attribute, or use 7z or tar which store higher-precision timestamps natively.

Character Encoding

ZIP filenames were originally encoded in CP437 (IBM PC encoding). Modern ZIP files use UTF-8 when the language encoding flag (bit 11 of the general-purpose bit flag) is set. Archives created on systems with different encodings can produce garbled filenames. Python's zipfile module handles this automatically in most cases, but edge cases with Japanese, Chinese, or Korean filenames in older archives may require manual decoding.

Corrupted Archives

Partially corrupted archives may have readable metadata for some entries but not others. ExifTool and Python's zipfile will raise errors on corrupted entries. Wrap your extraction code in try/except blocks and log which entries failed rather than stopping the entire process.

Frequently Asked Questions

What metadata is stored in a ZIP file?

A ZIP file stores per-entry metadata including the filename, last modification timestamp (2-second resolution by default), compression method, compressed and uncompressed size, CRC-32 checksum, creator operating system, PKZIP version, flag bits (including encryption status), external file attributes, an optional comment, and an extra field that can contain NTFS timestamps, Unix timestamps, or other extended data. The archive itself also has an optional global comment.

How do you view ZIP file properties without extracting?

Use ExifTool (exiftool archive.zip), 7-Zip (7z l -slt archive.zip), or Python's zipfile module with infolist(). All three read the central directory of the ZIP file, which contains metadata for every entry, without decompressing or extracting any file contents.

Can you extract metadata from encrypted archives?

For ZIP files, yes. The central directory with filenames, timestamps, and sizes is unencrypted even when file contents are password-protected. For RAR and 7z, it depends: both formats support optional header encryption, which encrypts the file listing itself. Without header encryption, you can read entry metadata. With header encryption enabled, you need the password to see anything.

Which archive format preserves the most file metadata?

TAR preserves the most Unix-specific metadata, including user/group IDs, full POSIX file permissions, and symbolic link targets. 7z stores the most general-purpose metadata fields per entry (over 15 when extended attributes are present). ZIP stores the least by default but can extend its metadata through extra fields. The best choice depends on whether you need Unix permission fidelity (tar) or cross-platform compatibility (ZIP or 7z).

How do you extract metadata from nested archives?

You need to extract the inner archive from the outer one first, then read its metadata separately. In Python, you can do this in memory without writing to disk: read the inner archive's bytes from the outer ZipFile, wrap them in a BytesIO object, and open them with the appropriate module (tarfile, zipfile, or py7zr). ExifTool does not recursively inspect nested archives.

What is the difference between archive metadata and file metadata?

Archive metadata describes the container and its entries, including compression method, encryption flags, timestamps of when files were added, and checksums. File metadata is the information embedded within individual files, like EXIF data in photos, author fields in PDFs, or revision history in Word documents. Extracting a file from an archive gives you the file metadata, but you need to inspect the archive itself to get the archive-level metadata.

Related Resources

Fastio features

Index your extracted archive contents with built-in AI search

Upload files from your archive extraction pipeline to Fast.io workspaces. Intelligence Mode auto-indexes everything for semantic search and RAG, no separate vector database needed. 50 GB free, no credit card required. Built for extract metadata from zip and archive files workflows.