AI & Agents

How to Extract Metadata from Excel Spreadsheets

Excel spreadsheets store far more than visible cell data. Author names, revision history, external data connections, and hidden sheets all live inside the file as metadata. This guide covers how to extract that metadata using Excel's built-in inspector, Python's openpyxl library, and PowerShell, plus how to handle bulk extraction across hundreds of files.

Fast.io Editorial Team 11 min read
AI-powered document analysis extracting structured metadata from files

What Metadata Hides Inside Excel Files

Every .xlsx file is actually a ZIP archive containing XML files that store your spreadsheet data, formatting, and metadata. When you save a workbook, Excel writes document properties into these XML files automatically, whether you intended to share that information or not.

The metadata stored in Excel files falls into several categories:

  • Core properties: author, title, subject, keywords, category, description, and language
  • Extended properties: company name, application version, total editing time, and document security settings
  • Custom properties: any key-value pairs added by macros, add-ins, or enterprise document management systems
  • Hidden structural data: hidden sheets, named ranges, external data connections, print areas, and defined names

A single Excel file can expose 20 or more document properties without the author realizing it. Enterprise spreadsheets often contain 3 to 5 external data connections that reveal internal server URLs, database names, or API endpoints. If you share a workbook externally without stripping this metadata first, you risk leaking internal infrastructure details, employee names, and file system paths.

Understanding what metadata exists is the first step. The rest of this guide shows you how to extract it programmatically so you can audit, catalog, or clean it at scale.

Helpful references: Fast.io Workspaces, Fast.io Collaboration, Fast.io AI, and Document Data Extraction.

What to check before scaling extract metadata from excel spreadsheets

The fast way to inspect a single file's metadata is through Excel itself. This works for spot checks, but it does not scale to bulk workflows.

View Document Properties

  1. Open the workbook in Excel
  2. Click File then Info
  3. The right panel displays the document properties: title, tags, categories, related dates, and related people
  4. Click Show All Properties at the bottom to see the full list including company, manager, and custom fields

Use the Document Inspector

Excel's Document Inspector finds hidden data that the Info panel does not show. To run it:

  1. Go to File then Info
  2. Click Check for Issues then Inspect Document
  3. Select the categories you want to scan: comments, document properties, hidden rows and columns, hidden sheets, external data connections, and custom XML
  4. Click Inspect

The inspector reports what it finds in each category. You can then choose to remove specific categories. This is useful for cleaning files before sharing, but it only works one file at a time.

Limitations

The built-in approach breaks down when you need to extract metadata from dozens or hundreds of files. You cannot export the properties to a structured format, compare metadata across files, or automate the inspection process. For those workflows, you need a programmatic approach.

Extract Metadata with Python and openpyxl

Python's openpyxl library gives you direct access to .xlsx document properties without opening Excel. This is the most flexible approach for custom extraction workflows.

Install openpyxl

pip install openpyxl

Read Core Properties

from openpyxl import load_workbook

wb = load_workbook('report.xlsx')
props = wb.properties

print(f"Title: {props.title}")
print(f"Author: {props.creator}")
print(f"Created: {props.created}")
print(f"Modified: {props.modified}")
print(f"Last Modified By: {props.lastModifiedBy}")
print(f"Company: {props.company}")
print(f"Category: {props.category}")
print(f"Keywords: {props.keywords}")
print(f"Description: {props.description}")
print(f"Subject: {props.subject}")
print(f"Revision: {props.revision}")

The wb.properties object exposes all core and extended Dublin Core properties stored in the file's docProps/core.xml and docProps/app.xml files.

Detect Hidden Sheets

for sheet in wb.sheetnames:
    ws = wb[sheet]
    visibility = ws.sheet_state
    print(f"{sheet}: {visibility}")

Sheets have three possible states: visible, hidden, and veryHidden. The veryHidden state means the sheet was hidden using VBA and cannot be unhidden through the Excel UI alone. This is a common place for sensitive data to persist unnoticed.

Extract Named Ranges and External Connections

for name in wb.defined_names.definedName:
    print(f"Name: {name.name}, Value: {name.attr_text}")

Named ranges sometimes reference external workbooks or data sources. Check for values that contain file paths or URLs, as these can leak information about your network structure.

Bulk Extraction Script

To process an entire directory of Excel files and export the metadata to CSV:

import os
import csv
from openpyxl import load_workbook

folder = './spreadsheets'
output = []

for filename in os.listdir(folder):
    if not filename.endswith('.xlsx'):
        continue
    path = os.path.join(folder, filename)
    try:
        wb = load_workbook(path, read_only=True)
        props = wb.properties
        output.append({
            'file': filename,
            'author': props.creator,
            'company': props.company,
            'created': str(props.created),
            'modified': str(props.modified),
            'last_modified_by': props.lastModifiedBy,
            'title': props.title,
            'hidden_sheets': sum(
                1 for s in wb.sheetnames
                if wb[s].sheet_state != 'visible'
            ),
        })
        wb.close()
    except Exception as e:
        print(f"Error reading {filename}: {e}")

with open('metadata_report.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=output[0].keys())
    writer.writeheader()
    writer.writerows(output)

Using read_only=True avoids loading cell data into memory, which makes this faster when you only need metadata. A directory of 500 spreadsheets typically processes in under a minute.

Structured audit log showing extracted document metadata and properties
Fastio features

Centralize Your Spreadsheet Metadata Workflows

Fast.io Metadata Views extract structured data from spreadsheets into a queryable grid. Describe the fields you need, and AI pulls author names, revision history, and custom properties across every file. Webhooks, audit trails, and version control included. Free for teams, no credit card required.

Extract Metadata with PowerShell

On Windows systems, PowerShell can extract Excel metadata without installing Python or any additional libraries. This approach uses COM automation through the Shell.Application object.

Basic Property Extraction

$shell = New-Object -ComObject Shell.Application
$folder = $shell.Namespace("C:\Reports")
$file = $folder.ParseName("quarterly-report.xlsx")

### Property indices: 0=Name, 9=Type, 20=Author, 21=Title
Write-Output "Author: $($folder.GetDetailsOf($file, 20))"
Write-Output "Title: $($folder.GetDetailsOf($file, 21))"
Write-Output "Subject: $($folder.GetDetailsOf($file, 22))"

The Shell.Application COM object exposes over 260 property indices. The exact indices vary slightly across Windows versions, so you may need to enumerate them to find what you need.

Enumerate All Properties

$shell = New-Object -ComObject Shell.Application
$folder = $shell.Namespace("C:\Reports")
$file = $folder.ParseName("quarterly-report.xlsx")

for ($i = 0; $i -lt 267; $i++) {
    $name = $folder.GetDetailsOf($null, $i)
    $value = $folder.GetDetailsOf($file, $i)
    if ($value) {
        Write-Output "${i}: ${name} = ${value}"
    }
}

This prints every non-empty metadata field with its index number and human-readable name. Run it once to identify which indices matter for your use case, then hard-code those indices in your extraction script.

Bulk Extraction to CSV

$shell = New-Object -ComObject Shell.Application
$folderPath = "C:\Reports"
$folder = $shell.Namespace($folderPath)
$results = @()

Get-ChildItem $folderPath -Filter *.xlsx | ForEach-Object {
    $file = $folder.ParseName($_.Name)
    $results += [PSCustomObject]@{
        FileName   = $_.Name
        Author     = $folder.GetDetailsOf($file, 20)
        Title      = $folder.GetDetailsOf($file, 21)
        Subject    = $folder.GetDetailsOf($file, 22)
        Created    = $folder.GetDetailsOf($file, 4)
        Modified   = $folder.GetDetailsOf($file, 3)
        Size       = $folder.GetDetailsOf($file, 1)
    }
}

$results | Export-Csv -Path "excel_metadata.csv" -NoTypeInformation

This script scans a directory, extracts the key properties from each .xlsx file, and writes the results to a CSV file. For large directories, this approach is fast because it uses the Windows shell rather than opening each file in Excel.

When to Use

PowerShell vs. Python PowerShell works well for quick Windows-only extractions where you need file-level properties (author, dates, size). Python with openpyxl is better when you need to inspect internal workbook structure like hidden sheets, named ranges, defined names, or custom XML properties. The COM object does not expose those internal details.

Security Implications of Excel Metadata

Metadata in Excel files creates real security and compliance risks. Before sharing any spreadsheet externally, you should audit what metadata it contains.

Common Data Leaks

Author and company information is written automatically based on the user's Office profile. If an employee creates a spreadsheet on a personal machine, the author field might contain their personal email or a previous employer's name. Internal server names often appear in external data connections, revealing infrastructure details to anyone who inspects the file.

Revision history and comments can expose the editing process. Draft comments, tracked changes, and previous authors all persist in the file metadata. A financial model shared with investors might contain internal notes like "pad these numbers" or "CEO wants this higher." Microsoft's Document Inspector was specifically designed to catch these issues.

Hidden sheets and named ranges are the most overlooked risk. A workbook might contain a hidden sheet with raw salary data or customer PII that the author forgot to remove. The veryHidden sheet state makes this worse, because users cannot find these sheets through the normal Excel UI.

Compliance Considerations

Organizations handling sensitive data need metadata extraction as part of their document review process. Financial services firms, law firms, and healthcare organizations regularly audit outbound documents for metadata leaks. Automated extraction with Python or PowerShell, as described in the sections above, lets you build this into your document workflow rather than relying on manual spot checks.

When you need to store and share spreadsheets that have been audited for metadata, a workspace with version control and audit trails helps maintain the chain of custody. Fast.io workspaces track file versions and access events, so you can verify that a cleaned file is the one that was actually shared. The platform's Intelligence feature indexes spreadsheet contents for search and AI chat, and Metadata Views can surface document properties like author, company, and revision history as filterable columns across your entire collection, giving compliance teams a queryable audit layer without writing extraction scripts.

Automating Metadata Extraction at Scale

Single-file extraction is straightforward. The challenge is building a repeatable process for hundreds or thousands of files across teams and projects.

Build a Metadata Catalog

A metadata catalog is a centralized record of document properties across your organization. The Python bulk extraction script from earlier is a starting point, but a production workflow needs a few additions:

  • Scheduling: Run extraction on a cron job or triggered by file upload events
  • Deduplication: Track files by hash to avoid re-processing unchanged documents
  • Alerting: Flag files that contain unexpected metadata like external data connections to unknown servers, or author fields that don't match your organization's domain
  • Storage: Write results to a database or structured file format that supports querying

Webhook-Driven Extraction

Rather than scanning directories on a schedule, you can trigger metadata extraction when files arrive. If your team uses a cloud workspace like Fast.io, you can configure webhooks to fire when new files are uploaded. Your extraction script receives the notification, downloads the file, extracts the metadata, and logs the results. This eliminates the delay between upload and audit.

Fast.io's webhook system notifies your endpoint when files change in a workspace. Combined with the download API, you can build an extraction pipeline that processes files within seconds of upload. For many spreadsheet workflows, you can skip the custom pipeline entirely: Fast.io's Metadata Views let you describe the fields you want extracted in natural language (author, company, hidden sheet count, external connection URLs) and the AI builds a typed schema and populates a sortable, filterable grid across every spreadsheet in the workspace. No scripts, no templates. Add new extraction columns later without reprocessing existing files.

Comparing Metadata Across Files

Once you have metadata from multiple files, you can identify patterns and anomalies. Common analyses include:

  • Files where the author does not match the expected team member
  • Workbooks with external data connections pointing to deprecated or unauthorized servers
  • Documents with hidden sheets that were not flagged during review
  • Files where the company property contains a competitor's name, suggesting the file was originally created elsewhere

These checks are straightforward to implement once you have the metadata in a structured format like CSV or a database table.

Storing Cleaned Files

After extracting and reviewing metadata, you often need to store the cleaned versions in a location where access is controlled and auditable. Cloud workspaces with granular permissions, like Fast.io, let you set access at the workspace, folder, or individual file level. Every download and view is logged, so you maintain visibility into who accessed the cleaned files and when. For teams that need to share cleaned spreadsheets with external parties, branded shares provide a controlled delivery mechanism without email attachments.

Frequently Asked Questions

How do I view metadata in an Excel file?

Open the file in Excel, click File then Info. The properties panel shows author, dates, and tags. Click Show All Properties for the full list. For deeper inspection, use File, Info, Check for Issues, Inspect Document to find hidden sheets, comments, and external connections.

What hidden data is stored in Excel files?

Excel files can contain author names, company information, revision history, comments, hidden and -hidden sheets, named ranges, external data connections with server URLs, custom XML data, print settings, and previous author information. The Document Inspector in Excel scans for all these categories.

How to extract metadata from multiple Excel files at once?

Use Python with openpyxl or PowerShell to loop through a directory of .xlsx files and extract properties programmatically. Both approaches can output results to CSV for analysis. Python is more flexible for inspecting internal workbook structure, while PowerShell works without additional dependencies on Windows.

Can Excel metadata reveal sensitive information?

Yes. Author fields expose employee names and email addresses. External data connections reveal internal server URLs and database names. Hidden sheets may contain raw data like salary figures or customer information. Revision history and comments can expose internal discussions that were never meant for external audiences.

What is the difference between hidden and veryHidden sheets in Excel?

Hidden sheets can be revealed by right-clicking any sheet tab and selecting Unhide. VeryHidden sheets are hidden through VBA code and do not appear in the Unhide dialog. You need the VBA editor or a programmatic tool like openpyxl to detect and access veryHidden sheets.

Does openpyxl work with .xls files?

No. openpyxl only reads .xlsx files (the Office Open XML format introduced in Excel 2007). For older .xls files, use the xlrd library in Python. The metadata structure differs between the two formats, so your extraction code will need separate handling for each.

Related Resources

Fastio features

Centralize Your Spreadsheet Metadata Workflows

Fast.io Metadata Views extract structured data from spreadsheets into a queryable grid. Describe the fields you need, and AI pulls author names, revision history, and custom properties across every file. Webhooks, audit trails, and version control included. Free for teams, no credit card required.