How to Extract Metadata from SharePoint Documents
SharePoint document libraries hold structured metadata across custom columns, content types, and managed metadata term stores. This guide covers three extraction methods: the Microsoft Graph API for programmatic access, PnP PowerShell for bulk scripting, and Power Automate for low-code workflows.
Why Extract Metadata from SharePoint Documents
SharePoint reached 200 million monthly active users in 2020, driven by Microsoft Teams integration and the shift to remote work. That number has continued growing as Microsoft 365 becomes the default enterprise document management layer. Most SharePoint deployments go well beyond file storage. Document libraries carry structured metadata through custom columns, content types, managed metadata terms, and version history.
SharePoint metadata extraction pulls document properties, managed metadata columns, content type fields, and version history from SharePoint document libraries for use in external systems. The challenge is that this metadata is locked inside SharePoint's APIs and database layer. When you need document properties in a reporting dashboard, a compliance system, a migration tool, or an external search index, you have to pull it out programmatically.
SharePoint's built-in export options do not cover most extraction scenarios. The "Export to Excel" button on a library view caps around 30,000 items, drops version metadata, and flattens managed metadata terms into plain text. CSV exports strip column types entirely, turning dates, lookups, and taxonomy terms into unstructured strings.
Common scenarios where programmatic extraction is worth the setup:
- Migrating document libraries to another platform while preserving all column values and content type assignments
- Building a cross-library search index that spans multiple site collections
- Feeding document properties into a compliance or records management system
- Syncing metadata to a data warehouse or BI tool for reporting
- Auditing which documents carry specific classification tags across an entire tenant
The typical enterprise document library contains 15 or more custom metadata columns beyond the defaults. These columns often encode business-critical information: contract expiration dates, project codes, approval status, client identifiers, or compliance classifications. When that data stays trapped in SharePoint, teams end up manually copying values into spreadsheets or duplicating entries across systems.
This guide covers three methods for extracting SharePoint metadata: the Microsoft Graph API for modern programmatic access, PnP PowerShell for bulk scripting, and Power Automate for low-code workflows.
What to check before scaling extract metadata from sharepoint documents
Before writing extraction code, you need to understand where metadata actually lives in SharePoint. The platform stores document metadata across three distinct layers, and each requires a different extraction approach.
Library Columns
Every
SharePoint document library is backed by a list, and each list has columns (also called fields). Some columns ship with the default Document content type: Name, Modified, Modified By, Created, Created By. Site administrators or content type designers add custom columns on top of these.
Library columns span several types: single line of text, choice, number, currency, date and time, lookup (referencing another list), person or group, hyperlink, and managed metadata. When you query a library through the API, these column values come back as list item fields.
Content Types and Inheritance
Content types are reusable collections of metadata columns that can be assigned to document libraries. A parent content type like "Project Document" might define columns for Project Name and Project Manager. Child content types such as "Project Plan" or "Status Report" inherit those columns automatically and can add their own.
Changes to a parent content type cascade down to children. If someone adds a "Department" column to the parent, every child content type picks it up. This inheritance means that extracting metadata from a single library might require understanding the content type hierarchy to know which columns originated where.
A single document library can have multiple content types enabled. One library might hold contracts, invoices, and proposals, each with a different metadata schema sharing the same storage location. Your extraction logic needs to handle items with different sets of populated fields.
Managed Metadata (Term Store)
Managed metadata columns link to a centralized Term Store, a hierarchical taxonomy shared across the entire SharePoint tenant. Unlike simple choice columns where options are defined per-list, managed metadata terms live in term groups and term sets that can be reused across hundreds of libraries.
Extracting managed metadata values is trickier than reading regular columns. The raw API response for a taxonomy field returns an internal label and a GUID pointing to the term in the Term Store. To get the full term path (for example, "Legal > Agreements > Contracts"), you need a second query against the Term Store API or the hidden TaxCatchAll column, which stores a denormalized copy of the label.
File Properties vs. Library Columns
SharePoint also stores metadata at the file level through document properties. For Office documents, this includes Author, Title, Subject, and Keywords from the Open XML specification. These file-level properties are separate from library column values. A document might have "John Smith" as the file Author property but "Jane Doe" in the library's "Document Owner" column.
Most business workflows care about library columns and managed metadata. File-level properties matter mainly for migration scenarios where you need to preserve embedded document metadata alongside the column values.
Extracting Metadata with the Microsoft Graph API
The Microsoft Graph API is the modern, supported way to interact with SharePoint Online programmatically. It provides REST endpoints for reading site information, list items, drive items, and their associated metadata.
Authentication Setup
Before making API calls, register an application in Microsoft Entra ID (formerly Azure Active Directory):
- Go to the Azure portal and navigate to App registrations
- Create a new registration with a descriptive name like "SharePoint Metadata Extractor"
- Under API permissions, add Microsoft Graph with
Sites.Read.Allfor broad read access, orSites.Selectedfor access to specific sites only - Generate a client secret or configure certificate authentication
- Note your tenant ID, client ID, and secret for use in API calls
For automated extraction pipelines, use application permissions with a client credential flow to avoid interactive sign-in.
Querying Document Library Items
SharePoint document libraries are accessible as lists through the Graph API. To retrieve all items with their metadata from a specific library:
GET https://graph.microsoft.com/v1.0/sites/{site-id}/lists/{list-id}/items?expand=fields
The expand=fields parameter is critical. Without it, the response returns only system properties like ID and web URL. With it, you get every custom column value for each document.
For large libraries, the API returns pages of 200 items by default. Follow the @odata.nextLink URL in each response to retrieve subsequent pages:
import requests
def get_all_items(site_id, list_id, token):
url = (
f"https://graph.microsoft.com/v1.0"
f"/sites/{site_id}/lists/{list_id}/items"
f"?expand=fields&$top=200"
)
headers = {"Authorization": f"Bearer {token}"}
all_items = []
while url:
response = requests.get(url, headers=headers)
data = response.json()
all_items.extend(data.get("value", []))
url = data.get("@odata.nextLink")
return all_items
Each item in the response includes a fields object with key-value pairs for every library column. Standard columns like Title, Created, and Modified appear alongside custom columns you have defined.
Handling Managed Metadata Fields
Managed metadata columns return a different structure than simple text or choice fields. The Graph API represents taxonomy values as objects with a Label and TermGuid:
{
"fields": {
"DocumentType": {
"Label": "Contract",
"TermGuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}
}
}
If you only need the display label, read it directly from the response. If you need the full term path, query the Term Store separately using the taxonomy REST API at /_api/v2.1/termStore/sets/{setId}/terms/{termId}.
You can also include the hidden TaxCatchAll column in your field selection. This column stores a flattened list of all taxonomy values applied to the item, which simplifies extraction when you have multiple managed metadata columns on a single library.
Selecting Specific Fields
For libraries with many columns, reduce response size by selecting only the fields you need:
GET /sites/{site-id}/lists/{list-id}/items
?expand=fields($select=Title,DocumentType,ExpirationDate,Status)
&$top=200
This matters at scale. Pulling all fields for 100,000 documents generates more data than pulling just the five columns your downstream system actually needs. Selective field expansion also reduces the chance of hitting Graph API throttling thresholds.
Stop Writing Metadata Extraction Scripts
Fast.io Metadata Views extract structured data from documents using AI. Describe the fields you want in plain English, and the platform populates a queryable database from your files. Free tier includes 50 GB storage and 5,000 monthly credits.
Bulk Export with PnP PowerShell
PnP PowerShell is the most popular tool for SharePoint administration scripts, and it handles bulk metadata export well. It wraps the SharePoint REST and CSOM APIs behind PowerShell cmdlets that manage authentication, pagination, and throttling automatically.
Installation and Connection Install the PnP
PowerShell module and connect to your SharePoint site:
Install-Module PnP.PowerShell -Scope CurrentUser
Connect-PnPOnline -Url "https://contoso.sharepoint.com/sites/projects" -Interactive
The -Interactive flag opens a browser for authentication. For automated scripts, use -ClientId and -ClientSecret with an app registration, or -ManagedIdentity when running in Azure.
Exporting Library Metadata to CSV
A basic export script retrieves all items from a document library and writes selected fields to a CSV file:
$items = Get-PnPListItem -List "Documents" -PageSize 500 `
-Fields "Title","FileRef","DocumentType","ExpirationDate","Modified"
$results = foreach ($item in $items) {
[PSCustomObject]@{
FileName = $item.FieldValues["FileLeafRef"]
Path = $item.FieldValues["FileRef"]
DocumentType = $item.FieldValues["DocumentType"].Label
ExpirationDate = $item.FieldValues["ExpirationDate"]
Modified = $item.FieldValues["Modified"]
}
}
$results | Export-Csv -Path "metadata-export.csv" -NoTypeInformation
The -PageSize 500 parameter requests 500 items per batch, reducing API round trips. For managed metadata fields, access the .Label property to get the human-readable term name instead of the internal GUID.
Handling Large Libraries
SharePoint
Online throttles requests that consume too many resources. For libraries with more than 5,000 items, the list view threshold can block unindexed queries.
PnP PowerShell handles pagination automatically through Get-PnPListItem, but you can improve performance for large libraries by adding a CAML query that filters on indexed columns:
$caml = @"
<View Scope='RecursiveAll'>
<Query>
<Where>
<Geq>
<FieldRef Name='Modified'/>
<Value Type='DateTime'>2025-01-01T00:00:00Z</Value>
</Geq>
</Where>
</Query>
</View>
"@
$items = Get-PnPListItem -List "Documents" -Query $caml -PageSize 500
For libraries exceeding 100,000 items, consider the SharePoint Asynchronous Metadata Read (AMR) API. It is designed for large-scale reads and avoids the same throttling limits, though setup is more involved and the output format requires additional parsing.
Exporting Version History
If you need metadata from previous document versions, loop through each item's version history:
foreach ($item in $items) {
$versions = Get-PnPListItemVersion -List "Documents" -Identity $item.Id
foreach ($version in $versions) {
### Access $version.FieldValues for historical metadata snapshots
}
}
Version history export is slow. Each version requires a separate API call, so a library with 10,000 documents averaging five versions each means 50,000 additional requests. Plan for overnight runs or batch by date range to stay within throttling limits.
Low-Code Extraction with Power Automate
Not every metadata extraction project requires code. Power Automate includes built-in SharePoint connectors that can extract document properties without writing scripts.
The "Get file metadata" action retrieves standard and custom column values for a single document. The "Get files (properties only)" action returns metadata for all files in a library, though it pages at 256 items by default and requires loop logic for larger collections.
A typical extraction flow follows this pattern:
- Trigger on a schedule (daily, weekly) or a manual button press
- Run the "Get files (properties only)" action against the target document library
- Create a CSV table from the results array using the "Create CSV table" action
- Save the CSV to OneDrive, email it, or push rows to an external system via an HTTP connector
Power Automate works well for recurring exports from small to medium libraries, roughly under 5,000 items. For larger collections, the connector's pagination and throttling limits make it slower than PnP PowerShell, and complex managed metadata fields sometimes return internal IDs rather than readable labels.
The main advantage is accessibility. Business users who manage document libraries can build and maintain their own extraction flows without involving IT or writing code. For one-off exports or small recurring reports, it is often the fast path from idea to working solution.
Limitations to Consider
Power
Automate's SharePoint connector does not expose the Term Store API directly. Extracting the full taxonomy hierarchy for managed metadata columns requires additional HTTP actions calling the SharePoint REST API. At that point, you are writing API calls inside a visual tool, which often ends up more complex than an equivalent PowerShell script. If your extraction requirements include managed metadata hierarchies or cross-site collection queries, scripted approaches give you more control.
Storing and Querying Extracted Metadata
Once metadata is out of SharePoint, the next question is where to put it. The right destination depends on what you plan to do with the data.
CSV and Excel
The simplest option. CSV works for one-time exports, audits, and small-scale analysis. The downside is that CSV files are static snapshots that go stale as soon as someone updates a document in SharePoint, and they offer no querying capability beyond Excel's filter and sort.
SQL Database
For recurring extraction, write metadata to a SQL database like Azure SQL, PostgreSQL, or SQLite for local use. This gives you proper indexing, joins across multiple library exports, and the ability to track changes over time by comparing current and previous extraction runs. Most enterprise BI tools connect natively to SQL, making it a good fit for reporting workflows.
S3 or Azure Blob Storage
For data lake architectures, export metadata as JSON or Parquet files to cloud object storage. This works well when the extracted metadata feeds into ETL pipelines, machine learning workflows, or analytics platforms that already read from a data lake.
Cloud Storage with AI-Powered Extraction
If your goal is to extract metadata from documents that have already left SharePoint, or if you want richer extraction beyond what SharePoint columns provide, consider a platform with built-in document intelligence.
Fast.io's Metadata Views take a different approach to the extraction problem. Instead of writing scripts to read predefined columns, you describe the fields you want in plain English. The AI designs a typed schema with field types like Text, Integer, Date, Boolean, and URL, then scans your documents and populates a sortable, filterable grid.
This is particularly useful when migrating documents out of SharePoint and you want to capture information that was never stored in library columns. Contracts might have expiration dates buried in the document body that no one added to a metadata column. Invoices might contain line items that exist only in the PDF, not in SharePoint's fields. Metadata Views extract from the document content itself, not from a predefined column schema.
The extraction works across PDFs, Word documents, spreadsheets, presentations, scanned pages, and images. You can add new extraction columns at any time without reprocessing existing files. For teams that need structured data from their document collections, whether those documents came from SharePoint or elsewhere, it replaces the write-code-to-read-columns pattern with a describe-what-you-need workflow.
Fast.io also exposes Metadata Views through its MCP server, so agents can create schemas, trigger extraction, and query results programmatically. The free tier includes 50 GB storage and 5,000 credits per month with no credit card required, which provides enough room to test extraction workflows against a real document set before committing to a migration.
Frequently Asked Questions
How do I extract metadata from SharePoint documents?
You have three main options. The Microsoft Graph API gives you programmatic REST access to document library columns, managed metadata, and version history. PnP PowerShell wraps those APIs into cmdlets that handle pagination and throttling for bulk export to CSV. Power Automate offers a no-code approach using built-in SharePoint connectors for smaller libraries. Choose based on your library size, technical skills, and whether the extraction is a one-time job or a recurring pipeline.
Can you bulk export SharePoint document properties?
Yes. PnP PowerShell is the most common tool for bulk export. Use the Get-PnPListItem cmdlet with a -PageSize parameter to retrieve items in batches, then pipe the results to Export-Csv. For libraries over 100,000 items, the SharePoint Asynchronous Metadata Read (AMR) API is better suited for large-scale reads. Power Automate can also export in bulk, but its pagination limits make it slower for large collections.
What API does SharePoint use for metadata?
SharePoint Online supports three API layers. The Microsoft Graph API is the modern, recommended choice for new projects. It exposes document libraries as list resources with full field metadata. The SharePoint REST API provides endpoints at /_api/ for operations the Graph API does not cover yet, including direct Term Store access. The older Client-Side Object Model (CSOM) is still supported but Microsoft encourages migration to Graph for new development.
How do you extract SharePoint managed metadata programmatically?
Through the Graph API, managed metadata columns return an object with a Label (the display name) and a TermGuid (the term identifier). For the full taxonomy path, query the Term Store REST API at /_api/v2.1/termStore/sets/{setId}/terms/{termId}. Alternatively, include the hidden TaxCatchAll column in your field selection, which stores a denormalized copy of all taxonomy values for each list item. PnP PowerShell exposes the same data through the .Label property on managed metadata field values.
What is the difference between library columns and file properties in SharePoint?
Library columns are metadata fields defined on the SharePoint list that backs a document library. They include custom columns, content type fields, and managed metadata terms. File properties are metadata embedded inside the document file itself, like Author, Title, and Subject in Office documents. These two layers are independent. A file's embedded Author property can differ from the library's Document Owner column. Most business extraction workflows focus on library columns because they carry the structured data that organizations define and manage.
Related Resources
Stop Writing Metadata Extraction Scripts
Fast.io Metadata Views extract structured data from documents using AI. Describe the fields you want in plain English, and the platform populates a queryable database from your files. Free tier includes 50 GB storage and 5,000 monthly credits.