How to Extract Metadata for Content Migration Projects
Content migration metadata extraction is the process of pulling structured properties like titles, tags, categories, authors, dates, and permissions from a source system and mapping them to the target platform's schema so that content retains its organization and discoverability after migration. This guide covers auditing source metadata, building field mapping documents, handling schema mismatches, and validating completeness after migration.
Why Metadata Gets Lost During Content Migrations
Content migration projects fail at a startling rate. According to Gartner research, 83% of data migration projects either fail outright or exceed their budgets and timelines. The most common reason is not that files go missing. It is that the metadata surrounding those files, the titles, categories, author attributions, publication dates, SEO properties, and permission settings, gets silently dropped or corrupted during the transfer.
Every CMS stores metadata differently. WordPress uses a key-value wp_postmeta table. SharePoint uses managed metadata term stores. Drupal uses entity fields. When you move content between these systems, there is rarely a 1:1 correspondence between metadata schemas. A "category" in one system might map to a "taxonomy term" in another, while custom fields often have no equivalent at all.
The result: content technically arrives in the new system, but the organizational layer that makes it findable, filterable, and useful is either missing or wrong. Search breaks. Internal links rot. SEO rankings drop because meta titles and descriptions vanished. Permissions reset to defaults, exposing sensitive documents.
The fix is treating metadata extraction as the first step in any migration, not an afterthought. Before you move a single file, you need a complete inventory of what metadata exists, where it lives, and how it maps to the target system. Fast.io's workspaces give migration teams a staging environment with collaboration and AI-powered metadata extraction built in, so the audit, mapping, and validation steps share one source of truth.
Step 1: Audit Your Source Metadata
Start by documenting every metadata field in your source system. This is more involved than it sounds because metadata lives in several layers.
System metadata includes fields the platform manages automatically: creation date, modification date, file size, MIME type, version history, and access permissions. These fields exist for every piece of content regardless of type.
Content-type metadata covers the fields defined by your content model: title, slug, author, publish date, status (draft, published, archived), category, tags, and featured image. In a CMS like WordPress, these are post fields. In SharePoint, they are column values in document libraries.
Custom metadata is anything your team added beyond the platform defaults. Think custom fields for regulatory classification, project codes, client names, review status, or workflow stage. These are the fields most likely to get lost because they have no standard equivalent in other systems.
SEO metadata includes meta titles, meta descriptions, canonical URLs, Open Graph tags, and structured data markup. Losing these during migration directly impacts search rankings.
Relationship metadata tracks how content connects: parent-child relationships, related content links, menu positions, redirect rules, and cross-references between documents.
To build your audit, export a sample of content from your source system and examine every field. Most CMS platforms provide database export tools or API endpoints that expose the full schema. For WordPress, use wp db export or the REST API's /wp/v2/posts?_fields= parameter. For SharePoint, the Asynchronous Metadata Read (AMR) API exports metadata in bulk. For Drupal, the Migrate module's source plugins enumerate every field on every entity type.
Document each field in a spreadsheet with columns for: field name, field type, where it lives in the source system, whether it contains data (many custom fields are defined but empty), and how critical it is for the migration.
Step 2: Build a Field Mapping Document
Once you know what metadata exists in the source, map each field to its equivalent in the target system. This is the most important artifact in the entire migration project.
A field mapping document is a table that shows, for every source field, the corresponding target field, any transformation needed, and what happens when no equivalent exists. Here is a simplified example from a WordPress-to-headless-CMS migration:
post_titlemaps totitlewith no transformation (direct 1:1 match)post_datemaps topublishedAtwith ISO 8601 format conversion and timezone adjustmentpost_categorymaps tocategory(reference type) by mapping term IDs to new UUIDs via a lookup table_yoast_wpseo_titlemaps toseo.metaTitleafter stripping template variables like%title%and%sep%post_authormaps toauthor(reference type) by mapping user IDs to new author entries, which must be created firstcustom_field_project_codemaps toprojectCode(custom field) with no transformation, but the field must be created in the target system firstfeatured_imagemaps toheroImage(media reference) by re-uploading the asset, mapping the new ID, and preserving alt text
Three scenarios require special attention:
Schema divergence. The source and target use fundamentally different structures for the same concept. WordPress categories are hierarchical taxonomies. A flat-tag system in the target loses that hierarchy. You need to decide: flatten the hierarchy, encode it in the tag name (e.g., "Parent > Child"), or create a custom field to preserve it.
Lossy transformations. Rich text fields in one CMS often contain platform-specific markup. WordPress shortcodes, Drupal render arrays, and SharePoint web parts all need to be stripped or converted to standard HTML or Markdown. Any embedded metadata in that markup (like image captions or gallery configurations) needs separate extraction before the transformation.
Missing target fields. When the target system has no equivalent for a source field, you have three options: create a custom field in the target to hold it, merge it into an existing field (e.g., appending a project code to the description), or document it as intentionally dropped. Never silently drop metadata without recording the decision.
Simplify Metadata Extraction for Your Next Migration
Fast.io Metadata Views extract structured fields from documents using AI. Upload files, describe the metadata you need, and get a queryable data grid. Free to start with 50 GB storage and 5,000 monthly credits.
Step 3: Extract and Transform Metadata
With the mapping document complete, build your extraction pipeline. The approach depends on your source system and the volume of content.
API-based extraction is the cleanest option when your source CMS has a well-documented API. Pull content in paginated batches, extract all fields, and write them to an intermediate format like JSON or CSV. This preserves data types and relationships better than database dumps.
For WordPress, the REST API at /wp-json/wp/v2/ exposes posts, pages, media, categories, tags, and users. Include the _embed parameter to pull featured images and author data in a single request. For custom fields, plugins like Advanced Custom Fields expose their data through the API automatically.
For SharePoint, use the Microsoft Graph API or the SharePoint REST API to query document libraries. The $select and $expand parameters let you pull specific metadata columns and related items. For large migrations, the AMR API handles bulk export without throttling issues.
For Drupal, the JSON:API module exposes every entity type with full field data. Use include parameters to pull referenced entities (like taxonomy terms and media) in the same request.
Database-level extraction works when API access is limited or when you need fields the API does not expose. Export the database, then query it directly. WordPress stores custom metadata in wp_postmeta as key-value pairs. Drupal uses dedicated field tables like node__field_project_code. SharePoint metadata lives in SQL Server content databases, though direct access is discouraged in cloud environments.
File-level metadata extraction handles the properties embedded in the files themselves: EXIF data in images, XMP sidecar data, ID3 tags in audio, and document properties in Office files. These are separate from CMS metadata and need their own extraction step. Tools like ExifTool, Apache Tika, or Python's python-docx library handle this programmatically.
Transformation rules convert extracted values into the format the target system expects. Common transformations include:
- Date format conversion (MySQL datetime to ISO 8601)
- ID remapping (source user IDs to target user UUIDs)
- Taxonomy term resolution (source term IDs to target term slugs)
- HTML cleanup (stripping shortcodes, converting platform-specific markup to standard HTML)
- URL rewriting (updating internal links to reflect new URL structures)
- Character encoding normalization (Latin-1 to UTF-8)
Write transformation scripts that log every change. When a field value gets modified during transformation, record both the original and transformed values. This audit trail is essential for debugging post-migration issues.
Step 4: Validate Metadata Completeness
Migration without validation is just copying files and hoping. Build validation into every stage of the pipeline, not just at the end.
Pre-migration validation checks the source data before extraction begins. Look for null values in required fields, orphaned references (a post referencing a category that no longer exists), duplicate slugs, and encoding issues. Fix these in the source system before migrating them to the target.
Transform validation checks the intermediate data after extraction and transformation but before loading into the target. Verify that every source record produced a target record. Compare field counts: if WordPress has 2,847 posts with a project_code custom field, the intermediate data should have exactly 2,847 entries with that field populated.
Post-migration validation compares the target system against the source to confirm nothing was lost. Run these checks:
- Record counts: Total content items in source vs. target, broken down by content type
- Field completeness: For each metadata field, count populated vs. empty values in both systems
- Relationship integrity: Verify that parent-child relationships, category assignments, and cross-references survived the migration
- SEO preservation: Compare meta titles, descriptions, and canonical URLs between source and target for a random sample
- Permission accuracy: Confirm that access controls transferred correctly, especially for sensitive content
- Search functionality: Run the same search queries in both systems and compare results
Automated validation scripts save significant time on large migrations. Write a script that pulls content from both systems via API, compares field-by-field, and generates a discrepancy report. Focus the report on fields that matter most: anything with a "critical" designation in your field mapping document.
For ongoing migrations (phased rollouts or continuous sync), set up monitoring that alerts when metadata fidelity drops below a threshold. Industry benchmarks for successful migrations target 95% or higher metadata fidelity within the first quarter after cutover.
Using AI-Powered Extraction for Complex Migrations
Traditional extraction scripts work well when source metadata is structured and well-documented. But many real-world migrations involve messy data: inconsistent field usage, metadata buried in unstructured text, scanned documents with no machine-readable properties, or legacy systems where the schema documentation no longer exists.
AI-powered extraction tools can handle scenarios that rule-based scripts cannot. Instead of writing a regex for every possible date format in a legacy field, you describe what you want extracted in plain language and let the model figure out the variations.
Fast.io's Metadata Views takes this approach for document-level extraction. You describe the fields you need, like "effective date," "counterparty name," or "document classification," and AI designs a typed schema with field types (Text, Integer, Decimal, Boolean, URL, JSON, Date & Time). It then scans files in a workspace and populates a sortable, filterable data grid with the extracted values.
This is particularly useful for migration projects that involve unstructured or semi-structured source content. If you are migrating a document library where metadata was never consistently applied, Metadata Views can extract structured properties from the document contents themselves rather than relying on fields that may be empty or inaccurate. PDFs, images, Word documents, spreadsheets, and even scanned pages are all supported.
The extraction results feed into your field mapping pipeline. Export the structured data from a Metadata View, transform it to match your target schema, and import it alongside the files. This fills gaps that would otherwise require manual review of hundreds or thousands of documents.
For teams already using automation in their migration pipeline, Fast.io's MCP server lets agents create Views, trigger extraction, and query results programmatically. An agent can upload a batch of source documents to a workspace, define a Metadata View for the fields needed in the target system, run extraction, validate the results, and export the mapped data, all without manual intervention.
Other options in this space include Apache Tika for file-level metadata parsing, custom scripts using OpenAI or Anthropic APIs for unstructured text extraction, and enterprise tools like Xillio or Informatica for large-scale CMS-to-CMS migrations. The right tool depends on your source system, the volume of content, and how much of your metadata lives in structured fields vs. document contents.
Building a Metadata Migration Checklist
Distilling the process into a reusable checklist keeps teams aligned across migration phases. Adapt this to your specific systems, but the structure applies to most CMS-to-CMS or platform-to-platform migrations.
Before extraction:
- Inventory all metadata fields in the source system (system, content-type, custom, SEO, relationship)
- Identify fields with low population rates (less than 10% filled) and decide whether to migrate or drop them
- Document data types and constraints for each field
- Build the field mapping document with transformation rules
- Set up the intermediate storage format (JSON, CSV, or a staging database)
- Create lookup tables for ID remapping (users, categories, tags, media)
During extraction:
- Extract in batches with checkpointing so you can resume after failures
- Log every transformation with before/after values
- Validate record counts at each stage
- Handle encoding issues immediately rather than deferring them
- Extract file-level metadata (EXIF, document properties) separately from CMS metadata
After loading:
- Run automated field-by-field comparison between source and target
- Verify relationship integrity (parent-child, cross-references, menu positions)
- Test search with representative queries
- Spot-check SEO metadata for a random sample of pages
- Confirm permissions and access controls
- Generate and review the discrepancy report
- Sign off on metadata fidelity before decommissioning the source system
Post-migration monitoring:
- Track search ranking changes for key pages over the following 4-6 weeks
- Monitor 404 error rates for redirected URLs
- Watch for user reports of missing content or broken navigation
- Schedule a 30-day retrospective to document lessons learned
The average enterprise manages content across 3-5 different platforms simultaneously. Each migration between these platforms is an opportunity to lose organizational context, or to improve it. A thorough metadata extraction process preserves the value embedded in your content structure, not just the content itself.
Frequently Asked Questions
How do you preserve metadata during content migration?
Start by auditing every metadata field in the source system, including system fields, custom fields, SEO properties, and relationship data. Build a field mapping document that shows how each source field corresponds to a target field, with transformation rules for schema differences. Extract metadata via API or database export, transform it to the target format, and validate completeness after loading. The key is treating metadata as a first-class migration artifact rather than an afterthought.
What metadata should be extracted before migrating files?
Extract five layers: system metadata (creation date, file size, version history, permissions), content-type metadata (title, author, publish date, status, category), custom metadata (project codes, workflow stages, client names), SEO metadata (meta titles, descriptions, canonical URLs, Open Graph tags), and relationship metadata (parent-child links, cross-references, menu positions). Also extract file-level metadata like EXIF data from images and document properties from Office files.
How to map metadata fields between different CMS platforms?
Create a field mapping table that lists every source field alongside its target equivalent, the transformation needed, and notes on edge cases. Export sample content from both systems and compare schemas side by side. Pay special attention to schema divergence (hierarchical categories vs. flat tags), lossy transformations (platform-specific markup), and missing target fields. Document every decision, especially fields you intentionally drop.
What tools help with content migration metadata extraction?
CMS APIs are the primary tool for structured metadata extraction: the WordPress REST API, SharePoint Graph API and AMR API, and Drupal JSON:API. For file-level metadata, ExifTool and Apache Tika handle most formats. Enterprise migration platforms like Xillio, Informatica, and Fivetran provide end-to-end extraction pipelines. For unstructured or inconsistently tagged content, AI-powered tools like Fast.io Metadata Views can extract structured fields from document contents without relying on existing metadata.
How long does a content migration metadata extraction take?
Timeline depends on content volume and metadata complexity. A straightforward WordPress-to-headless migration of 5,000 posts with standard fields might take 2-3 weeks for the full audit, mapping, extraction, and validation cycle. Enterprise migrations spanning 100,000+ documents across multiple source systems with custom metadata schemas can take 3-6 months. The audit and mapping phases typically account for 40-50% of the total timeline.
What happens if metadata is lost during migration?
Lost metadata degrades content findability, breaks internal navigation, and can damage SEO rankings. Missing meta titles and descriptions mean Google indexes default or empty values. Lost category assignments make content unfilterable. Permission metadata loss can expose sensitive documents or lock users out of content they need. Rebuilding metadata manually after migration is possible but expensive, often costing more than the original migration budget.
Related Resources
Simplify Metadata Extraction for Your Next Migration
Fast.io Metadata Views extract structured fields from documents using AI. Upload files, describe the metadata you need, and get a queryable data grid. Free to start with 50 GB storage and 5,000 monthly credits.