How to Extract Metadata from Google Workspace Files via API
Google Workspace files live entirely in the cloud, so there is no local file to parse with traditional metadata tools. This guide shows you how to use Google's Drive, Docs, Sheets, and Slides APIs to retrieve file properties, revision history, permissions, custom metadata, and document-specific structures programmatically.
Why Google Workspace Metadata Requires an API Approach
Traditional metadata extraction tools like ExifTool, Apache Tika, or python-docx work by reading binary file headers and embedded XML from local files. Google Docs, Sheets, and Slides break that model entirely. These documents exist as cloud-native objects with no persistent local file format. When you export a Google Doc as .docx, you get a snapshot, but the export strips revision history, permission data, comment threads, and custom properties that only exist on Google's servers.
The Google Drive API v3 treats the entire files resource as metadata. A single files.get call can return over 40 fields, including ownership, sharing state, storage quota usage, content hashes, and thumbnail links. Beyond the Drive layer, each Workspace app exposes its own API with document-specific metadata: the Docs API gives you heading structure and named ranges, the Sheets API exposes developer metadata and cell-level properties, and the Slides API returns page layouts, speaker notes, and revision IDs.
This layered architecture means a complete metadata extraction pipeline for Google Workspace needs to call at least two APIs per file: Drive for universal file metadata, and the app-specific API for structural metadata.
Google Workspace serves over 3 billion monthly active users, with Google Drive alone accounting for roughly 2 billion of them. If your organization runs on Workspace, metadata extraction through these APIs is the only reliable path to audit trails, compliance reporting, and automated classification.
What to check before scaling metadata extraction google workspace docs sheets slides api
Before you can pull any metadata, you need a Google Cloud project with the right APIs enabled and credentials configured.
Create a Google Cloud Project
Go to the Google Cloud Console, create a new project (or select an existing one), and enable these four APIs from the API Library:
- Google Drive API
- Google Docs API
- Google Sheets API
- Google Slides API
Choose an Authentication Method
Google offers two main credential types for API access:
OAuth 2.0 client credentials work when your application acts on behalf of a user. The user grants consent through a browser flow, and your app receives an access token scoped to the APIs and permissions they approved. Use this when you need to access files the user owns or has been shared on.
Service account credentials work for server-to-server automation. A service account is a bot identity with its own email address. If your organization uses Google Workspace with domain-wide delegation, a service account can impersonate any user in the domain and access their files without individual consent flows. This is the right choice for bulk metadata extraction across an entire organization.
Install the Client Library
For Python, install the Google API client:
pip install google-api-python-client google-auth-httplib2 google-auth-oauthlib
For Node.js:
npm install googleapis
Initialize the API Clients
Here is a Python example using a service account with domain-wide delegation:
from google.oauth2 import service_account
from googleapiclient.discovery import build
SCOPES = [
"https://www.googleapis.com/auth/drive.readonly",
"https://www.googleapis.com/auth/documents.readonly",
"https://www.googleapis.com/auth/spreadsheets.readonly",
"https://www.googleapis.com/auth/presentations.readonly",
]
credentials = service_account.Credentials.from_service_account_file(
"service-account.json", scopes=SCOPES
)
### Impersonate a domain user if needed
delegated = credentials.with_subject("user@yourdomain.com")
drive = build("drive", "v3", credentials=delegated)
docs = build("docs", "v1", credentials=delegated)
sheets = build("sheets", "v4", credentials=delegated)
slides = build("slides", "v1", credentials=delegated)
The drive.readonly scope is sufficient for reading metadata. If you also need to write custom properties (covered later), swap it for drive or drive.file.
Extracting File Metadata with the Drive API
The Drive API is the universal metadata layer for every file in Google Workspace. Regardless of whether a file is a Doc, Sheet, Slide, PDF, or image, the Drive API returns a consistent set of properties.
Core File Metadata
By default, files.get returns only id, name, mimeType, and kind. To get the full picture, you need to specify which fields you want, or use fields=* to request everything.
file_metadata = drive.files().get(
fileId="YOUR_FILE_ID",
fields="id,name,mimeType,createdTime,modifiedTime,owners,lastModifyingUser,"
"size,md5Checksum,sha256Checksum,version,permissions,shared,"
"sharingUser,viewedByMeTime,trashed,starred,parents,webViewLink"
).execute()
print(f"Name: {file_metadata['name']}")
print(f"Created: {file_metadata['createdTime']}")
print(f"Last modified by: {file_metadata['lastModifyingUser']['displayName']}")
print(f"Shared: {file_metadata['shared']}")
Key fields worth extracting for audit and compliance purposes:
- owners and lastModifyingUser: Identity of who created and last touched the file
- permissions: Full list of who has access, at what level, and whether links are shared publicly
- createdTime, modifiedTime, viewedByMeTime: Timestamps for lifecycle tracking
- md5Checksum and sha256Checksum: Content hashes (only available for binary files, not native Workspace documents)
- version: Monotonically increasing version number
- trashed and explicitlyTrashed: Whether the file is in the trash and how it got there
Listing Files with Metadata
For bulk extraction, files.list returns metadata for many files at once. Combine it with query filters to scope the extraction:
results = drive.files().list(
q="mimeType='application/vnd.google-apps.document' and modifiedTime > '2026-01-01T00:00:00'",
fields="nextPageToken, files(id,name,createdTime,modifiedTime,owners,permissions,shared)",
pageSize=100,
).execute()
for f in results.get("files", []):
print(f"{f['name']} - Modified: {f['modifiedTime']} - Shared: {f['shared']}")
The q parameter supports filtering by MIME type, date ranges, ownership, shared status, and custom properties. Pagination through nextPageToken is required for large result sets since the API caps each response at 1,000 files.
Revision History
The Revisions API returns the edit history for any file:
revisions = drive.revisions().list(
fileId="YOUR_FILE_ID",
fields="revisions(id,modifiedTime,lastModifyingUser,size,keepForever)"
).execute()
for rev in revisions.get("revisions", []):
user = rev.get("lastModifyingUser", {}).get("displayName", "Unknown")
print(f"Revision {rev['id']} by {user} at {rev['modifiedTime']}")
One important caveat: Google merges consecutive edits by the same user into single revisions to save storage. If three people edit a Doc simultaneously, the API may attribute the revision to only the first editor who was active in that session. Revision data is useful for audit trails, but it is not a complete edit-by-edit changelog.
Custom File Properties
The Drive API supports two types of custom key-value metadata:
- properties: Visible to any app or user with file access. Limited to 30 per file.
- appProperties: Private to the application that created them. Other apps cannot read them.
Both types are capped at 100 total entries per file, with each key and value limited to 124 bytes.
### Write custom properties
drive.files().update(
fileId="YOUR_FILE_ID",
body={"properties": {"department": "legal", "classification": "confidential"}},
fields="id,properties"
).execute()
### Read custom properties
meta = drive.files().get(
fileId="YOUR_FILE_ID",
fields="properties,appProperties"
).execute()
print(meta.get("properties"))
Custom properties are searchable with files.list queries, making them useful for building your own tagging and classification system on top of Google Drive.
Centralize Metadata Extraction Across All Your File Sources
Fast.io pulls files from Google Drive, OneDrive, Box, and Dropbox, then extracts structured metadata with AI. No API credentials to manage, no extraction code to maintain. 50 GB free, no credit card required.
Document-Specific Metadata from the Docs, Sheets, and Slides APIs
The Drive API gives you file-level metadata. The app-specific APIs give you structural metadata that only makes sense in the context of each document type.
Google Docs API
A documents.get call returns the full document structure as JSON, including:
- title: The document title (independent of the Drive file name)
- body.content: Every paragraph, table, and list as a structured element tree
- namedRanges: Bookmarked ranges that act as anchors for cross-references
- revisionId: A revision identifier for optimistic concurrency in update requests
- headers, footers, footnotes: Each returned as a map of content objects
doc = docs.documents().get(documentId="YOUR_DOC_ID").execute()
print(f"Title: {doc['title']}")
print(f"Revision ID: {doc['revisionId']}")
print(f"Named ranges: {list(doc.get('namedRanges', {}).keys())}")
print(f"Total content elements: {len(doc['body']['content'])}")
The structural content is especially valuable for metadata extraction pipelines. You can walk the element tree to count headings, extract all hyperlinks, identify embedded images, or calculate reading time from paragraph word counts.
Google Sheets API
The Sheets API exposes two metadata layers:
Spreadsheet properties include the title, locale, time zone, default format, and a list of all sheets (tabs) with their properties like title, index, row count, column count, and grid properties.
spreadsheet = sheets.spreadsheets().get(
spreadsheetId="YOUR_SHEET_ID",
fields="properties,sheets.properties"
).execute()
print(f"Title: {spreadsheet['properties']['title']}")
print(f"Locale: {spreadsheet['properties']['locale']}")
for sheet in spreadsheet['sheets']:
props = sheet['properties']
print(f" Tab: {props['title']} ({props['gridProperties']['rowCount']} rows)")
Developer metadata is a separate system that lets applications attach key-value pairs to specific cells, rows, columns, or sheets. Unlike Drive custom properties (which apply to the file as a whole), developer metadata follows its target as the spreadsheet is edited. If you attach metadata to row 5 and someone inserts a row above it, your metadata moves with it to row 6.
### Search for developer metadata
result = sheets.spreadsheets().developerMetadata().search(
spreadsheetId="YOUR_SHEET_ID",
body={"dataFilters": [{"developerMetadataLookup": {"metadataKey": "source_system"}}]}
).execute()
for match in result.get("matchedDeveloperMetadata", []):
md = match["developerMetadata"]
print(f"Key: {md['metadataKey']}, Value: {md['metadataValue']}")
Each sheet can hold 30,000 characters of developer metadata, and the spreadsheet itself gets an additional 30,000 characters, so a three-tab spreadsheet supports up to 120,000 characters total.
Google Slides API
The Slides API returns presentation-level metadata through presentations.get:
- title: The presentation title
- pageSize: Slide dimensions
- locale: Language/region setting
- revisionId: For concurrency control
- slides: Array of slide objects, each with page elements, layouts, and notes
presentation = slides.presentations().get(
presentationId="YOUR_SLIDE_ID"
).execute()
print(f"Title: {presentation['title']}")
print(f"Slide count: {len(presentation.get('slides', []))}")
print(f"Page size: {presentation['pageSize']}")
for i, slide in enumerate(presentation.get("slides", [])):
notes = slide.get("slideProperties", {}).get("notesPage", {})
elements = notes.get("pageElements", [])
for el in elements:
shape = el.get("shape", {})
text = shape.get("text", {})
if text:
for content in text.get("textElements", []):
run = content.get("textRun", {})
if run.get("content", "").strip():
print(f" Slide {i+1} note: {run['content'].strip()}")
Speaker notes are nested deep in the response object, but they are often the most valuable metadata for search indexing and content auditing.
Building a Metadata Extraction Pipeline
Individual API calls work for spot checks. For production use, you need a pipeline that handles pagination, rate limits, error recovery, and structured output.
Pipeline Architecture
A practical extraction pipeline for Google Workspace looks like this:
- Enumerate files with
files.list, filtered by MIME type, folder, or modification date - Fetch Drive metadata for each file with
files.getincluding permissions and properties - Fetch revision history with
revisions.listfor audit trail requirements - Fetch app-specific metadata by routing each file to the right API based on its MIME type
- Store results in a structured format (JSON, database, or spreadsheet) for querying
Handling Rate Limits
Google enforces per-user and per-project rate limits. The Drive API defaults to 12,000 queries per minute per project. If you are impersonating multiple users with a service account, each delegated user counts separately.
Implement exponential backoff for 429 (rate limit) and 500/503 (server error) responses:
import time
from googleapiclient.errors import HttpError
def safe_get(service_call, retries=5):
for attempt in range(retries):
try:
return service_call.execute()
except HttpError as e:
if e.resp.status in (429, 500, 503):
wait = 2 ** attempt
time.sleep(wait)
else:
raise
raise Exception("Max retries exceeded")
Batch Requests
The Drive API supports batch requests that pack up to 100 individual calls into a single HTTP request. This reduces overhead when you need metadata for many files:
from googleapiclient.http import BatchHttpRequest
results = {}
def callback(request_id, response, exception):
if exception:
results[request_id] = {"error": str(exception)}
else:
results[request_id] = response
batch = drive.new_batch_http_request(callback=callback)
for file_id in file_ids[:100]:
batch.add(
drive.files().get(fileId=file_id, fields="id,name,modifiedTime,owners,permissions"),
request_id=file_id
)
batch.execute()
Storing and Querying Extracted Metadata
For small-scale extraction, JSON files work fine. For ongoing pipelines, push metadata into a database or a purpose-built extraction tool.
Google Drive's custom properties support basic tagging, but they are limited to 124 bytes per value and 100 entries per file. When you need richer metadata (extracted summaries, classification labels, structured fields from document content), you need a separate system.
Fast.io's Metadata Views take a different approach to this problem. Instead of writing extraction code for each document type, you describe the fields you want in plain English, and AI designs a typed schema with support for Text, Integer, Decimal, Boolean, URL, JSON, and Date/Time fields. The extraction runs across your workspace files automatically, producing a sortable, filterable spreadsheet. You can add new columns without reprocessing existing files, and agents can create views, trigger extraction, and query results through the Fast.io MCP server.
This is especially useful when your pipeline extends beyond Google Workspace. If your team stores contracts, invoices, or media files across Google Drive, local storage, and other cloud services, centralizing them in a Fast.io workspace and using Metadata Views to extract structured data gives you a single queryable interface across all document types and sources.
Common Patterns and Troubleshooting
Exported Files Lose Metadata When you export a Google Doc as .docx or PDF, the exported file contains a subset of the original metadata. Revision history, permissions, comments, and custom properties are all stripped during export. Always extract metadata through the API before exporting if you need complete records.
Permission Metadata for Compliance Audits
The permissions field from the Drive API is critical for compliance auditing, but it requires the drive or drive.readonly scope. Each permission entry includes the role (owner, writer, commenter, reader), the type (user, group, domain, anyone), and whether the permission was inherited from a parent folder.
perms = drive.permissions().list(
fileId="YOUR_FILE_ID",
fields="permissions(id,role,type,emailAddress,displayName,inherited)"
).execute()
for p in perms.get("permissions", []):
inherited = "inherited" if p.get("inherited") else "direct"
print(f"{p.get('emailAddress', 'anyone')} - {p['role']} ({inherited})")
Watch for files with type: anyone permissions. These are publicly accessible and should be flagged in any security audit.
Handling Deleted and Trashed Files
By default, files.list excludes trashed files. If your extraction pipeline needs to audit deleted content (for legal hold or eDiscovery), add trashed=true to your query filter. Files in the trash retain all their metadata until they are permanently purged (typically 30 days after deletion).
MIME Type Reference for Google Workspace
When filtering by file type, use these MIME types:
- Google Docs:
application/vnd.google-apps.document - Google Sheets:
application/vnd.google-apps.spreadsheet - Google Slides:
application/vnd.google-apps.presentation - Google Forms:
application/vnd.google-apps.form - Google Drawings:
application/vnd.google-apps.drawing
Content Hashes Are Not Available for Native Workspace Files
The md5Checksum and sha256Checksum fields are only populated for binary files uploaded to Drive (PDFs, images, .docx files). Native Google Workspace documents do not have content hashes because they are not stored as static files. If you need to detect content changes, compare the version field or track modifiedTime instead.
Alternatives for Non-API Workflows
If setting up API credentials is more overhead than your use case warrants, Google Apps Script offers a lighter-weight option. Apps Script runs inside the Workspace environment with built-in authentication and can access Drive, Docs, Sheets, and Slides data through simplified wrapper objects. The tradeoff is that Apps Script has execution time limits (6 minutes for consumer accounts, 30 minutes for Workspace) and less control over error handling.
For organizations that need metadata extraction across file types beyond Google Workspace, platforms like Fast.io consolidate files from multiple cloud sources and apply AI-powered extraction in one place. Fast.io's URL Import feature can pull files directly from Google Drive, OneDrive, Box, and Dropbox without downloading them locally, and Metadata Views extract structured data from those files regardless of format.
Frequently Asked Questions
How do I get metadata from a Google Doc?
Use the Google Drive API's files.get method with a fields parameter specifying the metadata you need (name, createdTime, modifiedTime, owners, permissions, etc.). For document-specific metadata like heading structure, named ranges, and revision IDs, also call the Google Docs API's documents.get method with the document ID.
What metadata does Google Drive store about files?
Google Drive stores over 40 metadata fields per file, including name, MIME type, creation and modification timestamps, file size, content hashes (for binary files), owner and last modifier identity, sharing permissions, parent folder references, starred/trashed status, version number, thumbnail links, and custom key-value properties. Native Workspace files also have revision history accessible through the Revisions API.
Can you extract revision history from Google Sheets via API?
Yes. Use the Drive API's revisions.list method with the spreadsheet's file ID. Each revision includes the modifier's identity, timestamp, and file size. Note that Google merges consecutive edits by the same user into single revisions, so the history may not reflect every individual edit. For Sheets-specific developer metadata, use the Sheets API's developerMetadata.search endpoint.
How to get file properties from Google Drive API?
Call files.get with the fields parameter set to include "properties" and "appProperties." Properties are public key-value pairs visible to any app with file access. AppProperties are private to the app that created them. You can also set and update these properties with the files.update method, and search for files by property values using the q parameter in files.list.
What is the difference between Drive API metadata and Docs API metadata?
The Drive API returns universal file metadata that applies to any file type, like ownership, permissions, timestamps, and sharing state. The Docs API (and Sheets/Slides APIs) returns structural metadata specific to the document type, like heading hierarchy, named ranges, sheet tabs, cell properties, slide layouts, and speaker notes. A complete metadata extraction pipeline uses both layers.
Can I extract metadata from Google Workspace files without API credentials?
Google Apps Script provides a lighter alternative that runs inside the Workspace environment with built-in authentication. It can access Drive, Docs, Sheets, and Slides data through simplified wrapper objects, but it has execution time limits (6 minutes for consumer accounts, 30 minutes for Workspace). For anything beyond basic scripts, the REST APIs give you more control.
Related Resources
Centralize Metadata Extraction Across All Your File Sources
Fast.io pulls files from Google Drive, OneDrive, Box, and Dropbox, then extracts structured metadata with AI. No API credentials to manage, no extraction code to maintain. 50 GB free, no credit card required.