How to Extract Metadata from Jupyter Notebooks (ipynb Files)
Jupyter notebooks store structured metadata at the file, cell, and output levels inside their JSON-based .ipynb format. This guide walks through extracting kernel specs, language info, execution timestamps, cell tags, and custom metadata fields using raw JSON parsing and the nbformat Python library.
What Metadata Lives Inside a Jupyter Notebook
Every .ipynb file is a JSON document. Open one in a text editor and you will see a dictionary with four top-level keys: nbformat, nbformat_minor, metadata, and cells. The notebook format specification (nbformat v4) defines metadata slots at three distinct levels, and each level captures different information about how the notebook was created, configured, and executed.
Notebook-level metadata sits in the top-level metadata dictionary. The most important fields here are:
kernelspec: the kernel's display name, internal name, and languagelanguage_info: programming language name, version, file extension, MIME type, and codemirror modeauthors: a list of author dictionaries (each with anamefield)- Custom keys added by tools like Colab, Papermill, or nbproject
Cell-level metadata appears inside each cell's metadata dictionary. Standard fields include tags (a list of string labels), name (a unique cell identifier), editable, deletable, and execution timing data under the execution namespace. Code cells also carry an execution_count integer that records how many times the cell has been run.
Output-level metadata lives inside the outputs array of code cells. Each output object can have its own metadata dictionary, with fields like isolated (whether to render in an iframe) and format-specific keys for image dimensions or display priorities.
All metadata fields are optional according to the nbformat spec. No field is required to exist, and applications are free to ignore any metadata they do not recognize. This flexibility means real-world notebooks vary wildly in what metadata they contain. A notebook saved from JupyterLab might include detailed execution timestamps, while one exported from Google Colab might carry Colab-specific accelerator settings and runtime configurations.
The Jupyter project maintains the nbformat JSON schema on GitHub, which defines the complete structure and validation rules for each format version. Notebooks using nbformat v4 (the current standard) support over 15 documented metadata fields across the three levels, plus arbitrary custom keys in any metadata namespace.
What to check before scaling metadata extraction from jupyter notebooks ipynb
Because .ipynb files are plain JSON, the simplest extraction approach uses Python's built-in json module. No third-party libraries needed.
import json
with open("analysis.ipynb", "r", encoding="utf-8") as f:
notebook = json.load(f)
### Notebook-level metadata
meta = notebook.get("metadata", {})
print("Format:", notebook["nbformat"], notebook["nbformat_minor"])
### Kernel specification
kernel = meta.get("kernelspec", {})
print("Kernel:", kernel.get("display_name"))
print("Language:", kernel.get("language"))
### Language details
lang = meta.get("language_info", {})
print("Language version:", lang.get("version"))
print("File extension:", lang.get("file_extension"))
print("MIME type:", lang.get("mimetype"))
This gives you the notebook's kernel configuration and language version in a few lines. For cell-level metadata, iterate through the cells array:
for i, cell in enumerate(notebook["cells"]):
cell_meta = cell.get("metadata", {})
cell_type = cell["cell_type"]
tags = cell_meta.get("tags", [])
print(f"Cell {i}: type={cell_type}, tags={tags}")
if cell_type == "code":
exec_count = cell.get("execution_count")
print(f" Execution count: {exec_count}")
### Check for execution timestamps
execution = cell_meta.get("execution", {})
if execution:
start = execution.get("iopub.execute_input")
end = execution.get("shell.execute_reply")
print(f" Started: {start}")
print(f" Completed: {end}")
The raw JSON approach works well for quick scripts and environments where you want to avoid installing dependencies. It is also the right choice when you need to extract custom metadata keys that nbformat's API might not expose directly.
One thing to watch for: notebook files can be large. A notebook with inline images or dataframe outputs might be tens of megabytes. If you are processing many notebooks, load them one at a time rather than reading everything into memory at once.
Using nbformat for Validated Extraction
The nbformat library is Jupyter's official Python package for reading, writing, and validating notebook files. It handles format version differences and provides a cleaner API than raw JSON parsing.
Install it with pip:
pip install nbformat
Then load and extract metadata:
import nbformat
nb = nbformat.read("analysis.ipynb", as_version=4)
### Notebook-level metadata is a dict-like object
print("Kernel:", nb.metadata.get("kernelspec", {}).get("display_name"))
print("Language:", nb.metadata.get("language_info", {}).get("name"))
print("Language version:", nb.metadata.get("language_info", {}).get("version"))
### Authors (if present)
authors = nb.metadata.get("authors", [])
for author in authors:
print("Author:", author.get("name"))
The as_version=4 parameter tells nbformat to upgrade older notebook formats to v4 before returning the object. This means you can read notebooks saved in format v3 or earlier without writing separate parsing logic.
Validating Notebook Structure
Before extracting metadata from notebooks you did not create, validate them first:
from nbformat import validate, ValidationError
try:
validate(nb)
print("Notebook is valid")
except ValidationError as e:
print(f"Validation failed: {e.message}")
Validation catches structural problems like missing required fields, wrong data types, or malformed cell structures. This is particularly useful when processing notebooks collected from different sources, where format consistency is not guaranteed.
Extracting a Complete Metadata Summary
Here is a function that builds a structured summary of all metadata in a notebook:
def extract_notebook_metadata(path):
nb = nbformat.read(path, as_version=4)
summary = {
"format": f"{nb.nbformat}.{nb.nbformat_minor}",
"kernel": nb.metadata.get("kernelspec", {}),
"language": nb.metadata.get("language_info", {}),
"authors": nb.metadata.get("authors", []),
"cell_count": len(nb.cells),
"code_cells": 0,
"markdown_cells": 0,
"max_execution_count": 0,
"all_tags": set(),
"custom_metadata_keys": [],
}
known_keys = {"kernelspec", "language_info", "authors"}
custom = set(nb.metadata.keys()) - known_keys
summary["custom_metadata_keys"] = list(custom)
for cell in nb.cells:
if cell.cell_type == "code":
summary["code_cells"] += 1
ec = cell.get("execution_count") or 0
summary["max_execution_count"] = max(
summary["max_execution_count"], ec
)
elif cell.cell_type == "markdown":
summary["markdown_cells"] += 1
tags = cell.metadata.get("tags", [])
summary["all_tags"].update(tags)
summary["all_tags"] = list(summary["all_tags"])
return summary
This function captures the kernel spec, language info, cell counts, execution progress, all cell tags, and any custom metadata keys that tools like Papermill, Colab, or nbproject may have added.
Organize and query your notebook metadata in one place
Upload Jupyter notebooks to a Fast.io workspace, extract structured metadata with AI-powered Views, and search across your entire collection. 50 GB free storage, no credit card required.
Batch Processing Notebooks Across a Repository
Single-file extraction is useful for inspection, but most real-world scenarios involve processing entire directories or repositories of notebooks. Data science teams accumulate hundreds of notebooks, and understanding what kernels, language versions, and dependencies are in use requires scanning them all.
import json
from pathlib import Path
import nbformat
def scan_notebooks(root_dir):
root = Path(root_dir)
results = []
for nb_path in root.rglob("*.ipynb"):
### Skip checkpoint files
if ".ipynb_checkpoints" in str(nb_path):
continue
try:
nb = nbformat.read(str(nb_path), as_version=4)
kernel = nb.metadata.get("kernelspec", {})
lang = nb.metadata.get("language_info", {})
results.append({
"path": str(nb_path.relative_to(root)),
"kernel": kernel.get("display_name", "unknown"),
"language": lang.get("name", "unknown"),
"version": lang.get("version", "unknown"),
"cells": len(nb.cells),
"code_cells": sum(
1 for c in nb.cells if c.cell_type == "code"
),
"executed": any(
c.get("execution_count") is not None
for c in nb.cells
if c.cell_type == "code"
),
})
except Exception as e:
results.append({
"path": str(nb_path.relative_to(root)),
"error": str(e),
})
return results
### Scan and summarize
notebooks = scan_notebooks("./projects")
kernels = {}
for nb in notebooks:
if "error" not in nb:
k = nb["kernel"]
kernels[k] = kernels.get(k, 0) + 1
print("Kernel distribution:")
for kernel, count in sorted(kernels.items(), key=lambda x: -x[1]):
print(f" {kernel}: {count} notebooks")
This script skips Jupyter's checkpoint files (which are duplicates stored in .ipynb_checkpoints), catches malformed notebooks without crashing, and produces a kernel usage distribution across the entire repository.
Practical Use Cases for Batch Extraction
Dependency auditing: scan all notebooks to find which Python versions and kernels are in use before a migration. If you are upgrading from Python 3.9 to 3.12, you need to know which notebooks will be affected.
Compliance documentation: regulatory environments often require documenting what code was executed, when, and by whom. Extracting execution counts and author metadata from notebooks creates an audit trail without manual record-keeping.
Stale notebook detection: notebooks with no execution counts or old language versions are candidates for archival. A batch scan flags these automatically.
Tag-based organization: teams that tag cells with labels like data-load, visualization, or model-training can use extracted tags to build a searchable index of notebook content across a repository.
Extracting Output and Execution Metadata
Code cell outputs carry their own metadata that is useful for understanding what a notebook produced and how it was executed. Each output object has a type (execute_result, display_data, stream, or error) and may include MIME-typed data bundles and metadata.
import nbformat
nb = nbformat.read("analysis.ipynb", as_version=4)
for i, cell in enumerate(nb.cells):
if cell.cell_type != "code":
continue
if not cell.get("outputs"):
continue
print(f"
Cell {i} (execution_count={cell.execution_count}):")
for j, output in enumerate(cell.outputs):
print(f" Output {j}: type={output.output_type}")
### Check for rich output data types
if hasattr(output, "data"):
mime_types = list(output.data.keys())
print(f" MIME types: {mime_types}")
### Output-level metadata
if hasattr(output, "metadata") and output.metadata:
print(f" Metadata: {dict(output.metadata)}")
Image outputs often include metadata about dimensions. For example, a matplotlib plot rendered as PNG will have metadata like {"image/png": {"width": 640, "height": 480}} that tells you the output resolution without decoding the image data.
Execution Timing
JupyterLab and some extensions record execution timestamps in cell metadata under the execution namespace. These ISO 8601 timestamps tell you exactly when each cell started and finished:
iopub.execute_input: when the kernel received the execution requestiopub.status.busy: when the kernel began workingshell.execute_reply: when execution completediopub.status.idle: when the kernel was ready for the next request
The difference between iopub.execute_input and shell.execute_reply gives you the wall-clock execution time for each cell. Aggregating these across a notebook tells you the total compute time, which is valuable for resource planning and billing in shared compute environments.
Not every notebook will have these timestamps. They are typically present in notebooks executed through JupyterLab with timing extensions enabled, but missing from notebooks run in classic Jupyter Notebook or exported from Colab.
Managing Notebook Metadata at Scale with Fast.io
Extracting metadata from individual notebooks is straightforward. The harder problem is managing that metadata across teams, projects, and time. When you have hundreds of notebooks spread across local machines, shared drives, and Git repositories, keeping track of what each notebook contains becomes a data management challenge.
Local file systems do not index notebook metadata. Git tracks file changes but does not let you query across repositories by kernel version or execution status. Cloud storage services like Google Drive or S3 treat .ipynb files as opaque blobs with no awareness of their internal structure.
Fast.io's Metadata Views take a different approach. Instead of writing custom scripts to extract and store notebook metadata separately, you can upload notebooks to a Fast.io workspace and define extraction schemas in natural language. Describe the fields you want, such as "kernel name", "Python version", "number of code cells", or "last execution date", and Fast.io's AI designs a typed schema and extracts those fields into a sortable, filterable spreadsheet.
This works because Metadata Views support structured extraction from any document type, including JSON-based formats like .ipynb. The extracted data stays linked to the source files, so you can click through from a metadata row to the original notebook. When notebooks are updated, you can re-extract individual fields without reprocessing the entire collection.
For teams using AI agents in their data science workflows, Fast.io workspaces provide a shared layer where both humans and agents can access the same notebooks. Enable Intelligence Mode on a workspace and uploaded notebooks are automatically indexed for semantic search and AI-powered Q&A. An agent can use Fast.io's MCP server to upload notebooks, trigger metadata extraction, and query the results programmatically, while a human reviews the same data through the web interface.
The free agent plan includes 50 GB of storage and 5,000 monthly credits with no credit card required, which covers a substantial notebook library. For compliance-focused teams, Fast.io's audit trail logs every file access and metadata operation, creating the provenance record that notebook metadata extraction alone cannot provide.
Frequently Asked Questions
What metadata is stored in a Jupyter notebook?
Jupyter notebooks store metadata at three levels. Notebook-level metadata includes the kernel specification (name, display name, language), language info (name, version, file extension, MIME type), and an optional authors list. Cell-level metadata includes tags, name, editability flags, and execution timestamps. Code cells also store an execution_count integer. Output-level metadata can include rendering flags like isolated mode and format-specific properties like image dimensions.
How do I read ipynb file metadata with Python?
The simplest approach is to load the .ipynb file with Python's built-in json module, since notebooks are JSON files. Call json.load() on the file, then access notebook["metadata"] for file-level metadata and iterate notebook["cells"] for cell-level data. For validated reading with format version handling, use the nbformat library with nbformat.read("file.ipynb", as_version=4).
What is the nbformat schema for Jupyter notebooks?
The nbformat schema is a JSON Schema definition maintained by the Jupyter project that specifies the structure of .ipynb files. The current version (v4) defines four top-level keys, notebook-level metadata fields (kernelspec, language_info, authors), cell types (code, markdown, raw) with their metadata fields, and output types (execute_result, display_data, stream, error). The schema is published in the jupyter/nbformat GitHub repository and is used by nbformat's validate() function.
Can you extract kernel information from a notebook file?
Yes. Kernel information is stored in the notebook's top-level metadata under two keys. The kernelspec dictionary contains the kernel's display_name, name, and language. The language_info dictionary contains the programming language name, version, file extension, MIME type, and codemirror mode. Both can be accessed by parsing the .ipynb JSON or by using nbformat.read() and accessing nb.metadata["kernelspec"] and nb.metadata["language_info"].
How do I extract metadata from multiple notebooks at once?
Use Python's pathlib to recursively find all .ipynb files in a directory with Path.rglob("*.ipynb"), then process each file with nbformat.read(). Filter out files in .ipynb_checkpoints directories to avoid duplicates. Wrap each read in a try/except block to handle malformed files gracefully. Collect the results into a list of dictionaries for analysis, filtering, or export to CSV or JSON.
What custom metadata do tools like Google Colab add to notebooks?
Google Colab adds a "colab" key to notebook-level metadata containing accelerator type (GPU/TPU settings), provenance information, and Colab-specific display settings. Papermill adds a "papermill" key with parameter definitions and execution status. nbproject adds dependency tracking and integrity hashes. Any tool can add custom keys to any metadata namespace, since the nbformat spec treats metadata as an open dictionary.
Related Resources
Organize and query your notebook metadata in one place
Upload Jupyter notebooks to a Fast.io workspace, extract structured metadata with AI-powered Views, and search across your entire collection. 50 GB free storage, no credit card required.