AI & Agents

How to Extract Metadata from Web Pages (Open Graph, Schema.org, and Twitter Cards)

Web pages carry structured metadata in Open Graph tags, Schema.org JSON-LD, and Twitter Card elements. This guide walks through extracting all three protocols with Python and JavaScript, handling JavaScript-rendered pages, and building a pipeline that pulls consistent data from any URL.

Fast.io Editorial Team 9 min read
Neural network indexing visualization representing structured data extraction from web content

How Web Page Metadata Works

Every web page carries hidden metadata that controls how it appears when shared on social media, how search engines understand its content, and what information crawlers can extract programmatically. Three protocols dominate this space, and understanding how they differ is the first step toward extracting them reliably.

Open Graph (OG) tags were introduced by Facebook in 2010 to control how URLs render in social feeds. They live in the HTML <head> as <meta> tags with a property attribute prefixed by og:. According to W3Techs, 70.5% of all websites now include Open Graph tags, making them the most widely adopted structured metadata format on the web.

<meta property="og:title" content="Project Management Guide" />
<meta property="og:description" content="A practical guide to managing projects." />
<meta property="og:image" content="https://example.com/guide-cover.jpg" />
<meta property="og:url" content="https://example.com/guide" />
<meta property="og:type" content="article" />

Schema.org markup serves a different purpose. Created jointly by Google, Microsoft, Yahoo, and Yandex, Schema.org provides a vocabulary for describing entities like articles, products, events, and organizations. The most common implementation is JSON-LD (JavaScript Object Notation for Linked Data), embedded in a <script> tag. Over 45 million web domains include some form of Schema.org markup.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Project Management Guide",
  "author": { "@type": "Person", "name": "Jane Smith" },
  "datePublished": "2026-03-15"
}
</script>

Twitter Cards use <meta> tags with a name attribute prefixed by twitter:. They control how links appear in posts on X (formerly Twitter) and follow similar conventions to Open Graph, though they support card-specific features like twitter:card types (summary, summary_large_image, player).

<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:title" content="Project Management Guide" />
<meta name="twitter:image" content="https://example.com/guide-cover.jpg" />

Most well-maintained websites include all three. When building an extraction pipeline, you want to pull from each protocol and merge the results, since different platforms populate different fields.

Helpful references: Fast.io Workspaces, Fast.io Collaboration, and Fast.io AI.

Structured data summary showing extracted metadata fields

Extracting Open Graph Tags

Open Graph tags are the easiest metadata to extract because they follow a consistent pattern: <meta property="og:*" content="..." />. Here are working examples in both Python and JavaScript.

Python with BeautifulSoup

import requests
from bs4 import BeautifulSoup

def extract_og_tags(url):
    response = requests.get(url, headers={"User-Agent": "MetadataBot/1.0"})
    soup = BeautifulSoup(response.text, "html.parser")

og_tags = {}
    for tag in soup.find_all("meta", attrs={"property": True}):
        prop = tag.get("property", "")
        if prop.startswith("og:"):
            og_tags[prop] = tag.get("content", "")

return og_tags

### Example usage
metadata = extract_og_tags("https://example.com/article")
print(metadata)
### {'og:title': 'Project Management Guide',
### 'og:description': 'A practical guide...',
### 'og:image': 'https://example.com/cover.jpg'}

A few practical notes on this approach. Always send a User-Agent header, because many sites return different content (or block requests entirely) when the user agent is missing. Use html.parser for speed, but switch to lxml if you need to handle malformed HTML more gracefully. Check for response.status_code before parsing, since a 403 or 404 response will waste time in the parser.

Node.js with open-graph-scraper

The open-graph-scraper npm package handles fetching, parsing, and normalizing OG tags in a single call:

import ogs from "open-graph-scraper";

async function extractOG(url) {
  const { result } = await ogs({ url });
  return {
    title: result.ogTitle,
    description: result.ogDescription,
    image: result.ogImage?.[0]?.url,
    type: result.ogType,
    siteName: result.ogSiteName,
  };
}

const metadata = await extractOG("https://example.com/article");
console.log(metadata);

The open-graph-scraper library also extracts Twitter Card data and standard meta tags as fallbacks, which saves you from writing separate parsers for each protocol. For batch extraction, the metascraper package from Microlink is another strong option. It uses a plugin architecture where you compose rule sets for different metadata sources.

Parsing Schema.org JSON-LD from HTML

JSON-LD extraction follows a different pattern from OG tags. Instead of scanning meta elements, you look for <script> tags with type="application/ld+json" and parse their contents as JSON.

import json
import requests
from bs4 import BeautifulSoup

def extract_jsonld(url):
    response = requests.get(url, headers={"User-Agent": "MetadataBot/1.0"})
    soup = BeautifulSoup(response.text, "html.parser")

schemas = []
    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string)
            schemas.append(data)
        except (json.JSONDecodeError, TypeError):
            continue

return schemas

A single page can contain multiple JSON-LD blocks. An e-commerce product page might have one block for the Product schema (with price, availability, and reviews) and another for BreadcrumbList navigation. Your extraction code should collect all of them rather than stopping at the first match.

Handling Nested and Array Types

Schema.org data is often nested. An Article type might contain an author of type Person, which itself contains a worksFor of type Organization. When you need specific fields, walk the structure:

def get_article_metadata(schemas):
    for schema in schemas:
        items = schema if isinstance(schema, list) else [schema]
        for item in items:
            if item.get("@type") == "Article":
                return {
                    "headline": item.get("headline"),
                    "author": item.get("author", {}).get("name"),
                    "published": item.get("datePublished"),
                    "modified": item.get("dateModified"),
                    "publisher": item.get("publisher", {}).get("name"),
                }
    return None

Some sites embed Schema.org as microdata (using itemscope, itemtype, and itemprop HTML attributes) instead of JSON-LD. Microdata is harder to parse because the data is scattered across the DOM. The Python extruct library handles both formats and normalizes the output, which saves you from writing separate extraction logic for each.

import extruct
import requests

response = requests.get(url)
data = extruct.extract(response.text, syntaxes=["json-ld", "microdata"])

For most modern sites, JSON-LD is the preferred format. Google's own documentation recommends JSON-LD over microdata, and adoption has shifted accordingly. But if you need broad coverage across older sites, plan for both.

Audit log showing structured data extraction results
Fastio features

Turn Documents into Structured Data Without Writing Code

Fast.io Metadata Views let you describe the fields you want extracted in plain language. AI handles the rest, pulling structured data from PDFs, images, spreadsheets, and more into a queryable grid. 50 GB free, no credit card required.

Extracting Metadata from JavaScript-Rendered Pages

Static HTML parsing works for most websites, but single-page applications (SPAs) built with React, Vue, or Angular often inject meta tags after the initial page load through client-side JavaScript. When you fetch these pages with requests or a basic HTTP client, you get the raw HTML shell before any JavaScript runs, and the OG tags or JSON-LD blocks are absent.

The fix is a headless browser. Puppeteer (for Node.js) and Playwright (for Python or Node.js) both launch a real browser engine, execute all JavaScript, and give you the fully rendered DOM.

Puppeteer Example

import puppeteer from "puppeteer";

async function extractFromSPA(url) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: "networkidle0" });

const metadata = await page.evaluate(() => {
    const getMeta = (selector) =>
      document.querySelector(selector)?.getAttribute("content") || "";

return {
      ogTitle: getMeta('meta[property="og:title"]'),
      ogDescription: getMeta('meta[property="og:description"]'),
      ogImage: getMeta('meta[property="og:image"]'),
      jsonLd: Array.from(
        document.querySelectorAll('script[type="application/ld+json"]')
      ).map((el) => JSON.parse(el.textContent)),
    };
  });

await browser.close();
  return metadata;
}

When to Use a Headless Browser

Headless browsers are slower and more resource-intensive than static parsing. A simple requests.get() call takes milliseconds; launching a browser instance takes seconds. For most sites, try static parsing first and fall back to a headless browser only when the result is empty or incomplete.

A practical pattern:

  1. Fetch the page with a standard HTTP client
  2. Check if og:title or any JSON-LD block exists
  3. If the metadata is missing or suspiciously empty, re-fetch with a headless browser
  4. Cache the result so you do not repeat the expensive browser launch

The waitUntil: "networkidle0" setting in Puppeteer tells the browser to wait until there are no more than zero network connections for 500ms. For pages that load metadata through delayed API calls, you might need to add a short explicit wait or use waitForSelector to wait for a specific meta tag to appear in the DOM.

Playwright as an Alternative

Playwright offers the same capabilities with a slightly different API and supports Chromium, Firefox, and WebKit from a single installation. Its Python bindings are well maintained, making it a good fit if your pipeline is already Python-based:

from playwright.sync_api import sync_playwright

def extract_with_playwright(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")

og_title = page.get_attribute(
            'meta[property="og:title"]', "content"
        ) or ""
        json_ld = page.evaluate("""
            Array.from(
                document.querySelectorAll('script[type="application/ld+json"]')
            ).map(el => JSON.parse(el.textContent))
        """)

browser.close()
        return {"og_title": og_title, "json_ld": json_ld}

Building a Complete Extraction Pipeline

A production-ready metadata extractor pulls from all three protocols and merges the results into a single normalized record. The merge priority matters because different sources carry different levels of detail.

Priority Chain

When the same field (like "title") appears in multiple protocols, use this fallback order:

  1. Open Graph (most widely adopted, highest consistency)
  2. Twitter Cards (often mirrors OG, sometimes has unique data)
  3. Schema.org JSON-LD (most structured, but field names vary by type)
  4. Standard HTML (<title> tag, <meta name="description">)
def merge_metadata(og, twitter, schema, html):
    return {
        "title": og.get("og:title")
            or twitter.get("twitter:title")
            or schema.get("headline")
            or html.get("title"),
        "description": og.get("og:description")
            or twitter.get("twitter:description")
            or schema.get("description")
            or html.get("description"),
        "image": og.get("og:image")
            or twitter.get("twitter:image")
            or schema.get("image"),
        "author": schema.get("author"),
        "published": schema.get("datePublished"),
        "type": og.get("og:type")
            or schema.get("@type"),
    }

Handling Edge Cases

Real-world extraction runs into problems that tutorials skip. Some sites serve different HTML to different user agents, so you will get different metadata depending on whether your request looks like Googlebot, a browser, or a generic script. Some sites put OG tags inside <noscript> blocks. Others use property and name interchangeably on the same meta tag, even though the OG spec requires property.

Your parser should handle all of these:

  • Accept both property="og:title" and name="og:title" selectors
  • Strip whitespace and newlines from content values
  • Resolve relative image URLs against the page's base URL
  • Handle pages that return a redirect chain (follow redirects, but record the final URL)
  • Respect robots meta tags and X-Robots-Tag headers that may restrict scraping

Rate Limiting and Respectful Crawling

If you are extracting metadata from hundreds or thousands of pages, add delays between requests and respect robots.txt. A common pattern is to limit yourself to one request per second per domain and to cache results so repeated lookups do not hit the origin server.

Storing Extracted Metadata

Once extracted, metadata needs a home. For small-scale projects, a JSON file or SQLite database works fine. For larger pipelines, consider a structured data store where you can query across fields.

If your pipeline works with documents rather than web pages, the extraction problem shifts. PDFs, Word files, and spreadsheets carry their own metadata (author, creation date, keywords), but extracting it requires different tools. Fast.io's Metadata Views handle this by letting you describe the fields you want in natural language, then automatically extracting and organizing them into a queryable spreadsheet. You define columns like "author," "publish date," or "document type," and the AI matches files in your workspace and populates the data. It is the document equivalent of what you would build for web pages, but without writing extraction code.

For web metadata specifically, storing results alongside the source URL and extraction timestamp lets you track changes over time and detect when a page updates its OG tags or Schema.org markup.

Frequently Asked Questions

How do I extract Open Graph metadata from a URL?

Fetch the page HTML with an HTTP client like Python's requests library, then parse it with BeautifulSoup. Look for all meta tags where the property attribute starts with 'og:' and read their content attributes. In Node.js, the open-graph-scraper package handles fetching and parsing in a single call.

What is the difference between Open Graph and Schema.org?

Open Graph controls how a URL appears when shared on social media platforms like Facebook and LinkedIn. Schema.org provides structured data that search engines use to generate rich results, knowledge panels, and other enhanced search features. Open Graph uses meta tags in the HTML head, while Schema.org is most commonly implemented as JSON-LD in a script tag.

How do I scrape meta tags from a web page?

Use an HTTP library to download the page HTML, then parse the head section for meta elements. In Python, BeautifulSoup's find_all method with attribute filters extracts specific tag types. For JavaScript, open-graph-scraper or metascraper handle multiple tag formats automatically. If tags are injected by client-side JavaScript, you will need a headless browser like Puppeteer or Playwright.

Can I extract Schema.org JSON-LD with Python?

Yes. Find all script tags with type='application/ld+json' using BeautifulSoup, then parse each one with json.loads(). The extruct library is another option that extracts JSON-LD, microdata, and RDFa in a single call and normalizes the output.

Why are my extracted meta tags empty on some pages?

Single-page applications built with React, Vue, or Angular often inject meta tags after the initial page load using client-side JavaScript. A standard HTTP request only gets the raw HTML before JavaScript runs. Use a headless browser like Puppeteer or Playwright to render the page fully before extracting tags.

How do Twitter Cards relate to Open Graph tags?

Twitter Cards use meta tags with a 'twitter:' prefix to control how links appear on X (formerly Twitter). Many properties overlap with Open Graph, and X's crawler falls back to OG tags when Twitter Card tags are missing. If you already have Open Graph tags on your pages, you only need to add twitter:card to specify the card type.

Related Resources

Fastio features

Turn Documents into Structured Data Without Writing Code

Fast.io Metadata Views let you describe the fields you want extracted in plain language. AI handles the rest, pulling structured data from PDFs, images, spreadsheets, and more into a queryable grid. 50 GB free, no credit card required.