How do I get metadata from a Parquet file?

Use PyArrow's read_schema() function to get column names, types, and nullability from the file footer without loading data. Use read_metadata() for file-level details like row count, row groups, format version, compression, and column statistics. Both functions read only the binary footer, making them fast even on multi-gigabyte files.

What metadata is in a CSV file?

CSV files have no built-in metadata standard. If the first row contains headers, you get column names. Everything else, including data types, encoding, row counts, and relationships, must be inferred by reading the data or documented in a separate schema file. Tools like Pandas, Snowflake's INFER_SCHEMA, and the csv-schema-inference library can guess column types from sample data, but the results are probabilistic rather than definitive.

How do I extract schema from a SQL dump file?

SQL dumps contain CREATE TABLE statements with column names, data types, constraints, defaults, and index definitions. You can parse these with the simple-ddl-parser Python library, which supports PostgreSQL, MySQL, Oracle, SQL Server, Snowflake, and other dialects. For quick inspection, grep for CREATE TABLE statements in the dump file.

Can you read column types from a Parquet file without loading the data?

Yes. Parquet stores the full schema in a binary footer at the end of the file. PyArrow's read_schema() reads only this footer, returning column names, physical types, logical types, and nullability in milliseconds regardless of file size. You can also access row group statistics (min, max, null count) per column through the metadata object.

Which database export format preserves the most metadata?

Parquet preserves the most machine-readable metadata, including column types, encoding, compression, row group statistics, and custom key-value pairs. SQL dumps preserve the most semantic metadata, including constraints, foreign keys, defaults, and indexes that describe the database's relational structure. CSV preserves the least, with only column headers if present.

How do I catalog metadata from mixed-format database exports?

Build a routing function that detects the file extension and delegates to format-specific extractors. Normalize each extractor's output into a common schema with fields for column definitions, row counts, provenance, and a confidence level that reflects whether types are declared (Parquet, SQL) or inferred (CSV). For large collections, consider cloud data catalog services or Fastio's Metadata Views for AI-assisted extraction.

Extract Metadata from Database Exports (CSV, SQL, Parquet)

Why Database Export Metadata Matters: metadata extraction database export csv sql parquet

When you export data from a database, the resulting file is more than rows and values. Depending on the format, it carries schema definitions, column types, compression details, encoding information, and statistical summaries that describe the data without requiring you to read every record.

This metadata is useful for several practical reasons:

Data cataloging: Building a searchable inventory of exports across teams and projects
Migration validation: Confirming that column types survived the export intact before loading into a new system
Pipeline automation: Routing files through different processing paths based on schema shape, row counts, or compression type
Governance and auditing: Tracking data lineage from source database to export file to downstream consumer

The catch is that each export format handles metadata differently. CSV files carry almost none. SQL dumps embed full DDL. Parquet files store rich schema metadata in a binary footer. Understanding what each format gives you, and what it doesn't, is the first step toward building reliable extraction workflows.

Helpful references: Fastio Workspaces, Fastio Collaboration, and Fastio AI.

What Metadata Each Format Contains

Before writing any extraction code, it helps to know what you're working with. Here is a comparison of the three most common database export formats and the metadata each one provides.

CSV files have no built-in metadata standard. The file is plain text with a delimiter, and everything about the schema, including column names, data types, encoding, and row counts, must be inferred or documented externally. If the first row contains headers, you get column names. Beyond that, you're guessing.

SQL dump files are the most metadata-rich text-based export format. A typical dump from PostgreSQL or MySQL contains CREATE TABLE statements with column names, data types, constraints (PRIMARY KEY, NOT NULL, UNIQUE, FOREIGN KEY), default values, and index definitions. Some dumps also include comments, character set declarations, and database version information.

Apache Parquet files store full schema metadata in a binary footer at the end of the file. This includes column names, physical and logical types, encoding algorithms, compression codecs, row group boundaries, and min/max statistics for each column chunk. Parquet was designed to separate metadata from data, so query engines can read the footer without scanning the file body.

Here is a practical summary:

CSV: Column names (if headers exist). Everything else is inferred.
SQL dump: Column names, data types, constraints, indexes, defaults, character sets, table relationships
Parquet: Column names, physical types, logical types, encoding, compression, row group stats, custom key-value pairs, row counts

The gap between CSV and Parquet is significant. If you're building a data catalog that needs to auto-discover schema information, Parquet files are self-describing while CSV files require external documentation or inference logic.

Extracting Metadata from CSV Files

CSV metadata extraction is schema inference, since the format itself stores almost nothing about the data's structure. Here are the practical approaches.

Reading Headers and Basic Properties

Python's built-in csv module gives you column names from the header row, and you can count rows with a simple loop:

import csv
import os

def csv_metadata(filepath):
    file_size = os.path.getsize(filepath)
    with open(filepath, "r", encoding="utf-8") as f:
        reader = csv.reader(f)
        headers = next(reader)
        row_count = sum(1 for _ in reader)
    return {
        "columns": headers,
        "column_count": len(headers),
        "row_count": row_count,
        "file_size_bytes": file_size,
    }

This tells you the column names, how many rows and columns exist, and the file size. It tells you nothing about data types.

Inferring Column Types with Pandas

Pandas reads the first chunk of data and infers types based on content patterns:

import pandas as pd

df = pd.read_csv("export.csv", nrows=1000)
schema = {col: str(dtype) for col, dtype in df.dtypes.items()}

The nrows parameter limits how many rows Pandas samples for inference. This is fast but imprecise. A column that looks like integers in the first 1,000 rows might contain floats or strings further down. For more reliable inference, sample multiple chunks or read the full file.

Handling Encoding

Detection CSV files don't declare their encoding. If you're processing exports from different systems, use the chardet library to detect encoding before parsing:

import chardet

with open("export.csv", "rb") as f:
    raw = f.read(10000)
    result = chardet.detect(raw)
    encoding = result["encoding"]

Dedicated Inference Tools

For production pipelines, consider specialized tools. Snowflake's INFER_SCHEMA function auto-detects file metadata from staged data files. The csv-schema-inference Python library runs configurable type detection with tunable accuracy thresholds and batch sizes. Both go beyond what Pandas offers by handling edge cases like mixed-type columns and locale-specific number formats.

The fundamental limitation remains: CSV metadata extraction is always probabilistic. You're inferring what the schema should be, not reading what it is.

Spreadsheet view showing extracted metadata fields from uploaded files

Catalog Your Database Exports with AI-Powered Metadata Extraction

Fastio's Metadata Views turn uploaded files into structured, queryable data. Describe the fields you need, and AI extracts them from CSVs, Parquet files, SQL dumps, and more. generous storage, no credit card required.

Start 14-Day Trial

Extracting Metadata from SQL Dump Files

SQL dumps contain DDL statements that serve as the authoritative schema metadata for the original database. The challenge is parsing those statements reliably across different SQL dialects.

What DDL Statements Tell You

A typical CREATE TABLE statement in a PostgreSQL dump looks like this:

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL REFERENCES customers(id),
    total NUMERIC(10, 2) DEFAULT 0.00,
    status VARCHAR(20) CHECK (status IN ('pending', 'shipped', 'delivered')),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

From this single statement, you can extract: table name, five column names with exact data types, a primary key constraint, a foreign key relationship, a check constraint, default values, and nullability rules. That's far richer than anything CSV or even Parquet provides about the data's intended semantics.

Parsing DDL with simple-ddl-parser

The simple-ddl-parser

Python library handles DDL from PostgreSQL, MySQL, Oracle, SQL Server, Snowflake, Redshift, and BigQuery. It converts CREATE TABLE statements into structured Python dictionaries:

from simple_ddl_parser import DDLParser

with open("dump.sql", "r") as f:
    ddl_text = f.read()

result = DDLParser(ddl_text).run()

for table in result:
    print(f"Table: {table['table_name']}")
    for col in table["columns"]:
        print(f"  {col['name']}: {col['type']} (nullable: {col['nullable']})")

Each parsed table includes column types, defaults, primary keys, foreign keys, unique constraints, and index definitions. The library processes about 9,000 downloads per day on PyPI, so it's actively maintained and widely used.

Extracting Metadata with grep and regex

For quick, one-off inspection of SQL dumps, command-line tools work well:

grep -i "CREATE TABLE" dump.sql

This lists every table in the dump. For more structured extraction, a regex-based approach in Python can pull table and column definitions:

import re

with open("dump.sql", "r") as f:
    content = f.read()

tables = re.findall(
    r"CREATE TABLE\s+(\w+)\s*\((.*?)\);",
    content,
    re.DOTALL | re.IGNORECASE,
)

This works for simple dumps but breaks on complex DDL with nested parentheses, multi-line constraints, or dialect-specific syntax. For anything beyond quick inspection, use a proper parser.

Beyond Table Definitions

SQL dumps often contain metadata beyond CREATE TABLE: index definitions (CREATE INDEX), views, stored procedures, sequences, comments (COMMENT ON), and SET statements that reveal the source database's character set, collation, and version. A thorough metadata extraction pipeline should capture these too, since they document the database's operational configuration alongside its schema.

Extracting Metadata from Parquet Files

Parquet files are self-describing. The format stores all schema metadata in a binary footer, so you can read column definitions, types, encoding, compression, and statistics without loading any data rows. This makes Parquet the most metadata-friendly export format by a wide margin.

Reading Schema with PyArrow

PyArrow's read_schema function reads only the footer, returning the full schema without touching the data:

import pyarrow.parquet as pq

schema = pq.read_schema("export.parquet")

for field in schema:
    print(f"{field.name}: {field.type} (nullable: {field.nullable})")

This returns column names, Arrow data types (which map to Parquet physical and logical types), and nullability. For a 10 GB Parquet file, this operation takes milliseconds because only the footer is read.

Accessing File-Level Metadata

The read_metadata function returns detailed file-level information:

metadata = pq.read_metadata("export.parquet")

print(f"Format version: {metadata.format_version}")
print(f"Row count: {metadata.num_rows}")
print(f"Row groups: {metadata.num_row_groups}")
print(f"Columns: {metadata.num_columns}")
print(f"Created by: {metadata.created_by}")
print(f"File size: {metadata.serialized_size} bytes")

The created_by field typically identifies the library and version that wrote the file (for example, "parquet-mr version 1.12.3" for Java or "parquet-cpp-arrow version 15.0.0" for PyArrow). This provenance information is valuable for debugging format compatibility issues.

Row Group and Column Statistics

Parquet stores min/max statistics at the row group and column chunk level, enabling query engines to skip irrelevant data. You can access these programmatically:

for i in range(metadata.num_row_groups):
    rg = metadata.row_group(i)
    print(f"Row group {i}: {rg.num_rows} rows")
    for j in range(rg.num_columns):
        col = rg.column(j)
        print(f"  Column {j}: {col.physical_type}")
        print(f"    Compression: {col.compression}")
        print(f"    Encodings: {col.encodings}")
        if col.statistics:
            stats = col.statistics
            print(f"    Min: {stats.min}, Max: {stats.max}")
            print(f"    Null count: {stats.null_count}")
            print(f"    Distinct count: {stats.distinct_count}")

These statistics are what make Parquet fast for analytical queries. They're also useful metadata for cataloging: you can determine value ranges, null ratios, and data distribution without reading a single data row.

Custom Key-Value Metadata

Parquet supports arbitrary key-value metadata pairs at the file level. Teams use this for storing pipeline version, data source URLs, export timestamps, or schema registry identifiers:

if schema.metadata:
    for key, value in schema.metadata.items():
        print(f"{key.decode()}: {value.decode()}")

Pandas DataFrames saved to Parquet automatically store the Pandas schema as a JSON blob in the pandas metadata key, which includes the original index structure, column order, and dtype mapping.

Detailed audit log showing file metadata properties and schema details

Building a Cross-Format Metadata Catalog

Once you can extract metadata from each format individually, the next step is unifying it into a consistent catalog. This is where most teams hit friction, because CSV, SQL, and Parquet describe schemas in fundamentally different vocabularies.

Designing a Common Schema A practical metadata record should capture:

Source format: csv, sql, parquet
Table or file name: the logical dataset name
Column definitions: name, inferred or declared type, nullable
Row count: exact for Parquet and SQL (if available), estimated for CSV
File properties: size, encoding, compression, creation date
Provenance: which system or tool created the export
Confidence level: high for Parquet and SQL DDL, low for CSV type inference

The confidence field matters. A Parquet schema that declares a column as INT64 NOT NULL is a contract. A Pandas inference that a CSV column is int64 is a guess that could be wrong in row 50,001.

Automated Cataloging Pipeline

Here is a simplified routing pattern that extracts metadata from any of the three formats:

from pathlib import Path

def catalog_export(filepath):
    ext = Path(filepath).suffix.lower()
    if ext == ".csv":
        return extract_csv_metadata(filepath)
    elif ext in (".sql", ".dump"):
        return extract_sql_metadata(filepath)
    elif ext == ".parquet":
        return extract_parquet_metadata(filepath)
    else:
        return {"error": f"unsupported format: {ext}"}

Each extractor normalizes its output into the common schema defined above. The SQL extractor might return one record per table found in the dump, while CSV and Parquet return one record per file.

Scaling with Cloud Platforms

For teams managing hundreds or thousands of export files, manual extraction scripts don't scale. Cloud data catalogs like AWS Glue Data Catalog, Google Cloud Data Catalog, and Apache Atlas can auto-discover Parquet schemas from object storage. Snowflake's INFER_SCHEMA works on staged CSV and Parquet files alike.

For export files stored in shared workspaces, Fastio's Metadata Views offer a different approach. Instead of writing extraction code, you describe the fields you want extracted in natural language, and AI designs a typed schema, matches files in the workspace, and populates a sortable, filterable spreadsheet. This works well for PDF-based exports, scanned database reports, and spreadsheet dumps where traditional parsing tools struggle. Metadata Views support seven field types (Text, Integer, Decimal, Boolean, URL, JSON, Date & Time) and let you add new columns without reprocessing existing files.

For structured exports like clean Parquet and SQL dumps, programmatic tools like PyArrow and simple-ddl-parser give you more control. For messy, mixed-format collections that include scanned pages and handwritten notes alongside structured exports, Metadata Views fill the gap that code-only approaches leave open.

Practical Tips and Common Pitfalls

A few things that trip up teams working with database export metadata in practice:

CSV Encoding Mismatches

Exports from legacy systems often use Latin-1 or Windows-1252 encoding instead of UTF-8. If your extraction pipeline assumes UTF-8, you'll get garbled column names or outright parse failures. Always detect encoding before inferring schema, especially for exports from older ERP or CRM systems.

SQL Dialect Differences

A MySQL dump and a PostgreSQL dump use different syntax for the same concepts. AUTO_INCREMENT vs. SERIAL, backticks vs. double quotes for identifiers, ENGINE=InnoDB declarations that don't exist in PostgreSQL. If you're parsing dumps from mixed database sources, use a multi-dialect parser like simple-ddl-parser rather than writing custom regex per dialect.

Parquet Version Compatibility

Parquet format version 2.x introduced logical types (also called converted types in earlier specs) that version 1.x readers may not understand. If metadata extraction returns unexpected types like TIMESTAMP_MICROS instead of a simple timestamp, check which format version the file was written with using metadata.format_version.

Row Count Discrepancies

Parquet metadata includes an exact row count in the footer. CSV row counts require reading the entire file. SQL dumps may include INSERT statement counts, but partial dumps or dumps with COPY statements handle this differently. When building a catalog, always note whether the row count is exact or estimated.

Large File Handling

For Parquet files, metadata extraction is always fast because only the footer is read. For large CSV files (10 GB+), even counting rows can take minutes. Consider sampling strategies: read the first 10,000 rows for schema inference, count bytes, and estimate total rows from the average row size.

Schema Evolution

Database exports taken at different points in time may have different schemas. A column added in March won't appear in a January export. Your catalog should track schema versions over time and flag when an export's schema differs from the most recent known version for that dataset.

How to Extract Metadata from Database Exports (CSV, SQL, Parquet)

Why Database Export Metadata Matters: metadata extraction database export csv sql parquet

What Metadata Each Format Contains

Extracting Metadata from CSV Files

Reading Headers and Basic Properties

Inferring Column Types with Pandas

Handling Encoding

Dedicated Inference Tools

Catalog Your Database Exports with AI-Powered Metadata Extraction

Extracting Metadata from SQL Dump Files

What DDL Statements Tell You

Parsing DDL with simple-ddl-parser

Extracting Metadata with grep and regex

Beyond Table Definitions

Extracting Metadata from Parquet Files

Reading Schema with PyArrow

Accessing File-Level Metadata

Row Group and Column Statistics

Custom Key-Value Metadata

Building a Cross-Format Metadata Catalog

Designing a Common Schema A practical metadata record should capture:

Automated Cataloging Pipeline

Scaling with Cloud Platforms

Practical Tips and Common Pitfalls

CSV Encoding Mismatches

SQL Dialect Differences

Parquet Version Compatibility

Row Count Discrepancies

Large File Handling

Schema Evolution

Frequently Asked Questions

Related Resources

Catalog Your Database Exports with AI-Powered Metadata Extraction