How to Extract Metadata from Database Exports (CSV, SQL, Parquet)
Database exports carry schema definitions, column types, row counts, encoding formats, and provenance information that most teams ignore during migration and cataloging work. This guide walks through extracting that metadata from CSV, SQL dump, and Parquet files using practical tools and code examples, then compares what each format actually gives you.
Why Database Export Metadata Matters: metadata extraction database export csv sql parquet
When you export data from a database, the resulting file is more than rows and values. Depending on the format, it carries schema definitions, column types, compression details, encoding information, and statistical summaries that describe the data without requiring you to read every record.
This metadata is useful for several practical reasons:
- Data cataloging: Building a searchable inventory of exports across teams and projects
- Migration validation: Confirming that column types survived the export intact before loading into a new system
- Pipeline automation: Routing files through different processing paths based on schema shape, row counts, or compression type
- Governance and auditing: Tracking data lineage from source database to export file to downstream consumer
The catch is that each export format handles metadata differently. CSV files carry almost none. SQL dumps embed full DDL. Parquet files store rich schema metadata in a binary footer. Understanding what each format gives you, and what it doesn't, is the first step toward building reliable extraction workflows.
Helpful references: Fast.io Workspaces, Fast.io Collaboration, and Fast.io AI.
What Metadata Each Format Contains
Before writing any extraction code, it helps to know what you're working with. Here is a comparison of the three most common database export formats and the metadata each one provides.
CSV files have no built-in metadata standard. The file is plain text with a delimiter, and everything about the schema, including column names, data types, encoding, and row counts, must be inferred or documented externally. If the first row contains headers, you get column names. Beyond that, you're guessing.
SQL dump files are the most metadata-rich text-based export format. A typical dump from PostgreSQL or MySQL contains CREATE TABLE statements with column names, data types, constraints (PRIMARY KEY, NOT NULL, UNIQUE, FOREIGN KEY), default values, and index definitions. Some dumps also include comments, character set declarations, and database version information.
Apache Parquet files store full schema metadata in a binary footer at the end of the file. This includes column names, physical and logical types, encoding algorithms, compression codecs, row group boundaries, and min/max statistics for each column chunk. Parquet was designed to separate metadata from data, so query engines can read the footer without scanning the file body.
Here is a practical summary:
- CSV: Column names (if headers exist). Everything else is inferred.
- SQL dump: Column names, data types, constraints, indexes, defaults, character sets, table relationships
- Parquet: Column names, physical types, logical types, encoding, compression, row group stats, custom key-value pairs, row counts
The gap between CSV and Parquet is significant. If you're building a data catalog that needs to auto-discover schema information, Parquet files are self-describing while CSV files require external documentation or inference logic.
Extracting Metadata from CSV Files
CSV metadata extraction is schema inference, since the format itself stores almost nothing about the data's structure. Here are the practical approaches.
Reading Headers and Basic Properties
Python's built-in csv module gives you column names from the header row, and you can count rows with a simple loop:
import csv
import os
def csv_metadata(filepath):
file_size = os.path.getsize(filepath)
with open(filepath, "r", encoding="utf-8") as f:
reader = csv.reader(f)
headers = next(reader)
row_count = sum(1 for _ in reader)
return {
"columns": headers,
"column_count": len(headers),
"row_count": row_count,
"file_size_bytes": file_size,
}
This tells you the column names, how many rows and columns exist, and the file size. It tells you nothing about data types.
Inferring Column Types with Pandas
Pandas reads the first chunk of data and infers types based on content patterns:
import pandas as pd
df = pd.read_csv("export.csv", nrows=1000)
schema = {col: str(dtype) for col, dtype in df.dtypes.items()}
The nrows parameter limits how many rows Pandas samples for inference. This is fast but imprecise. A column that looks like integers in the first 1,000 rows might contain floats or strings further down. For more reliable inference, sample multiple chunks or read the full file.
Handling Encoding
Detection CSV files don't declare their encoding. If you're processing exports from different systems, use the chardet library to detect encoding before parsing:
import chardet
with open("export.csv", "rb") as f:
raw = f.read(10000)
result = chardet.detect(raw)
encoding = result["encoding"]
Dedicated Inference Tools
For production pipelines, consider specialized tools. Snowflake's INFER_SCHEMA function auto-detects file metadata from staged data files. The csv-schema-inference Python library runs configurable type detection with tunable accuracy thresholds and batch sizes. Both go beyond what Pandas offers by handling edge cases like mixed-type columns and locale-specific number formats.
The fundamental limitation remains: CSV metadata extraction is always probabilistic. You're inferring what the schema should be, not reading what it is.
Catalog Your Database Exports with AI-Powered Metadata Extraction
Fast.io's Metadata Views turn uploaded files into structured, queryable data. Describe the fields you need, and AI extracts them from CSVs, Parquet files, SQL dumps, and more. 50 GB free, no credit card required.
Extracting Metadata from SQL Dump Files
SQL dumps contain DDL statements that serve as the authoritative schema metadata for the original database. The challenge is parsing those statements reliably across different SQL dialects.
What DDL Statements Tell You
A typical CREATE TABLE statement in a PostgreSQL dump looks like this:
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(id),
total NUMERIC(10, 2) DEFAULT 0.00,
status VARCHAR(20) CHECK (status IN ('pending', 'shipped', 'delivered')),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
From this single statement, you can extract: table name, five column names with exact data types, a primary key constraint, a foreign key relationship, a check constraint, default values, and nullability rules. That's far richer than anything CSV or even Parquet provides about the data's intended semantics.
Parsing DDL with simple-ddl-parser
The simple-ddl-parser
Python library handles DDL from PostgreSQL, MySQL, Oracle, SQL Server, Snowflake, Redshift, and BigQuery. It converts CREATE TABLE statements into structured Python dictionaries:
from simple_ddl_parser import DDLParser
with open("dump.sql", "r") as f:
ddl_text = f.read()
result = DDLParser(ddl_text).run()
for table in result:
print(f"Table: {table['table_name']}")
for col in table["columns"]:
print(f" {col['name']}: {col['type']} (nullable: {col['nullable']})")
Each parsed table includes column types, defaults, primary keys, foreign keys, unique constraints, and index definitions. The library processes about 9,000 downloads per day on PyPI, so it's actively maintained and widely used.
Extracting Metadata with grep and regex
For quick, one-off inspection of SQL dumps, command-line tools work well:
grep -i "CREATE TABLE" dump.sql
This lists every table in the dump. For more structured extraction, a regex-based approach in Python can pull table and column definitions:
import re
with open("dump.sql", "r") as f:
content = f.read()
tables = re.findall(
r"CREATE TABLE\s+(\w+)\s*\((.*?)\);",
content,
re.DOTALL | re.IGNORECASE,
)
This works for simple dumps but breaks on complex DDL with nested parentheses, multi-line constraints, or dialect-specific syntax. For anything beyond quick inspection, use a proper parser.
Beyond Table Definitions
SQL dumps often contain metadata beyond CREATE TABLE: index definitions (CREATE INDEX), views, stored procedures, sequences, comments (COMMENT ON), and SET statements that reveal the source database's character set, collation, and version. A thorough metadata extraction pipeline should capture these too, since they document the database's operational configuration alongside its schema.
Extracting Metadata from Parquet Files
Parquet files are self-describing. The format stores all schema metadata in a binary footer, so you can read column definitions, types, encoding, compression, and statistics without loading any data rows. This makes Parquet the most metadata-friendly export format by a wide margin.
Reading Schema with PyArrow
PyArrow's read_schema function reads only the footer, returning the full schema without touching the data:
import pyarrow.parquet as pq
schema = pq.read_schema("export.parquet")
for field in schema:
print(f"{field.name}: {field.type} (nullable: {field.nullable})")
This returns column names, Arrow data types (which map to Parquet physical and logical types), and nullability. For a 10 GB Parquet file, this operation takes milliseconds because only the footer is read.
Accessing File-Level Metadata
The read_metadata function returns detailed file-level information:
metadata = pq.read_metadata("export.parquet")
print(f"Format version: {metadata.format_version}")
print(f"Row count: {metadata.num_rows}")
print(f"Row groups: {metadata.num_row_groups}")
print(f"Columns: {metadata.num_columns}")
print(f"Created by: {metadata.created_by}")
print(f"File size: {metadata.serialized_size} bytes")
The created_by field typically identifies the library and version that wrote the file (for example, "parquet-mr version 1.12.3" for Java or "parquet-cpp-arrow version 15.0.0" for PyArrow). This provenance information is valuable for debugging format compatibility issues.
Row Group and Column Statistics
Parquet stores min/max statistics at the row group and column chunk level, enabling query engines to skip irrelevant data. You can access these programmatically:
for i in range(metadata.num_row_groups):
rg = metadata.row_group(i)
print(f"Row group {i}: {rg.num_rows} rows")
for j in range(rg.num_columns):
col = rg.column(j)
print(f" Column {j}: {col.physical_type}")
print(f" Compression: {col.compression}")
print(f" Encodings: {col.encodings}")
if col.statistics:
stats = col.statistics
print(f" Min: {stats.min}, Max: {stats.max}")
print(f" Null count: {stats.null_count}")
print(f" Distinct count: {stats.distinct_count}")
These statistics are what make Parquet fast for analytical queries. They're also useful metadata for cataloging: you can determine value ranges, null ratios, and data distribution without reading a single data row.
Custom Key-Value Metadata
Parquet supports arbitrary key-value metadata pairs at the file level. Teams use this for storing pipeline version, data source URLs, export timestamps, or schema registry identifiers:
if schema.metadata:
for key, value in schema.metadata.items():
print(f"{key.decode()}: {value.decode()}")
Pandas DataFrames saved to Parquet automatically store the Pandas schema as a JSON blob in the pandas metadata key, which includes the original index structure, column order, and dtype mapping.
Building a Cross-Format Metadata Catalog
Once you can extract metadata from each format individually, the next step is unifying it into a consistent catalog. This is where most teams hit friction, because CSV, SQL, and Parquet describe schemas in fundamentally different vocabularies.
Designing a Common Schema A practical metadata record should capture:
- Source format: csv, sql, parquet
- Table or file name: the logical dataset name
- Column definitions: name, inferred or declared type, nullable
- Row count: exact for Parquet and SQL (if available), estimated for CSV
- File properties: size, encoding, compression, creation date
- Provenance: which system or tool created the export
- Confidence level: high for Parquet and SQL DDL, low for CSV type inference
The confidence field matters. A Parquet schema that declares a column as INT64 NOT NULL is a contract. A Pandas inference that a CSV column is int64 is a guess that could be wrong in row 50,001.
Automated Cataloging Pipeline
Here is a simplified routing pattern that extracts metadata from any of the three formats:
from pathlib import Path
def catalog_export(filepath):
ext = Path(filepath).suffix.lower()
if ext == ".csv":
return extract_csv_metadata(filepath)
elif ext in (".sql", ".dump"):
return extract_sql_metadata(filepath)
elif ext == ".parquet":
return extract_parquet_metadata(filepath)
else:
return {"error": f"unsupported format: {ext}"}
Each extractor normalizes its output into the common schema defined above. The SQL extractor might return one record per table found in the dump, while CSV and Parquet return one record per file.
Scaling with Cloud Platforms
For teams managing hundreds or thousands of export files, manual extraction scripts don't scale. Cloud data catalogs like AWS Glue Data Catalog, Google Cloud Data Catalog, and Apache Atlas can auto-discover Parquet schemas from object storage. Snowflake's INFER_SCHEMA works on staged CSV and Parquet files alike.
For export files stored in shared workspaces, Fast.io's Metadata Views offer a different approach. Instead of writing extraction code, you describe the fields you want extracted in natural language, and AI designs a typed schema, matches files in the workspace, and populates a sortable, filterable spreadsheet. This works well for PDF-based exports, scanned database reports, and spreadsheet dumps where traditional parsing tools struggle. Metadata Views support seven field types (Text, Integer, Decimal, Boolean, URL, JSON, Date & Time) and let you add new columns without reprocessing existing files.
For structured exports like clean Parquet and SQL dumps, programmatic tools like PyArrow and simple-ddl-parser give you more control. For messy, mixed-format collections that include scanned pages and handwritten notes alongside structured exports, Metadata Views fill the gap that code-only approaches leave open.
Practical Tips and Common Pitfalls
A few things that trip up teams working with database export metadata in practice:
CSV Encoding Mismatches
Exports from legacy systems often use Latin-1 or Windows-1252 encoding instead of UTF-8. If your extraction pipeline assumes UTF-8, you'll get garbled column names or outright parse failures. Always detect encoding before inferring schema, especially for exports from older ERP or CRM systems.
SQL Dialect Differences
A MySQL dump and a PostgreSQL dump use different syntax for the same concepts. AUTO_INCREMENT vs. SERIAL, backticks vs. double quotes for identifiers, ENGINE=InnoDB declarations that don't exist in PostgreSQL. If you're parsing dumps from mixed database sources, use a multi-dialect parser like simple-ddl-parser rather than writing custom regex per dialect.
Parquet Version Compatibility
Parquet format version 2.x introduced logical types (also called converted types in earlier specs) that version 1.x readers may not understand. If metadata extraction returns unexpected types like TIMESTAMP_MICROS instead of a simple timestamp, check which format version the file was written with using metadata.format_version.
Row Count Discrepancies
Parquet metadata includes an exact row count in the footer. CSV row counts require reading the entire file. SQL dumps may include INSERT statement counts, but partial dumps or dumps with COPY statements handle this differently. When building a catalog, always note whether the row count is exact or estimated.
Large File Handling
For Parquet files, metadata extraction is always fast because only the footer is read. For large CSV files (10 GB+), even counting rows can take minutes. Consider sampling strategies: read the first 10,000 rows for schema inference, count bytes, and estimate total rows from the average row size.
Schema Evolution
Database exports taken at different points in time may have different schemas. A column added in March won't appear in a January export. Your catalog should track schema versions over time and flag when an export's schema differs from the most recent known version for that dataset.
Frequently Asked Questions
How do I get metadata from a Parquet file?
Use PyArrow's read_schema() function to get column names, types, and nullability from the file footer without loading data. Use read_metadata() for file-level details like row count, row groups, format version, compression, and column statistics. Both functions read only the binary footer, making them fast even on multi-gigabyte files.
What metadata is in a CSV file?
CSV files have no built-in metadata standard. If the first row contains headers, you get column names. Everything else, including data types, encoding, row counts, and relationships, must be inferred by reading the data or documented in a separate schema file. Tools like Pandas, Snowflake's INFER_SCHEMA, and the csv-schema-inference library can guess column types from sample data, but the results are probabilistic rather than definitive.
How do I extract schema from a SQL dump file?
SQL dumps contain CREATE TABLE statements with column names, data types, constraints, defaults, and index definitions. You can parse these with the simple-ddl-parser Python library, which supports PostgreSQL, MySQL, Oracle, SQL Server, Snowflake, and other dialects. For quick inspection, grep for CREATE TABLE statements in the dump file.
Can you read column types from a Parquet file without loading the data?
Yes. Parquet stores the full schema in a binary footer at the end of the file. PyArrow's read_schema() reads only this footer, returning column names, physical types, logical types, and nullability in milliseconds regardless of file size. You can also access row group statistics (min, max, null count) per column through the metadata object.
Which database export format preserves the most metadata?
Parquet preserves the most machine-readable metadata, including column types, encoding, compression, row group statistics, and custom key-value pairs. SQL dumps preserve the most semantic metadata, including constraints, foreign keys, defaults, and indexes that describe the database's relational structure. CSV preserves the least, with only column headers if present.
How do I catalog metadata from mixed-format database exports?
Build a routing function that detects the file extension and delegates to format-specific extractors. Normalize each extractor's output into a common schema with fields for column definitions, row counts, provenance, and a confidence level that reflects whether types are declared (Parquet, SQL) or inferred (CSV). For large collections, consider cloud data catalog services or Fast.io's Metadata Views for AI-assisted extraction.
Related Resources
Catalog Your Database Exports with AI-Powered Metadata Extraction
Fast.io's Metadata Views turn uploaded files into structured, queryable data. Describe the fields you need, and AI extracts them from CSVs, Parquet files, SQL dumps, and more. 50 GB free, no credit card required.