How to Extract Metadata from Git Repositories
Git repositories hold far more than source code. Every commit stores author details, timestamps, diff stats, branch references, and GPG signatures that are valuable for analytics, compliance audits, and migration planning. This guide covers practical methods for pulling that data out and putting it to work.
What Metadata Lives Inside a Git Repository
Every git commit carries a structured set of fields beyond the code diff itself. An active repository with 500 or more commits per year builds up a rich dataset you can query, filter, and export.
Here's what you can extract from each commit:
- Commit hash (%H for full SHA-1, %h for abbreviated). The unique identifier for every snapshot in the repo's history.
- Author name and email (%an, %ae). The person who wrote the change. This is distinct from the committer, which matters for rebased or cherry-picked commits.
- Committer name and email (%cn, %ce). The person who applied the change. In most workflows these match the author, but they diverge during rebases, patches, and integrations.
- Author date and committer date (%ad, %cd). Two separate timestamps per commit. Author date records when the change was originally written. Committer date records when it was applied to the current branch.
- Subject and body (%s, %b). The first line of the commit message and everything after the blank line.
- Diff stats (--stat, --shortstat, --numstat). Lines added, lines removed, and files changed per commit. This is where you get velocity and churn data.
- Branch and tag refs (%D). Which branches and tags point at the commit.
- GPG signature status (%G?, %GS, %GK). Whether the commit is signed, who signed it, and with which key.
- Tree hash (%T). The hash of the directory tree at that commit, useful for detecting identical snapshots across branches.
- Parent hashes (%P). References to parent commits, which reveal merge topology.
That's 15+ extractable fields per commit before you factor in file-level change data. The --numstat flag alone gives you per-file additions and deletions for every commit in the history.
Helpful references: Fast.io Workspaces, Fast.io Collaboration, and Fast.io AI.
What to check before scaling metadata extraction from git repositories
The git log --pretty=format command is the most direct way to pull structured data from a repository. You define a template using placeholder tokens, and git outputs one line per commit in your chosen format.
Exporting to CSV
A basic CSV export captures the essentials:
git log --pretty=format:'%H,%an,%ae,%ad,%s' --date=iso > commits.csv
This gives you the full hash, author name, email, ISO-formatted date, and subject line for every commit. Add --numstat to append per-file change counts, though the output format becomes multi-line and needs post-processing.
For a more complete CSV with proper escaping, pipe through a formatter:
git log --pretty=format:'"%H","%an","%ae","%ad","%s"' \
--date=iso-strict > commits.csv
The --date=iso-strict flag outputs RFC 3339 timestamps that parse cleanly in spreadsheets and databases.
Exporting to JSON
Git doesn't output native JSON, but the format string gets you close. Simon Willison documented a clean approach using jq:
git log --pretty=format:'%H%x00%an%x00%ae%x00%aI%x00%s' |
jq -R -s 'split("
") | map(select(length > 0)) |
map(split("