db-diff

db-diff is the most advanced and user-friendly Python CLI tool and library for comparing CSV, TSV, and JSON database dumps. Designed specifically for database workflows, it delivers precise, human-readable or machine-readable diffs, supports custom key columns, handles massive files efficiently with streaming mode, and offers flexible field selection.

Features

Compare CSV, TSV, and JSON files for differences
Human-readable and machine-readable (JSON) output
JSON output to file with filename and path support
Detects added, removed, and changed rows and columns
Supports custom key columns for row identity
Field inclusion/exclusion for focused diffs
Streaming mode for very large files (memory efficient)
Can output to terminal or JSON file
Automatic delimiter and encoding detection
Python 3.6+ compatible

Installation

Install the latest version directly from GitHub:

pip install git+https://github.com/datsom1/db-diff.git

To upgrade to the latest version:

pip install --upgrade --force-reinstall git+https://github.com/datsom1/db-diff.git

Quick Start

Suppose you have two CSV files:

one.csv

Id,name,age
1,Cleo,4
2,Pancakes,2

two.csv

Id,name,age
1,Cleo,5
3,Bailey,1

Compare them using:

db-diff one.csv two.csv --key=Id

Sample output:

1 rows changed, 1 rows added, 1 rows removed

1 rows changed

  Id: 1
    age: "4" => "5"

1 rows added

  Id: 3
  name: Bailey
  age: 1

1 rows removed

  Id: 2
  name: Pancakes
  age: 2

Usage

Command Line

db-diff is a flexible CLI tool for comparing two data files (CSV, TSV, or JSON). It detects added, removed, and changed rows and columns, and can output results in a human-readable or machine-readable format.

Basic usage:

db-diff [OPTIONS] PREVIOUS CURRENT

PREVIOUS and CURRENT are the file paths to the two files you want to compare.
The tool auto-detects file format by extension, or you can specify with --format.

Key features:

Custom Key Column: Use --key to specify which column uniquely identifies rows.
Output Formats: Choose between human-readable (readable), JSON (json), or save JSON to a file (jsonfile).
Field Selection: Use --fields to compare only specific columns, or --ignorefields to exclude columns.
Streaming Mode: For very large files, use --streaming (CSV/TSV only, files must be sorted by key).
Encoding: Specify file encoding with --encoding.
Show Unchanged: Use --showunchanged to display unchanged fields for changed rows.
List Fields: Use --listfields to print available columns and exit.
Timing: Use --time to display how long the diff operation took.

See all options:

db-diff --help

Python Library

You can use db-diff as a Python library for advanced or automated workflows. The library provides functions to load data, compare datasets, and render results.

Loading Data

from db_diff import load_csv, load_json

# Load CSV file, using a specific column as the key
with open("one.csv", encoding="utf-8") as f:
    prev = load_csv(f, key="Id")

# Load JSON file, using a specific key
with open("two.json", encoding="utf-8") as f:
    curr = load_json(f, key="Id")

Comparing Data

from db_diff import compare

# Compare two datasets (dictionaries keyed by your chosen column)
diff = compare(prev, curr, show_unchanged=False)

show_unchanged: If True, includes unchanged fields for changed rows.
fields: Pass a set of field names to only compare those fields.
ignorefields: Pass a set of field names to ignore during comparison.

Streaming Comparison (for large CSV/TSV files)

from db_diff import streaming_compare_csv

diff = streaming_compare_csv(
    "one.csv",
    "two.csv",
    key="Id",
    compare_columns={"Id", "name", "age"},
    encoding="utf-8",
    dialect="excel"
)

Diff Result Structure

The result of compare or streaming_compare_csv is a dictionary:

{
    "added": [ ... ],            # List of added rows (dicts)
    "removed": [ ... ],          # List of removed rows (dicts)
    "changed": [                 # List of changed rows
        {
            "key": "row_id",
            "changes": {
                "field1": ["old", "new"],
                ...
            },
            "unchanged": { ... } # (optional) if show_unchanged=True
        },
        ...
    ],
    "columns_added": [ ... ],    # List of columns added
    "columns_removed": [ ... ]   # List of columns removed
}

Rendering Human-Readable Output

from db_diff import human_text

print(human_text(diff, key="Id", current=curr))

Example: Full Workflow

from db_diff import load_csv, compare, human_text

with open("one.csv") as f1, open("two.csv") as f2:
    prev = load_csv(f1, key="Id")
    curr = load_csv(f2, key="Id")
    diff = compare(prev, curr, show_unchanged=True)
    print(human_text(diff, key="Id", current=curr))

Options

See all available options with:

db-diff --help

A summary of key options:

Option	Description
`--key TEXT`	Column to use as a unique ID for each row (default: first column header)
`--output TEXT`	Output format: `readable`, `json`, or `jsonfile` (default: readable)
`--outfilename FILE`	File to write JSON output to (used with `--output=jsonfile`)
`--outfilepath DIR`	Directory to save the output file (used with `--output=jsonfile`)
`--fields TEXT`	Comma-separated list of fields to compare (all others ignored)
`--ignorefields TEXT`	Comma-separated list of fields to ignore during comparison
`--showunchanged`	Show all fields for changed records, not just changed fields
`--time`	Measure and display elapsed time for the diff operation
`--format TEXT`	Explicitly specify input format: `csv`, `tsv`, or `json` (default: auto-detect)
`--encoding TEXT`	Input file encoding (default: utf-8)
`--streaming`	Use streaming mode for very large CSV/TSV files (requires files to be sorted by key)
`--listfields`	List available fields/columns in the input files and exit
`--version`	Show the version and exit
`-h, --help`	Show help message and exit

Examples

Show unchanged fields for changed rows:

db-diff one.csv two.csv --key=Id --showunchanged

Output as JSON:

db-diff one.csv two.csv --key=Id --output=json

Save JSON output to a file:

db-diff one.csv two.csv --key=Id --output=jsonfile --outfilename=diffs.json

Compare only specific fields:

db-diff one.csv two.csv --key=Id --fields=Id,name

Ignore specific fields:

db-diff one.csv two.csv --key=Id --ignorefields=LastModifiedDate

Streaming mode for large files:

db-diff large1.csv large2.csv --key=Id --streaming

Performance

For very large files, use --streaming mode (CSV/TSV only, files must be sorted by key).
Supports efficient memory usage and fast comparison for millions of rows.

Development

Clone the repository:

git clone https://github.com/datsom1/db-diff.git
cd db-diff

Install dependencies:

pip install -e .

License

This project is licensed under the Apache License 2.0.

Author: Thomas Coyle
Repository: https://github.com/datsom1/db-diff

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
db_diff		db_diff
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

db-diff

Table of Contents

Features

Installation

Quick Start

Usage

Command Line

Python Library

Loading Data

Comparing Data

Streaming Comparison (for large CSV/TSV files)

Diff Result Structure

Rendering Human-Readable Output

Example: Full Workflow

Options

Examples

Performance

Development

License

About

Uh oh!

Releases 13

Packages

Languages

License

datsom1/db-diff

Folders and files

Latest commit

History

Repository files navigation

db-diff

Table of Contents

Features

Installation

Quick Start

Usage

Command Line

Python Library

Loading Data

Comparing Data

Streaming Comparison (for large CSV/TSV files)

Diff Result Structure

Rendering Human-Readable Output

Example: Full Workflow

Options

Examples

Performance

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Languages

Packages