Skip to content
forked from simonw/csv-diff

Python CLI tool and library for comparing CSV database dumps and finding differences.

License

Notifications You must be signed in to change notification settings

datsom1/db-diff

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

db-diff

Changelog License

db-diff is the most advanced and user-friendly Python CLI tool and library for comparing CSV, TSV, and JSON database dumps. Designed specifically for database workflows, it delivers precise, human-readable or machine-readable diffs, supports custom key columns, handles massive files efficiently with streaming mode, and offers flexible field selection.


Table of Contents


Features

  • Compare CSV, TSV, and JSON files for differences
  • Human-readable and machine-readable (JSON) output
  • JSON output to file with filename and path support
  • Detects added, removed, and changed rows and columns
  • Supports custom key columns for row identity
  • Field inclusion/exclusion for focused diffs
  • Streaming mode for very large files (memory efficient)
  • Can output to terminal or JSON file
  • Automatic delimiter and encoding detection
  • Python 3.6+ compatible

Installation

Install the latest version directly from GitHub:

pip install git+https://github.com/datsom1/db-diff.git

To upgrade to the latest version:

pip install --upgrade --force-reinstall git+https://github.com/datsom1/db-diff.git

Quick Start

Suppose you have two CSV files:

one.csv

Id,name,age
1,Cleo,4
2,Pancakes,2

two.csv

Id,name,age
1,Cleo,5
3,Bailey,1

Compare them using:

db-diff one.csv two.csv --key=Id

Sample output:

1 rows changed, 1 rows added, 1 rows removed

1 rows changed

  Id: 1
    age: "4" => "5"

1 rows added

  Id: 3
  name: Bailey
  age: 1

1 rows removed

  Id: 2
  name: Pancakes
  age: 2

Usage

Command Line

db-diff is a flexible CLI tool for comparing two data files (CSV, TSV, or JSON). It detects added, removed, and changed rows and columns, and can output results in a human-readable or machine-readable format.

Basic usage:

db-diff [OPTIONS] PREVIOUS CURRENT
  • PREVIOUS and CURRENT are the file paths to the two files you want to compare.
  • The tool auto-detects file format by extension, or you can specify with --format.

Key features:

  • Custom Key Column: Use --key to specify which column uniquely identifies rows.
  • Output Formats: Choose between human-readable (readable), JSON (json), or save JSON to a file (jsonfile).
  • Field Selection: Use --fields to compare only specific columns, or --ignorefields to exclude columns.
  • Streaming Mode: For very large files, use --streaming (CSV/TSV only, files must be sorted by key).
  • Encoding: Specify file encoding with --encoding.
  • Show Unchanged: Use --showunchanged to display unchanged fields for changed rows.
  • List Fields: Use --listfields to print available columns and exit.
  • Timing: Use --time to display how long the diff operation took.

See all options:

db-diff --help

Python Library

You can use db-diff as a Python library for advanced or automated workflows. The library provides functions to load data, compare datasets, and render results.

Loading Data

from db_diff import load_csv, load_json

# Load CSV file, using a specific column as the key
with open("one.csv", encoding="utf-8") as f:
    prev = load_csv(f, key="Id")

# Load JSON file, using a specific key
with open("two.json", encoding="utf-8") as f:
    curr = load_json(f, key="Id")

Comparing Data

from db_diff import compare

# Compare two datasets (dictionaries keyed by your chosen column)
diff = compare(prev, curr, show_unchanged=False)
  • show_unchanged: If True, includes unchanged fields for changed rows.
  • fields: Pass a set of field names to only compare those fields.
  • ignorefields: Pass a set of field names to ignore during comparison.

Streaming Comparison (for large CSV/TSV files)

from db_diff import streaming_compare_csv

diff = streaming_compare_csv(
    "one.csv",
    "two.csv",
    key="Id",
    compare_columns={"Id", "name", "age"},
    encoding="utf-8",
    dialect="excel"
)

Diff Result Structure

The result of compare or streaming_compare_csv is a dictionary:

{
    "added": [ ... ],            # List of added rows (dicts)
    "removed": [ ... ],          # List of removed rows (dicts)
    "changed": [                 # List of changed rows
        {
            "key": "row_id",
            "changes": {
                "field1": ["old", "new"],
                ...
            },
            "unchanged": { ... } # (optional) if show_unchanged=True
        },
        ...
    ],
    "columns_added": [ ... ],    # List of columns added
    "columns_removed": [ ... ]   # List of columns removed
}

Rendering Human-Readable Output

from db_diff import human_text

print(human_text(diff, key="Id", current=curr))

Example: Full Workflow

from db_diff import load_csv, compare, human_text

with open("one.csv") as f1, open("two.csv") as f2:
    prev = load_csv(f1, key="Id")
    curr = load_csv(f2, key="Id")
    diff = compare(prev, curr, show_unchanged=True)
    print(human_text(diff, key="Id", current=curr))

Options

See all available options with:

db-diff --help

A summary of key options:

Option Description
--key TEXT Column to use as a unique ID for each row (default: first column header)
--output TEXT Output format: readable, json, or jsonfile (default: readable)
--outfilename FILE File to write JSON output to (used with --output=jsonfile)
--outfilepath DIR Directory to save the output file (used with --output=jsonfile)
--fields TEXT Comma-separated list of fields to compare (all others ignored)
--ignorefields TEXT Comma-separated list of fields to ignore during comparison
--showunchanged Show all fields for changed records, not just changed fields
--time Measure and display elapsed time for the diff operation
--format TEXT Explicitly specify input format: csv, tsv, or json (default: auto-detect)
--encoding TEXT Input file encoding (default: utf-8)
--streaming Use streaming mode for very large CSV/TSV files (requires files to be sorted by key)
--listfields List available fields/columns in the input files and exit
--version Show the version and exit
-h, --help Show help message and exit

Examples

Show unchanged fields for changed rows:

db-diff one.csv two.csv --key=Id --showunchanged

Output as JSON:

db-diff one.csv two.csv --key=Id --output=json

Save JSON output to a file:

db-diff one.csv two.csv --key=Id --output=jsonfile --outfilename=diffs.json

Compare only specific fields:

db-diff one.csv two.csv --key=Id --fields=Id,name

Ignore specific fields:

db-diff one.csv two.csv --key=Id --ignorefields=LastModifiedDate

Streaming mode for large files:

db-diff large1.csv large2.csv --key=Id --streaming

Performance

  • For very large files, use --streaming mode (CSV/TSV only, files must be sorted by key).
  • Supports efficient memory usage and fast comparison for millions of rows.

Development

Clone the repository:

git clone https://github.com/datsom1/db-diff.git
cd db-diff

Install dependencies:

pip install -e .

License

This project is licensed under the Apache License 2.0.


Author: Thomas Coyle
Repository: https://github.com/datsom1/db-diff

About

Python CLI tool and library for comparing CSV database dumps and finding differences.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%