jointer

Command line utility to aid in joining/merging tabular data, extending the functionality of the join command and 'two-file' awk idioms.

Example use cases

Simple: Mapping country codes to country names

Maybe you have a file with two columns, (1) a 3-letter country code, and (2) the number of people from that country in millions (I don't know if these numbers are accurate):

Example file country_populations.tsv:

country_code	population
BRA	209
CHN	1439
IND	1380
USA	328

then you have another file with two columns, (1) a 3-letter country code, and (2) the name of the country:

Example file country_codes.tsv:

country_code	country_name
BRA	Brazil
CHN	China
IND	India
USA	United-States

Your goal is to merge these two files to get a table with three columns: (1) the country code, (2) the country name, and (3) the population. You could achieve this using the join command:

join country_populations.tsv country_codes.tsv

This would give you the following merged table:

country_code	country_name	population
BRA	Brazil	209
CHN	China	1439
IND	India	1380
USA	United-States	328

You could also use jointer to achieve the same result:

jointer country_populations.tsv country_codes.tsv

By default join and jointer will join on the first field of each file. If the fields to join on are not the first field, you can specify the fields to join on using the -1 and -2 options. For example, the country code could be the second field in the country_codes_col_reorder.tsv file

country_name	country_code
Brazil	BRA
China	CHN
India	IND
United-States	USA

In which case you would use the following command to perform the join:

join -1 1 -2 2 country_populations.tsv country_codes_col_reorder.tsv

or

jointer -1 1 -2 2 country_populations.tsv country_codes_col_reorder.tsv

If the join column is the 2nd for both files, you can use the -j option instead, even though this is marked as a "compatibility" option that is not recommended for use in new scripts:

join -j 2 country_populations_col_reorder.tsv country_codes_col_reorder.tsv

jointer -j 2 country_populations_col_reorder.tsv country_codes_col_reorder.tsv

!TODO add in examples from join man page TODO! !TODO add in doctest TODO!

Multi-column merging and aggregating of two sorted tables

Let's say we have kept track of the hours logged by users using two different applications and we're interested in knowing the total number of hours logged by each user on each day, but only for the days that the user logged hours in both applications.

Example date sorted table application_A.tsv:

date	user	hours_logged
2020-01-01	user1	3.5
2020-01-02	user1	4.5
2020-01-02	user2	3

Example date sorted table application_B.tsv:

date	user	hours_logged
2020-01-01	user1	3.0
2020-01-01	user2	2.0
2020-01-02	user1	4.0
2020-01-02	user2	3.5

The desired merged table would be:

date	user	hours_logged
2020-01-01	user1	6.5
2020-01-02	user1	8.5
2020-01-02	user2	6.5

We need to join on two fields, but the join coreutils command only supports joining on a single field. Instead we can first use awk to create a single key to join on, and then use join to merge the files.

join \
    <(awk '{print $1"-"$2,$1,$2,$3}' < application_A.tsv) \
    <(awk '{print $1"-"$2,$3}' < application_B.tsv) \
    | awk 'BEGIN{OFS="\t"} {print $2,$3,$4+$5}'

Alternatively, we can use idiomatic awk two-file processing (good explanation here)

awk 'NR==FNR {a[$1$2]=$3; next} $1$2 in a {print $1,$2,$3+a[$1$2]}' \
    application_A.tsv application_B.tsv

Or we can use jointer:

jointer \
    -j 1,eq 2,eq \
    -m all \
    -c 1,2,sum(3) \
    application_A.tsv application_B.tsv

Merging BED files sorted by chrom, start, end

This functionality is already covered by the bedtools suite of software. However, it is just an abstraction of the general merging that is possible with the jointer tool.

Example sorted BED file source_a.bed:

#chrom	start	end	count
chr1	100	150	3
chr1	120	150	5
chr2	570	600	1
chrX	1430	1450	8

Example sorted BED file source_b.bed:

#chrom	start	end	count
chr1	80	110	2
chr2	370	600	4
chrX	1420	1490	9

There are multiple examples of different ways that we might want to merge these files. For example, we might want to merge the files by taking the maximum count for each overlapping region, and we might want to report the extended regions from the intersection rather than the original start/end coordinates from each file. We might also want to only report regions with overlaps from all input files. The merged file would then be:

#chrom	start	end	count
chr1	80	150	3
chr2	370	600	4
chrX	1420	1490	9

To do this using bedtools and awk we could use the following command:

bedtools intersect -wa -wb -a source_a.bed -b source_b.bed | \
    awk 'BEGIN{OFS="\t"}{print $1,($2<$6)?$2:$6,($3>$7)?$3:$7,($4>$8)?$4:$8}' \
    > merged_max_outer_bedtools_awk.bed

or using jointer:

jointer \
    -j 1,eq 2-3,ov \
    -c 1,min(2),max(3),max(4) \
    source_a.bed source_b.bed > merged_max_outer_jointer.bed

Or similarly we might want to merge the files by taking the sum of the counts, but only reporting the shared parts of the regions that overlap

#chrom	start	end	count
chr1	100	110	5
chr2	570	600	5
chrX	1430	1450	17

bedtools intersect -wa -wb -a source_a.bed -b source_b.bed | \
    awk 'BEGIN{OFS="\t"}{print $1,($2>$6)?$2:$6,($3<$7)?$3:$7,$4+$8}' \
    > merged_sum_inner_bedtools_awk.bed

jointer \
    -j 1,eq 2<>3 \
    -c 1,max(2),min(3),sum(4) \
    source_a.bed source_b.bed > merged.bed

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/readme_examples		data/readme_examples
python/jointer		python/jointer
src		src
test		test
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jointer

Example use cases

Simple: Mapping country codes to country names

Multi-column merging and aggregating of two sorted tables

Merging BED files sorted by chrom, start, end

About

Releases

Packages

Languages

biermanr/jointer

Folders and files

Latest commit

History

Repository files navigation

jointer

Example use cases

Simple: Mapping country codes to country names

Multi-column merging and aggregating of two sorted tables

Merging BED files sorted by chrom, start, end

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages