Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query -H format improvement #1856

Closed
janxkoci opened this issue Feb 5, 2023 · 2 comments
Closed

query -H format improvement #1856

janxkoci opened this issue Feb 5, 2023 · 2 comments

Comments

@janxkoci
Copy link

janxkoci commented Feb 5, 2023

The problem

I would like to propose a small tweak to bcftools query with the -H flag, which prints header names as the first line of output. Currently, the header line begins with # (a hash sign followed by space):

$ bcftools query -H -f '%CHROM' input.vcf | head -2
# CHROM
chr1

This can confuse many downstream tools trying to parse data into columns. The first line will appear to have one more column in the eyes of many standard tools, such as awk, cut, datamash, R, and others, including spreadsheet apps.

Consider the following example:

$ bcftools query -H -f '%CHROM %POS %REF %ALT\n' input.vcf | awk 'NR < 4 {print "ncol:", NF, "(col $1: "$1")"}'
ncol: 5 (col $1: #)
ncol: 4 (col $1: chr1)
ncol: 4 (col $1: chr1)

In this toy example, we can see that the first line has more columns, and that the name of first column is "#", rather than e.g. "CHROM", as we asked in the query format. It makes the header less useful by default, requiring additional processing. For example, I often end up piping through sed, e.g.

bcftools query -H -f '%CHROM %POS %REF %ALT\n' input.vcf | sed '1s/# //' | awk ...

The proposed change

Depending on your preference regarding the hash sign # in the header, I propose removing either the space following the hash signs, or remove both the hash sign and the space. This would result in a name such as either #CHROM or just CHROM, respectively.

@pd3 pd3 closed this as completed in 02a3961 Feb 6, 2023
@pd3
Copy link
Member

pd3 commented Feb 6, 2023

Fair point. Note that the problem is caused by the choice of using space as a delimiter and would not be a problem in tab-delimited output, such as %CHROM\t%POS. In general, using space as a delimiter is problematic because space values are, sadly, permitted by the VCF specification.

As for solutions, the leading hash character # is a common way to separate header and comments from data lines, so it will stay, but we can remove the leading space. This has been done just now in 02a3961

Thanks for reporting the issue

@janxkoci
Copy link
Author

janxkoci commented Feb 6, 2023

I should have expected the argument about space as delimiter - but note that to awk (my primary work language now) space and tabs are the same (by default). I usually use tabs as delimiters, I was (somewhat deliberately) lazy in my example and went with spaces, but also chose awk to show a tool where the difference doesn't matter. Could have been more explicit though.

And thanks for the fix ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants