Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize-by-median.py --force default should be 'true' #745

Closed
macmanes opened this issue Jan 28, 2015 · 7 comments · Fixed by #843
Closed

normalize-by-median.py --force default should be 'true' #745

macmanes opened this issue Jan 28, 2015 · 7 comments · Fixed by #843

Comments

@macmanes
Copy link

In messing around with normalize-by-median.py, I realized that if I killed a run while it was running, then issued a new command using the same output filename, that the product of the new command was appended to the results of the first command. I don't think this is ever when the user wants - a set of reads will be duplicated in the normalized output file.. I think that the output should always be overwritten --force TRUE with or without a warning that the file is being overwritten..

grep HWI-ST1097:199:D22V1ACXX:1:1101:1154:2129 khmer.normalized.C50.fastq
>HWI-ST1097:199:D22V1ACXX:1:1101:1154:2129/1
>HWI-ST1097:199:D22V1ACXX:1:1101:1154:2129/2
>HWI-ST1097:199:D22V1ACXX:1:1101:1154:2129/1
>HWI-ST1097:199:D22V1ACXX:1:1101:1154:2129/2

My 2cents

@ctb
Copy link
Member

ctb commented Jan 28, 2015

On Wed, Jan 28, 2015 at 12:11:08PM -0800, Matt MacManes wrote:

In messing around with normalize-by-median.py, I realized that if I killed a run while it was running, then issued a new command using the same output filename, that the product of the new command was appended to the results of the first command. I don't think this is ever when the user wants - a set of reads will be duplicated in the normalized output file.. I think that the output should always be overwritten --force TRUE with or without a warning that the file is being overwritten..

grep HWI-ST1097:199:D22V1ACXX:1:1101:1154:2129 khmer.normalized.C50.fastq
>HWI-ST1097:199:D22V1ACXX:1:1101:1154:2129/1
>HWI-ST1097:199:D22V1ACXX:1:1101:1154:2129/2
>HWI-ST1097:199:D22V1ACXX:1:1101:1154:2129/1
>HWI-ST1097:199:D22V1ACXX:1:1101:1154:2129/2

My 2cents

Definitely a bug. thx!

--titus

@mr-c mr-c added the Python label Jan 31, 2015
@mr-c mr-c added this to the 1.3+ milestone Jan 31, 2015
@ctb
Copy link
Member

ctb commented Feb 8, 2015

I think the fix is to move this open out of the for loop. It's a little sketchy it's in the for loop, frankly :)

https://github.com/ged-lab/khmer/blob/master/scripts/normalize-by-median.py#L243

@drtamermansour
Copy link
Member

I will be working on this #745 .
I agree we have to move the open of the output file to outside the loop. However, this is not going to solve the problem. The basic idea here is that, as we go through the input files, we append the reads to a single output file.
The default --force as false allow the user to keep adding normalized reads to the same output file from more input files. I think we should change the default --force to True.
Also we can add a check when we write the 1st read to the output file and through a warning if the read already exists in the output

@ctb
Copy link
Member

ctb commented Feb 20, 2015 via email

@mr-c
Copy link
Contributor

mr-c commented Feb 20, 2015

I would call this a bug and that the fix would not break semantic versioning. I will prominently advertise the change in behaviour upon release.

@drtamermansour
Copy link
Member

This function does not make sense for me:
https://github.com/ged-lab/khmer/blob/master/scripts/normalize-by-median.py#L221
I tracked the implementation of the function here
https://github.com/drtamermansour/khmer/blob/fix/%23745/khmer/kfile.py#L43
It uses the size of the input files to make a decision about the available desk space !!!

@ctb
Copy link
Member

ctb commented Feb 24, 2015

Yes, that all seems correct to me - what do you disagree with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants