You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create an input.bam file with ~1.5M @RG header lines. it can contain a single read, it does not really matter
Create a filter.txt file with a single line containing one of the @RG values
Run samtools view -R filter.txt input.bam
Expected
command should complete in <1min
Actual
command takes ~6 hours before it starts writing to output file
Notes
I dug through the code, the problem seems to be that sam_hdr_remove_lines calls sam_hrecs_remove_line for every header line that needs to be removed. sam_hrecs_remove_line is pretty quick, but it calls sam_hrecs_remove_hash_entry which removes the entry from the hrecs->rg array. Since removing an element from an array is O(n) operation, the process of removing (almost) all the lines becomes O(n^2).
If there was a way to "rebuild" the hrecs->rg array in a single pass once all the lines are removed (instead of updating it for every removed line), the whole process could become much faster.
The text was updated successfully, but these errors were encountered:
Yes, this will be slow if you try to remove a huge number of records. It should be possible to make it defer the array rebuild to the end, although we'll have to take care not to access any broken pointers on the way.
Steps to reproduce
@RG
header lines. it can contain a single read, it does not really matter@RG
valuessamtools view -R filter.txt input.bam
Expected
command should complete in <1min
Actual
command takes ~6 hours before it starts writing to output file
Notes
I dug through the code, the problem seems to be that
sam_hdr_remove_lines
callssam_hrecs_remove_line
for every header line that needs to be removed.sam_hrecs_remove_line
is pretty quick, but it callssam_hrecs_remove_hash_entry
which removes the entry from thehrecs->rg
array. Since removing an element from an array isO(n)
operation, the process of removing (almost) all the lines becomesO(n^2)
.If there was a way to "rebuild" the
hrecs->rg
array in a single pass once all the lines are removed (instead of updating it for every removed line), the whole process could become much faster.The text was updated successfully, but these errors were encountered: