Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sam_hdr_remove_lines is inefficient if original header contains a lot of lines #1460

Closed
astaric opened this issue Jun 23, 2022 · 1 comment · Fixed by #1662
Closed

sam_hdr_remove_lines is inefficient if original header contains a lot of lines #1460

astaric opened this issue Jun 23, 2022 · 1 comment · Fixed by #1662
Assignees

Comments

@astaric
Copy link

astaric commented Jun 23, 2022

Steps to reproduce

  1. Create an input.bam file with ~1.5M @RG header lines. it can contain a single read, it does not really matter
  2. Create a filter.txt file with a single line containing one of the @RG values
  3. Run samtools view -R filter.txt input.bam

Expected
command should complete in <1min

Actual
command takes ~6 hours before it starts writing to output file

Notes
I dug through the code, the problem seems to be that sam_hdr_remove_lines calls sam_hrecs_remove_line for every header line that needs to be removed. sam_hrecs_remove_line is pretty quick, but it calls sam_hrecs_remove_hash_entry which removes the entry from the hrecs->rg array. Since removing an element from an array is O(n) operation, the process of removing (almost) all the lines becomes O(n^2).

If there was a way to "rebuild" the hrecs->rg array in a single pass once all the lines are removed (instead of updating it for every removed line), the whole process could become much faster.

@daviesrob
Copy link
Member

Yes, this will be slow if you try to remove a huge number of records. It should be possible to make it defer the array rebuild to the end, although we'll have to take care not to access any broken pointers on the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants