Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Positional arguments (especially seqkit_stats_nosecondary) in duplex_tools assess_split_on_adapter #40

Open
rocpv1977 opened this issue Apr 3, 2023 · 1 comment

Comments

@rocpv1977
Copy link

Hi!

I am trying to asses how well duplex_tools split_on_adapter is doing its job and duplex_tools assess_split_on_adapter asks for the following positional arguments:
seqkit_stats_nosecondary
edited_reads
unedited_reads
split_multiple_times

I imagine the last three are the .pkl files that are created in the folder for split files, but I am not sure what "seqkit_stats_nosecondary". I have tried to introduce the output of

seqkit stats path/to/file --all

and

seqkit stats path/to/file --all

but I get this error:

/media/seq-ur/65225E7076CF2AF3/basecalling_bacterias/K_oxytoca/K_oxytoca_29_03_2023/pass/split/seqkit_stats contains 1 reads
Traceback (most recent call last):
File "/home/seq-ur/venv/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3652, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'read'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/seq-ur/venv/bin/duplex_tools", line 33, in
sys.exit(load_entry_point('duplex-tools==0.3.2', 'console_scripts', 'duplex_tools')())
File "/home/seq-ur/venv/lib/python3.9/site-packages/duplex_tools/init.py", line 39, in main
args.func(args)
File "/home/seq-ur/venv/lib/python3.9/site-packages/duplex_tools/assess_split_on_adapter.py", line 129, in main
assess(
File "/home/seq-ur/venv/lib/python3.9/site-packages/duplex_tools/assess_split_on_adapter.py", line 32, in assess
txt = txt[txt['read'].isin(expected_read_ids)]
File "/home/seq-ur/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 3760, in getitem
indexer = self.columns.get_loc(key)
File "/home/seq-ur/venv/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3654, in get_loc
raise KeyError(key) from err
KeyError: 'read'

Could you help me understand what "seqkit_stats_nosecondary" is?

Thanks!

@ollenordesjo
Copy link
Contributor

Hi @rocpv1977!

Thanks for the question. You're definitely on the right track. You are expected to give it the output from seqkit bam on a bam file that does not have secondary alignments. If your alignment has been done in a way that includes secondary alignments, you would be expected to filter out secondary reads, for example with samtools view:

samtools view -F 256 input.bam > nosecondary.bam
seqkit bam nosecondary.bam 2> nosecondary.txt

Excuse the confusing naming and the lack of documentation regarding this. It's worth tidying up.

Best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants