Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match reads with genes:gloc = genes_pop(code & (1 << 30) - 1) KeyError: 3603 #188

Open
Dexter5577 opened this issue Nov 13, 2023 · 10 comments

Comments

@Dexter5577
Copy link

Hi:
Thank you for wonderful tool.
When I want to match reads with genes,I encountered the following issue with my code. I would be extremely grateful if you could provide an answer.
/share/pub1/lijq/lijq/bowie2/py311/bin/woltka classify -i /share/pub1/lijq/lijq/bt2out/R17001923LR01.sam -c /share/pub1/lijq/lijq/refseq/coords.txt -o /share/pub1/lijq/lijq/gene_merge/R17001923LR01.output.biom

The error report is:

File "/share/pub1/lijq/lijq/bowie2/py311/lib/python3.11/site-packages/woltka/ordinal.py", line 108, in flush
for read, gene in match_read_gene(queue):
File "/share/pub1/lijq/lijq/bowie2/py311/lib/python3.11/site-packages/woltka/ordinal.py", line 380, in match_read_gene
gloc = genes_pop(code & (1 << 30) - 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 3603

@qiyunzhu
Copy link
Owner

Hello @Dexter5577 Thanks for your interest in our program! It could be that the coordinates file and the alignment file don't match. Can you provide parts of both files (e.g., head- n10) for diagnosis?

@Dexter5577
Copy link
Author

Hi:
Thank you for your prompt reply.
The coordinates file is:

PVX_087670 9352 9696
PVX_087675 13393 13602
PVX_087675 13092 13214
PVX_087680 23278 25230
PVX_087685 32100 32174
PVX_087685 31876 31981
PVX_087685 31437 31538
PVX_087685 31271 31311
PVX_087690 33831 39329

and the alignment file is:

@hd VN:1.5 SO:unsorted GO:query
@sq SN:G000002415 LN:27068631
@sq SN:G000002435 LN:12078866
@sq SN:G000002445 LN:26075714
@sq SN:G000002495 LN:40980161
@sq SN:G000002515 LN:10729567
@sq SN:G000002525 LN:20551017
@sq SN:G000002545 LN:12338568
@sq SN:G000002655 LN:29385098
@sq SN:G000002715 LN:27862281

@qiyunzhu
Copy link
Owner

@Dexter5577 Can you please do grep -v ^'@' align.sam | head? Thanks! And sorry for not being explicit enough.

@Dexter5577
Copy link
Author

Of course.Thanks for your patience.The file is:

ST-E00211:363:HCYTMCCXY:2:1101:16599:68289 77 * 0 0 * * CCCCAGTCAGAATGATGAATATAAGACTGCTGAATTTCATTTGCCCTTTGATGCTTAGTGCTAGGAGGCGAAGACAAAAAGCGCCACTTGGGCTGTGTGAGAATATATACATCGCAGCAGCTATGATTCGTGCTTAGTCCTGGCGTGGT <<AAFJJAJJFJF7AA7-FFFAJJFFJ<<A-AFJJJ-F<JFAFJ-A7FJ-F-<F7F-F77AJAA--7---7<-FJ<JF--<F-7AJ<-<F-F--77<7-<-7-A-7--7FFAJ---7A<--7FA-----7F--7-7--7--7-77AA-7 YT:Z:UP
ST-E00211:363:HCYTMCCXY:2:1101:16599:68289 141 * 0 0 * * GGATGACATGCGAAAGCCACAGCTACCGGCTTTAGAGCTAATTAGGGCTGCGGCGTGTGTGTGATTGTCAATATACACATTTTGAAGTTTGTGAGTGCGACGTGTAGGGATAGAGTGTAGATTTCGGTGGCCGTTGTATAATTAAATAT AAAA-AA-<-77-----<FJ-7-<--A77--7-------77F-<-7F--A7-<77777-7<-77AJ-<<------<7A---7-7<---<7<----7-7F-A<7--FJ----7----77FAF7-7FA--<F---<)7--7--7---A-77 YT:Z:UP
ST-E00211:363:HCYTMCCXY:2:1101:20161:68517 77 * 0 0 * * AGGGAGGCAGCCGATTACTGAAGCGATGTGAGCTCAGTTACTCTGGTAGGATTTTAGAAGGTCTTGTGTCAGGGATGGGGACATAGAGGAGAGAGGAGATGTAGAGGGGATGAATCATCCAGGAACGGCGGAGAAAGGAGCGGAAGAGCA A-AFAFFJJF-7FAFFJJJJJJ-A--<7F77FJFJJJJ-FF-F<77-<FJ--7AFJF-FJFFF7-F-FFA-AA-----7-7FJ---<A-7-7--<-7<--AA--A7<7A7-AF<7-A-7------7-7---7------7-7<7FFF---F YT:Z:UP
ST-E00211:363:HCYTMCCXY:2:1101:20161:68517 141 * 0 0 * * TTCTTCACCTCTTCAGTGTGTTTCATTCCCCAATAATTTTCTGTTTTTGTTGTGGCCCTGTCCCCTGCTCCAATCCCTCTAAAATTCTTCCCAAATGCCGTGACTCAGCCTTTTTCATTAATTCGCCAACTTCATTGAAACGGT A-A7-----<----7---7---<----<FF7---<F<-<-<--<--<---------7-7----7-----7-----7------7<----777---------7-----7-77--7-7--7--777-7--------------7---7 YT:Z:UP
ST-E00211:363:HCYTMCCXY:2:1101:16589:68553 77 * 0 0 * * GGAAAGAGGCACCCACTGCCCACTGATGGTTATGTCGCATTCCAGCACTCCAAGCTCTGGACTCAGGTTATGGCAGAGAAACAATGCTATGATAAGAAGGAGGAAGTATATCGGAAGGGCACACGGCTGAACTTCAGTCACATCACTATC A-AAAJFAFJ<FJ7JF7AAFFAJJ--A7-<--7J<J<-FJ7-F7FAJ--7-7--7F77A-<----F-A-7-<7JF-7AJFA-<FJ-77<7-7-AJAA-<FAJ-7A7--7A77F-------A--J--7<--77A-77-A-7<--7FF--7< YT:Z:UP
ST-E00211:363:HCYTMCCXY:2:1101:16589:68553 141 * 0 0 * * AGTTGGTCTTTCTTCTCATGGAATTGTGTCTCTGCCATTAGCTGATTCCAGAGATGAGTGTGTTTGCATGGGACCTAGCCATGTGTGGGCAGTGGGGGTGGCTAGCCAGATCGAAAAGGAGTCGAGTGGGGATATTGTAGAGAG ----A----<F-F--A-<J-<F7F-FJ--<A--AJ<---<--<FJ--<A-<--<7-----<7---<FA<--7A---7-----7-<A77AAJA<-<---7-------------7-----7-7--7-------7-A-7---777-7 YT:Z:UP
ST-E00211:363:HCYTMCCXY:2:1101:6451:68922 77 * 0 0 * * CTCGCTTCTGTTACTCGGACGCGCCGGAGGTCCTCTTGGAGAATCTTCTTTCTTCTCTCTACTGGCAGTTGCTTCCTCTTTATTCGCACCTGGAATCAGGGAAGGTGCCAGGACAGACGTATTCCCTCCTGGGGTAGCCATGTGGTCAGA AAFFF<AJJJJJJJJF-7AJJ-FJA-FFJ-<A-FJJF-A<<-7<<JF<7---FFFJFFAFJJJ--A--JF-7F-<-F7<FJJJJ--7AFA77-AFJ<-<-AF-AFFJ--7J-A7FF-<----<7AFFAAF-7A77<AJFFFF-A-7AA7F YT:Z:UP
ST-E00211:363:HCYTMCCXY:2:1101:6451:68922 141 * 0 0 * * GACCACATGTCTATCCCAGGAGGGAGTCAGTCTGTCCTGGCACCTTCCCGGATTCCGGGTGTGAATAAAGGGGAAGAAACGGAAAGTAGAGAGGAGAAATAAGGTTCGCTAAGGTGGAGGCCGGCGCGTCAGAGTATCATAAGCGAGGGA AAA--FJJF-<F-F-<FJ-F-7AJJ-7JJ<A--<<---77-A-J7FAJF-AFJ-7---77-FF7F---AF-<F7F<7-<-7-AAA<<A777AJ-FJJ-7------<7-7F---7-7-7--7A7F7FFAFJJ<F7---7---FFF7FJ))7 YT:Z:UP
ST-E00211:363:HCYTMCCXY:2:1101:8420:69062 77 * 0 0 * * CCAGAATAGAGCAATTATTATCAAGGGCCATATTTTCATTAGGGAGGGGGGTGAGAAAATTTTATTCCCAAAGGTCCTTGGTGGTGTGGGTATGAAATAAAAGACATTCGCAGAATACGGTATTCTAGTTTGGAGTTTTTAAACGTATAT <<AFFAJJJ-JAJ<JJ<-F<JF7FJ7JJJ-7<FJFJJ<7F-<FFFFF-7FAFJJAJJ-A<-<--7-<AF-FJJJ---7A77AJJ<-777FF-F-AFJJJJ<---7AJ-<-7FJ-AJF---7F-7FA-777-A---7-7AFA7--7A-A-7 YT:Z:UP
ST-E00211:363:HCYTMCCXY:2:1101:8420:69062 141 * 0 0 * * GATGTTTGGGTGTATTATGGCTAAAAAGGGTGTTATAACCTAGCTACCATTTACGGCATGGTATCTATAGAATAGACGTGTAAAGACTCCAAATTAGATTATAGTATTCTCAGAGTAGAGTTTATCGCATACCCTCACCTCAAAGGTCC A-<-7FJJ7-<F-F<J-FAFJJ---<<-7---7-<<--7F7FJJJ-FA-<JA<-77<J-7--<J7-7--<-FA--<FJF-7F7J-A7<AA-7AJ--7A-F<--7-7AA7<7--<----7-<<J----77-A-<A-)7A)-77--77-7< YT:Z:UP

@qiyunzhu
Copy link
Owner

@Dexter5577 Sorry, I meant the alignment file (SAM), not the sequencing data (FASTQ).

@Dexter5577
Copy link
Author

I really feel sorry,but this is actually my alignment file.The file is attached here.Sorry again for wasting your time.
R17001923LR01.zip

@qiyunzhu
Copy link
Owner

Okay I see a plausible reason: In the alignment file, the subject IDs are like "G000006565". However in the coordinates file the gene IDs are like "PVX_087670". I suspect that it was created using a different database. You may want to check it.

@Dexter5577
Copy link
Author

Thank you very much! I will examine these two documents carefully. Both of these files are generated from files in the Refseq database.

@Dexter5577
Copy link
Author

When I check these two files,I couldn't find the error. And the coordinates file is downloaded with the following code:
refseq_build.py -p standard -o .
The coordinates file is 2.65gb, and when I examined it carefully, it also had ids like "G000006565", where PVX_087670 was just the secondary gene coordinates for every ID like "G000006565"
image
image
image

@qiyunzhu
Copy link
Owner

Thanks for the additional information. The files you showed look correct. Let me look into this issue more carefully. All I can think of now is some potential formatting issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants