-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for writing Anopheles SNP data to the plink binary file format #515
Add support for writing Anopheles SNP data to the plink binary file format #515
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Hey Tristan. Nice work! Ill save comments for now but FYI - when you add notebooks to malariagen_data, make sure you have cleared all outputs, otherwise they can become quite hefty in size and then the repo balloons in size over time (all of it is stored in git history). |
…s, avoids using dask
…variant_allele mapping
I've found the source of the AssertionError (also see issue #516) - something to do with how I haven't managed to get to the bottom of it yet but in this PR there's a temporary fix that just applies |
Removed numba decorator for the apply_allele_mapping function, for now
I've (I hope!) made a fix to the above error (issue #516), and described it in more detail there |
I think it should work. Feel free to mark it as "Ready for review" when you think that it is appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tristanpwdennis. Before we merge this PR, could I ask you to do some clean-up?
test-1.ipynb
needs to be removed and you made some changes to .gitignore
and test_snp_data.py
. There are also quite a few print
commands that need to be removed or switched to debug mode.
Could you also add a test that checks (at least) that the file is created?
malariagen_data/anoph/to_plink.py
Outdated
# Filter SNPs for segregating sites only | ||
with self._spinner("Subsetting to segregating sites"): | ||
gt = ds_snps["call_genotype"].data.compute() | ||
print("count alleles") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this print
statement?
malariagen_data/anoph/to_plink.py
Outdated
print("count alleles") | ||
with ProgressBar(): | ||
ac = allel.GenotypeArray(gt).count_alleles(max_allele=3) | ||
print("ascertain segregating sites") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here.
malariagen_data/anoph/to_plink.py
Outdated
& (ac[:, 0] <= max_ref_ac) | ||
& (an_missing <= max_missing_an) | ||
) | ||
print(f"ascertained {np.count_nonzero(loc_sites):,} sites") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here.
malariagen_data/anoph/to_plink.py
Outdated
print(f"ascertained {np.count_nonzero(loc_sites):,} sites") | ||
|
||
# Set up dataset with required vars for plink conversion | ||
print("Set up dataset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here.
tests/anoph/test_snp_data.py
Outdated
@@ -88,6 +88,7 @@ def test_open_snp_sites(fixture, api: AnophelesSnpData): | |||
assert "variants" in contig_grp | |||
variants = contig_grp["variants"] | |||
assert "POS" in variants | |||
assert False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this is here for.
Thank you very much @tristanpwdennis. This is great. I added a few comments where there are still some |
Hi Jon, |
Hi @jonbrenas, I had a tidy and removed some redundant code from to_plink.py. I also added a test (test_plink_converter.py) to make sure the files are created. Let me know how everything looks & if this is sufficient, or if I can add any more tests. Hope this works ok! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tristanpwdennis, all checks are passing in the shadow PR so we're looking great here and almost ready to merge!
Just a couple more things here to add documentation for the new function...
Btw have taken the liberty to edit the PR title and description just so this PR will look good in the release notes :) |
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
…anpwdennis/malariagen-data-python into plink-converter-2024-03-26
Addressed comments and feeling the urge to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tristanpwdennis, noticed a couple more small things. Promise then I'll stop nit picking! If we could make these changes, then get all the checks to pass over in the shadow PR, happy to merge.
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Accepted corrections :) |
Thanks @tristanpwdennis. You probably saw that the tests failed. It looks like the |
EDIT: I see this now on github, let me try to fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good to me, happy to be merged once checks have passed.
Wooohoo! |
Hooray, thanks guys! I wonder if I get a GH badge for being the longest open PR on a given repo... |
This PR adds a new function
biallelic_snps_to_plink()
to theAg3
andAf1
APIs which allows for selection of a set of biallelic SNPs for export to the plink binary file format. This then allows for data to be imported into tools like ADMIXTURE which can read the plink format.Resolves #248.