Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for writing Anopheles SNP data to the plink binary file format #515

Merged

Conversation

tristanpwdennis
Copy link
Contributor

@tristanpwdennis tristanpwdennis commented Mar 26, 2024

This PR adds a new function biallelic_snps_to_plink() to the Ag3 and Af1 APIs which allows for selection of a set of biallelic SNPs for export to the plink binary file format. This then allows for data to be imported into tools like ADMIXTURE which can read the plink format.

Resolves #248.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@tristanpwdennis tristanpwdennis marked this pull request as draft March 26, 2024 11:02
@sanjaynagi
Copy link
Collaborator

Hey Tristan. Nice work!

Ill save comments for now but FYI - when you add notebooks to malariagen_data, make sure you have cleared all outputs, otherwise they can become quite hefty in size and then the repo balloons in size over time (all of it is stored in git history).

@tristanpwdennis
Copy link
Contributor Author

I've found the source of the AssertionError (also see issue #516) - something to do with how dask.array.map_blocks computes variant_allele at line 1629 of snp_data.py.

I haven't managed to get to the bottom of it yet but in this PR there's a temporary fix that just applies apply_allele_mapping to an in-memory np array of variant_allele, and I've now added biallelic_snp_calls to to_plink.py instead of calling snp_calls and thinning them manually.

@tristanpwdennis
Copy link
Contributor Author

I've (I hope!) made a fix to the above error (issue #516), and described it in more detail there

@jonbrenas
Copy link
Collaborator

I think it should work. Feel free to mark it as "Ready for review" when you think that it is appropriate.

Copy link
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tristanpwdennis. Before we merge this PR, could I ask you to do some clean-up?
test-1.ipynb needs to be removed and you made some changes to .gitignore and test_snp_data.py. There are also quite a few print commands that need to be removed or switched to debug mode.
Could you also add a test that checks (at least) that the file is created?

# Filter SNPs for segregating sites only
with self._spinner("Subsetting to segregating sites"):
gt = ds_snps["call_genotype"].data.compute()
print("count alleles")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this print statement?

print("count alleles")
with ProgressBar():
ac = allel.GenotypeArray(gt).count_alleles(max_allele=3)
print("ascertain segregating sites")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

& (ac[:, 0] <= max_ref_ac)
& (an_missing <= max_missing_an)
)
print(f"ascertained {np.count_nonzero(loc_sites):,} sites")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

print(f"ascertained {np.count_nonzero(loc_sites):,} sites")

# Set up dataset with required vars for plink conversion
print("Set up dataset")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

@@ -88,6 +88,7 @@ def test_open_snp_sites(fixture, api: AnophelesSnpData):
assert "variants" in contig_grp
variants = contig_grp["variants"]
assert "POS" in variants
assert False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this is here for.

@jonbrenas
Copy link
Collaborator

Thank you very much @tristanpwdennis. This is great. I added a few comments where there are still some print statement that need to be removed. There is also still a .png file that should be removed. Please feel free to ask if you have a question or if some of the requested changes do not make sense to you.

@tristanpwdennis
Copy link
Contributor Author

Hi Jon,
Thanks! Will tidy further and add the test. Cheers :)

@tristanpwdennis
Copy link
Contributor Author

tristanpwdennis commented Sep 19, 2024

Hi @jonbrenas, I had a tidy and removed some redundant code from to_plink.py. I also added a test (test_plink_converter.py) to make sure the files are created. Let me know how everything looks & if this is sufficient, or if I can add any more tests. Hope this works ok!
Thanks
-t

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tristanpwdennis, all checks are passing in the shadow PR so we're looking great here and almost ready to merge!

Just a couple more things here to add documentation for the new function...

malariagen_data/anoph/to_plink.py Outdated Show resolved Hide resolved
malariagen_data/anoph/to_plink.py Outdated Show resolved Hide resolved
malariagen_data/anoph/to_plink.py Outdated Show resolved Hide resolved
malariagen_data/anoph/to_plink.py Outdated Show resolved Hide resolved
tests/anoph/test_plink_converter.py Outdated Show resolved Hide resolved
@alimanfoo alimanfoo changed the title Add plink converter function Add support for writing Anopheles SNP data to the plink binary file format Nov 21, 2024
@alimanfoo
Copy link
Member

Btw have taken the liberty to edit the PR title and description just so this PR will look good in the release notes :)

@tristanpwdennis
Copy link
Contributor Author

Addressed comments and feeling the urge to merge

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tristanpwdennis, noticed a couple more small things. Promise then I'll stop nit picking! If we could make these changes, then get all the checks to pass over in the shadow PR, happy to merge.

malariagen_data/anoph/plink_params.py Outdated Show resolved Hide resolved
malariagen_data/anoph/to_plink.py Outdated Show resolved Hide resolved
malariagen_data/anoph/to_plink.py Outdated Show resolved Hide resolved
malariagen_data/anoph/to_plink.py Outdated Show resolved Hide resolved
malariagen_data/anoph/to_plink.py Outdated Show resolved Hide resolved
tristanpwdennis and others added 5 commits November 27, 2024 08:47
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
@tristanpwdennis
Copy link
Contributor Author

Accepted corrections :)

@jonbrenas
Copy link
Collaborator

jonbrenas commented Nov 27, 2024

Thanks @tristanpwdennis. You probably saw that the tests failed. It looks like the tmp_path used during the tests ends up being a pathlib.PosixPath instead of the required str.

@tristanpwdennis
Copy link
Contributor Author

tristanpwdennis commented Nov 27, 2024

EDIT: I see this now on github, let me try to fix

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me, happy to be merged once checks have passed.

@jonbrenas jonbrenas merged commit b7af935 into malariagen:master Dec 2, 2024
@alimanfoo
Copy link
Member

Wooohoo!

@tristanpwdennis
Copy link
Contributor Author

Hooray, thanks guys! I wonder if I get a GH badge for being the longest open PR on a given repo...

@alimanfoo alimanfoo added the BMGF-068808 Work supported by BMGF grant INV-068808 (MalariaGEN 2024-2027). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BMGF-068808 Work supported by BMGF grant INV-068808 (MalariaGEN 2024-2027).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write SNP calls to plink binary file
5 participants