Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for writing Anopheles SNP data to the plink binary file format (shadow PR) #584

Merged
merged 55 commits into from
Dec 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
ea54da7
add plink converyt function, faff around with some tests
Mar 26, 2024
6afd608
temp fix for avoiding error in mapping all-allele -> biallelic allele…
Apr 11, 2024
3268640
updated plink converter to use biallelic_snp_calls with tmp. amended …
Apr 11, 2024
ab6f28c
Merge branch 'master' into plink-converter-2024-03-26
tristanpwdennis Aug 8, 2024
5e502bf
Added chunking of allele_mapping array so that da.map blocks works
Aug 8, 2024
31f6469
numba works fine
Aug 8, 2024
4ef5e9c
clear up notebook
Aug 9, 2024
98767fc
Merge branch 'master' into plink-converter-2024-03-26
jonbrenas Aug 20, 2024
af47433
remove dud notebook, tidy up to_plink.py
Aug 27, 2024
96eb157
add plink converter test
Sep 19, 2024
98700e3
update plink converter test to include test for created files
Sep 19, 2024
6db9db8
tidy, add tests, refactor snp_data call
Sep 19, 2024
03ab293
Tidy tests and remove allelism selection from converter, add tests to…
Oct 7, 2024
e7a35c8
move bedreader to bottom of pyproject list
Oct 7, 2024
201f9d4
Merge branch 'master' into plink-converter-2024-03-26
jonbrenas Oct 9, 2024
7823580
Merge branch 'malariagen:master' into plink-converter-2024-03-26
tristanpwdennis Oct 15, 2024
6335208
Change the default chunks value.
jonbrenas Oct 16, 2024
c84cd1a
checkout changes from upstream to snpdata.py
Oct 18, 2024
51f9174
Update tests/anoph/test_plink_converter.py
tristanpwdennis Nov 13, 2024
ae7d474
Update malariagen_data/anoph/to_plink.py
tristanpwdennis Nov 13, 2024
e33bfcd
Update malariagen_data/anoph/to_plink.py
tristanpwdennis Nov 13, 2024
23ff3df
Update malariagen_data/anoph/to_plink.py
tristanpwdennis Nov 13, 2024
f2ceb46
update plink converter
Nov 13, 2024
df75274
fix to_plink.py
Nov 13, 2024
2c13752
fix to_plink and tests
Nov 13, 2024
df936c6
Merge branch 'plink-converter-2024-03-26' of https://github.com/trist…
Nov 13, 2024
90e72d5
update ac and an params to be same as pca, replace out dir
Nov 13, 2024
51d6c56
added overwrite param
Nov 14, 2024
0e196af
Merge branch 'master' into plink-converter-2024-03-26
alimanfoo Nov 19, 2024
ebb7ae8
Add Tristan Dennis as author
alimanfoo Nov 19, 2024
fc70d08
poetry update
alimanfoo Nov 19, 2024
643534e
fix relative import
alimanfoo Nov 19, 2024
8c6dec0
fix test plink converter notebook and run tests
Nov 20, 2024
7d10281
update notebook to fix output dir and relative output path
Nov 20, 2024
38b9510
Update malariagen_data/anoph/to_plink.py
tristanpwdennis Nov 24, 2024
b48abf4
add bool type definition
Nov 24, 2024
cc5ed2d
Merge branch 'plink-converter-2024-03-26' of https://github.com/trist…
Nov 24, 2024
6730123
add docstring
Nov 25, 2024
92e8340
fix function definitions
Nov 25, 2024
fb08e1c
add overwrite to base params and finish docstring
Nov 25, 2024
dab86ec
fix docstring and base params
Nov 25, 2024
033fc8a
update output dir type def
Nov 25, 2024
c3d7118
make plink_params.py containing overwrite and output dir, fix type de…
Nov 25, 2024
d235b76
Update malariagen_data/anoph/plink_params.py
tristanpwdennis Nov 26, 2024
c62307c
Update malariagen_data/anoph/to_plink.py
tristanpwdennis Nov 26, 2024
4605612
Update malariagen_data/anoph/to_plink.py
tristanpwdennis Nov 26, 2024
23d8c3a
Update malariagen_data/anoph/to_plink.py
tristanpwdennis Nov 26, 2024
c562106
Update malariagen_data/anoph/to_plink.py
tristanpwdennis Nov 26, 2024
b7632cc
Linting
jonbrenas Nov 27, 2024
e07f92c
Merge branch 'master' into plink-converter-2024-03-26-tristanpwdennis…
jonbrenas Nov 27, 2024
de0b36c
coerce tmp_path from posix to str
tristanpwdennis Nov 27, 2024
31abfea
Merge branch 'master' into plink-converter-2024-03-26
jonbrenas Nov 28, 2024
7d3e575
Merge branch 'master' into plink-converter-2024-03-26-tristanpwdennis…
jonbrenas Dec 2, 2024
b7af935
Merge remote-tracking branch 'upstream/plink-converter-2024-03-26' in…
jonbrenas Dec 2, 2024
a481570
Merge branch 'plink-converter-2024-03-26-tristanpwdennis-shadow' of g…
jonbrenas Dec 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/Af1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ SNP data access
is_accessible
biallelic_snp_calls
biallelic_diplotypes
biallelic_snps_to_plink

Haplotype data access
---------------------
Expand Down
1 change: 1 addition & 0 deletions docs/source/Ag3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ SNP data access
is_accessible
biallelic_snp_calls
biallelic_diplotypes
biallelic_snps_to_plink

Haplotype data access
---------------------
Expand Down
18 changes: 18 additions & 0 deletions malariagen_data/anoph/plink_params.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
"""Parameters for Plink converter functions."""

from typing_extensions import Annotated, TypeAlias

overwrite: TypeAlias = Annotated[
bool,
"""
A boolean indicating whether a previously written file with the same name ought
to be overwritten. Default is False.
""",
]

output_dir: TypeAlias = Annotated[
str,
"""
A string indicating the desired output file location.
""",
]
132 changes: 132 additions & 0 deletions malariagen_data/anoph/to_plink.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
from typing import Optional

import allel # type: ignore
import numpy as np
import os
import bed_reader

from ..util import dask_compress_dataset
from .snp_data import AnophelesSnpData
from . import base_params
from . import plink_params
from . import pca_params
from numpydoc_decorator import doc # type: ignore


class PlinkConverter(
AnophelesSnpData,
):
def __init__(
self,
**kwargs,
):
# N.B., this class is designed to work cooperatively, and
# so it's important that any remaining parameters are passed
# to the superclass constructor.
super().__init__(**kwargs)

@doc(
summary="""
Write Anopheles biallelic SNP data to the Plink binary file format.
""",
extended_summary="""
This function writes biallelic SNPs to the Plink binary file format. It enables
subsetting to specific regions (`region`), selecting specific sample sets, or lists of
samples, randomly downsampling sites, and specifying filters based on missing data and
minimum minor allele count (see the docs for `biallelic_snp_calls` for more information).
The `overwrite` parameter, set to true, will enable overwrite of data with the same
SNP selection parameter values.
""",
returns="""
Base path to files containing binary Plink output files. Append .bed,
.bim or .fam to obtain paths for the binary genotype table file, variant
information file and sample information file respectively.
""",
notes="""
This computation may take some time to run, depending on your computing
environment. Unless the `overwrite` parameter is set to `True`, results will be returned
from a previous computation, if available.
""",
)
def biallelic_snps_to_plink(
self,
output_dir: plink_params.output_dir,
region: base_params.regions,
n_snps: base_params.n_snps,
overwrite: plink_params.overwrite = False,
thin_offset: base_params.thin_offset = 0,
sample_sets: Optional[base_params.sample_sets] = None,
sample_query: Optional[base_params.sample_query] = None,
sample_indices: Optional[base_params.sample_indices] = None,
site_mask: Optional[base_params.site_mask] = base_params.DEFAULT,
min_minor_ac: Optional[
base_params.min_minor_ac
] = pca_params.min_minor_ac_default,
max_missing_an: Optional[
base_params.max_missing_an
] = pca_params.max_missing_an_default,
random_seed: base_params.random_seed = 42,
inline_array: base_params.inline_array = base_params.inline_array_default,
chunks: base_params.chunks = base_params.native_chunks,
):
# Define output files
plink_file_path = f"{output_dir}/{region}.{n_snps}.{min_minor_ac}.{max_missing_an}.{thin_offset}"

bed_file_path = f"{plink_file_path}.bed"

# Check to see if file exists and if overwrite is set to false, return existing file
if os.path.exists(bed_file_path):
if not overwrite:
return plink_file_path

# Get snps
ds_snps = self.biallelic_snp_calls(
region=region,
sample_sets=sample_sets,
sample_query=sample_query,
sample_indices=sample_indices,
site_mask=site_mask,
min_minor_ac=min_minor_ac,
max_missing_an=max_missing_an,
n_snps=n_snps,
thin_offset=thin_offset,
random_seed=random_seed,
inline_array=inline_array,
chunks=chunks,
)

# Set up dataset with required vars for plink conversion

# Compute gt ref counts
with self._dask_progress("Computing genotype ref counts"):
gt_asc = ds_snps["call_genotype"].data # dask array
gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127)
gn_ref = gn_ref.compute()

# Ensure genotypes vary
loc_var = np.any(gn_ref != gn_ref[:, 0, np.newaxis], axis=1)

# Load final data
ds_snps_final = dask_compress_dataset(ds_snps, loc_var, dim="variants")

# Init vars for input to bed reader
gn_ref_final = gn_ref[loc_var]
val = gn_ref_final.T
with self._spinner("Prepare output data"):
alleles = ds_snps_final["variant_allele"].values
properties = {
"iid": ds_snps_final["sample_id"].values,
"chromosome": ds_snps_final["variant_contig"].values,
"bp_position": ds_snps_final["variant_position"].values,
"allele_1": alleles[:, 0],
"allele_2": alleles[:, 1],
}

bed_reader.to_bed(
filepath=bed_file_path,
val=val,
properties=properties,
count_A1=True,
)

return plink_file_path
2 changes: 2 additions & 0 deletions malariagen_data/anopheles.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
from .anoph.distance import AnophelesDistanceAnalysis
from .anoph.sample_metadata import AnophelesSampleMetadata, locate_cohorts
from .anoph.snp_data import AnophelesSnpData
from .anoph.to_plink import PlinkConverter
from .anoph.g123 import AnophelesG123Analysis
from .anoph.fst import AnophelesFstAnalysis
from .anoph.h12 import AnophelesH12Analysis
Expand Down Expand Up @@ -100,6 +101,7 @@ class AnophelesDataResource(
AnophelesSnpFrequencyAnalysis,
AnophelesDistanceAnalysis,
AnophelesPca,
PlinkConverter,
AnophelesIgv,
AnophelesAimData,
AnophelesHapData,
Expand Down
50 changes: 50 additions & 0 deletions notebooks/plink_convert.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import malariagen_data\n",
"import os \n",
"\n",
"ag3 = malariagen_data.Ag3(pre=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ag3.biallelic_snps_to_plink(output_dir=os.getcwd(),\n",
" region='2L:100000-2000000',\n",
" n_snps=2000,\n",
" sample_sets='AG1000G-AO',\n",
" )"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "malariagen_plink",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading
Loading