Replace contig with region in GWSS functions #691

leehart · 2024-12-06T18:00:27Z

Resolves #375

review-notebook-app · 2024-12-06T18:00:33Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…che name.

… cache names.

jonbrenas

Maybe using a random region instead of a random contig in at least some of the tests would be better. The function random_region_str from tests/anoph/conftest.py exists and there are a few examples of use in test_snp_data.py, for instance.

leehart · 2024-12-13T11:35:10Z

Just to note that there are other places in the code unrelated to this PR where a random contig is used for a region in testing, e.g.

test_biallelic_diplotype_pairwise_distance_with_metric
test_njt_with_metric
test_njt_with_algorithm
test_plot_njt
test_average_fst
test_average_fst_with_min_cohort_size
test_pca_plotting
test_pca_exclude_samples
test_pca_fit_exclude_samples
test_plink_converter

There might be a valid reason for choosing a random contig over a random region string, I'm not sure, but we could deal with those cases in another PR, if needs be.

leehart · 2024-12-13T11:41:12Z

I'm getting IndexError: index 0 is out of bounds for axis 0 with size 0 in a few tests after switching to random region strings, so it seems worthwhile!

…ith_default_sites()

leehart · 2025-01-20T12:48:16Z

It looks like the error is coming from cases where, e.g. in plot_g123_gwss_track():

        x, g123 = self.g123_gwss(
            region=region,
            [...]
        )
        # determine X axis range
        x_min = x[0]

But x isn't what's expected [It is an empty list, i.e. [].], hence IndexError: index 0 is out of bounds for axis 0 with size 0

leehart · 2025-01-20T14:22:50Z

Noting here that x is meant to be

"An array containing the window centre point genomic positions."

which is calculated via

x = allel.moving_statistic(pos, statistic=np.mean, size=window_size)

in _g123_gwss()

…ion tests

leehart · 2025-01-23T12:57:37Z

It looks like the test failures for regions were happening whenever the region was sufficiently small or in the wrong place so that no sites were captured.

It looks like this problem is avoided in other places by setting the random region to a fixed size, usually 5000, but I increased this to 10_000 where phased sites were being used.

This does not prevent these IndexError: index 0 is out of bounds for axis 0 with size 0 test failures from ever occurring, since there is still a chance that a random region and sample set are chosen that do not yield anything for x, which breaks assumptive code such as x_min = x[0], which currently occurs in 7 different places in the codebase.

I'm not sure what a more robust solution to this looks like yet, but it seems less related to this particular issue and more to do with the fact that functions such as g123_gwss() cannot currently be guaranteed to return a non-empty x. Perhaps we need to raise another issue to handle those edge cases and their errors, so that the code (and these tests) won't randomly fail as a result.

leehart · 2025-02-03T11:52:23Z

Noting test failure for (3.12, numpy~=2.0)

=========================== short test summary info ============================
FAILED tests/anoph/test_snp_frq.py::test_allele_frequencies_with_str_cohorts_and_sample_query[af1_sim] - ValueError: > No amino acid change SNPs found for the given transcript and site mask.
=========== 1 failed, 523 passed, 154 warnings in 272.09s (0:04:32) ============

There was also another failure for "tests with coverage", of the same kind previously encountered, so this is not yet resolved:

=========================== short test summary info ============================
FAILED tests/anoph/test_fst.py::test_fst_gwss[af1_sim] - IndexError: index 0 is out of bounds for axis 0 with size 0
=========== 1 failed, 523 passed, 174 warnings in 532.16s (0:08:52) ============

leehart · 2025-02-03T12:07:46Z

Plan to add support and show some form of deprecation warning for the contig params (in favour of region) in the following public functions:

fst_gwss
plot_fst_gwss_track
plot_fst_gwss
g123_gwss
g123_calibration
plot_g123_gwss_track
plot_g123_gwss
plot_g123_calibration
h12_calibration
plot_h12_calibration
h12_gwss
plot_h12_gwss_track
plot_h12_gwss
plot_h12_gwss_multi_overlay_track
plot_h12_gwss_multi_overlay
plot_h12_gwss_multi_panel
h1x_gwss
plot_h1x_gwss_track
plot_h1x_gwss
ihs_gwss
plot_ihs_gwss_track
plot_xpehh_gwss
plot_ihs_gwss
xpehh_gwss
plot_xpehh_gwss_track

I suspect a similar approach might be applied in the future if we ever support multiple regions instead of as singular region in these functions.

I reckon the current plan is to drop support for the contig param in these functions completely in some future major release, as yet undetermined.

leehart · 2025-02-03T12:10:53Z

Noting more failures for tests (3.12, numpy==1.26.4), which might necessitate increasing the random region_size from 5000 to 10_000, at least in these tests.

=========================== short test summary info ============================
FAILED tests/anoph/test_g123.py::test_g123_gwss_with_default_sites[af1_sim] - IndexError: index 0 is out of bounds for axis 0 with size 0
FAILED tests/anoph/test_h12.py::test_h12_gwss_multi_with_default_analysis[af1_sim] - IndexError: index 0 is out of bounds for axis 0 with size 0
=========== 2 failed, 522 passed, 174 warnings in 288.72s (0:04:48) ============

…_sites() and test_h12_gwss_multi_with_default_analysis()

leehart · 2025-02-03T12:27:53Z

One complication is that many, if not all, of these functions currently use positional arguments, rather than requiring keyword-only arguments.

Conveniently, this might not have the usual impact in this case, because region is a superset of contig, so function calls that don't specify the parameter name should still work as they are, and function calls that do specify the parameter name(s) should also work but will receive a deprecation warning.

leehart · 2025-02-03T12:46:31Z

@ahernank I don't know yet why some tests relating to cohorts and allele frequencies have randomly started failing, e.g. test_allele_frequencies_with_str_cohorts_and_sample_query and test_allele_frequencies_with_min_cohort_size. Could it be something to do with the recent cohorts_20250131 ? Although all other cohorts tests are usually succeeding.

=========================== short test summary info ============================
FAILED tests/anoph/test_snp_frq.py::test_allele_frequencies_with_min_cohort_size[ag3_sim-0] - ValueError: No amino acid change SNPs found for the given transcript and site mask.
=========== 1 failed, 523 passed, 171 warnings in 492.38s (0:08:12) ============

ahernank · 2025-02-03T13:38:42Z

Thanks @leehart, I believe these are failures related to the randomness of the region selected for tests rather than any changes in cohorts -- I've re-run the tests on this PR, and they have now passed with a different region.

leehart · 2025-02-03T14:22:14Z

Thanks @ahernank , that would make sense but I can't see where random regions come into play for test_allele_frequencies_with_str_cohorts_and_sample_query or test_allele_frequencies_with_min_cohort_size?

I can see random site_mask, transcript, cohorts, country for test_allele_frequencies_with_str_cohorts_and_sample_query.
I can see random sample_sets, site_mask and transcript for test_allele_frequencies_with_min_cohort_size.

leehart · 2025-02-03T14:42:15Z

Unfortunately, with regards to providing a deprecation path for the contig parameter, the current requirement for positional arguments and the current lack of defaults in other params is causing a problem.

For example, if we kept support for the contig param as optional, but actually required the region parameter, then when a user tried to use the old contig parameter like this:

fst_gwss = ag3.fst_gwss(
    contig="2L",
    window_size=10_000,
    cohort1_query="cohort_admin2_year == 'ML-2_Kati_colu_2014'",
    cohort2_query="cohort_admin2_year == 'ML-2_Kati_gamb_2014'",
    site_mask="gamb_colu",
    cohort_size=10,
    sample_sets="3.0",
)

...then they wouldn't get the DeprecationWarning, because they would get the TypeError: fst_gwss() missing 1 required positional argument: 'region' first.

We can't solve that by making both the new region parameter and the old contig parameters optional, which would then require defaults (None), because the we currently need to have either the contig or the region parameter appear first in the order of positional arguments, in order to preserve backwards compatibility, and the second positional argument in this case is window_size, which currently doesn't have a default set. (We can't have positional params with defaults appear before params without defaults.)

Perhaps one way forwards is to fill in the missing defaults between the first param and the next param that has a default, which in this case would mean setting defaults for window_size, cohort1_query and cohort2_query. Perhaps if a no values are specified for those parameters then we should just raise a ValueError. 🤔

codecov · 2025-02-03T17:11:31Z

Codecov Report

Attention: Patch coverage is 71.87500% with 9 lines in your changes missing coverage. Please review.

Project coverage is 93.94%. Comparing base (79a1d69) to head (eb463e9).
Report is 15 commits behind head on master.

Files with missing lines	Patch %	Lines
malariagen_data/anoph/fst.py	65.38%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #691      +/-   ##
==========================================
- Coverage   94.93%   93.94%   -0.99%     
==========================================
  Files          44       46       +2     
  Lines        4541     4627      +86     
==========================================
+ Hits         4311     4347      +36     
- Misses        230      280      +50

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

leehart · 2025-02-03T17:20:17Z

@alimanfoo @jonbrenas Before I apply the same to the other 24 public functions here, is this an agreeable approach to the deprecation of params such as contig in favour of region in the example committed to this PR for fst_gwss()?

   def fst_gwss(
        self,
        region: Optional[base_params.region] = None,
        window_size: Optional[fst_params.window_size] = None,
        cohort1_query: Optional[base_params.sample_query] = None,
        cohort2_query: Optional[base_params.sample_query] = None,
        sample_query_options: Optional[base_params.sample_query_options] = None,
        sample_sets: Optional[base_params.sample_sets] = None,
        site_mask: Optional[base_params.site_mask] = base_params.DEFAULT,
        cohort_size: Optional[base_params.cohort_size] = fst_params.cohort_size_default,
        min_cohort_size: Optional[
            base_params.min_cohort_size
        ] = fst_params.min_cohort_size_default,
        max_cohort_size: Optional[
            base_params.max_cohort_size
        ] = fst_params.max_cohort_size_default,
        random_seed: base_params.random_seed = 42,
        inline_array: base_params.inline_array = base_params.inline_array_default,
        chunks: base_params.chunks = base_params.native_chunks,
        clip_min: fst_params.clip_min = 0.0,
        contig: Optional[base_params.region] = None,  # Deprecated
    ) -> Tuple[np.ndarray, np.ndarray]:
        # Change this name if you ever change the behaviour of this function, to
        # invalidate any previously cached data.
        name = "fst_gwss_v3"

        # Specify which quasi-positional args are required.
        required_args = ("window_size", "cohort1_query", "cohort2_query")

        # Raise an error for any missing required args.
        missing_args = []
        for required_arg in required_args:
            if locals().get(required_arg) is None:
                missing_args.append(required_arg)
        if missing_args:
            raise ValueError(f"Missing required arguments: {missing_args}")

        # Specify which sets of alternative args are required.
        required_alternative_arg_sets = (("contig", "region"),)

        # Raise an error for any missing required alternative args.
        missing_alt_args = []
        for args_set in required_alternative_arg_sets:
            # Check if all alternative arguments are missing
            args_set_values = []
            for arg in args_set:
                args_set_values.append(locals().get(arg))
            if not any(args_set_values):
                missing_alt_args.append(args_set)
        if missing_alt_args:
            raise ValueError(
                f"Missing required alternative arguments: {missing_alt_args}"
            )

In this case, when contig is provided as a named parameter instead of region, the user should see the warning:

DeprecationWarning: The 'contig' parameter has been deprecated. Please use 'region' instead.

Since we have to enable that type of warning, I have also included code to switch it off again, to avoid unintended side-effects of warnings showing up where we want them switched off.

In the case where the user provides an unnamed value in the first position, they should see no warning.

Due to the issue around these functions have positional arguments, I needed to give some of the other parameters a default value, which is checked manually, such that missing arguments would raise a ValueError. This is instead of the TypeError usually seen when one or more positional arguments are missing, e.g. TypeError: fst_gwss() missing 2 required positional arguments: 'region' and 'window_size'.

For example, if cohort2_query is missing, via:

ag3.fst_gwss(
    "2L",
    10_000,
    "cohort_admin2_year == 'ML-2_Kati_colu_2014'",
    site_mask="gamb_colu",
    cohort_size=10,
    sample_sets="3.0",
)

....then the user would see a corresponding ValueError rather than a TypeError:

ValueError: missing required arguments: ['cohort2_query']

To avoid a cryptic TypeError when both the "optional" contig and region params are supplied, e.g.

  str: is not an instance of str
  malariagen_data.util.Region: is not an instance of malariagen_data.util.Region
  Mapping: is not a mapping
  List[Union[str, malariagen_data.util.Region, Mapping]]: is not a list
  Tuple[Union[str, malariagen_data.util.Region, Mapping], ellipsis]: is not a tuple

...which would otherwise be caused by code like this:

ag3.fst_gwss(
    window_size=10_000,
    cohort1_query="cohort_admin2_year == 'ML-2_Kati_colu_2014'",
    cohort2_query="cohort_admin2_year == 'ML-2_Kati_gamb_2014'",
)

...instead, the user would instead see a corresponding ValueError, e.g.

ValueError: Missing required alternative arguments: [('contig', 'region')]

Note: the code here uses locals() but we could accept all **kwargs instead, which would need to be defined as a parameter for the documentation. I've gone with locals() to keep the docs cleaner, so that the contig param gets explicitly mentioned instead of the catch-all and vague kwargs param. However, locals() comes with its own complications, e.g. around function scope, which might lead to subtle bugs, so it might not be the best choice. ~~Happy to switch to **kwargs instead, if needs be,~~[see comments below] or anything else that's preferable.

leehart · 2025-02-04T10:29:47Z

I've changed my mind! I plan change this code to use kwargs instead of locals() after all, mainly because it seems more natural and is less risky. I doubt having another kwargs parameter in the docs would be an issue.

leehart · 2025-02-04T11:26:05Z

I've changed my mind again! Using kwargs doesn't look as straightforward as I first imagined for this situation. We need to support situations where some positional arguments are provided, and we need to temporarily support both contig and region params, whether they are given by position or keyword. I guess there might be a way, but it's looking tricky.

alimanfoo

Hi @leehart, couple of suggestions...

alimanfoo · 2025-02-25T14:11:34Z

malariagen_data/anoph/fst.py

+        window_size: Optional[fst_params.window_size] = None,
+        cohort1_query: Optional[base_params.sample_query] = None,
+        cohort2_query: Optional[base_params.sample_query] = None,


Not sure why the type annotations of these parameters needs to change.

It looks like I tried to explain the reason for this in a comment above but I should revisit this to double-check.

alimanfoo · 2025-02-25T14:29:15Z

malariagen_data/anoph/fst.py

+        local_vars = locals().copy()
+
+        # Specify which quasi-positional args are required.
+        required_args = ("window_size", "cohort1_query", "cohort2_query")
+
+        # Raise an error for any missing required args.
+        missing_args = []
+        for required_arg in required_args:
+            if local_vars.get(required_arg) is None:
+                missing_args.append(required_arg)
+        if missing_args:
+            raise ValueError(f"Missing required arguments: {missing_args}")


This shouldn't be necessary I don't think, if the type annotations are left the same.

alimanfoo · 2025-02-25T14:52:58Z

malariagen_data/anoph/fst.py

+        required_alternative_arg_sets = (("contig", "region"),)
+
+        # Raise an error for any missing required alternative args.
+        missing_alt_args = []
+        for args_set in required_alternative_arg_sets:
+            # Check if all alternative arguments are missing
+            args_set_values = []
+            for arg in args_set:
+                args_set_values.append(local_vars.get(arg))
+            if not any(args_set_values):
+                missing_alt_args.append(args_set)
+        if missing_alt_args:
+            raise ValueError(
+                f"Missing required alternative arguments: {missing_alt_args}"
+            )
+
+        if contig is not None:
+            # Get the current warning filters.
+            original_warning_filters = warnings.filters[:]
+
+            # Trigger the warning.
+            warnings.simplefilter("default", DeprecationWarning)
+            warnings.warn(
+                "The 'contig' parameter has been deprecated. Please use 'region' instead.",
+                DeprecationWarning,
+            )
+
+            # Restore the original warning filters.
+            warnings.filters = original_warning_filters
+
+            # If contig and region are both given, then prefer region.
+            region = contig if region is None else region


I would suggest to handle this within a helper function. E.g., replace all of this with:

region = _handle_deprecated_contig_param(region=region, contig=contig) del contig

The implementation if this helper function could then live in a convenient common location somewhere, and could look something like:

def _handle_deprecated_contig_param(region, contig): if contig is None: # User is not using the old 'contig' parameter, all good. return region elif region is None: # User is using the old 'contig' parameter, raise a warning. warnings.warn( "The 'contig' parameter has been deprecated. Please use 'region' instead.", DeprecationWarning, ) # A contig is a valid region, so return the contig as the region. return contig else: # User is using both 'region' and 'contig' parameters, raise an error. raise ValueError("Found both 'region' and 'contig' parameters, please provide 'region' parameter only.")

alimanfoo · 2025-02-25T14:59:26Z

Since we have to enable that type of warning, I have also included code to switch it off again, to avoid unintended side-effects of warnings showing up where we want them switched off.

FWIW I would just raise a warning and not try to override any warning filters.

Replace contig with region in H12 GWSS functions and tests

23105a4

leehart added 8 commits December 6, 2024 18:21

Replace contig with region in G123 GWSS functions and tests

acdba78

Replace contig with region in iHS GWSS functions and tests

226c471

Change iHS GWSS cache names

1409756

Merge branch 'master' into GH375_allow_region_instead_of_contig_param

3825a92

Replace contig with region in FST GWSS functions and tests

65c048f

Change cache name for fst_gwss()

bf5b2f9

Replace contig with region in H1X GWSS functions and tests. Change ca…

ab45382

…che name.

Replace contig with region in XP-EHH GWSS functions and tests. Change…

95fb659

… cache names.

leehart marked this pull request as ready for review December 9, 2024 17:45

leehart requested review from alimanfoo and jonbrenas December 9, 2024 17:46

jonbrenas requested changes Dec 10, 2024

View reviewed changes

leehart marked this pull request as draft December 13, 2024 13:00

leehart added 5 commits December 13, 2024 14:07

WIP: use random_region_str() for random region in GWSS function tests

d7a7698

Merge branch 'master' into GH375_allow_region_instead_of_contig_param

9dc4e76

WIP: use random contig for GWSS function tests

8dfa5f8

Use random_region_str() instead of random contig for test_fst_gwss()

e668189

Use random_region_str() instead of random contig for test_g123_gwss_w…

b1e0e24

…ith_default_sites()

Replcase random contig with random region of fixed size in gwss funct…

97c63f3

…ion tests

leehart requested a review from jonbrenas January 23, 2025 12:59

leehart marked this pull request as ready for review January 23, 2025 12:59

leehart marked this pull request as draft January 24, 2025 11:55

leehart mentioned this pull request Jan 31, 2025

Allow multiple SNP transcripts in plot_diplotype_clustering_advanced() #703

Merged

Merge branch 'master' into GH375_allow_region_instead_of_contig_param

288d40c

Add region_size for random_region_str in test_fst_gwss()

aa41bbd

Increase random region size to 10_000 for test_g123_gwss_with_default…

2221fd7

…_sites() and test_h12_gwss_multi_with_default_analysis()

leehart added 2 commits February 3, 2025 15:54

Support deprecated contig param in fst_gwss()

69eec31

Raise ValueError for missing required alternative args in fst_gwss()

eb463e9

Fix logic bug in fst_gwss() re missing alt args

88a87cb

Copy locals() in fst_gwss()

8a0aaba

alimanfoo reviewed Feb 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace contig with region in GWSS functions #691

Replace contig with region in GWSS functions #691

leehart commented Dec 6, 2024 •

edited

Loading

review-notebook-app bot commented Dec 6, 2024

jonbrenas left a comment •

edited

Loading

leehart commented Dec 13, 2024 •

edited

Loading

leehart commented Dec 13, 2024

leehart commented Jan 20, 2025 •

edited

Loading

leehart commented Jan 20, 2025

leehart commented Jan 23, 2025

leehart commented Feb 3, 2025

leehart commented Feb 3, 2025

leehart commented Feb 3, 2025

leehart commented Feb 3, 2025

leehart commented Feb 3, 2025

ahernank commented Feb 3, 2025

leehart commented Feb 3, 2025

leehart commented Feb 3, 2025

codecov bot commented Feb 3, 2025

leehart commented Feb 3, 2025 •

edited

Loading

leehart commented Feb 4, 2025

leehart commented Feb 4, 2025

alimanfoo left a comment

alimanfoo Feb 25, 2025

leehart Feb 25, 2025

alimanfoo Feb 25, 2025

alimanfoo Feb 25, 2025

alimanfoo commented Feb 25, 2025

Replace contig with region in GWSS functions #691

Are you sure you want to change the base?

Replace contig with region in GWSS functions #691

Conversation

leehart commented Dec 6, 2024 • edited Loading

review-notebook-app bot commented Dec 6, 2024

jonbrenas left a comment • edited Loading

Choose a reason for hiding this comment

leehart commented Dec 13, 2024 • edited Loading

leehart commented Dec 13, 2024

leehart commented Jan 20, 2025 • edited Loading

leehart commented Jan 20, 2025

leehart commented Jan 23, 2025

leehart commented Feb 3, 2025

leehart commented Feb 3, 2025

leehart commented Feb 3, 2025

leehart commented Feb 3, 2025

leehart commented Feb 3, 2025

ahernank commented Feb 3, 2025

leehart commented Feb 3, 2025

leehart commented Feb 3, 2025

codecov bot commented Feb 3, 2025

Codecov Report

leehart commented Feb 3, 2025 • edited Loading

leehart commented Feb 4, 2025

leehart commented Feb 4, 2025

alimanfoo left a comment

Choose a reason for hiding this comment

alimanfoo Feb 25, 2025

Choose a reason for hiding this comment

leehart Feb 25, 2025

Choose a reason for hiding this comment

alimanfoo Feb 25, 2025

Choose a reason for hiding this comment

alimanfoo Feb 25, 2025

Choose a reason for hiding this comment

alimanfoo commented Feb 25, 2025

leehart commented Dec 6, 2024 •

edited

Loading

jonbrenas left a comment •

edited

Loading

leehart commented Dec 13, 2024 •

edited

Loading

leehart commented Jan 20, 2025 •

edited

Loading

leehart commented Feb 3, 2025 •

edited

Loading