Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace contig with region in GWSS functions #691

Draft
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

leehart
Copy link
Collaborator

@leehart leehart commented Dec 6, 2024

Resolves #375

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@leehart leehart marked this pull request as ready for review December 9, 2024 17:45
Copy link
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe using a random region instead of a random contig in at least some of the tests would be better. The function random_region_str from tests/anoph/conftest.py exists and there are a few examples of use in test_snp_data.py, for instance.

@leehart
Copy link
Collaborator Author

leehart commented Dec 13, 2024

Just to note that there are other places in the code unrelated to this PR where a random contig is used for a region in testing, e.g.

  • test_biallelic_diplotype_pairwise_distance_with_metric
  • test_njt_with_metric
  • test_njt_with_algorithm
  • test_plot_njt
  • test_average_fst
  • test_average_fst_with_min_cohort_size
  • test_pca_plotting
  • test_pca_exclude_samples
  • test_pca_fit_exclude_samples
  • test_plink_converter

There might be a valid reason for choosing a random contig over a random region string, I'm not sure, but we could deal with those cases in another PR, if needs be.

@leehart
Copy link
Collaborator Author

leehart commented Dec 13, 2024

I'm getting IndexError: index 0 is out of bounds for axis 0 with size 0 in a few tests after switching to random region strings, so it seems worthwhile!

@leehart leehart marked this pull request as draft December 13, 2024 13:00
@leehart
Copy link
Collaborator Author

leehart commented Jan 20, 2025

It looks like the error is coming from cases where, e.g. in plot_g123_gwss_track():

        x, g123 = self.g123_gwss(
            region=region,
            [...]
        )
        # determine X axis range
        x_min = x[0]

But x isn't what's expected [It is an empty list, i.e. [].], hence IndexError: index 0 is out of bounds for axis 0 with size 0

@leehart
Copy link
Collaborator Author

leehart commented Jan 20, 2025

Noting here that x is meant to be

"An array containing the window centre point genomic positions."

which is calculated via

x = allel.moving_statistic(pos, statistic=np.mean, size=window_size)

in _g123_gwss()

@leehart
Copy link
Collaborator Author

leehart commented Jan 23, 2025

It looks like the test failures for regions were happening whenever the region was sufficiently small or in the wrong place so that no sites were captured.

It looks like this problem is avoided in other places by setting the random region to a fixed size, usually 5000, but I increased this to 10_000 where phased sites were being used.

This does not prevent these IndexError: index 0 is out of bounds for axis 0 with size 0 test failures from ever occurring, since there is still a chance that a random region and sample set are chosen that do not yield anything for x, which breaks assumptive code such as x_min = x[0], which currently occurs in 7 different places in the codebase.

I'm not sure what a more robust solution to this looks like yet, but it seems less related to this particular issue and more to do with the fact that functions such as g123_gwss() cannot currently be guaranteed to return a non-empty x. Perhaps we need to raise another issue to handle those edge cases and their errors, so that the code (and these tests) won't randomly fail as a result.

@leehart leehart requested a review from jonbrenas January 23, 2025 12:59
@leehart leehart marked this pull request as ready for review January 23, 2025 12:59
@leehart leehart marked this pull request as draft January 24, 2025 11:55
@leehart
Copy link
Collaborator Author

leehart commented Feb 3, 2025

Noting test failure for (3.12, numpy~=2.0)

=========================== short test summary info ============================
FAILED tests/anoph/test_snp_frq.py::test_allele_frequencies_with_str_cohorts_and_sample_query[af1_sim] - ValueError: > No amino acid change SNPs found for the given transcript and site mask.
=========== 1 failed, 523 passed, 154 warnings in 272.09s (0:04:32) ============

There was also another failure for "tests with coverage", of the same kind previously encountered, so this is not yet resolved:

=========================== short test summary info ============================
FAILED tests/anoph/test_fst.py::test_fst_gwss[af1_sim] - IndexError: index 0 is out of bounds for axis 0 with size 0
=========== 1 failed, 523 passed, 174 warnings in 532.16s (0:08:52) ============

@leehart
Copy link
Collaborator Author

leehart commented Feb 3, 2025

Plan to add support and show some form of deprecation warning for the contig params (in favour of region) in the following public functions:

  • fst_gwss
  • plot_fst_gwss_track
  • plot_fst_gwss
  • g123_gwss
  • g123_calibration
  • plot_g123_gwss_track
  • plot_g123_gwss
  • plot_g123_calibration
  • h12_calibration
  • plot_h12_calibration
  • h12_gwss
  • plot_h12_gwss_track
  • plot_h12_gwss
  • plot_h12_gwss_multi_overlay_track
  • plot_h12_gwss_multi_overlay
  • plot_h12_gwss_multi_panel
  • h1x_gwss
  • plot_h1x_gwss_track
  • plot_h1x_gwss
  • ihs_gwss
  • plot_ihs_gwss_track
  • plot_xpehh_gwss
  • plot_ihs_gwss
  • xpehh_gwss
  • plot_xpehh_gwss_track

I suspect a similar approach might be applied in the future if we ever support multiple regions instead of as singular region in these functions.

I reckon the current plan is to drop support for the contig param in these functions completely in some future major release, as yet undetermined.

@leehart
Copy link
Collaborator Author

leehart commented Feb 3, 2025

Noting more failures for tests (3.12, numpy==1.26.4), which might necessitate increasing the random region_size from 5000 to 10_000, at least in these tests.

=========================== short test summary info ============================
FAILED tests/anoph/test_g123.py::test_g123_gwss_with_default_sites[af1_sim] - IndexError: index 0 is out of bounds for axis 0 with size 0
FAILED tests/anoph/test_h12.py::test_h12_gwss_multi_with_default_analysis[af1_sim] - IndexError: index 0 is out of bounds for axis 0 with size 0
=========== 2 failed, 522 passed, 174 warnings in 288.72s (0:04:48) ============

…_sites() and test_h12_gwss_multi_with_default_analysis()
@leehart
Copy link
Collaborator Author

leehart commented Feb 3, 2025

One complication is that many, if not all, of these functions currently use positional arguments, rather than requiring keyword-only arguments.

Conveniently, this might not have the usual impact in this case, because region is a superset of contig, so function calls that don't specify the parameter name should still work as they are, and function calls that do specify the parameter name(s) should also work but will receive a deprecation warning.

@leehart
Copy link
Collaborator Author

leehart commented Feb 3, 2025

@ahernank I don't know yet why some tests relating to cohorts and allele frequencies have randomly started failing, e.g. test_allele_frequencies_with_str_cohorts_and_sample_query and test_allele_frequencies_with_min_cohort_size. Could it be something to do with the recent cohorts_20250131 ? Although all other cohorts tests are usually succeeding.

=========================== short test summary info ============================
FAILED tests/anoph/test_snp_frq.py::test_allele_frequencies_with_min_cohort_size[ag3_sim-0] - ValueError: No amino acid change SNPs found for the given transcript and site mask.
=========== 1 failed, 523 passed, 171 warnings in 492.38s (0:08:12) ============

@ahernank
Copy link
Collaborator

ahernank commented Feb 3, 2025

Thanks @leehart, I believe these are failures related to the randomness of the region selected for tests rather than any changes in cohorts -- I've re-run the tests on this PR, and they have now passed with a different region.

@leehart
Copy link
Collaborator Author

leehart commented Feb 3, 2025

Thanks @ahernank , that would make sense but I can't see where random regions come into play for test_allele_frequencies_with_str_cohorts_and_sample_query or test_allele_frequencies_with_min_cohort_size?

I can see random site_mask, transcript, cohorts, country for test_allele_frequencies_with_str_cohorts_and_sample_query.
I can see random sample_sets, site_mask and transcript for test_allele_frequencies_with_min_cohort_size.

@leehart
Copy link
Collaborator Author

leehart commented Feb 3, 2025

Unfortunately, with regards to providing a deprecation path for the contig parameter, the current requirement for positional arguments and the current lack of defaults in other params is causing a problem.

For example, if we kept support for the contig param as optional, but actually required the region parameter, then when a user tried to use the old contig parameter like this:

fst_gwss = ag3.fst_gwss(
    contig="2L",
    window_size=10_000,
    cohort1_query="cohort_admin2_year == 'ML-2_Kati_colu_2014'",
    cohort2_query="cohort_admin2_year == 'ML-2_Kati_gamb_2014'",
    site_mask="gamb_colu",
    cohort_size=10,
    sample_sets="3.0",
)

...then they wouldn't get the DeprecationWarning, because they would get the TypeError: fst_gwss() missing 1 required positional argument: 'region' first.

We can't solve that by making both the new region parameter and the old contig parameters optional, which would then require defaults (None), because the we currently need to have either the contig or the region parameter appear first in the order of positional arguments, in order to preserve backwards compatibility, and the second positional argument in this case is window_size, which currently doesn't have a default set. (We can't have positional params with defaults appear before params without defaults.)

Perhaps one way forwards is to fill in the missing defaults between the first param and the next param that has a default, which in this case would mean setting defaults for window_size, cohort1_query and cohort2_query. Perhaps if a no values are specified for those parameters then we should just raise a ValueError. 🤔

Copy link

codecov bot commented Feb 3, 2025

Codecov Report

Attention: Patch coverage is 71.87500% with 9 lines in your changes missing coverage. Please review.

Project coverage is 93.94%. Comparing base (79a1d69) to head (eb463e9).
Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
malariagen_data/anoph/fst.py 65.38% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #691      +/-   ##
==========================================
- Coverage   94.93%   93.94%   -0.99%     
==========================================
  Files          44       46       +2     
  Lines        4541     4627      +86     
==========================================
+ Hits         4311     4347      +36     
- Misses        230      280      +50     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@leehart
Copy link
Collaborator Author

leehart commented Feb 3, 2025

@alimanfoo @jonbrenas Before I apply the same to the other 24 public functions here, is this an agreeable approach to the deprecation of params such as contig in favour of region in the example committed to this PR for fst_gwss()?

   def fst_gwss(
        self,
        region: Optional[base_params.region] = None,
        window_size: Optional[fst_params.window_size] = None,
        cohort1_query: Optional[base_params.sample_query] = None,
        cohort2_query: Optional[base_params.sample_query] = None,
        sample_query_options: Optional[base_params.sample_query_options] = None,
        sample_sets: Optional[base_params.sample_sets] = None,
        site_mask: Optional[base_params.site_mask] = base_params.DEFAULT,
        cohort_size: Optional[base_params.cohort_size] = fst_params.cohort_size_default,
        min_cohort_size: Optional[
            base_params.min_cohort_size
        ] = fst_params.min_cohort_size_default,
        max_cohort_size: Optional[
            base_params.max_cohort_size
        ] = fst_params.max_cohort_size_default,
        random_seed: base_params.random_seed = 42,
        inline_array: base_params.inline_array = base_params.inline_array_default,
        chunks: base_params.chunks = base_params.native_chunks,
        clip_min: fst_params.clip_min = 0.0,
        contig: Optional[base_params.region] = None,  # Deprecated
    ) -> Tuple[np.ndarray, np.ndarray]:
        # Change this name if you ever change the behaviour of this function, to
        # invalidate any previously cached data.
        name = "fst_gwss_v3"

        # Specify which quasi-positional args are required.
        required_args = ("window_size", "cohort1_query", "cohort2_query")

        # Raise an error for any missing required args.
        missing_args = []
        for required_arg in required_args:
            if locals().get(required_arg) is None:
                missing_args.append(required_arg)
        if missing_args:
            raise ValueError(f"Missing required arguments: {missing_args}")

        # Specify which sets of alternative args are required.
        required_alternative_arg_sets = (("contig", "region"),)

        # Raise an error for any missing required alternative args.
        missing_alt_args = []
        for args_set in required_alternative_arg_sets:
            # Check if all alternative arguments are missing
            args_set_values = []
            for arg in args_set:
                args_set_values.append(locals().get(arg))
            if not any(args_set_values):
                missing_alt_args.append(args_set)
        if missing_alt_args:
            raise ValueError(
                f"Missing required alternative arguments: {missing_alt_args}"
            )

In this case, when contig is provided as a named parameter instead of region, the user should see the warning:

DeprecationWarning: The 'contig' parameter has been deprecated. Please use 'region' instead.

Since we have to enable that type of warning, I have also included code to switch it off again, to avoid unintended side-effects of warnings showing up where we want them switched off.

In the case where the user provides an unnamed value in the first position, they should see no warning.

Due to the issue around these functions have positional arguments, I needed to give some of the other parameters a default value, which is checked manually, such that missing arguments would raise a ValueError. This is instead of the TypeError usually seen when one or more positional arguments are missing, e.g. TypeError: fst_gwss() missing 2 required positional arguments: 'region' and 'window_size'.

For example, if cohort2_query is missing, via:

ag3.fst_gwss(
    "2L",
    10_000,
    "cohort_admin2_year == 'ML-2_Kati_colu_2014'",
    site_mask="gamb_colu",
    cohort_size=10,
    sample_sets="3.0",
)

....then the user would see a corresponding ValueError rather than a TypeError:

ValueError: missing required arguments: ['cohort2_query']

To avoid a cryptic TypeError when both the "optional" contig and region params are supplied, e.g.

  str: is not an instance of str
  malariagen_data.util.Region: is not an instance of malariagen_data.util.Region
  Mapping: is not a mapping
  List[Union[str, malariagen_data.util.Region, Mapping]]: is not a list
  Tuple[Union[str, malariagen_data.util.Region, Mapping], ellipsis]: is not a tuple

...which would otherwise be caused by code like this:

ag3.fst_gwss(
    window_size=10_000,
    cohort1_query="cohort_admin2_year == 'ML-2_Kati_colu_2014'",
    cohort2_query="cohort_admin2_year == 'ML-2_Kati_gamb_2014'",
)

...instead, the user would instead see a corresponding ValueError, e.g.

ValueError: Missing required alternative arguments: [('contig', 'region')]

Note: the code here uses locals() but we could accept all **kwargs instead, which would need to be defined as a parameter for the documentation. I've gone with locals() to keep the docs cleaner, so that the contig param gets explicitly mentioned instead of the catch-all and vague kwargs param. However, locals() comes with its own complications, e.g. around function scope, which might lead to subtle bugs, so it might not be the best choice. Happy to switch to **kwargs instead, if needs be,[see comments below] or anything else that's preferable.

@leehart
Copy link
Collaborator Author

leehart commented Feb 4, 2025

I've changed my mind! I plan change this code to use kwargs instead of locals() after all, mainly because it seems more natural and is less risky. I doubt having another kwargs parameter in the docs would be an issue.

@leehart
Copy link
Collaborator Author

leehart commented Feb 4, 2025

I've changed my mind again! Using kwargs doesn't look as straightforward as I first imagined for this situation. We need to support situations where some positional arguments are provided, and we need to temporarily support both contig and region params, whether they are given by position or keyword. I guess there might be a way, but it's looking tricky.

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @leehart, couple of suggestions...

Comment on lines +110 to +112
window_size: Optional[fst_params.window_size] = None,
cohort1_query: Optional[base_params.sample_query] = None,
cohort2_query: Optional[base_params.sample_query] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the type annotations of these parameters needs to change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like I tried to explain the reason for this in a comment above but I should revisit this to double-check.

Comment on lines +134 to +145
local_vars = locals().copy()

# Specify which quasi-positional args are required.
required_args = ("window_size", "cohort1_query", "cohort2_query")

# Raise an error for any missing required args.
missing_args = []
for required_arg in required_args:
if local_vars.get(required_arg) is None:
missing_args.append(required_arg)
if missing_args:
raise ValueError(f"Missing required arguments: {missing_args}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary I don't think, if the type annotations are left the same.

Comment on lines +148 to +179
required_alternative_arg_sets = (("contig", "region"),)

# Raise an error for any missing required alternative args.
missing_alt_args = []
for args_set in required_alternative_arg_sets:
# Check if all alternative arguments are missing
args_set_values = []
for arg in args_set:
args_set_values.append(local_vars.get(arg))
if not any(args_set_values):
missing_alt_args.append(args_set)
if missing_alt_args:
raise ValueError(
f"Missing required alternative arguments: {missing_alt_args}"
)

if contig is not None:
# Get the current warning filters.
original_warning_filters = warnings.filters[:]

# Trigger the warning.
warnings.simplefilter("default", DeprecationWarning)
warnings.warn(
"The 'contig' parameter has been deprecated. Please use 'region' instead.",
DeprecationWarning,
)

# Restore the original warning filters.
warnings.filters = original_warning_filters

# If contig and region are both given, then prefer region.
region = contig if region is None else region
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to handle this within a helper function. E.g., replace all of this with:

        region = _handle_deprecated_contig_param(region=region, contig=contig)
        del contig

The implementation if this helper function could then live in a convenient common location somewhere, and could look something like:

def _handle_deprecated_contig_param(region, contig):
    if contig is None:
        # User is not using the old 'contig' parameter, all good.
        return region
    elif region is None:
        # User is using the old 'contig' parameter, raise a warning.
        warnings.warn(
            "The 'contig' parameter has been deprecated. Please use 'region' instead.",
            DeprecationWarning,
        )
        # A contig is a valid region, so return the contig as the region.
        return contig
    else:
        # User is using both 'region' and 'contig' parameters, raise an error.
        raise ValueError("Found both 'region' and 'contig' parameters, please provide 'region' parameter only.")

@alimanfoo
Copy link
Member

Since we have to enable that type of warning, I have also included code to switch it off again, to avoid unintended side-effects of warnings showing up where we want them switched off.

FWIW I would just raise a warning and not try to override any warning filters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support region instead of contig parameter in H12, G123, IHS and Fst functions
4 participants