Skip to content

Fix strategy validation thread-safety #4473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

tybug
Copy link
Member

@tybug tybug commented Jul 15, 2025

Part of #4451. This one is pretty rare, only triggering about once per full test suite run under --parallel-threads 2. There might be a way to write do_validate that takes recursion into account and only tracks validate_called = True at the end, but that's a bigger rewrite.

I'm somewhat worried about the unconditional lock overhead in the singlethreaded case. It's ~pure lock/unlock overhead, I expect ~zero contention. Here's a claude benchmark for .validate for lambda i: st.integers().map(lambda x: x + i)), where the lambda i and .map are cache-busters so each strategy really gets .validate called.

master        : 0.000080612s per call (12405 calls/sec)
pr        : 0.000086325s per call (11584 calls/sec)
Details
from hypothesis import given, settings, HealthCheck, strategies as st
from hypothesis.strategies import integers, text
from hypothesis import settings
import pytest
import timeit
import statistics

def benchmark_integers_validate():
    """Benchmark st.integers().validate() performance with cached strategies"""

    # Test different integer strategy configurations with unique mapped functions
    test_cases = [
        ("default", lambda i: st.integers().map(lambda x: x + i)),
        ("bounded", lambda i: st.integers(min_value=0, max_value=1000).map(lambda x: x * i)),
        ("min_only", lambda i: st.integers(min_value=0).map(lambda x: x - i)),
        ("max_only", lambda i: st.integers(max_value=1000).map(lambda x: x // (i + 1))),
        ("large_range", lambda i: st.integers(min_value=-1000000, max_value=1000000).map(lambda x: x % (i + 1))),
    ]

    results = {}

    for name, strategy_func in test_cases:
        print(f"\nBenchmarking {name} integers strategy with unique mapped functions:")

        # Benchmark creating and validating unique strategies
        def run_validate_with_unique_strategy():
            # Create a unique strategy for each call using a different function
            import random
            i = random.randint(1, 1000)  # Random value to make function unique
            strategy = strategy_func(i)
            return strategy.validate()

        # Run multiple times for accuracy
        times = timeit.repeat(run_validate_with_unique_strategy, repeat=100, number=100)

        # Calculate statistics
        mean_time = statistics.mean(times)
        median_time = statistics.median(times)
        min_time = min(times)
        max_time = max(times)
        std_dev = statistics.stdev(times) if len(times) > 1 else 0

        results[name] = {
            'mean': mean_time,
            'median': median_time,
            'min': min_time,
            'max': max_time,
            'std_dev': std_dev,
            'total_calls': 100 * 100
        }

        print(f"  Mean time: {mean_time:.6f} seconds per 100 calls")
        print(f"  Median time: {median_time:.6f} seconds per 100 calls")
        print(f"  Min time: {min_time:.6f} seconds per 100 calls")
        print(f"  Max time: {max_time:.6f} seconds per 100 calls")
        print(f"  Std dev: {std_dev:.6f} seconds")
        print(f"  Per call: {mean_time/100:.9f} seconds")
        print(f"  Calls per second: {100/mean_time:.0f}")

    # Summary comparison
    print("\n" + "="*50)
    print("SUMMARY COMPARISON")
    print("="*50)

    sorted_results = sorted(results.items(), key=lambda x: x[1]['mean'])

    for name, stats in sorted_results:
        print(f"{name:15s}: {stats['mean']/100:.9f}s per call ({100/stats['mean']:.0f} calls/sec)")

    return results

def benchmark_strategy_caching_comparison():
    """Compare performance of reused vs unique strategies"""

    print("\n" + "="*60)
    print("STRATEGY CACHING COMPARISON")
    print("="*60)

    # Test 1: Reusing the same strategy
    print("\nTest 1: Reusing same strategy")
    same_strategy = st.integers().map(lambda x: x + 1)

    def run_same_strategy():
        return same_strategy.validate()

    times_same = timeit.repeat(run_same_strategy, repeat=50, number=1000)
    mean_same = statistics.mean(times_same)

    print(f"  Mean time (same strategy): {mean_same:.6f} seconds per 1000 calls")
    print(f"  Per call: {mean_same/1000:.9f} seconds")

    # Test 2: Creating unique strategies each time
    print("\nTest 2: Creating unique strategies each time")

    def run_unique_strategies():
        import random
        i = random.randint(1, 1000000)
        strategy = st.integers().map(lambda x: x + i)
        return strategy.validate()

    times_unique = timeit.repeat(run_unique_strategies, repeat=50, number=100)
    mean_unique = statistics.mean(times_unique)

    print(f"  Mean time (unique strategies): {mean_unique:.6f} seconds per 100 calls")
    print(f"  Per call: {mean_unique/100:.9f} seconds")

    # Compare
    print(f"\nComparison:")
    print(f"  Same strategy per call: {mean_same/1000:.9f}s")
    print(f"  Unique strategy per call: {mean_unique/100:.9f}s")
    print(f"  Overhead ratio: {(mean_unique/100) / (mean_same/1000):.2f}x")
    print(f"  Caching saves: {((mean_unique/100) - (mean_same/1000)) / (mean_unique/100) * 100:.1f}% per call")

if __name__ == "__main__":
    print("Benchmarking st.integers().validate() performance with cached strategies...")
    benchmark_integers_validate()
    benchmark_strategy_caching_comparison()

Around 7% slower. Not great, since I've seen .validate be a performance hotspot before.

The following command reproduces the relevant failure on master: counter=1; while pytest hypothesis-python/tests/ --parallel-threads 2 -k test_invalid_args; do ((counter++)); done;

@tybug tybug force-pushed the free-threading-strategy-validation branch from 9378c6f to 42299bb Compare July 15, 2025 05:52

with validate_lock:
try:
self.validate_called = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should validate_called = True be moved below do_validate(), given the early return outside lock on l.486?

That would leave open the possibility of do_validate called multiple time in an initialization race (unless explicitly checked inside), probably harmless though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... per comment, this isn't sufficient. Maybe the early return needs to move inside critical region, or if that is too expensive possibly a two-stage process, i.e. early-return like

if self.validate_called and not self.validate_in_progress:
    return

I'm not thinking clearly right now, so please consider this a hint not a recipe

Copy link
Member Author

@tybug tybug Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if self.validate_called and not self.validate_in_progress: is also going to run into an infinite recursion, but possibly what we could do is set validate_calleds: dict[int, bool] which tracks thread_id: validate_called, and only return if validate has been called on this thread before. This would be lock-free and allow for concurrent validates. If threading.get_ident() is not expensive then this could work. I'll test it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of this is that all threads rerun validation for all strategies. It's a tradeoff between singlethreaded and multithreaded performance. I'm defaulting to prioritizing singlethreaded performance, but I could see us changing this in the future (or now, if people have preferences)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep prioritizing single-threaded perf for now, at least when it's not super-lopsided.

Potential caveat in this case: is it safe to concurrently run validation for a strategy in multiple threads? If not, we should probably go with the smallish cost of a lock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent validation should be safe, yeah. (in theory, and I've tested a full run in practice)

@tybug
Copy link
Member Author

tybug commented Jul 15, 2025

I went with the approach described in #4473 (comment). Benchmark:

validate: master
validate2: pr-threading-ident
validate3: pr-locks
image

@tybug tybug merged commit ad9d60c into HypothesisWorks:master Jul 15, 2025
120 of 121 checks passed
@tybug tybug deleted the free-threading-strategy-validation branch July 15, 2025 23:20
spencerkclark added a commit to pydata/xarray that referenced this pull request Jul 17, 2025
In the next version of `hypothesis` subclasses of `hypothesis.strategies.SearchStrategy` will be required to call `super().__init__()` in their `__init__` method (HypothesisWorks/hypothesis#4473).  This PR addresses this in the two subclasses in our codebase: `CFTimeStrategy` and `CFTimeStrategyISO8601`.  

Apparently this kind of subclassing is not actually part of the public API ([link](https://github.com/HypothesisWorks/hypothesis/pull/4473/files#diff-9abc0311b216f25f0b71cfff6b7043b22071d09a58cb949f6bc5022ddeaa8e7f)), so maybe we should adjust the approach here long term, but this at least gets the tests passing for now.

- [x] Closes #10541
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants