Skip to content

Conversation

@MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Jun 17, 2025

Summary

This PR adds some larger mypy primer projects to our benchmarks. Smaller projects run on codespeed's instrumented runners. Larger projects run on codespeed's walltime runner because they take too long on the instrumented runner.

I had to use divan instead of criterion for the walltime runner because criterion requires at least 10 iterations for each benchmark, which is too long for our use case. Divan allows us to use arbitrarily few iterations (okay, at least one). I picked a sample of 3x1 (3 iterations), which completes in a reasonable time and hopefully is stable enough not to be noisy. One added advantage of using the walltime runner is that we can measure the end-to-end performance, including file discovery.

Leaving the smaller projects run on instrumented runners has the advantage that the results will be more precise (and we're charged less).

The smaller projects are picked arbitrarily. I'm happy to change them if someone has other suggestions.

I picked the larger projects mainly on @AlexWaygood suggestions (thank you!)

  • altair: That's one I picked. It was pointed out in Add a benchmark involving realistic code that creates large unions ty#240 that it makes heavy use of typed dict
  • colour-science: very large unions
  • Pydantic: Pydantic creates large unions of TypedDicts, which can often be slow to type-check
  • Hydra-zen and sympy: Extensive use of generics.
  • Pandas: A lot of very, very large classes with a lot of implicit instance attributes.

Both benchmark jobs now complete after roughly 10 minutes (compared to 6min before).

For the new wall-time benchmarks, see https://codspeed.io/astral-sh/ruff/branches/micha%2Freal-world-benchmarks

Part of astral-sh/ty#240

@MichaReiser MichaReiser added the ty Multi-file analysis & type inference label Jun 17, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jun 17, 2025

mypy_primer results

No ecosystem changes detected ✅

@github-actions
Copy link
Contributor

github-actions bot commented Jun 17, 2025

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@MichaReiser MichaReiser force-pushed the micha/real-world-benchmarks branch from 5fc1f43 to 6b0f777 Compare June 17, 2025 08:53
@MichaReiser MichaReiser changed the title Add real world benchmarks [tuAdd real world benchmarks Jun 17, 2025
@MichaReiser MichaReiser changed the title [tuAdd real world benchmarks [ty] Add larger benchmarks Jun 17, 2025
@MichaReiser MichaReiser added the testing Related to testing Ruff itself label Jun 17, 2025
@MichaReiser MichaReiser changed the title [ty] Add larger benchmarks [ty] Add more benchmarks Jun 17, 2025
@MichaReiser MichaReiser force-pushed the micha/real-world-benchmarks branch from 09a37d4 to a78c609 Compare June 17, 2025 12:05
@MichaReiser MichaReiser force-pushed the micha/real-world-benchmarks branch from adaa735 to 7cdfea4 Compare June 17, 2025 12:20
@MichaReiser MichaReiser force-pushed the micha/real-world-benchmarks branch from 1eddff6 to a669a2a Compare June 17, 2025 14:15
@MichaReiser MichaReiser force-pushed the micha/real-world-benchmarks branch 2 times, most recently from 38b7b42 to 9a6a491 Compare June 17, 2025 15:30
@MichaReiser MichaReiser force-pushed the micha/real-world-benchmarks branch from 9a6a491 to 7d13208 Compare June 17, 2025 15:41
Copy link
Contributor

@sharkdp sharkdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much looking forward to this — thank you!

I didn't really review the project selection as such. Having this in place is the important part. Iterating on which projects we want to use is probably a continuous process.

Comment on lines +408 to +414
assert!(
diagnostics > 1 && diagnostics <= max_diagnostics,
"Expected between {} and {} diagnostics but got {}",
1,
max_diagnostics,
diagnostics
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have this? As some kind of low-fidelity filter that we do "actual type checking work" on the project? I'm not opposed to it, but it might require updating the thresholds from time to time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's more a gap-stop to ensure the dependencies are correctly installed and we actually check any files (although that would probably visible now :))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that it will be useful when bumping the commit. A sudden increase could indicate that the dependencies need to be updated.

@AlexWaygood
Copy link
Member

I picked the larger projects mainly on @AlexWaygood suggestions (thank you!)

I was quite confused for a while because I couldn't see where these larger projects were being reported on codspeed, but I eventually figured out that you need to click the "wall time" tab here:

Screenshot 2025-06-18 at 11 02 40

I assume if we significantly regress one of the walltime benchmarks, codspeed will post a comment on the PR to tell us off about it, the same as it does with the instrumented benchmarks?

I don't feel like I have a good understanding of what the practical difference is between using a walltime runner and using an instrumented runner, but I trust you that this is the correct choice for the larger projects :-)

@AlexWaygood
Copy link
Member

AlexWaygood commented Jun 18, 2025

The smaller projects are picked arbitrarily. I'm happy to change them if someone has other suggestions.

They seem reasonable to me (definitely good enough to land now). anyio and attrs are both very popular projects, and I'm pretty sure hydra-zen has caused performance issues for type checkers in the past!

@hauntsaninja or @erictraut -- I don't suppose either of you have suggestions for smaller projects that have caused performance issues for mypy or pyright, which would be good for us to include in our benchmark suite?

@sharkdp
Copy link
Contributor

sharkdp commented Jun 18, 2025

I don't feel like I have a good understanding of what the practical difference is between using a walltime runner and using an instrumented runner, but I trust you that this is the correct choice for the larger projects :-)

My understanding is the following. The benchmarks that we ran on codspeed so far were "instrumented". What this means is that codspeed runs the executable using valgrind (i.e. on a virtual CPU). Instead of measuring actual time, valgrind counts CPU instructions and then converts them to a time (using some arbitrary but fixed clock speed). This approach has the benefit that it's noise-free, and the benchmark only needs to run once (but on a virtual CPU, which is much much slower than on an actual CPU). The big disadvantage is that you don't see the real impact of IO operations. I believe codspeed makes some assumptions about the time it takes for various syscalls to complete (how long does a read operation take if it reads x bytes, …), but this is just guesswork.

Walltime runners use much less magic. They execute the benchmark on an actual CPU and really measure wall clock time. This process is inherently noisy, even if they certainly have measures to reduce noise. That's why the benchmark should ideally run multiple times to gather some statistics. The disadvantage here is that this is harder to measure accurately and reliably. The advantage is that it includes IO operations and reflects the actual runtime that a user would see more directly.

@MichaReiser
Copy link
Member Author

Thank's @sharkdp for the very detailed comparison. In this case, the only reason for using walltime is because the instrumented runners are simply too slow. At the other hand, having some wall-time runners is nice (and e.g revealed a salsa performance issue when using multiple dbs).

@MichaReiser
Copy link
Member Author

I'll go ahead and merge this. We can easily change the projects in folllow up PRs and I'm also curious to see how stable/flaky the benchmarks are (which practice will show)

@MichaReiser MichaReiser merged commit 23261a3 into main Jun 18, 2025
35 of 36 checks passed
@MichaReiser MichaReiser deleted the micha/real-world-benchmarks branch June 18, 2025 11:41
@MichaReiser
Copy link
Member Author

The CI error seems unrelated to this PR.

dcreager added a commit that referenced this pull request Jun 19, 2025
* main: (68 commits)
  Unify `OldDiagnostic` and `Message` (#18391)
  [`pylint`] Detect more exotic NaN literals in `PLW0177` (#18630)
  [`flake8-async`] Mark autofix for `ASYNC115` as unsafe if the call expression contains comments (#18753)
  [`flake8-bugbear`] Mark autofix for `B004` as unsafe if the `hasattr` call expr contains comments (#18755)
  Enforce `pytest` import for decorators (#18779)
  [`flake8-comprehension`] Mark autofix for `C420` as unsafe if there's comments inside the dict comprehension (#18768)
  [flake8-async] fix detection for large integer sleep durations in `ASYNC116` rule (#18767)
  Update dependency ruff to v0.12.0 (#18790)
  Update taiki-e/install-action action to v2.53.2 (#18789)
  Add lint rule for calling chmod with non-octal integers (#18541)
  Mark `RET501` fix unsafe if comments are inside (#18780)
  Use `LintContext::report_diagnostic_if_enabled` in `check_tokens` (#18769)
  [UP008]: use `super()`, not `__super__` in error messages (#18743)
  Use Depot Windows runners for `cargo test` (#18754)
  Run ty benchmarks when `ruff_benchmark` changes (#18758)
  Disallow newlines in format specifiers of single quoted f- or t-strings (#18708)
  [ty] Add more benchmarks (#18714)
  [ty] Anchor all exclude patterns (#18685)
  Include changelog reference for other major versions (#18745)
  Use updated pre-commit id (#18718)
  ...
@hauntsaninja
Copy link
Contributor

I don't know about smaller, but you could try some of:

colour-science/colour
home-assistant/core
pandas-dev/pandas
sympy/sympy
pydantic/pydantic
Rapptz/discord.py
FasterSpeeding/Tanjun
python/mypy

@AlexWaygood
Copy link
Member

AlexWaygood commented Jun 20, 2025

FasterSpeeding/Tanjun

Ah yeah, this one in particular might add some coverage that we don't currently have IIRC @MichaReiser -- they do a lot of stuff with ParamSpecs IIRC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Related to testing Ruff itself ty Multi-file analysis & type inference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants