[ty] Add more benchmarks #18714

MichaReiser · 2025-06-17T07:13:25Z

Summary

This PR adds some larger mypy primer projects to our benchmarks. Smaller projects run on codespeed's instrumented runners. Larger projects run on codespeed's walltime runner because they take too long on the instrumented runner.

I had to use divan instead of criterion for the walltime runner because criterion requires at least 10 iterations for each benchmark, which is too long for our use case. Divan allows us to use arbitrarily few iterations (okay, at least one). I picked a sample of 3x1 (3 iterations), which completes in a reasonable time and hopefully is stable enough not to be noisy. One added advantage of using the walltime runner is that we can measure the end-to-end performance, including file discovery.

Leaving the smaller projects run on instrumented runners has the advantage that the results will be more precise (and we're charged less).

The smaller projects are picked arbitrarily. I'm happy to change them if someone has other suggestions.

I picked the larger projects mainly on @AlexWaygood suggestions (thank you!)

altair: That's one I picked. It was pointed out in Add a benchmark involving realistic code that creates large unions ty#240 that it makes heavy use of typed dict
colour-science: very large unions
Pydantic: Pydantic creates large unions of TypedDicts, which can often be slow to type-check
Hydra-zen and sympy: Extensive use of generics.
Pandas: A lot of very, very large classes with a lot of implicit instance attributes.

Both benchmark jobs now complete after roughly 10 minutes (compared to 6min before).

For the new wall-time benchmarks, see https://codspeed.io/astral-sh/ruff/branches/micha%2Freal-world-benchmarks

Part of astral-sh/ty#240

github-actions · 2025-06-17T07:16:32Z

`mypy_primer` results

No ecosystem changes detected ✅

github-actions · 2025-06-17T07:27:26Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

crates/ruff_benchmark/benches/ty.rs

crates/ruff_benchmark/benches/ty_walltime.rs

crates/ruff_benchmark/benches/ty.rs

crates/ruff_benchmark/benches/ty_walltime.rs

sharkdp

Very much looking forward to this — thank you!

I didn't really review the project selection as such. Having this in place is the important part. Iterating on which projects we want to use is probably a continuous process.

sharkdp · 2025-06-18T09:30:13Z

crates/ruff_benchmark/benches/ty.rs

+        assert!(
+            diagnostics > 1 && diagnostics <= max_diagnostics,
+            "Expected between {} and {} diagnostics but got {}",
+            1,
+            max_diagnostics,
+            diagnostics
+        );


Why do we have this? As some kind of low-fidelity filter that we do "actual type checking work" on the project? I'm not opposed to it, but it might require updating the thresholds from time to time?

Yeah, it's more a gap-stop to ensure the dependencies are correctly installed and we actually check any files (although that would probably visible now :))

I also think that it will be useful when bumping the commit. A sudden increase could indicate that the dependencies need to be updated.

crates/ruff_benchmark/src/real_world_projects.rs

AlexWaygood · 2025-06-18T10:06:15Z

I picked the larger projects mainly on @AlexWaygood suggestions (thank you!)

I was quite confused for a while because I couldn't see where these larger projects were being reported on codspeed, but I eventually figured out that you need to click the "wall time" tab here:

I assume if we significantly regress one of the walltime benchmarks, codspeed will post a comment on the PR to tell us off about it, the same as it does with the instrumented benchmarks?

I don't feel like I have a good understanding of what the practical difference is between using a walltime runner and using an instrumented runner, but I trust you that this is the correct choice for the larger projects :-)

AlexWaygood · 2025-06-18T10:14:48Z

The smaller projects are picked arbitrarily. I'm happy to change them if someone has other suggestions.

They seem reasonable to me (definitely good enough to land now). anyio and attrs are both very popular projects, and I'm pretty sure hydra-zen has caused performance issues for type checkers in the past!

@hauntsaninja or @erictraut -- I don't suppose either of you have suggestions for smaller projects that have caused performance issues for mypy or pyright, which would be good for us to include in our benchmark suite?

sharkdp · 2025-06-18T10:33:14Z

I don't feel like I have a good understanding of what the practical difference is between using a walltime runner and using an instrumented runner, but I trust you that this is the correct choice for the larger projects :-)

My understanding is the following. The benchmarks that we ran on codspeed so far were "instrumented". What this means is that codspeed runs the executable using valgrind (i.e. on a virtual CPU). Instead of measuring actual time, valgrind counts CPU instructions and then converts them to a time (using some arbitrary but fixed clock speed). This approach has the benefit that it's noise-free, and the benchmark only needs to run once (but on a virtual CPU, which is much much slower than on an actual CPU). The big disadvantage is that you don't see the real impact of IO operations. I believe codspeed makes some assumptions about the time it takes for various syscalls to complete (how long does a read operation take if it reads x bytes, …), but this is just guesswork.

Walltime runners use much less magic. They execute the benchmark on an actual CPU and really measure wall clock time. This process is inherently noisy, even if they certainly have measures to reduce noise. That's why the benchmark should ideally run multiple times to gather some statistics. The disadvantage here is that this is harder to measure accurately and reliably. The advantage is that it includes IO operations and reflects the actual runtime that a user would see more directly.

MichaReiser · 2025-06-18T11:14:57Z

Thank's @sharkdp for the very detailed comparison. In this case, the only reason for using walltime is because the instrumented runners are simply too slow. At the other hand, having some wall-time runners is nice (and e.g revealed a salsa performance issue when using multiple dbs).

MichaReiser · 2025-06-18T11:28:06Z

I'll go ahead and merge this. We can easily change the projects in folllow up PRs and I'm also curious to see how stable/flaky the benchmarks are (which practice will show)

MichaReiser · 2025-06-18T11:41:42Z

The CI error seems unrelated to this PR.

* main: (68 commits) Unify `OldDiagnostic` and `Message` (#18391) [`pylint`] Detect more exotic NaN literals in `PLW0177` (#18630) [`flake8-async`] Mark autofix for `ASYNC115` as unsafe if the call expression contains comments (#18753) [`flake8-bugbear`] Mark autofix for `B004` as unsafe if the `hasattr` call expr contains comments (#18755) Enforce `pytest` import for decorators (#18779) [`flake8-comprehension`] Mark autofix for `C420` as unsafe if there's comments inside the dict comprehension (#18768) [flake8-async] fix detection for large integer sleep durations in `ASYNC116` rule (#18767) Update dependency ruff to v0.12.0 (#18790) Update taiki-e/install-action action to v2.53.2 (#18789) Add lint rule for calling chmod with non-octal integers (#18541) Mark `RET501` fix unsafe if comments are inside (#18780) Use `LintContext::report_diagnostic_if_enabled` in `check_tokens` (#18769) [UP008]: use `super()`, not `__super__` in error messages (#18743) Use Depot Windows runners for `cargo test` (#18754) Run ty benchmarks when `ruff_benchmark` changes (#18758) Disallow newlines in format specifiers of single quoted f- or t-strings (#18708) [ty] Add more benchmarks (#18714) [ty] Anchor all exclude patterns (#18685) Include changelog reference for other major versions (#18745) Use updated pre-commit id (#18718) ...

hauntsaninja · 2025-06-20T18:01:26Z

I don't know about smaller, but you could try some of:

colour-science/colour
home-assistant/core
pandas-dev/pandas
sympy/sympy
pydantic/pydantic
Rapptz/discord.py
FasterSpeeding/Tanjun
python/mypy

AlexWaygood · 2025-06-20T18:06:55Z

FasterSpeeding/Tanjun

Ah yeah, this one in particular might add some coverage that we don't currently have IIRC @MichaReiser -- they do a lot of stuff with ParamSpecs IIRC

MichaReiser added 2 commits June 17, 2025 09:11

Claude prototype for real world benchmarks

e7de9a8

Add uv

e62bf33

MichaReiser added the ty Multi-file analysis & type inference label Jun 17, 2025

Move to separate file, allow concurrent checking

6b0f777

MichaReiser force-pushed the micha/real-world-benchmarks branch from 5fc1f43 to 6b0f777 Compare June 17, 2025 08:53

MichaReiser added 3 commits June 17, 2025 11:22

Add different projects

7d58720

Use different projects

35d0e86

Add some logging

c2acebf

MichaReiser changed the title ~~Add real world benchmarks~~ [tuAdd real world benchmarks Jun 17, 2025

MichaReiser changed the title ~~[tuAdd real world benchmarks~~ [ty] Add larger benchmarks Jun 17, 2025

MichaReiser added the testing Related to testing Ruff itself label Jun 17, 2025

MichaReiser changed the title ~~[ty] Add larger benchmarks~~ [ty] Add more benchmarks Jun 17, 2025

MichaReiser added 2 commits June 17, 2025 13:47

Speedup git clone

219d13f

Add colour_science

a78c609

MichaReiser force-pushed the micha/real-world-benchmarks branch from 09a37d4 to a78c609 Compare June 17, 2025 12:05

Use walltime runner

7cdfea4

MichaReiser force-pushed the micha/real-world-benchmarks branch from adaa735 to 7cdfea4 Compare June 17, 2025 12:20

More projects

a669a2a

MichaReiser force-pushed the micha/real-world-benchmarks branch from 1eddff6 to a669a2a Compare June 17, 2025 14:15

MichaReiser added 2 commits June 17, 2025 16:38

Reduce iterations for codspeed, disable multithreading

bf826ab

Use --quick

dac6ba0

MichaReiser force-pushed the micha/real-world-benchmarks branch 2 times, most recently from 38b7b42 to 9a6a491 Compare June 17, 2025 15:30

Try devan

7d13208

MichaReiser force-pushed the micha/real-world-benchmarks branch from 9a6a491 to 7d13208 Compare June 17, 2025 15:41

MichaReiser added 3 commits June 17, 2025 17:50

Prewarm rayon

78c9f4c

Disable multithreading again

463284f

Unignore sympy

9012cb7

Update crates/ruff_benchmark/benches/ty.rs

860d3b8