-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[ty] Add more benchmarks #18714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ty] Add more benchmarks #18714
Conversation
|
|
5fc1f43 to
6b0f777
Compare
09a37d4 to
a78c609
Compare
adaa735 to
7cdfea4
Compare
1eddff6 to
a669a2a
Compare
38b7b42 to
9a6a491
Compare
9a6a491 to
7d13208
Compare
sharkdp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much looking forward to this — thank you!
I didn't really review the project selection as such. Having this in place is the important part. Iterating on which projects we want to use is probably a continuous process.
| assert!( | ||
| diagnostics > 1 && diagnostics <= max_diagnostics, | ||
| "Expected between {} and {} diagnostics but got {}", | ||
| 1, | ||
| max_diagnostics, | ||
| diagnostics | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have this? As some kind of low-fidelity filter that we do "actual type checking work" on the project? I'm not opposed to it, but it might require updating the thresholds from time to time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's more a gap-stop to ensure the dependencies are correctly installed and we actually check any files (although that would probably visible now :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that it will be useful when bumping the commit. A sudden increase could indicate that the dependencies need to be updated.
I was quite confused for a while because I couldn't see where these larger projects were being reported on codspeed, but I eventually figured out that you need to click the "wall time" tab here:
I assume if we significantly regress one of the walltime benchmarks, codspeed will post a comment on the PR to tell us off about it, the same as it does with the instrumented benchmarks? I don't feel like I have a good understanding of what the practical difference is between using a walltime runner and using an instrumented runner, but I trust you that this is the correct choice for the larger projects :-) |
They seem reasonable to me (definitely good enough to land now). anyio and attrs are both very popular projects, and I'm pretty sure hydra-zen has caused performance issues for type checkers in the past! @hauntsaninja or @erictraut -- I don't suppose either of you have suggestions for smaller projects that have caused performance issues for mypy or pyright, which would be good for us to include in our benchmark suite? |
My understanding is the following. The benchmarks that we ran on codspeed so far were "instrumented". What this means is that codspeed runs the executable using valgrind (i.e. on a virtual CPU). Instead of measuring actual time, valgrind counts CPU instructions and then converts them to a time (using some arbitrary but fixed clock speed). This approach has the benefit that it's noise-free, and the benchmark only needs to run once (but on a virtual CPU, which is much much slower than on an actual CPU). The big disadvantage is that you don't see the real impact of IO operations. I believe codspeed makes some assumptions about the time it takes for various syscalls to complete (how long does a Walltime runners use much less magic. They execute the benchmark on an actual CPU and really measure wall clock time. This process is inherently noisy, even if they certainly have measures to reduce noise. That's why the benchmark should ideally run multiple times to gather some statistics. The disadvantage here is that this is harder to measure accurately and reliably. The advantage is that it includes IO operations and reflects the actual runtime that a user would see more directly. |
|
Thank's @sharkdp for the very detailed comparison. In this case, the only reason for using walltime is because the instrumented runners are simply too slow. At the other hand, having some wall-time runners is nice (and e.g revealed a salsa performance issue when using multiple dbs). |
|
I'll go ahead and merge this. We can easily change the projects in folllow up PRs and I'm also curious to see how stable/flaky the benchmarks are (which practice will show) |
|
The CI error seems unrelated to this PR. |
* main: (68 commits) Unify `OldDiagnostic` and `Message` (#18391) [`pylint`] Detect more exotic NaN literals in `PLW0177` (#18630) [`flake8-async`] Mark autofix for `ASYNC115` as unsafe if the call expression contains comments (#18753) [`flake8-bugbear`] Mark autofix for `B004` as unsafe if the `hasattr` call expr contains comments (#18755) Enforce `pytest` import for decorators (#18779) [`flake8-comprehension`] Mark autofix for `C420` as unsafe if there's comments inside the dict comprehension (#18768) [flake8-async] fix detection for large integer sleep durations in `ASYNC116` rule (#18767) Update dependency ruff to v0.12.0 (#18790) Update taiki-e/install-action action to v2.53.2 (#18789) Add lint rule for calling chmod with non-octal integers (#18541) Mark `RET501` fix unsafe if comments are inside (#18780) Use `LintContext::report_diagnostic_if_enabled` in `check_tokens` (#18769) [UP008]: use `super()`, not `__super__` in error messages (#18743) Use Depot Windows runners for `cargo test` (#18754) Run ty benchmarks when `ruff_benchmark` changes (#18758) Disallow newlines in format specifiers of single quoted f- or t-strings (#18708) [ty] Add more benchmarks (#18714) [ty] Anchor all exclude patterns (#18685) Include changelog reference for other major versions (#18745) Use updated pre-commit id (#18718) ...
|
I don't know about smaller, but you could try some of: colour-science/colour |
Ah yeah, this one in particular might add some coverage that we don't currently have IIRC @MichaReiser -- they do a lot of stuff with |

Summary
This PR adds some larger mypy primer projects to our benchmarks. Smaller projects run on codespeed's instrumented runners. Larger projects run on codespeed's walltime runner because they take too long on the instrumented runner.
I had to use divan instead of criterion for the walltime runner because criterion requires at least 10 iterations for each benchmark, which is too long for our use case. Divan allows us to use arbitrarily few iterations (okay, at least one). I picked a sample of 3x1 (3 iterations), which completes in a reasonable time and hopefully is stable enough not to be noisy. One added advantage of using the walltime runner is that we can measure the end-to-end performance, including file discovery.
Leaving the smaller projects run on instrumented runners has the advantage that the results will be more precise (and we're charged less).
The smaller projects are picked arbitrarily. I'm happy to change them if someone has other suggestions.
I picked the larger projects mainly on @AlexWaygood suggestions (thank you!)
Both benchmark jobs now complete after roughly 10 minutes (compared to 6min before).
For the new wall-time benchmarks, see https://codspeed.io/astral-sh/ruff/branches/micha%2Freal-world-benchmarks
Part of astral-sh/ty#240