Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace String with SmolStr for nodes Identifier and ExprName #9202

Closed
wants to merge 7 commits into from

Conversation

LaBatata101
Copy link
Contributor

Summary

Changes the type of the id field in the Identifier and ExprName nodes from String to SmolStr. This is a required change for the new parser.

This PR is part of #9152

Test Plan

cargo test --lib

@charliermarsh charliermarsh added performance Potential performance improvement parser Related to the parser labels Dec 19, 2023
Copy link
Contributor

github-actions bot commented Dec 19, 2023

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

The code changes look good to me. The two main questions that I have are:

  • How did you decide on smol_str over any of the other small string crates? For example, compact_str supports up to 24 bytes and 12 bytes on 32-bit architectures. But I'm unfamiliar with all the crates, so there might be other considerations. The intention isn't that you change the small string crate, but that we document how we made the decision in case we want to re-visit it in the future.
  • How did you decide on where to use SmolStr? E.g. you use it for Names and Identifiers but not for String literals.

@@ -131,7 +131,7 @@ fn get_undecorated_methods(

if let Expr::Name(ast::ExprName { id, .. }) = &arguments.args[0] {
if target_name == *id {
explicit_decorator_calls.insert(id.clone(), stmt.range());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub doesn't allow me to make a suggestion on the entire code. We can avoid calling to_string here by changing explicit_decorator_calls to HashMap<&str, TextRange>.

The id.to_string call on linne 128 also seems unnecessary (we can use the SmolStr directly or use as_str)

@@ -1739,7 +1740,7 @@ impl From<ExprStarred> for Expr {
#[derive(Clone, Debug, PartialEq)]
pub struct ExprName {
pub range: TextRange,
pub id: String,
pub id: SmolStr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you decide where to use SmolStr? Should ExprIpyEscapeCommand, StringLiteral, and FStringElement use SmolStr too or is it intentional that they do not because they're less likely to be short?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already answered most of this question in another comment, but I didn't mention ExprIpyEscapeCommand. I haven't seen many examples of real world use of ExprIpyEscapeCommand so I don't know if they are usually short, like Names and Identifiers. Can't really tell if they are a suitable candidate to use SmolStr.

@LaBatata101
Copy link
Contributor Author

How did you decide on smol_str over any of the other small string crates? For example, compact_str supports up to 24 bytes and 12 bytes on 32-bit architectures. But I'm unfamiliar with all the crates, so there might be other considerations. The intention isn't that you change the small string crate, but that we document how we made the decision in case we want to re-visit it in the future.

I haven't investigated much about the different string crates, I found out about smol_str when l was looking into rust-analyzer parser source code. I tested in the new parser to see if it had any performance improvement, and it gave a slight performance increase IIRC, that was the reason I chose it 🤣. But it's probably a good idea to do a proper comparison between different string crates in the future.

How did you decide on where to use SmolStr? E.g. you use it for Names and Identifiers but not for String literals.

Since SmolStr allocates strings up to 23 bytes long on the stack, and strings in Python are also used for documentation, they can get quite large, so I didn't see much benefit from using SmolStr for string literals. However, for Names and Identifiers, which are usually small, and are a common token in the code, using SmolStr for them will, probably, save quite a lot of allocations.

@charliermarsh
Copy link
Member

Thank for separating this out!

@charliermarsh
Copy link
Member

Do you mind either merging main (there's a type error against new changes), or giving me edit privileges on this branch to clean up prior to merging?

@LaBatata101
Copy link
Contributor Author

Do you mind either merging main (there's a type error against new changes), or giving me edit privileges on this branch to clean up prior to merging?

I think you already have edit privileges on this branch

@charliermarsh
Copy link
Member

Ah right, thank you, it was an HTTPS vs. SSH thing.

@charliermarsh charliermarsh force-pushed the smolstr-id branch 2 times, most recently from f06094b to ee373e3 Compare December 20, 2023 21:31
@BurntSushi
Copy link
Member

It looks like there is almost no difference in the micro-benchmarks? Is this change still worth doing?

Tok::Name {
name: name.to_owned(),
}
Tok::Name { name: name.into() }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Tok::Name { name: name.into() }
Tok::Name { name: ComactString::new_inline(name) }

@MichaReiser
Copy link
Member

It looks like there is almost no difference in the micro-benchmarks? Is this change still worth doing?

That's the observation I made as well when I did this change a while back #6174

I wouldn't expect any changes for all benchmarks other than the lexer because our (old) parser allocates way too much, so any other improvements are lost in the noise.

I'm surprised that the lexer benchmarks are regressing. But the microbenchmarks might just be too short and exercise the exact same memory pattern for each run, allowing the allocator to re-use the allocations cheaply.

Or the change indeed is not as impactful as we thought.

@MichaReiser
Copy link
Member

MichaReiser commented Dec 21, 2023

Using CompactString seems more promising (but the API is more annoying). The main difference to SmolStr is that:

  • CompactString::clone is O(n) compared to SmolStr's O(1). Cloning in O(1) could be useful but we don't see any significant benefits from it from our existing benchmarks. Maybe we would need to update some downstream code to use SmolStr too.
  • CompactString has no optimisation to store more than 23 whitespace only characters on the stack. This is an optimisation that we don't need.

@MichaReiser
Copy link
Member

Copying this here from the Discord conversation with @LaBatata101

My current thinking on small-string crates:

  • SmolStr shows too little benefits at the moment to justify adding a new dependency. That's why I wouldn't merge the PR just yet. We can revisit after switching the parser to see if it brings greater benefits (I'm doubtful but performance is hard to predict)
    CompactStr: It improves the lexer performance by about 5%, but the speedups don't manifest in parser or linter improvements. I suspect that it is because our current parser is slow, and that it reduces the number of allocations for an Identifier from 2 to one, which is significant but maybe not significant enough. The API is a bit cumbersome when it comes to comparing with str, which raises the entry barrier for new contributors (applies in general, yet another string type in Rust for them to understand)
  • Hipstr: As @LaBatata101 pointed out, it requires adding lifetimes to Token and all Nodes which is too big a refactor for now.

I'm leaning towards deferring the refactor until we've switched to the new parser to re-evaluate the benefits.
It's also worth considering alternatives or combining it with other improvements reducing the number of allocations. For example, we could explore using an Arena for the AST. This greatly reduces the number of allocations (and removes the Drop time), with the result that allocating for Names proportionally becomes more significant (and removing it yields greater performance improvements).

What I haven't analyzed yet is whether using a small string crate improves Ruff's memory consumption. A reduced memory consumption on its own would be justification enough for me to switch to a small string crate.

@LaBatata101 do you want to analyze the memory consumption or should we move forward without the small string optimisation for now (and close our both PRs?)

@LaBatata101
Copy link
Contributor Author

do you want to analyze the memory consumption or should we move forward without the small string optimisation for now (and close our both PRs?)

I think it's better to close both of our PRs for now, and do an in-depth analysis after the new parser is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser performance Potential performance improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants