Replace `String` with `SmolStr` for nodes `Identifier` and `ExprName` #9202

LaBatata101 · 2023-12-19T21:11:48Z

Summary

Changes the type of the id field in the Identifier and ExprName nodes from String to SmolStr. This is a required change for the new parser.

This PR is part of #9152

Test Plan

cargo test --lib

github-actions · 2023-12-19T21:43:29Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

MichaReiser

Thank you.

The code changes look good to me. The two main questions that I have are:

How did you decide on smol_str over any of the other small string crates? For example, compact_str supports up to 24 bytes and 12 bytes on 32-bit architectures. But I'm unfamiliar with all the crates, so there might be other considerations. The intention isn't that you change the small string crate, but that we document how we made the decision in case we want to re-visit it in the future.
How did you decide on where to use SmolStr? E.g. you use it for Names and Identifiers but not for String literals.

crates/ruff_linter/src/rules/pyflakes/rules/strings.rs

MichaReiser · 2023-12-19T23:18:08Z

crates/ruff_linter/src/rules/pylint/rules/no_method_decorator.rs

@@ -131,7 +131,7 @@ fn get_undecorated_methods(

                        if let Expr::Name(ast::ExprName { id, .. }) = &arguments.args[0] {
                            if target_name == *id {
-                                explicit_decorator_calls.insert(id.clone(), stmt.range());


GitHub doesn't allow me to make a suggestion on the entire code. We can avoid calling to_string here by changing explicit_decorator_calls to HashMap<&str, TextRange>.

The id.to_string call on linne 128 also seems unnecessary (we can use the SmolStr directly or use as_str)

MichaReiser · 2023-12-19T23:35:07Z

crates/ruff_python_ast/src/nodes.rs

@@ -1739,7 +1740,7 @@ impl From<ExprStarred> for Expr {
 #[derive(Clone, Debug, PartialEq)]
 pub struct ExprName {
    pub range: TextRange,
-    pub id: String,
+    pub id: SmolStr,


How did you decide where to use SmolStr? Should ExprIpyEscapeCommand, StringLiteral, and FStringElement use SmolStr too or is it intentional that they do not because they're less likely to be short?

I already answered most of this question in another comment, but I didn't mention ExprIpyEscapeCommand. I haven't seen many examples of real world use of ExprIpyEscapeCommand so I don't know if they are usually short, like Names and Identifiers. Can't really tell if they are a suitable candidate to use SmolStr.

LaBatata101 · 2023-12-20T20:58:36Z

How did you decide on smol_str over any of the other small string crates? For example, compact_str supports up to 24 bytes and 12 bytes on 32-bit architectures. But I'm unfamiliar with all the crates, so there might be other considerations. The intention isn't that you change the small string crate, but that we document how we made the decision in case we want to re-visit it in the future.

I haven't investigated much about the different string crates, I found out about smol_str when l was looking into rust-analyzer parser source code. I tested in the new parser to see if it had any performance improvement, and it gave a slight performance increase IIRC, that was the reason I chose it 🤣. But it's probably a good idea to do a proper comparison between different string crates in the future.

How did you decide on where to use SmolStr? E.g. you use it for Names and Identifiers but not for String literals.

Since SmolStr allocates strings up to 23 bytes long on the stack, and strings in Python are also used for documentation, they can get quite large, so I didn't see much benefit from using SmolStr for string literals. However, for Names and Identifiers, which are usually small, and are a common token in the code, using SmolStr for them will, probably, save quite a lot of allocations.

charliermarsh · 2023-12-20T20:59:25Z

Thank for separating this out!

charliermarsh · 2023-12-20T21:08:17Z

Do you mind either merging main (there's a type error against new changes), or giving me edit privileges on this branch to clean up prior to merging?

LaBatata101 · 2023-12-20T21:21:37Z

Do you mind either merging main (there's a type error against new changes), or giving me edit privileges on this branch to clean up prior to merging?

I think you already have edit privileges on this branch

charliermarsh · 2023-12-20T21:26:08Z

Ah right, thank you, it was an HTTPS vs. SSH thing.

BurntSushi · 2023-12-20T22:17:04Z

It looks like there is almost no difference in the micro-benchmarks? Is this change still worth doing?

MichaReiser · 2023-12-20T23:50:58Z

crates/ruff_python_parser/src/soft_keywords.rs

-    Tok::Name {
-        name: name.to_owned(),
-    }
+    Tok::Name { name: name.into() }


Suggested change

Tok::Name { name: name.into() }

Tok::Name { name: ComactString::new_inline(name) }

MichaReiser · 2023-12-20T23:53:33Z

It looks like there is almost no difference in the micro-benchmarks? Is this change still worth doing?

That's the observation I made as well when I did this change a while back #6174

I wouldn't expect any changes for all benchmarks other than the lexer because our (old) parser allocates way too much, so any other improvements are lost in the noise.

I'm surprised that the lexer benchmarks are regressing. But the microbenchmarks might just be too short and exercise the exact same memory pattern for each run, allowing the allocator to re-use the allocations cheaply.

Or the change indeed is not as impactful as we thought.

MichaReiser · 2023-12-21T10:38:43Z

Using CompactString seems more promising (but the API is more annoying). The main difference to SmolStr is that:

CompactString::clone is O(n) compared to SmolStr's O(1). Cloning in O(1) could be useful but we don't see any significant benefits from it from our existing benchmarks. Maybe we would need to update some downstream code to use SmolStr too.
CompactString has no optimisation to store more than 23 whitespace only characters on the stack. This is an optimisation that we don't need.

MichaReiser · 2023-12-22T03:41:55Z

Copying this here from the Discord conversation with @LaBatata101

My current thinking on small-string crates:

SmolStr shows too little benefits at the moment to justify adding a new dependency. That's why I wouldn't merge the PR just yet. We can revisit after switching the parser to see if it brings greater benefits (I'm doubtful but performance is hard to predict)
CompactStr: It improves the lexer performance by about 5%, but the speedups don't manifest in parser or linter improvements. I suspect that it is because our current parser is slow, and that it reduces the number of allocations for an Identifier from 2 to one, which is significant but maybe not significant enough. The API is a bit cumbersome when it comes to comparing with str, which raises the entry barrier for new contributors (applies in general, yet another string type in Rust for them to understand)

Hipstr: As @LaBatata101 pointed out, it requires adding lifetimes to Token and all Nodes which is too big a refactor for now.

I'm leaning towards deferring the refactor until we've switched to the new parser to re-evaluate the benefits.
It's also worth considering alternatives or combining it with other improvements reducing the number of allocations. For example, we could explore using an Arena for the AST. This greatly reduces the number of allocations (and removes the Drop time), with the result that allocating for Names proportionally becomes more significant (and removing it yields greater performance improvements).

What I haven't analyzed yet is whether using a small string crate improves Ruff's memory consumption. A reduced memory consumption on its own would be justification enough for me to switch to a small string crate.

@LaBatata101 do you want to analyze the memory consumption or should we move forward without the small string optimisation for now (and close our both PRs?)

LaBatata101 · 2023-12-23T20:48:19Z

do you want to analyze the memory consumption or should we move forward without the small string optimisation for now (and close our both PRs?)

I think it's better to close both of our PRs for now, and do an in-depth analysis after the new parser is merged.

LaBatata101 and others added 3 commits December 19, 2023 18:04

Replace String with SmolStr for nodes Identifier and ExprName

d6a2083

Fix clippy lint

bdd8f54

Avoid allocations

687dc2f

charliermarsh added performance Potential performance improvement parser Related to the parser labels Dec 19, 2023

Avoid more allocations

fdbb201

MichaReiser reviewed Dec 19, 2023

View reviewed changes

MichaReiser approved these changes Dec 19, 2023

View reviewed changes

charliermarsh approved these changes Dec 20, 2023

View reviewed changes

Merge branch 'main' into smolstr-id

4cd4c03

charliermarsh force-pushed the smolstr-id branch 2 times, most recently from f06094b to ee373e3 Compare December 20, 2023 21:31

charliermarsh added 2 commits December 20, 2023 16:31

Fix type error

ee373e3

Fix Clippy

1969019

MichaReiser reviewed Dec 20, 2023

View reviewed changes

LaBatata101 closed this Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `String` with `SmolStr` for nodes `Identifier` and `ExprName` #9202

Replace `String` with `SmolStr` for nodes `Identifier` and `ExprName` #9202

LaBatata101 commented Dec 19, 2023

github-actions bot commented Dec 19, 2023 •

edited

Loading

MichaReiser left a comment •

edited

Loading

MichaReiser Dec 19, 2023

MichaReiser Dec 19, 2023

LaBatata101 Dec 20, 2023

LaBatata101 commented Dec 20, 2023

charliermarsh commented Dec 20, 2023

charliermarsh commented Dec 20, 2023

LaBatata101 commented Dec 20, 2023

charliermarsh commented Dec 20, 2023

BurntSushi commented Dec 20, 2023

MichaReiser Dec 20, 2023

MichaReiser commented Dec 20, 2023

MichaReiser commented Dec 21, 2023 •

edited

Loading

MichaReiser commented Dec 22, 2023

LaBatata101 commented Dec 23, 2023

	Tok::Name { name: name.into() }
	Tok::Name { name: ComactString::new_inline(name) }

Replace String with SmolStr for nodes Identifier and ExprName #9202

Replace String with SmolStr for nodes Identifier and ExprName #9202

Conversation

LaBatata101 commented Dec 19, 2023

Summary

Test Plan

github-actions bot commented Dec 19, 2023 • edited Loading

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

MichaReiser left a comment • edited Loading

Choose a reason for hiding this comment

MichaReiser Dec 19, 2023

Choose a reason for hiding this comment

MichaReiser Dec 19, 2023

Choose a reason for hiding this comment

LaBatata101 Dec 20, 2023

Choose a reason for hiding this comment

LaBatata101 commented Dec 20, 2023

charliermarsh commented Dec 20, 2023

charliermarsh commented Dec 20, 2023

LaBatata101 commented Dec 20, 2023

charliermarsh commented Dec 20, 2023

BurntSushi commented Dec 20, 2023

MichaReiser Dec 20, 2023

Choose a reason for hiding this comment

MichaReiser commented Dec 20, 2023

MichaReiser commented Dec 21, 2023 • edited Loading

MichaReiser commented Dec 22, 2023

LaBatata101 commented Dec 23, 2023

Replace `String` with `SmolStr` for nodes `Identifier` and `ExprName` #9202

Replace `String` with `SmolStr` for nodes `Identifier` and `ExprName` #9202

github-actions bot commented Dec 19, 2023 •

edited

Loading

`ruff-ecosystem` results

MichaReiser left a comment •

edited

Loading

MichaReiser commented Dec 21, 2023 •

edited

Loading