Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unnecessary_join lint #8579

Merged
merged 1 commit into from
Mar 24, 2022
Merged

unnecessary_join lint #8579

merged 1 commit into from
Mar 24, 2022

Conversation

yoav-lavi
Copy link
Contributor

@yoav-lavi yoav-lavi commented Mar 24, 2022

Adds a lint called unnecessary_join that detects cases of .collect::<Vec<String>>.join("") or .collect::<Vec<_>>.join("") on an iterator, suggesting .collect::<String>() instead

Fixes: #8570

This is a reopen of #8573

changelog: add lint [unnecessary_join]

@rust-highfive
Copy link

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @Manishearth (or someone else) soon.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties label Mar 24, 2022
@yoav-lavi
Copy link
Contributor Author

r? @flip1995

@yoav-lavi
Copy link
Contributor Author

@flip1995 this is the same PR only with a squashed commit

@flip1995
Copy link
Member

Thanks! @bors r+

@flip1995
Copy link
Member

@bors r+

Seems like bors lost track of this PR. I re-synced bors, let's hope it works this time.

@bors
Copy link
Contributor

bors commented Mar 24, 2022

📌 Commit b60a7fb has been approved by flip1995

@bors
Copy link
Contributor

bors commented Mar 24, 2022

⌛ Testing commit b60a7fb with merge e29b3b5...

@bors
Copy link
Contributor

bors commented Mar 24, 2022

☀️ Test successful - checks-action_dev_test, checks-action_remark_test, checks-action_test
Approved by: flip1995
Pushing e29b3b5 to master...

@bors bors merged commit e29b3b5 into rust-lang:master Mar 24, 2022
@flip1995
Copy link
Member

Great first contribution!

I noticed today that I never asked first-time contributors about our dev experience. So:

How was your experience as a first time contributor to Clippy? Did you find the docs helpful? How was your experience with our tooling? Is there anything you found in the docs/tooling that was unclear? Any other things we could improve on?

It would be great if you could give me/us a bit of feedback on (some of) those questions. You can also write me on Zulip/Discord/email, if you prefer. However, don't feel obligated to give us feedback, if you don't have time / don't want to :)

@yoav-lavi
Copy link
Contributor Author

yoav-lavi commented Mar 24, 2022

Thank you @flip1995!

It was a good experience! The docs were helpful for getting the project working and working with VSCode / Clion, and cargo dev was very nice to work with.

Here's a list of some things I had to gather from context, some might be due to being relatively new to Rust / compilers (some of these might be in the docs and I missed them):

  • Naming conventions in the project, e.g. "recv" (recursive? receive? if so, what's recursive or being received?), "ty" (type), "tcx" (T context? the lifetime of the context), "bless" (basically snapshot / approve the current stdout / stderr files)
  • The types of contexts and which to use (I recall seeing a general definition in the docs but not why you'd chose one or the other when creating a lint)
  • Where things are (rustc_hir / rustc_ast / rustc_middle etc.)
  • Method lints aren't normal lints and as such aren't covered by the lint generator directly (might be worth it to make an interactive step there and ask which kind of lint the user wants to create)
  • The suggestion in the lint implementation is also what's used for automatic fixing
  • Finding which span is the correct one to use (might be easier if there was some sort of automatic debug / pretty print showing the current match for the lint + which spans cover what within the match, would especially be helpful for with_hi / with_lo situations)
    • This point might be related to the fact that debugging mostly involved printing, the tests I ran (I'm assuming) used a binary and didn't stop on breakpoints
  • Which fields are what when destructing a node (e.g. ExprKind::MethodCall(_path, [join_self_arg, join_arg], _span))
  • Creating a new test (where, what it should include, etc.)
  • Not being able to use a raw string in the lint description might be something others run into
  • When matching a method in method/mod.rs, the name you're matching is last rather than first (e.g. "join" for .collect().join())
  • The fact that a method isn't being handled in methods/mod.rs doesn't mean it shouldn't be there / it's is specifically for iterators rather than vectors / etc. but rather that it wasn't needed until now
  • Knowing whether my lint already existed / was relevant

General suggestions:

  • Example implementations for common lints / operations

Hopefully this helps! I had fun contributing and this is mostly a list of what came to mind and the items aren't in a particular order or were all necessarily large hurdles. I might be able to expand on the list later

@flip1995
Copy link
Member

Thanks for the summary!

I will include some of your review once I finally finish the Clippy book. I copied your comment to #6628 to not forget about it.

I can give you some answers now though:

  • recv = receiver and means the self argument of a method. We have a table of general abbreviations here, but it looks like we should extend this. 'tcx is the lifetime of everything that is connected with type information in rustc. Basically the lifetime of THIR.
  • Where things are: I guess this is just a matter of getting familiar with the compiler. the rustc_dev_guide is a great place to get a high level overview.
  • Knowing whether my lint already existed / was relevant

I guess we should monitor and respond to issues better. I guess we don't have the capacity to triage our issues as well as I would like it.

Example implementations for common lints / operations

Common Tools for writing lints should contain common operations for lints. "Common lints" is basically looking at existing lints.

@yoav-lavi
Copy link
Contributor Author

yoav-lavi commented Mar 24, 2022

You're welcome @flip1995!

Thanks for the answers!

I will include some of your review once I finally finish the Clippy book. I copied your comment to #6628 to not forget about it.

In what context would it be included?

I guess we should monitor and respond to issues better. I guess we don't have the capacity to triage our issues as well as I would like it.

Didn't mean this as an accusation, more in the direction of possibly creating an index of sort of existing lints (rather than just a list of names, with which items they relate to etc.) and possibly a playground to check if a piece of code is linted using all rules. The Rust playground has clippy but I'm not sure it turns everything on by default.

Edit: The clippy lint list seems to work with keywords like "iterator" so this might already be covered

Common Tools for writing lints should contain common operations for lints. "Common lints" is basically looking at existing lints.

Not sure I saw this when working on the lint, will take a deeper look. Regarding common lints - You can look at existing code, but having a few examples that are indexed by what they're doing would probably be helpful (rather than trying to find lints which do what you want) (the previous point might tie into this using the indexed list of lints).

@flip1995
Copy link
Member

In what context would it be included?

I plan to rewrite some of our dev decumentation to make it easier finding things. I will use your feedback as some pointers where I might want to start with that.

Didn't mean this as an accusation, more in the direction of possibly creating an index of sort of existing lints

No worries, I didn't took it that way.

We have the Lint documentation where searching for some keywords like "join" would show you that such a lint didn't exist. Maybe we could improve the search, but I'm not sure how further improve that.

Not sure I saw this when working on the lint, will take a deeper look.

If you find things are missing, that you feel would have been useful there, feel free to open a PR adding those or leave a comment in #6628

@yoav-lavi
Copy link
Contributor Author

yoav-lavi commented Mar 24, 2022

Sounds good!

We have the Lint documentation where searching for some keywords like "join" would show you that such a lint didn't exist. Maybe we could improve the search, but I'm not sure how further improve that.

I just noticed that that works and edited the original, that does cover the general usecase!

@yoav-lavi
Copy link
Contributor Author

@flip1995 something else I ran into today and might be worth considering:

image

It seems like the pedantic category isn't always known to users

@flip1995
Copy link
Member

Thanks for pointing this out! We have that kind of on the radar and will also improve on that with the Clippy book. (If only the day had more than 24 hours, I would've already finished that 😐)

It's done by adding #![warn(clippy::pedantic)] to your lib.rs/main.rs file, btw. :)

@yoav-lavi
Copy link
Contributor Author

yoav-lavi commented Mar 25, 2022

Yeah the first comment on that question had the answer (which has eerily similar wording to your comment here haha):

image

I ran into https://reddit.com/r/rust/comments/a4wblu/how_to_configure_clippy_to_be_as_annoying_as/ a while back when looking up how to add more lint categories (this isn't my comment), but thought to share that with you since it may indicate a certain lack of visibility for non-default lints.

(If only the day had more than 24 hours, I would've already finished that 😐)

Not at all meant as criticism! Open source is hard, often thankless work done for free and from good intentions, my point is to bring new information to help the project rather than imply anything

@flip1995
Copy link
Member

Not at all meant as criticism! Open source is hard, often thankless work done for free and from good intentions, my point is to bring new information to help the project rather than imply anything

Thanks a lot for this, that really helps, even though I can't act on it right away! I definitely do not take your comments as if they were bad intended. I just have a 10 hour work day behind me and don't have the patience to put more time in formulating my replies better. 😅 Thanks again for your feedback. That will really help! 🚀

@yoav-lavi
Copy link
Contributor Author

That's completely fine, just wanted to make sure I'm not sending across the wrong message.

Have a great weekend!

@matthiasbeyer
Copy link

I found this via discu.eu - very useful lint indeed.

May I ask: This seems to only be applicable if the join() argument is an empty string, right?
If I want to join a string by something else, say ", ", what would I rather write:

iter.collect::<Vec<String>>().join(", ");

// or
iter.intersperse(", ").collect::<String>()

I like to believe that the second version is more performant, but who I am to just guess this? Either way: Maybe there's potential for another lint?

@yoav-lavi
Copy link
Contributor Author

yoav-lavi commented Mar 26, 2022

Interesting! You may be onto something. I'm not next to a computer right now but I would benchmark the two using a few different variations to check. Notice that intersperse is nightly though so it may make sense to wait for it to be stable. We should let the maintainers weigh in though (cc @flip1995)

Edit: regarding your question, the current version supports "" specifically

@matthiasbeyer
Copy link

Notice that intersperse is nightly though so it may make sense to wait for it to be stable.

Ah, I didn't even notice this. Maybe though even some other iterator-chaining is more performant. I will also do some benchmarks (will be a nice learning experience) and post them here.

@matthiasbeyer
Copy link

So writing the benchmark was less complex than I thought. Maybe I did something wrong, because I never wrote benchmarks for rust code before. Here's the playground, it is implemented on nightly.

I get varying results though:

    Finished bench [optimized] target(s) in 0.01s
     Running unittests src/lib.rs (target/release/deps/bench_join-5b81b7a4eb2ee675)

running 4 tests
test bench_collect_join             ... bench:         146 ns/iter (+/- 7)
test bench_intersperce_collect_join ... bench:         200 ns/iter (+/- 7)
test bench_intersperce_manual_join  ... bench:         301 ns/iter (+/- 8)
test bench_manual_join              ... bench:         103 ns/iter (+/- 2)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 12.13s

On my machine, the "manual join" implementation (the one with collect().join("") seems to produce the fastest code. When using the intersperse() function, the collect::<String>() implementation is way faster. Of course the problem size is really small here, maybe I need to provide a larger problem to the implementations.

@matthiasbeyer
Copy link

matthiasbeyer commented Mar 26, 2022

Ah yes, a bigger problem yields better visibility in the results:

running 4 tests
test bench_collect_join             ... bench:     570,037 ns/iter (+/- 64,868)
test bench_intersperce_collect_join ... bench:     699,265 ns/iter (+/- 39,398)
test bench_intersperce_manual_join  ... bench:   1,205,594 ns/iter (+/- 49,033)
test bench_manual_join              ... bench:     823,865 ns/iter (+/- 23,774)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 7.41s

with (1..10_000) as problem on my machine.

@yoav-lavi
Copy link
Contributor Author

The first result might be due to loop unrolling, does it work differently with iter or with a slightly longer input? Or when using a vec rather than an array?

@matthiasbeyer
Copy link

Might be, but a quick test (using vec![...] instead of [...]) showed similiar results.

@yoav-lavi
Copy link
Contributor Author

I'll be able to check this later on but we did run into cases where the compiler unrolls the manual join but not the collect join, makes sense to me that this is the case here

@yoav-lavi
Copy link
Contributor Author

@matthiasbeyer I get correct results:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn criterion_benchmark(criterion: &mut Criterion) {
    let mut benchmark_group = criterion.benchmark_group("unnecessary_join");

    benchmark_group.bench_function("collect and join", |bencher| {
        bencher.iter(|| {
            black_box(["1", "2", "3", "4", "5"])
                .into_iter()
                .map(String::from)
                .collect::<Vec<String>>()
                .join("")
        })
    });

    benchmark_group.bench_function("collect only", |bencher| {
        bencher.iter(|| {
            black_box(["1", "2", "3", "4", "5"])
                .into_iter()
                .map(String::from)
                .collect::<String>()
        })
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
Benchmarking unnecessary_join/collect and join: Collecting 100 samples in estimated                                                                                     unnecessary_join/collect and join                        
                        time:   [198.64 ns 199.32 ns 199.96 ns]
Benchmarking unnecessary_join/collect only: Collecting 100 samples in estimated 5.00                                                                                    unnecessary_join/collect only                        
                        time:   [157.65 ns 157.98 ns 158.32 ns]

@yoav-lavi
Copy link
Contributor Author

yoav-lavi commented Mar 26, 2022

@matthiasbeyer with the other benchmarks:

(MacBook Pro (14-inch, 2021), 8 core M1 Pro, 16 GB RAM)

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn criterion_benchmark(criterion: &mut Criterion) {
    let mut benchmark_group = criterion.benchmark_group("unnecessary_join");

    benchmark_group.bench_function("array_into_iter/collect_and_join", |bencher| {
        bencher.iter(|| {
            black_box(["1", "2", "3", "4", "5"])
                .into_iter()
                .map(String::from)
                .collect::<Vec<String>>()
                .join("")
        })
    });

    benchmark_group.bench_function("array_into_iter/collect_only", |bencher| {
        bencher.iter(|| {
            black_box(["1", "2", "3", "4", "5"])
                .into_iter()
                .map(String::from)
                .collect::<String>()
        })
    });

    benchmark_group.bench_function("vec_iter/collect_and_join", |bencher| {
        bencher.iter(|| {
            let vector = black_box(vec!["hello", "world"]);
            let _output = vector
                .iter()
                .map(|item| item.to_uppercase())
                .collect::<Vec<String>>()
                .join("");
        })
    });

    benchmark_group.bench_function("vec_iter/collect_only", |bencher| {
        bencher.iter(|| {
            let vector = black_box(vec!["hello", "world"]);
            let _output = vector
                .iter()
                .map(|item| item.to_uppercase())
                .collect::<String>();
        })
    });

    benchmark_group.bench_function("deno/collect_and_join", |bencher| {
        bencher.iter(|| {
            let url = black_box("https://google.com".to_owned());
            let split_specifier = url.as_str().split(':');
            let _ = split_specifier.skip(1).collect::<Vec<_>>().join("");
        })
    });

    benchmark_group.bench_function("deno/collect_only", |bencher| {
        bencher.iter(|| {
            let url = black_box("https://google.com".to_owned());
            let split_specifier = url.as_str().split(':');
            let _ = split_specifier.skip(1).collect::<String>();
        })
    });

    benchmark_group.bench_function("not_unrolled/collect_and_join", |bencher| {
        bencher.iter(|| {
            let _ = black_box(std::iter::repeat("'a, "))
                .take(10)
                .collect::<Vec<_>>()
                .join("");
        })
    });

    benchmark_group.bench_function("not_unrolled/collect_only", |bencher| {
        bencher.iter(|| {
            let _ = black_box(std::iter::repeat("'a, "))
                .take(10)
                .collect::<String>();
        })
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
unnecessary_join/array_into_iter/collect_and_join                        
                        time:   [197.77 ns 197.92 ns 198.06 ns]    
unnecessary_join/array_into_iter/collect_only                        
                        time:   [160.49 ns 160.86 ns 161.24 ns]                                                                            

unnecessary_join/vec_iter/collect_and_join                        
                        time:   [156.61 ns 156.72 ns 156.83 ns]       
unnecessary_join/vec_iter/collect_only                        
                        time:   [118.48 ns 118.53 ns 118.59 ns]                                                                           

unnecessary_join/deno/collect_and_join                        
                        time:   [93.969 ns 94.001 ns 94.041 ns]      
unnecessary_join/deno/collect_only                        
                        time:   [70.695 ns 70.762 ns 70.835 ns]                                                                            

unnecessary_join/not_unrolled/collect_and_join                        
                        time:   [90.768 ns 90.830 ns 90.920 ns]   
unnecessary_join/not_unrolled/collect_only                        
                        time:   [139.73 ns 139.87 ns 140.02 ns]

The last benchmark being an example where the loop doesn't unroll when using collect::<String>()

@yoav-lavi
Copy link
Contributor Author

yoav-lavi commented Mar 26, 2022

It should yield a ~22% improvement in normal cases according to this benchmark

@yoav-lavi
Copy link
Contributor Author

yoav-lavi commented Mar 26, 2022

@matthiasbeyer It seems that when compiling these to x86_64-apple-darwin and running under Rosetta 2 the first two benchmarks are slightly in favor of collect and join, I think that's what you're seeing:

unnecessary_join/array_into_iter/collect_and_join                        
                        time:   [326.13 ns 326.82 ns 327.52 ns]
unnecessary_join/array_into_iter/collect_only                        
                        time:   [347.46 ns 347.93 ns 348.39 ns]
      
unnecessary_join/vec_iter/collect_and_join
                        time:   [261.55 ns 262.66 ns 263.74 ns]
unnecessary_join/vec_iter/collect_only                        
                        time:   [285.76 ns 286.24 ns 286.70 ns]     

unnecessary_join/deno/collect_and_join                        
                        time:   [169.84 ns 170.09 ns 170.32 ns]   
unnecessary_join/deno/collect_only                        
                        time:   [134.34 ns 134.47 ns 134.60 ns]

unnecessary_join/not_unrolled/collect_and_join                        
                        time:   [185.72 ns 185.91 ns 186.10 ns]
unnecessary_join/not_unrolled/collect_only                        
                        time:   [281.17 ns 281.30 ns 281.48 ns]

This is the assembly diff for an example that behaves like the first benchmark on x86_64: https://rust.godbolt.org/z/GKqhq5o3a

@flip1995 let me know if you think we should change the wording

@matthiasbeyer
Copy link

Thanks for helping me understanding what is going on here! 👍

@flip1995
Copy link
Member

flip1995 commented Apr 6, 2022

@flip1995 let me know if you think we should change the wording

Maybe "usually more performant" -> "might be more performant" and "in most cases" -> "sometimes".

@yoav-lavi
Copy link
Contributor Author

@flip1995 should I create a new PR?

@flip1995
Copy link
Member

flip1995 commented Apr 6, 2022

That would be great!

@yoav-lavi
Copy link
Contributor Author

yoav-lavi commented Apr 6, 2022

@flip1995 I know some that people / projects use the pedantic category entirely rather than on a case basis, would it be a concern that those people / projects would have this change made automatically?

It's sort of contrary to the description of the category but I want to make sure we don't need to do anything specific regarding those cases

bors added a commit that referenced this pull request Apr 6, 2022
update unnecessary_join documentation

changelog: none

Updates the description of `unnecessary_join` in accordance with #8579 (comment). I've also added a line regarding differences in assembly output, please let me know if it should also make it in.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties
Projects
None yet
Development

Successfully merging this pull request may close these issues.

needless_join for Iter<String>
6 participants