Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by is weird #374

Closed
Tolsi opened this issue Oct 13, 2019 · 20 comments · Fixed by #866
Closed

group_by is weird #374

Tolsi opened this issue Oct 13, 2019 · 20 comments · Fixed by #866

Comments

@Tolsi
Copy link

Tolsi commented Oct 13, 2019

I spent several hours debugging my code written using the group_by function.

I expected from this function the logic of grouping the same objects by a key, how surprised I was that he groups only some nearby elements while the group key has the same value.

In the documentation it is also indicated somehow strange and until the last, I was not sure if this is a bug or feature. I carefully read the documentation 5 times and it turned out to be a feature! Awful incomprehensible feature!

I do not think that this is the behavior that many expect for this method name. I demand to rename this method to something more suitable like cons_group_by or lazy_group_by not to mislead users.

Example

I'm not the only one to think so

@phimuemue
Copy link
Member

I agree that we could make the examples in the documentation a bit clearer.

You may be looking for partition.

@Tolsi
Copy link
Author

Tolsi commented Oct 13, 2019

Thank you for your answer.

This example is not my case, but it shows the strange logic of the method, from the name of which you do not expect this.

What I do is here, but it should be like this. It looks like I need into_group_map, but, guys, I think these names stole not only my hours of life.

@bluss
Copy link
Member

bluss commented Oct 13, 2019

It groups consecutive runs, just like python's itertools.groupby, so there is some precedent.

@Tolsi
Copy link
Author

Tolsi commented Oct 13, 2019

This is an angry post about the contracts and expectations that users expect from methods with some name. I write in Rust not so long ago and CLion automatically advised me to import itertools when I wrote group_by, but this is not what I expected.

@bluss Hm, I have no experience with python and its itertools and I think it is also misleading. But I have a lot of Java/Scala/C#/.../SQL experience and it is the different.

I understand that now you can't rename the method for backward compatibility, but at least you can add the result of an example to the documentation. Or maybe you can rename it in the next major version update?

@eldritchconundrum
Copy link

I just had the exact same problem of expecting group_by to work like in most other languages, and I think I have an idea: the current group_by function should be renamed chunk_by.

Don't you think it would be a better name for its current behavior?

@porglezomp
Copy link

I agree, chunk_by seems to me to intuitively match the behavior of the chunk methods in the standard library.

@gin-ahirsch
Copy link
Contributor

The documentation states pretty clearly that it groups "consecutive elements". The example also shows this, though still I think #375 is a good change anyway.
I don't find the name misleading, but I like chunk_by just as much.
Maybe I just have the advantage of reading this issue first and knowing that I need a more careful look at the documented semantics though. Anyways, just my 2¢.

bors bot added a commit that referenced this issue Dec 3, 2019
375: Clarify documentation for group_by r=jswrenn a=phimuemue

This refers to #374 (comment).

I tried to make the docs a bit clearer by `collect`ing into a vector, showing the groups explicitly.

Co-authored-by: philipp <descpl@yahoo.de>
bors bot added a commit that referenced this issue Dec 3, 2019
375: Clarify documentation for group_by r=jswrenn a=phimuemue

This refers to #374 (comment).

I tried to make the docs a bit clearer by `collect`ing into a vector, showing the groups explicitly.

Co-authored-by: philipp <descpl@yahoo.de>
@ramosbugs
Copy link

ramosbugs commented Sep 21, 2020

just want to start off by saying this is one of my favorite crates and makes it a lot easier to write functional Rust!

that said, I have made this group_by mistake myself more than once. I've even read the docs, fixed my code, and then made the same mistake a few months later after forgetting the non-obvious semantics. I've seen some colleagues introduce the same bug as well.

the current group_by semantics are nice and efficient when we know the input iterator is already sorted, and as mentioned above there is precedent in other languages. it's also similar to the coreutils comm command, which assumes that the files you're comparing are pre-sorted.

unfortunately though, in all of these cases it's easy to forget. worse, it's easy to pass unit tests and only run into this issue in production, once you're iterating through a larger set of data that exposes the bug due to lack of sorting.

some potential improvements that could help here, which unfortunately all introduce breaking changes to varying degrees:

  • rename to presorted_group_by or unchecked_group_by (the unchecked word seems to have pretty widespread usage in Rust to imply "users are responsible for ensuring some invariant"); chunk_by as suggested above seems fine too, though perhaps a bit less self-explanatory
  • add a Sorted marker trait (or similar) that gets returned by sort_by_key and its friends or that users can manually impl on their types. then, move this function to a SortedItertools or something trait that is only impled on Iterator + Sorted
  • for debug builds, add (expensive) pre-sort checks and panic on failure

@tobymurray
Copy link

tobymurray commented Nov 13, 2023

Adding another anecdote supporting the "I'm confused" position. I acknowledge that upon rereading "Consecutive elements that map to the same key (“runs”), are assigned to the same group." it lines up with the behavior, but as someone with a preconceived notion of what I expected (and unfortunately for the first set of data it lined up) I just glossed over that statement because it didn't make sense to me. It could be argued that users should understand what they're doing, but I'd counter that a quality of a strong interface is that it's hard to misuse and I think group_by fails that test.

It groups consecutive runs, just like python's itertools.groupby, so there is some precedent.

While acknowledging that this may be strictly true, the existence of SQL's group by doing something different feels like it muddies the waters significantly. Heavily biased, but I feel like the SQL interpretation of "group by" matches its name more intuitively. The confusion is particularly insidious because certain inputs will provide the same outputs with both interpretations (how I ended up here).

I'm a strong +1 for suggesting the function name be deprecated and moved to something else. That it can be broadly misunderstood is one thing, but that it can look like it's working as expected (when misunderstood) and then act "differently" with a different set of data is pretty troubling.

@phimuemue
Copy link
Member

Is group_consecutive_by or group_adjacent_by a better name?

@tobymurray
Copy link

I don't feel like I understand why group_by as it exists today is actually useful in practice, so I'm not sure I can suggest a descriptive name. Is it a subset of the SQL-group-by, as in an optimization for pre-sorted lists, or is it useful in its own right?

@Philippe-Cholet
Copy link
Member

I get that people are not familiar with python (where this came from) but its own documentation is interesting. So I'm quoting it here:

Make an iterator that returns consecutive keys and groups from the iterable. The key is a function computing a key value for each element. If not specified or is None, key defaults to an identity function and returns the element unchanged. Generally, the iterable needs to already be sorted on the same key function.

The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list:

@tobymurray
Copy link

tobymurray commented Nov 17, 2023

I think I understand the implementation and that it can theoretically be more efficient if the data is sorted or the aggregated groups aren't actually needed, I'm just not sure where this would be used. From the python docs:

Generally, the iterable needs to already be sorted on the same key function.

Does anyone have some examples of where this would be useful outside of this case? That one string compression interview question perhaps (aaabbcccc -> a3b2c4) but surely that's not a compelling use case. It feels like maybe leaning in to interpreting this as is an optimized version of group_by for pre-sorted data makes sense.

My totally biased and likely somewhat ignorant opinion is that this version of group_by's applicability is extremely narrow and nuanced and SQL-style group by is common. Picking a "bad" name for this so only people who know the implications (if it's not supported by the type system) end up using it feels like it's a relatively small downside.

Support above seems to be for chunk_by, I marginally like group_consecutive_by more than group_adjacent_by but all 3 would work for me not assuming it's not the "group_by" method I was looking for.

This result on StackOverflow is pretty damning for the same choice made by Python's itertools: https://stackoverflow.com/questions/773/how-do-i-use-itertools-groupby. (Bonus points for the extra StackOverflow toxicity of "The example on the Python docs is quite straightforward").

@Philippe-Cholet
Copy link
Member

I used it mostly when using python and each time because I knew something about the iterable data (that elements were consecutively grouped for some reason) or I wanted to know things about groups in the data.
I do not think I've sorted data and then grouped it by a same key, that would be the exception for me.

I certainly see that group elements non-consecutively has more usage.
We have other grouping methods like into_grouping_map[_by] (which has several useful methods) and into_group_map[_by] resulting in a HashMap (and if your usage is not covered but common enough, we might consider adding another).

I'm personally not opposed to changing the method name (I would say chunk_by or group_consecutive_by) but it's a breaking change and I'm wondering if we could just clarify the documentation enough to avoid any weirdness.

@tobymurray
Copy link

I'm wondering if we could just clarify the documentation enough to avoid any weirdness.

My instinctual reaction is "no, this can't effectively be fixed by documentation".

I think the name "group by" means "SQL-style group by" to significantly more people than it means "Python-itertools-style group by". The StackOverflow question with a huge number of people publicly confused by how it's intended to be used supports that position. To me, the issue is the method name itself not aligning with the expected functionality - documentation won't fix that for the majority of consumers. Even within Python, it looks like the Panda GroupBy does not align with the itertools interpretation.

Put another way, if it's feeling like the first line of the documentation should be "this method likely doesn't do what you think it does", I'd argue that's not a great result and a breaking change is a better result. A deprecation and replacement feels pretty palatable from that perspective - renames are relatively easy to accommodate in Rust.

@Philippe-Cholet
Copy link
Member

@jswrenn @phimuemue It's not ideal but I think we should consider deprecate group_by and rename it group_consecutive_by.
chunk_by is also a possibility but I don't find it clear enough for my taste.

@jswrenn
Copy link
Member

jswrenn commented Nov 17, 2023

I'm not opposed, but I'd like to wait for the naming debate in rust-lang/rust#80552 to settle. There's a strong case to be made for consistency with the Rust standard library.

@tobymurray
Copy link

I think that naming debate has settled? The PR is up for standard library to use chunk_by, i.e.. rust-lang/rust#117678

@RaitoBezarius
Copy link

I stumbled yet again on this confusion after 2 hours of debugging while re-reading and missing consecutive multiple times. :)

tvlbot pushed a commit to tvlfyi/tvix that referenced this issue Jan 17, 2024
Previously, we were assembling very naively an attribute set composed of context we saw.

But it was forgetting that `"${drv}${drv.drvPath}"` would contain 2 contexts with the same key, but
with different values, one with `outputs = [ "out" ];` and `allOutputs = true;`.

Following this reasoning and comparing with what Nix does, we ought to merge underlying values systematically.

Hence, I bring `itertools` to perform a group by on the key and merge everything on the fly, it's not
beautiful but it's the best I could find, notice that I don't use
`group_by` but I talk about group by, that is, because `group_by` is a
`group_by_consecutive`, see
rust-itertools/itertools#374.

Initially, I tried to do it without a `into_grouping_map_by`, it was akin to assemble the final `NixAttrs` directly,
it was less readable and harder to pull out because we don't have a lot of in-place mutable functions on
our data structures.

Change-Id: I9933c9bd88ffe04de50dda14f21879b60d8b8cd4
Reviewed-on: https://cl.tvl.fyi/c/depot/+/10620
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
@Philippe-Cholet
Copy link
Member

Philippe-Cholet commented Jan 30, 2024

Oh... The naming debate has settled and rust-lang/rust#117678 was merged four days ago (and added to the 1.77.0 milestone).

PR done and merged, it will be released in our next release 0.13.0.

facebook-github-bot pushed a commit to facebookexperimental/rust-shed that referenced this issue Oct 16, 2024
Summary:
# Motivation

And if there is one is one design mistake Rust ecosystem made, it is making [`group_by` weird](rust-itertools/itertools#374) and possible for users to, quote:

> I spent several hours debugging my code written using the group_by function.

Luckily for internal users, the build tooling responds accordingly and fails build for all usages of deprecated API. Thus, there is a need to adjust all of them manually.

# [Release notes](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md#0130)

### Breaking
- Removed implementation of `DoubleEndedIterator` for `ConsTuples` (#853)
- Made `MultiProduct` fused and fixed on an empty iterator (#835, #834)
- Changed `iproduct!` to return tuples for maxi one iterator too (#870)
- Changed `PutBack::put_back` to return the old value (#880)
- Removed deprecated `repeat_call, Itertools::{foreach, step, map_results, fold_results}` (#878)
- Removed `TakeWhileInclusive::new` (#912)

NOTE: Quick search didn't tell me anything related to breaking changes above, CI will tell. And, of course, scream to me if it breaks your personal build.

Reviewed By: anps77

Differential Revision: D64306014

fbshipit-source-id: 881ac716e1dc23968d4a28000fdaccdbf9097ec2
facebook-github-bot pushed a commit to facebookincubator/below that referenced this issue Oct 16, 2024
Summary:
# Motivation

And if there is one is one design mistake Rust ecosystem made, it is making [`group_by` weird](rust-itertools/itertools#374) and possible for users to, quote:

> I spent several hours debugging my code written using the group_by function.

Luckily for internal users, the build tooling responds accordingly and fails build for all usages of deprecated API. Thus, there is a need to adjust all of them manually.

# [Release notes](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md#0130)

### Breaking
- Removed implementation of `DoubleEndedIterator` for `ConsTuples` (#853)
- Made `MultiProduct` fused and fixed on an empty iterator (#835, #834)
- Changed `iproduct!` to return tuples for maxi one iterator too (#870)
- Changed `PutBack::put_back` to return the old value (#880)
- Removed deprecated `repeat_call, Itertools::{foreach, step, map_results, fold_results}` (#878)
- Removed `TakeWhileInclusive::new` (#912)

NOTE: Quick search didn't tell me anything related to breaking changes above, CI will tell. And, of course, scream to me if it breaks your personal build.

Reviewed By: anps77

Differential Revision: D64306014

fbshipit-source-id: 881ac716e1dc23968d4a28000fdaccdbf9097ec2
facebook-github-bot pushed a commit to facebook/errpy that referenced this issue Oct 16, 2024
Summary:
# Motivation

And if there is one is one design mistake Rust ecosystem made, it is making [`group_by` weird](rust-itertools/itertools#374) and possible for users to, quote:

> I spent several hours debugging my code written using the group_by function.

Luckily for internal users, the build tooling responds accordingly and fails build for all usages of deprecated API. Thus, there is a need to adjust all of them manually.

# [Release notes](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md#0130)

### Breaking
- Removed implementation of `DoubleEndedIterator` for `ConsTuples` (#853)
- Made `MultiProduct` fused and fixed on an empty iterator (#835, #834)
- Changed `iproduct!` to return tuples for maxi one iterator too (#870)
- Changed `PutBack::put_back` to return the old value (#880)
- Removed deprecated `repeat_call, Itertools::{foreach, step, map_results, fold_results}` (#878)
- Removed `TakeWhileInclusive::new` (#912)

NOTE: Quick search didn't tell me anything related to breaking changes above, CI will tell. And, of course, scream to me if it breaks your personal build.

Reviewed By: anps77

Differential Revision: D64306014

fbshipit-source-id: 881ac716e1dc23968d4a28000fdaccdbf9097ec2
facebook-github-bot pushed a commit to facebookincubator/reindeer that referenced this issue Oct 16, 2024
Summary:
# Motivation

And if there is one is one design mistake Rust ecosystem made, it is making [`group_by` weird](rust-itertools/itertools#374) and possible for users to, quote:

> I spent several hours debugging my code written using the group_by function.

Luckily for internal users, the build tooling responds accordingly and fails build for all usages of deprecated API. Thus, there is a need to adjust all of them manually.

# [Release notes](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md#0130)

### Breaking
- Removed implementation of `DoubleEndedIterator` for `ConsTuples` (#853)
- Made `MultiProduct` fused and fixed on an empty iterator (#835, #834)
- Changed `iproduct!` to return tuples for maxi one iterator too (#870)
- Changed `PutBack::put_back` to return the old value (#880)
- Removed deprecated `repeat_call, Itertools::{foreach, step, map_results, fold_results}` (#878)
- Removed `TakeWhileInclusive::new` (#912)

NOTE: Quick search didn't tell me anything related to breaking changes above, CI will tell. And, of course, scream to me if it breaks your personal build.

Reviewed By: anps77

Differential Revision: D64306014

fbshipit-source-id: 881ac716e1dc23968d4a28000fdaccdbf9097ec2
facebook-github-bot pushed a commit to facebook/sapling that referenced this issue Oct 16, 2024
Summary:
# Motivation

And if there is one is one design mistake Rust ecosystem made, it is making [`group_by` weird](rust-itertools/itertools#374) and possible for users to, quote:

> I spent several hours debugging my code written using the group_by function.

Luckily for internal users, the build tooling responds accordingly and fails build for all usages of deprecated API. Thus, there is a need to adjust all of them manually.

# [Release notes](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md#0130)

### Breaking
- Removed implementation of `DoubleEndedIterator` for `ConsTuples` (#853)
- Made `MultiProduct` fused and fixed on an empty iterator (#835, #834)
- Changed `iproduct!` to return tuples for maxi one iterator too (#870)
- Changed `PutBack::put_back` to return the old value (#880)
- Removed deprecated `repeat_call, Itertools::{foreach, step, map_results, fold_results}` (#878)
- Removed `TakeWhileInclusive::new` (#912)

NOTE: Quick search didn't tell me anything related to breaking changes above, CI will tell. And, of course, scream to me if it breaks your personal build.

Reviewed By: anps77

Differential Revision: D64306014

fbshipit-source-id: 881ac716e1dc23968d4a28000fdaccdbf9097ec2
facebook-github-bot pushed a commit to facebook/hhvm that referenced this issue Oct 16, 2024
Summary:
# Motivation

And if there is one is one design mistake Rust ecosystem made, it is making [`group_by` weird](rust-itertools/itertools#374) and possible for users to, quote:

> I spent several hours debugging my code written using the group_by function.

Luckily for internal users, the build tooling responds accordingly and fails build for all usages of deprecated API. Thus, there is a need to adjust all of them manually.

# [Release notes](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md#0130)

### Breaking
- Removed implementation of `DoubleEndedIterator` for `ConsTuples` (#853)
- Made `MultiProduct` fused and fixed on an empty iterator (#835, #834)
- Changed `iproduct!` to return tuples for maxi one iterator too (#870)
- Changed `PutBack::put_back` to return the old value (#880)
- Removed deprecated `repeat_call, Itertools::{foreach, step, map_results, fold_results}` (#878)
- Removed `TakeWhileInclusive::new` (#912)

NOTE: Quick search didn't tell me anything related to breaking changes above, CI will tell. And, of course, scream to me if it breaks your personal build.

Reviewed By: anps77

Differential Revision: D64306014

fbshipit-source-id: 881ac716e1dc23968d4a28000fdaccdbf9097ec2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.