Skip to content

Documentation for sorting #1568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Nov 8, 2024
Merged

Documentation for sorting #1568

merged 10 commits into from
Nov 8, 2024

Conversation

132ikl
Copy link
Member

@132ikl 132ikl commented Sep 30, 2024

Companion PR to nushell/nushell#13154. Should not be merged until the nushell PR is merged.

Adds a new page to the book explaining how Nushell's many different sorting behaviors work, including old features and new features.

@132ikl 132ikl changed the title Documentation for new sort features Documentation for sorting Oct 1, 2024
Copy link
Member

@sholderbach sholderbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being so thorough about all those semantics.

Maybe some of it is very thorough for end-user documentation but I think it is good if we start out with well documented/specified semantics. We can still move the details to the reference material and have a simpler intro at a later point in time.

book/sorting.md Outdated
If you _do_ want a sort containing differing types to error, see [strict sort](#strict-sort).
:::

Nushell's sort is also **stable**, meaning equal values will retain their original ordering relative to each other. This is illustrated here using the [insensitive](#insensitive-sort) sort option:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to commit to all our sorting implementations being stable or should that be relaxed to sort or "stable by default"?

For clarity I would specify the insensitive

Suggested change
Nushell's sort is also **stable**, meaning equal values will retain their original ordering relative to each other. This is illustrated here using the [insensitive](#insensitive-sort) sort option:
Nushell's sort is also **stable**, meaning equal values will retain their original ordering relative to each other. This is illustrated here using the [case insensitive](#insensitive-sort) sort option:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort and sort-by both use Rust's sort_by, which is stable. I also have a unit test which does something similar to the example here which demonstrates the sort being stable. I'm not sure what other sorting implementation there might be?

I agree, I'll rename the heading to "case insensitive" as well.

╰───┴────────╯
```

We can see that the numbers are sorted in order, and the strings are sorted to the end of the list, also in order. If you are coming from other programming languages, this may not be quite what you expect. In Nushell, as a general rule, **data can always be sorted without erroring**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some part of my inner "language patent lawyer" thinks that promising this leads to some tricky rules to define a total order over all data types with challenges as soon as new types would be introduced and are messy to teach if you run into them. But for the general usability it is definitely a bonus if you don't have to worry about this.

Copy link
Member Author

@132ikl 132ikl Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least currently, there is indeed an ordering between all types. It's effectively the order of the Value enum's variants, with some exceptions. As I see it, the tradeoff here is between "sort which errors on mixed types" or "sort that never errors". Personally, if it would be possible for sort to error because of incompatible types, I would rather make it fully strict at that point.

I wasn't planning on specifying what order the types would be, except maybe that within the same version of Nushell the ordering between types is always the same. I would like to guarantee at least that null comes at the end, and the corresponding Nushell PR does do this. Since Nothing is kind of a special type, and you might expect it to appear a lot more than say, an integer and a string, I think it's a useful property for it to always appear at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's unlikely that we will add more types (I would probably be against it). For the current set of types, a total order is totally doable. Floats are really the only issue, and could be ordered using total_cmp or sorting NaNs last like ordered_float.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point. I might look into adding a total order 🤔

IanManske added a commit to nushell/nushell that referenced this pull request Oct 10, 2024
…#13154)

# Description

Closes #12535
Implements sort-by functionality of #8322
Fixes sort-by part of #8667

This PR does two main things: add a new cell path and closure parameter
to `sort-by`, and attempt to make Nushell's sorting behavior
well-defined.

## `sort-by` features

The `columns` parameter is replaced with a `comparator` parameter, which
can be a cell path or a closure. Examples are from docs PR.

1. Cell paths

The basic interactive usage of `sort-by` is the same. For example, `ls |
sort-by modified` still works the same as before. It is not quite a
drop-in replacement, see [behavior changes](#behavior-changes).
   
   Here's an example of how the cell path comparator might be useful:
   
   ```nu
   > let cities = [
{name: 'New York', info: { established: 1624, population: 18_819_000 } }
{name: 'Kyoto', info: { established: 794, population: 37_468_000 } }
{name: 'São Paulo', info: { established: 1554, population: 21_650_000 }
}
   ]
   > $cities | sort-by info.established
   ╭───┬───────────┬────────────────────────────╮
   │ # │   name    │            info            │
   ├───┼───────────┼────────────────────────────┤
   │ 0 │ Kyoto     │ ╭─────────────┬──────────╮ │
   │   │           │ │ established │ 794      │ │
   │   │           │ │ population  │ 37468000 │ │
   │   │           │ ╰─────────────┴──────────╯ │
   │ 1 │ São Paulo │ ╭─────────────┬──────────╮ │
   │   │           │ │ established │ 1554     │ │
   │   │           │ │ population  │ 21650000 │ │
   │   │           │ ╰─────────────┴──────────╯ │
   │ 2 │ New York  │ ╭─────────────┬──────────╮ │
   │   │           │ │ established │ 1624     │ │
   │   │           │ │ population  │ 18819000 │ │
   │   │           │ ╰─────────────┴──────────╯ │
   ╰───┴───────────┴────────────────────────────╯
   ```

2. Key closures

You can supply a closure which will transform each value into a sorting
key (without changing the underlying data). Here's an example of a key
closure, where we want to sort a list of assignments by their average
grade:

   ```nu
   > let assignments = [
       {name: 'Homework 1', grades: [97 89 86 92 89] }
       {name: 'Homework 2', grades: [91 100 60 82 91] }
       {name: 'Exam 1', grades: [78 88 78 53 90] }
       {name: 'Project', grades: [92 81 82 84 83] }
   ]
   > $assignments | sort-by { get grades | math avg }
   ╭───┬────────────┬───────────────────────╮
   │ # │    name    │        grades         │
   ├───┼────────────┼───────────────────────┤
   │ 0 │ Exam 1     │ [78, 88, 78, 53, 90]  │
   │ 1 │ Project    │ [92, 81, 82, 84, 83]  │
   │ 2 │ Homework 2 │ [91, 100, 60, 82, 91] │
   │ 3 │ Homework 1 │ [97, 89, 86, 92, 89]  │
   ╰───┴────────────┴───────────────────────╯
   ```

3. Custom sort closure

The `--custom`, or `-c`, flag will tell `sort-by` to interpret closures
as custom sort closures. A custom sort closure has two parameters, and
returns a boolean. The closure should return `true` if the first
parameter comes _before_ the second parameter in the sort order.
   
For a simple example, we could rewrite a cell path sort as a custom sort
(see
[here](https://github.com/nushell/nushell.github.io/pull/1568/files#diff-a7a233e66a361d8665caf3887eb71d4288000001f401670c72b95cc23a948e86R231)
for a more complex example):
   
   ```nu
   > ls | sort-by -c {|a, b| $a.size < $b.size }
   ╭───┬─────────────────────┬──────┬──────────┬────────────────╮
   │ # │        name         │ type │   size   │    modified    │
   ├───┼─────────────────────┼──────┼──────────┼────────────────┤
   │ 0 │ my-secret-plans.txt │ file │    100 B │ 10 minutes ago │
   │ 1 │ shopping_list.txt   │ file │    100 B │ 2 months ago   │
   │ 2 │ myscript.nu         │ file │  1.1 KiB │ 2 weeks ago    │
   │ 3 │ bigfile.img         │ file │ 10.0 MiB │ 3 weeks ago    │
   ╰───┴─────────────────────┴──────┴──────────┴────────────────╯
   ```
   

## Making sort more consistent

I think it's important for something as essential as `sort` to have
well-defined semantics. This PR contains some changes to try to make the
behavior of `sort` and `sort-by` consistent. In addition, after working
with the internals of sorting code, I have a much deeper understanding
of all of the edge cases. Here is my attempt to try to better define
some of the semantics of sorting (if you are just interested in changes,
skip to "User-Facing changes")

- `sort`, `sort -v`, and `sort-by` now all work the same. Each
individual sort implementation has been refactored into two functions in
`sort_utils.rs`: `sort`, and `sort_by`. These can also be used in other
parts of Nushell where values need to be sorted.
  - `sort` and `sort-by` used to handle `-i` and `-n` differently.
- `sort -n` would consider all values which can't be coerced into a
string to be equal
- `sort-by -i` and `sort-by -n` would only work if all values were
strings
- In this PR, insensitive sort only affects comparison between strings,
and natural sort only applies to numbers and strings (see below).
- (not a change) Before and after this PR, `sort` and `sort-by` support
sorting mixed types. There was a lot of discussion about potentially
making `sort` and `sort-by` only work on lists of homogeneous types, but
the general consensus was that `sort` should not error just because its
input contains incompatible types.
- In order to try to make working with data containing `null` values
easier, I changed the PartialOrd order to sort `Nothing` values to the
end of a list, regardless of what other types the list contains. Before,
`null` would be sorted before `Binary`, `CellPath`, and `Custom` values.
- (not a change) When sorted, lists of mixed types will contain sorted
values of each type in order, for the most part
- (not a change) For example, `[0x[1] (date now) "a" ("yesterday" | into
datetime) "b" 0x[0]]` will be sorted as `["a", "b", a day ago, now, [0],
[1]]`, where sorted strings appear first, then sorted datetimes, etc.
- (not a change) The exception to this is `Int`s and `Float`s, which
will intermix, `Strings` and `Glob`s, which will intermix, and `None` as
described above. Additionally, natural sort will intermix strings with
ints and floats (see below).
- Natural sort no longer coerce all inputs to strings.
- I did originally make natural only apply to strings, but @fdncred
pointed out that the previous behavior also allowed you to sort numeric
strings with numbers. This seems like a useful feature if we are trying
to support sorting with mixed types, so I settled on coercing only
numbers (int, float). This can be reverted if people don't like it.
- Here is an example of this behavior in action, which is the same
before and after this PR:
      ```nushell
      $ [1 "4" 3 "2"] | sort --natural
      ╭───┬───╮
      │ 0 │ 1 │
      │ 1 │ 2 │
      │ 2 │ 3 │
      │ 3 │ 4 │
      ╰───┴───╯
      ```



# User-Facing Changes

## New features

- Replaces the `columns` string parameter of `sort-by` with a cell path
or a closure.
  - The cell path parameter works exactly as you would expect
- By default, the `closure` parameter acts as a "key sort"; that is,
each element is transformed by the closure into a sorting key
- With the `--custom` (`-c`) parameter, you can define a comparison
function for completely custom sorting order.

## Behavior changes

<details>
<summary><code>sort -v</code> does not coerce record values to
strings</summary>

This was a bit of a surprising behavior, and is now unified with the
behavior of `sort` and `sort-by`. Here's an example where you can
observe the values being implicitly coerced into strings for sorting, as
they are sorted like strings rather than numbers:

Old behavior:

```nushell
$ {foo: 9 bar: 10} | sort -v
╭─────┬────╮
│ bar │ 10 │
│ foo │ 9  │
╰─────┴────╯
```

New behavior:

```nushell
$ {foo: 9 bar: 10} | sort -v
╭─────┬────╮
│ foo │ 9  │
│ bar │ 10 │
╰─────┴────╯
```

</details>


<details>
<summary>Changed <code>sort-by</code> parameters from
<code>string</code> to <code>cell-path</code> or <code>closure</code>.
Typical interactive usage is the same as before, but if passing a
variable to <code>sort-by</code> it must be a cell path (or closure),
not a string</summary>

Old behavior:

```nushell
$ let sort = "modified"
$ ls | sort-by $sort
╭───┬──────┬──────┬──────┬────────────────╮
│ # │ name │ type │ size │    modified    │
├───┼──────┼──────┼──────┼────────────────┤
│ 0 │ foo  │ file │  0 B │ 10 hours ago   │
│ 1 │ bar  │ file │  0 B │ 35 seconds ago │
╰───┴──────┴──────┴──────┴────────────────╯
```

New behavior:

```nushell
$ let sort = "modified"
$ ls | sort-by $sort
Error: nu::shell::type_mismatch

  × Type mismatch.
   ╭─[entry #10:1:14]
 1 │ ls | sort-by $sort
   ·              ──┬──
   ·                ╰── Cannot sort using a value which is not a cell path or closure
   ╰────
$ let sort = $."modified"
$ ls | sort-by $sort
╭───┬──────┬──────┬──────┬───────────────╮
│ # │ name │ type │ size │   modified    │
├───┼──────┼──────┼──────┼───────────────┤
│ 0 │ foo  │ file │  0 B │ 10 hours ago  │
│ 1 │ bar  │ file │  0 B │ 2 minutes ago │
╰───┴──────┴──────┴──────┴───────────────╯
```
</details>

<details>
<summary>Insensitve and natural sorting behavior reworked</summary>

Previously, the `-i` and `-n` worked differently for `sort` and
`sort-by` (see "Making sort more consistent"). Here are examples of how
these options result in different sorts now:

1. `sort -n`
- Old behavior (types other than numbers, strings, dates, and binary
sorted incorrectly)
      ```nushell
      $ [2sec 1sec] | sort -n
      ╭───┬──────╮
      │ 0 │ 2sec │
      │ 1 │ 1sec │
      ╰───┴──────╯
      ```
    - New behavior
      ```nushell
      $ [2sec 1sec] | sort -n
      ╭───┬──────╮
      │ 0 │ 1sec │
      │ 1 │ 2sec │
      ╰───┴──────╯
      ```
    
2. `sort-by -i`
- Old behavior (uppercase words appear before lowercase words as they
would in a typical sort, indicating this is not actually an insensitive
sort)
     ```nushell
     $ ["BAR" "bar" "foo" 2 "FOO" 1] | wrap a | sort-by -i a
     ╭───┬─────╮
     │ # │  a  │
     ├───┼─────┤
     │ 0 │   1 │
     │ 1 │   2 │
     │ 2 │ BAR │
     │ 3 │ FOO │
     │ 4 │ bar │
     │ 5 │ foo │
     ╰───┴─────╯
     ```
- New behavior (strings are sorted stably, indicating this is an
insensitive sort)
     ```nushell
     $ ["BAR" "bar" "foo" 2 "FOO" 1] | wrap a | sort-by -i a
     ╭───┬─────╮
     │ # │  a  │
     ├───┼─────┤
     │ 0 │   1 │
     │ 1 │   2 │
     │ 2 │ BAR │
     │ 3 │ bar │
     │ 4 │ foo │
     │ 5 │ FOO │
     ╰───┴─────╯
     ```

3. `sort-by -n`
- Old behavior (natural sort does not work when data contains non-string
values)
     ```nushell
     $ ["10" 8 "9"] | wrap a | sort-by -n a
     ╭───┬────╮
     │ # │ a  │
     ├───┼────┤
     │ 0 │  8 │
     │ 1 │ 10 │
     │ 2 │ 9  │
     ╰───┴────╯
     ```
   - New behavior
     ```nushell
     $ ["10" 8 "9"] | wrap a | sort-by -n a
     ╭───┬────╮
     │ # │ a  │
     ├───┼────┤
     │ 0 │  8 │
     │ 1 │ 9  │
     │ 2 │ 10 │
     ╰───┴────╯
     ```

</details>

<details>
<summary>
Sorting a list of non-record values with a non-existent column/path now
errors instead of sorting the values directly (<code>sort</code> should
be used for this, not <code>sort-by</code>)
</summary>

Old behavior:

```nushell
$ [2 1] | sort-by foo
╭───┬───╮
│ 0 │ 1 │
│ 1 │ 2 │
╰───┴───╯
```

New behavior:

```nushell
$ [2 1] | sort-by foo
Error: nu:🐚:incompatible_path_access

  × Data cannot be accessed with a cell path
   ╭─[entry #29:1:17]
 1 │ [2 1] | sort-by foo
   ·                 ─┬─
   ·                  ╰── int doesn't support cell paths
   ╰────
```

</details>

<details>
<summary><code>sort</code> and <code>sort-by</code> output
<code>List</code> instead of <code>ListStream</code> </summary>

This isn't a meaningful change (unless I misunderstand the purpose of
ListStream), since `sort` and `sort-by` both need to collect in order to
do the sorting anyway, but is user observable.

Old behavior:

```nushell
$ ls | sort | describe -d
╭──────────┬───────────────────╮
│ type     │ stream            │
│ origin   │ nushell           │
│ subtype  │ {record 3 fields} │
│ metadata │ {record 1 field}  │
╰──────────┴───────────────────╯
```

```nushell
$ ls | sort-by name | describe -d
╭──────────┬───────────────────╮
│ type     │ stream            │
│ origin   │ nushell           │
│ subtype  │ {record 3 fields} │
│ metadata │ {record 1 field}  │
╰──────────┴───────────────────╯
```

New behavior:


```nushell
ls | sort | describe -d
╭────────┬─────────────────╮
│ type   │ list            │
│ length │ 22              │
│ values │ [table 22 rows] │
╰────────┴─────────────────╯
```

```nushell
$ ls | sort-by name | describe -d
╭────────┬─────────────────╮
│ type   │ list            │
│ length │ 22              │
│ values │ [table 22 rows] │
╰────────┴─────────────────╯
```

</details>

- `sort` now errors when nothing is piped in (`sort-by` already did
this)

# Tests + Formatting

I added lots of unit tests on the new sort implementation to enforce new
sort behaviors and prevent regressions.

# After Submitting

See [docs PR](nushell/nushell.github.io#1568),
which is ~2/3 finished.

---------

Co-authored-by: NotTheDr01ds <32344964+NotTheDr01ds@users.noreply.github.com>
Co-authored-by: Ian Manske <ian.manske@pm.me>
@NotTheDr01ds NotTheDr01ds added the wait-after-next-release The PR will be merged after next Nu release label Oct 11, 2024
@kubouch kubouch mentioned this pull request Oct 15, 2024
10 tasks
@devyn
Copy link
Contributor

devyn commented Oct 16, 2024

Is this ready to go? Now that 0.99.0 has released we can merge

@devyn devyn removed the wait-after-next-release The PR will be merged after next Nu release label Oct 16, 2024
@132ikl 132ikl marked this pull request as ready for review October 16, 2024 21:10
@132ikl
Copy link
Member Author

132ikl commented Oct 16, 2024

@devyn Just finished up everything I was planning 😄

@sholderbach What are your thoughts on the semantics you brought up? I also added some more info about how the ordering between types works in the "Sorting with mixed types" section, I would appreciate your feedback on that as well!

@132ikl
Copy link
Member Author

132ikl commented Nov 8, 2024

I'd like to get this landed, @sholderbach if you are still concerned about those sections you commented on I can modify them or outright remove them. Let me know what you think 😄

Copy link
Member

@sholderbach sholderbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with the write up thanks! Sorry for not getting back.

Have some vague thoughts about forward compatibility around any type extensions but those are future problems 😄

@sholderbach sholderbach merged commit e701983 into nushell:main Nov 8, 2024
2 checks passed
@fdncred
Copy link
Contributor

fdncred commented Nov 9, 2024

I just wanted to drop by and say that this sorting documentation is completely awesome! I'm impressed by the thoroughness and all the details but keeping it very approachable with samples. Great work!

@132ikl
Copy link
Member Author

132ikl commented Nov 9, 2024

Thanks so much, I appreciate it 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants