Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify SortingHOWTO regarding locale aware string sorting #91415

Closed
CendioOssman mannequin opened this issue Apr 8, 2022 · 3 comments
Closed

Clarify SortingHOWTO regarding locale aware string sorting #91415

CendioOssman mannequin opened this issue Apr 8, 2022 · 3 comments
Assignees
Labels
3.10 only security fixes 3.11 only security fixes docs Documentation in the Doc dir

Comments

@CendioOssman
Copy link
Mannequin

CendioOssman mannequin commented Apr 8, 2022

BPO 47259
Nosy @rhettinger, @stevendaprano, @CendioOssman

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/rhettinger'
closed_at = None
created_at = <Date 2022-04-08.10:31:59.766>
labels = ['3.11', '3.10', 'docs']
title = 'Clarify SortingHOWTO regarding locale aware string sorting'
updated_at = <Date 2022-04-08.17:49:04.631>
user = 'https://github.com/CendioOssman'

bugs.python.org fields:

activity = <Date 2022-04-08.17:49:04.631>
actor = 'rhettinger'
assignee = 'rhettinger'
closed = False
closed_date = None
closer = None
components = ['Documentation']
creation = <Date 2022-04-08.10:31:59.766>
creator = 'CendioOssman'
dependencies = []
files = []
hgrepos = []
issue_num = 47259
keywords = []
message_count = 2.0
messages = ['416972', '416997']
nosy_count = 3.0
nosy_names = ['rhettinger', 'steven.daprano', 'CendioOssman']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue47259'
versions = ['Python 3.10', 'Python 3.11']

@CendioOssman
Copy link
Mannequin Author

CendioOssman mannequin commented Apr 8, 2022

There is a big gotcha in Python that is easily overlooked and should at the very least be more prominently pointed out in the documentation.

Sorting strings will produce results that is very confusing for humans.

I happens to work for ASCII, but will generally produce bad results for other things as code points do not always follow the alphabetical order.

The expressions chapter¹ mentions this fact, but you have to dig quite a bit to reach that. It also mentions that normalization is an issue, but it never mentions the issue about code point order versus alphabetical order.

The sorting tutorial mentions under "Odds and ends"² that you need to use a special key or comparison function to get locale aware sorting. It doesn't mention that this also includes respecting alphabetical order, which might be overlooked unless you are very familiar with how the sorting works. The tutorial is also something you have to dig a bit to reach.

Ideally string comparison would always be locale aware in a high level language such as Python. However, a smaller step would be a note on sorted()³ that extra care needs to be taken for strings as the default behaviour will produce unexpected results once your strings include anything outside the English alphabet.

¹ https://docs.python.org/3/reference/expressions.html
² https://docs.python.org/3/howto/sorting.html#odd-and-ends
³ https://docs.python.org/3/library/functions.html#sorted

@CendioOssman CendioOssman mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Apr 8, 2022
@rhettinger
Copy link
Contributor

I don't think splashing this everywhere else in the docs would be helpful. Tools like list.sort, sorted, min, max, nlargest, nsmallest use whatever sort order is provided by the underlying object whether it be a string, tuple, float, or int.

The section on expressions is the intended place to cover how comparison are defined for core objects: https://docs.python.org/3/reference/expressions.html#value-comparisons

As suggested, I will edit the sorting howto to be cleared that locale aware sort ordering refers to alphabetical orderings which can vary (for example, the Spanish ll sorts differently in different locales).

@rhettinger rhettinger added docs Documentation in the Doc dir and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Apr 8, 2022
@rhettinger rhettinger self-assigned this Apr 8, 2022
@rhettinger rhettinger added docs Documentation in the Doc dir and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Apr 8, 2022
@rhettinger rhettinger self-assigned this Apr 8, 2022
@rhettinger rhettinger added 3.10 only security fixes 3.11 only security fixes labels Apr 8, 2022
@rhettinger rhettinger changed the title string sorting often incorrect Clarify SortingHOWTO regarding locale aware string sorting Apr 8, 2022
@rhettinger rhettinger added 3.10 only security fixes 3.11 only security fixes labels Apr 8, 2022
@rhettinger rhettinger changed the title string sorting often incorrect Clarify SortingHOWTO regarding locale aware string sorting Apr 8, 2022
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@andjc
Copy link

andjc commented Apr 14, 2022

Actually sorting, and collation in general, can be quite a complex topic, I am slowly working through writing my own notes on it, and it can bend the mind at times.

You have various approaches to sorting, including:

  • ISO/IEC 14651
  • normalised ISO/IEC 14651
  • language tailored ISO/IEC 14651
  • normalised language tailored ISO/IEC 14651
  • EOR (either ISO/IEC 14651 based implementation or CLDR based implementation)
  • DUCET
  • CLDR collation algorithm
  • language tailored CLDR collation
  • customisations to the CLDR collation algorithm or language tailored CLDR collations

You could probably write a whole book just to properly address available customisations to CLDR collations.

Added to that, you have the differences in locales between GNU/Linux, BSD/macOS, and Windows, and the fact that many locale collation tables on BSD and macOS are symlinked to one specific collation table (which negates the ability for language tailored collation).

Then you have emoji collation sequences and the possibility of building custom collation rules, combining language tailored collation with emoji collation.

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 16, 2022
…TO (pythonGH-98336)

(cherry picked from commit ae19217)

Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>
miss-islington added a commit that referenced this issue Oct 16, 2022
…-98336)

(cherry picked from commit ae19217)

Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>
carljm added a commit to carljm/cpython that referenced this issue Oct 17, 2022
* main: (31 commits)
  pythongh-95913: Move subinterpreter exper removal to 3.11 WhatsNew (pythonGH-98345)
  pythongh-95914: Add What's New item describing PEP 670 changes (python#98315)
  Remove unused arrange_output_buffer function from zlibmodule.c. (pythonGH-98358)
  pythongh-98174: Handle EPROTOTYPE under macOS in test_sendfile_fallback_close_peer_in_the_middle_of_receiving (python#98316)
  pythonGH-98327: Reduce scope of catch_warnings() in _make_subprocess_transport (python#98333)
  pythongh-93691: Compiler's code-gen passes location around instead of holding it on the global compiler state (pythonGH-98001)
  pythongh-97669: Create Tools/build/ directory (python#97963)
  pythongh-95534: Improve gzip reading speed by 10% (python#97664)
  pythongh-95913: Forward-port int/str security change to 3.11 What's New in main (python#98344)
  pythonGH-91415: Mention alphabetical sort ordering in the Sorting HOWTO (pythonGH-98336)
  pythongh-97930: Merge with importlib_resources 5.9 (pythonGH-97929)
  pythongh-85525: Remove extra row in doc (python#98337)
  pythongh-85299: Add note warning about entry point guard for asyncio example (python#93457)
  pythongh-97527: IDLE - fix buggy macosx patch (python#98313)
  pythongh-98307: Add docstring and documentation for SysLogHandler.createSocket (pythonGH-98319)
  pythongh-94808: Cover `PyFunction_GetCode`, `PyFunction_GetGlobals`, `PyFunction_GetModule` (python#98158)
  pythonGH-94597: Deprecate child watcher getters and setters (python#98215)
  pythongh-98254: Include stdlib module names in error messages for NameErrors (python#98255)
  Improve speed. Reduce auxiliary memory to 16.6% of the main array. (pythonGH-98294)
  [doc] Update logging cookbook with an example of custom handling of levels. (pythonGH-98290)
  ...
pablogsal pushed a commit that referenced this issue Oct 22, 2022
…-98336)

(cherry picked from commit ae19217)

Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.10 only security fixes 3.11 only security fixes docs Documentation in the Doc dir
Projects
Status: Done
Development

No branches or pull requests

2 participants