Evaluate consistency and naming of char vs u32 methods in icu_collections and icu_properties #2413

sffc · 2022-08-19T23:20:46Z

We inconsistently name methods in the various properties and collections classes that deal with char vs u32. Examples: contains(char), contains_32(u32), get(char), and get_u32(u32), but sometimes it is get(u32). And the get_u32 name sounds like it is returning a u32, similar to get_ule, when in fact it is an overload of the get method.

Feedback from @markusicu.

Thoughts?

The text was updated successfully, but these errors were encountered:

markusicu · 2022-08-19T23:23:34Z

I think the contains overloads work well.
Consider changing get_u32 to get_for_u32 or get_from_u32.
If a class/trait only ever deals with u32 and not char, then get(u32) should be fine.

Manishearth · 2022-08-19T23:53:23Z

I think get_from_u32 might be good yeah

Though I'm skeptical we should have these in the first place, I guess. it's easy enough to as u32 the char.

sffc · 2022-08-20T03:45:57Z

Concretely, the classes and functions in question are

CodePointInversionList: contains, contains_u32
CodePointSetDataBorrowed: contains, contains_u32
CodePointTrie: get, get_u32, get_ule
CanonicalCombiningClassMap: get, get_u32
CodePointMapDataBorrowed: get, get_u32
PropertyCodePointMapV1 internal type: get, get_u32
Char16TrieIterator: next, next_u16, next_u32

CodePointTrie::get is the only non-suffixed function to take a u32 argument. In CodePointTrie, get_u32 returns a u32. In all other places, we are consistent in taking a char.

I'm not sure what my preference is. I'm okay leaving things the way they are, and considering CodePointTrie a special case since it is a low-level collection type. If we start renaming things, what about:

get32 (more concise and doesn't as strongly suggest that we are getting a u32)
geti ("get by integer")
getu ("get by unsigned integer")

markusicu · 2022-08-20T22:48:23Z

Note: The data structures are designed to map from code points to values. In Rust, supporting all code points requires u32 because char forbids surrogate code points.

Therefore, one could argue that the primary input should be a u32. Lookup via char would use a cast, or an "override".

get_u32 taking a u32 instead of returning a u32 seems misleading. getu would be better.

sffc · 2022-08-25T17:36:11Z

Discussion:

@robertbastian - u32_get()
@zbraniecki - Rust generally recommends not to have get
@zbraniecki - get_with_u32, get_from_u32
@sffc - These are basically overloads; should we consider shorter method names? like get32
@Manishearth - I like get32 because it doesn't tell my brain that I am getting a u32 return value
@zbraniecki - Do we need both versions with the overloads?
@Manishearth - The use case is that if you have a u32 character, but you don't know if it's valid, you can query the ICU4X collection and get back a valid value, which could be a default/error value.
@zbraniecki - Seems like the Rusty thing is to use TryInto for char. The version using u32 is more specific.
@sffc - There are valid use cases for the u32. If we have only one getter, it should be the u32 version, not the char version.
@robertbastian - Should we implement the Index trait?
@sffc - The Index trait returns by reference and we return by value
@Manishearth - That's a long-standing issue with the Index trait
@robertbastian - TryInto is not the Rusty way to access a collection

Proposal:

get(char)
get32(u32)
contains(char)
contains32(u32)
next(char)
next16(u16) // code unit (Char16Trie only)
next32(u32)

OK: @sffc @Manishearth @robertbastian @nordzilla

markusicu · 2022-08-25T18:09:12Z

The proposal wfm.

robertbastian · 2024-10-02T15:54:25Z

Still needs docs work

robertbastian · 2024-10-03T08:22:39Z

Given that we have decided to use try_from_utf8 for unvalidated string constructors, I'd like to reopen this discussion. I think a more consistent naming for the 32 methods would now be contains_utf32. Is this worth changing?

sffc · 2024-10-03T11:56:20Z

The problem was with get according to the discussion above. If you say get_utf32, the thinking was, then it looks like you are getting a UTF-32 code unit, when in reality you are passing one in as a parameter. (I don't know how I personally feel)

sffc · 2024-10-03T18:29:32Z

@Manishearth I'm fine with get and contains being in different namespaces
@sffc We have things like get32_u32
@sffc A difference between these and try_from_utf8 is that one works on code points and the other works on strings
@robertbastian - I think it's almost the same to take a sequence of unvalidated UTF-8 code units, or a single unvalidated UTF-32 code unit
@robertbastian - There's also next16, whose documentation isn't great: https://unicode-org.github.io/icu4x/rustdoc/icu/collections/char16trie/struct.Char16TrieIterator.html#method.next16
@sffc why do we even return unvalidated things?
@robertbastian - Seems like we should return the upgraded type and let people convert to a downgraded type.
@sffc There's also get32_ule which returns a reference, which we actually want.
@echeran It seems fine to me if get and contains are different signatures. contains always returns a bool.
@robertbastian We also have next and other methods. I would like if we wouldn't need to make this decision on each API. If we get rid of get32_u32, we can adopt my proposed naming scheme.
@sffc How would we name get32_ule?
@robertbastian Probably get_ule_utf32
@robertbastian We say "utf32" which is short for "potential_utf32"
@echeran It seems redundant to say "potential_utf32" for a single code unit; may as well just be "u32". If you want to communicate any further, you have docs.
@robertbastian - u32 is already in the signature, it's redundant to put it twice. Also, utf32 highlights that it's meant to be an unvalidated char, whereas a u32 could be a number or something
@sffc I don't really think that the decision we made in 2.0 for string functions should invalidate the decision we made in 1.0 for code point functions. A lot of the arguments being brought up here are the same ones that we had previously discussed.
@roberbastian - a utf32 suffix was not discussed back then and I think it's way better. Otherwise I'll need to go and write that the u32 is a potential UTF-32 code unit in all the docs

No conclusion yet.

markusicu · 2024-10-03T18:41:23Z

utf32 is a string encoding. u32 is one possible type for a code point.

sffc · 2024-10-23T15:33:41Z

@sffc We previously discussed the naming with regard to underscores. I don't see a great way to address that.
@robertbastian I think utf32 is more clear, but it's just an improvement, and if we don't have consensus, it's less work.

Conclusion:

Stick with get32-type naming in ICU4X 2.0
Discuss with @markusicu (not blocking 2.0) what to say in the docs:
- "a potentially ill-formed Unicode code point"
- "a UTF-32 code unit"
- "a potentially ill-formed UTF-32 code unit"

LGTM: @sffc @robertbastian

Manishearth · 2024-10-23T15:58:41Z

LGTM

sffc added the discuss-priority Discuss at the next ICU4X meeting label Aug 19, 2022

sffc added this to the ICU4X 1.0 (Final) milestone Aug 19, 2022

sffc added T-core Type: Required functionality C-unicode Component: Props, sets, tries S-small Size: One afternoon (small bug fix or enhancement) good first issue Good for newcomers and removed discuss-priority Discuss at the next ICU4X meeting labels Aug 25, 2022

robertbastian assigned dsipasseuth Aug 26, 2022

dsipasseuth mentioned this issue Aug 26, 2022

renaming functions to match agreed conventions #2460

Merged

robertbastian closed this as completed in #2460 Sep 7, 2022

This was referenced Nov 22, 2023

Bump Diplomat and use DiplomatChar #4349

Merged

Change CodePointInversionListBuilder::add_u32 to add32 and check for others #4354

Closed

Bump Diplomat and use DiplomatStr[16] #4353

Merged

robertbastian reopened this Sep 30, 2024

robertbastian assigned robertbastian and unassigned dsipasseuth Sep 30, 2024

robertbastian mentioned this issue Sep 30, 2024

More foo32 consistency #5619

Merged

robertbastian modified the milestones: ICU4X 1.0 (Final), ICU4X 2.0 Beta ⟨P1⟩ Oct 1, 2024

robertbastian closed this as completed Oct 1, 2024

robertbastian reopened this Oct 2, 2024

robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Oct 3, 2024

robertbastian removed the good first issue Good for newcomers label Oct 3, 2024

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Oct 31, 2024

sffc modified the milestones: ICU4X 2.0 Beta ⟨P1⟩, ICU4X 2.0 ⟨P1⟩ Oct 31, 2024

robertbastian removed their assignment Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate consistency and naming of char vs u32 methods in icu_collections and icu_properties #2413

Evaluate consistency and naming of char vs u32 methods in icu_collections and icu_properties #2413

sffc commented Aug 19, 2022 •

edited

Loading

markusicu commented Aug 19, 2022

Manishearth commented Aug 19, 2022

sffc commented Aug 20, 2022 •

edited

Loading

markusicu commented Aug 20, 2022

sffc commented Aug 25, 2022

markusicu commented Aug 25, 2022

robertbastian commented Oct 2, 2024

robertbastian commented Oct 3, 2024

sffc commented Oct 3, 2024

sffc commented Oct 3, 2024

markusicu commented Oct 3, 2024

sffc commented Oct 23, 2024

Manishearth commented Oct 23, 2024

Evaluate consistency and naming of char vs u32 methods in icu_collections and icu_properties #2413

Evaluate consistency and naming of char vs u32 methods in icu_collections and icu_properties #2413

Comments

sffc commented Aug 19, 2022 • edited Loading

markusicu commented Aug 19, 2022

Manishearth commented Aug 19, 2022

sffc commented Aug 20, 2022 • edited Loading

markusicu commented Aug 20, 2022

sffc commented Aug 25, 2022

markusicu commented Aug 25, 2022

robertbastian commented Oct 2, 2024

robertbastian commented Oct 3, 2024

sffc commented Oct 3, 2024

sffc commented Oct 3, 2024

markusicu commented Oct 3, 2024

sffc commented Oct 23, 2024

Manishearth commented Oct 23, 2024

sffc commented Aug 19, 2022 •

edited

Loading

sffc commented Aug 20, 2022 •

edited

Loading