Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate consistency and naming of char vs u32 methods in icu_collections and icu_properties #2413

Open
2 tasks
sffc opened this issue Aug 19, 2022 · 13 comments · Fixed by #2460
Open
2 tasks
Labels
C-unicode Component: Props, sets, tries S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Aug 19, 2022

We inconsistently name methods in the various properties and collections classes that deal with char vs u32. Examples: contains(char), contains_32(u32), get(char), and get_u32(u32), but sometimes it is get(u32). And the get_u32 name sounds like it is returning a u32, similar to get_ule, when in fact it is an overload of the get method.

Feedback from @markusicu.

Thoughts?

@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Aug 19, 2022
@sffc sffc added this to the ICU4X 1.0 (Final) milestone Aug 19, 2022
@markusicu
Copy link
Member

I think the contains overloads work well.
Consider changing get_u32 to get_for_u32 or get_from_u32.
If a class/trait only ever deals with u32 and not char, then get(u32) should be fine.

@Manishearth
Copy link
Member

I think get_from_u32 might be good yeah

Though I'm skeptical we should have these in the first place, I guess. it's easy enough to as u32 the char.

@sffc
Copy link
Member Author

sffc commented Aug 20, 2022

Concretely, the classes and functions in question are

  • CodePointInversionList: contains, contains_u32
  • CodePointSetDataBorrowed: contains, contains_u32
  • CodePointTrie: get, get_u32, get_ule
  • CanonicalCombiningClassMap: get, get_u32
  • CodePointMapDataBorrowed: get, get_u32
  • PropertyCodePointMapV1 internal type: get, get_u32
  • Char16TrieIterator: next, next_u16, next_u32

CodePointTrie::get is the only non-suffixed function to take a u32 argument. In CodePointTrie, get_u32 returns a u32. In all other places, we are consistent in taking a char.

I'm not sure what my preference is. I'm okay leaving things the way they are, and considering CodePointTrie a special case since it is a low-level collection type. If we start renaming things, what about:

  • get32 (more concise and doesn't as strongly suggest that we are getting a u32)
  • geti ("get by integer")
  • getu ("get by unsigned integer")

@markusicu
Copy link
Member

Note: The data structures are designed to map from code points to values. In Rust, supporting all code points requires u32 because char forbids surrogate code points.

Therefore, one could argue that the primary input should be a u32. Lookup via char would use a cast, or an "override".


get_u32 taking a u32 instead of returning a u32 seems misleading. getu would be better.

@sffc
Copy link
Member Author

sffc commented Aug 25, 2022

Discussion:

  • @robertbastian - u32_get()
  • @zbraniecki - Rust generally recommends not to have get
  • @zbraniecki - get_with_u32, get_from_u32
  • @sffc - These are basically overloads; should we consider shorter method names? like get32
  • @Manishearth - I like get32 because it doesn't tell my brain that I am getting a u32 return value
  • @zbraniecki - Do we need both versions with the overloads?
  • @Manishearth - The use case is that if you have a u32 character, but you don't know if it's valid, you can query the ICU4X collection and get back a valid value, which could be a default/error value.
  • @zbraniecki - Seems like the Rusty thing is to use TryInto for char. The version using u32 is more specific.
  • @sffc - There are valid use cases for the u32. If we have only one getter, it should be the u32 version, not the char version.
  • @robertbastian - Should we implement the Index trait?
  • @sffc - The Index trait returns by reference and we return by value
  • @Manishearth - That's a long-standing issue with the Index trait
  • @robertbastian - TryInto is not the Rusty way to access a collection

Proposal:

  • get(char)
  • get32(u32)
  • contains(char)
  • contains32(u32)
  • next(char)
  • next16(u16) // code unit (Char16Trie only)
  • next32(u32)

OK: @sffc @Manishearth @robertbastian @nordzilla

@sffc sffc added T-core Type: Required functionality C-unicode Component: Props, sets, tries S-small Size: One afternoon (small bug fix or enhancement) good first issue Good for newcomers and removed discuss-priority Discuss at the next ICU4X meeting labels Aug 25, 2022
@markusicu
Copy link
Member

The proposal wfm.

@robertbastian
Copy link
Member

Still needs docs work

@robertbastian robertbastian reopened this Oct 2, 2024
@robertbastian
Copy link
Member

Given that we have decided to use try_from_utf8 for unvalidated string constructors, I'd like to reopen this discussion. I think a more consistent naming for the 32 methods would now be contains_utf32. Is this worth changing?

@robertbastian robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Oct 3, 2024
@robertbastian robertbastian removed the good first issue Good for newcomers label Oct 3, 2024
@sffc
Copy link
Member Author

sffc commented Oct 3, 2024

The problem was with get according to the discussion above. If you say get_utf32, the thinking was, then it looks like you are getting a UTF-32 code unit, when in reality you are passing one in as a parameter. (I don't know how I personally feel)

@sffc
Copy link
Member Author

sffc commented Oct 3, 2024

  • @Manishearth I'm fine with get and contains being in different namespaces
  • @sffc We have things like get32_u32
  • @sffc A difference between these and try_from_utf8 is that one works on code points and the other works on strings
  • @robertbastian - I think it's almost the same to take a sequence of unvalidated UTF-8 code units, or a single unvalidated UTF-32 code unit
  • @robertbastian - There's also next16, whose documentation isn't great: https://unicode-org.github.io/icu4x/rustdoc/icu/collections/char16trie/struct.Char16TrieIterator.html#method.next16
  • @sffc why do we even return unvalidated things?
  • @robertbastian - Seems like we should return the upgraded type and let people convert to a downgraded type.
  • @sffc There's also get32_ule which returns a reference, which we actually want.
  • @echeran It seems fine to me if get and contains are different signatures. contains always returns a bool.
  • @robertbastian We also have next and other methods. I would like if we wouldn't need to make this decision on each API. If we get rid of get32_u32, we can adopt my proposed naming scheme.
  • @sffc How would we name get32_ule?
  • @robertbastian Probably get_ule_utf32
  • @robertbastian We say "utf32" which is short for "potential_utf32"
  • @echeran It seems redundant to say "potential_utf32" for a single code unit; may as well just be "u32". If you want to communicate any further, you have docs.
  • @robertbastian - u32 is already in the signature, it's redundant to put it twice. Also, utf32 highlights that it's meant to be an unvalidated char, whereas a u32 could be a number or something
  • @sffc I don't really think that the decision we made in 2.0 for string functions should invalidate the decision we made in 1.0 for code point functions. A lot of the arguments being brought up here are the same ones that we had previously discussed.
  • @roberbastian - a utf32 suffix was not discussed back then and I think it's way better. Otherwise I'll need to go and write that the u32 is a potential UTF-32 code unit in all the docs

No conclusion yet.

@markusicu
Copy link
Member

utf32 is a string encoding. u32 is one possible type for a code point.

@sffc
Copy link
Member Author

sffc commented Oct 23, 2024

  • @sffc We previously discussed the naming with regard to underscores. I don't see a great way to address that.
  • @robertbastian I think utf32 is more clear, but it's just an improvement, and if we don't have consensus, it's less work.

Conclusion:

  • Stick with get32-type naming in ICU4X 2.0
  • Discuss with @markusicu (not blocking 2.0) what to say in the docs:
    • "a potentially ill-formed Unicode code point"
    • "a UTF-32 code unit"
    • "a potentially ill-formed UTF-32 code unit"

LGTM: @sffc @robertbastian

@Manishearth
Copy link
Member

LGTM

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Oct 31, 2024
@robertbastian robertbastian removed their assignment Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants