Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss data key category for properties resources #1196

Closed
3 tasks done
echeran opened this issue Oct 21, 2021 · 13 comments · Fixed by #1406
Closed
3 tasks done

Discuss data key category for properties resources #1196

echeran opened this issue Oct 21, 2021 · 13 comments · Fixed by #1406
Assignees
Labels
C-unicode Component: Props, sets, tries S-small Size: One afternoon (small bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt

Comments

@echeran
Copy link
Contributor

echeran commented Oct 21, 2021

Currently, the ResourceKey category name is uniset for the UnicodeSet data (for binary and enumerated properties) and for CodePointTrie data. For more context, see this comment.

Briefly, that means we have keys like

...
"uniset/AHex@1"  // UnicodeSet for binary property (AHex aka ASCII_Hex_Digit)
...
"uniset/gc=Lu@1" // UnicodeSet for enumerated property prop=val (gc=Lu aka General_Category=Uppercase_Letter)
...
"uniset/gc@1" // CodePointTrie for enumerated property
...

We could change the ResourceKey category from uniset to properties. That would keep all of the keys in the same category, as they currently are. But it does not reflect the type of data in the payload being returned.

Alternatively, we could change the category for CodePointTrie keys from uniset to codepointttrie or cpt. Then, "uniset/gc@1" becomes "codepointtrie/gc@1". This would make the category name reflect the type of data in the data payload / data struct, but some properties will "appear" in these 2 different keys (albeit differently) -- ex: "uniset/gc=Lu@1" vs. "codepointtrie/gc@1".

Need approval:

@echeran echeran added discuss Discuss at a future ICU4X-SC meeting C-unicode Component: Props, sets, tries T-techdebt Type: ICU4X code health and tech debt labels Oct 21, 2021
@sffc sffc added the S-small Size: One afternoon (small bug fix or enhancement) label Oct 21, 2021
@sffc
Copy link
Member

sffc commented Oct 21, 2021

CC @iainireland

I could go either way on this one. I think I lean toward putting all the data in the properties category just for simplicity.

@iainireland
Copy link
Contributor

iainireland commented Oct 21, 2021

Right now, we distinguish between enumerated property sets and binary property sets by checking whether the key contains an '='. Will putting additional keys in the same namespace interfere with that? It seems fine in the specialized case (eg DataProvider<'_, UnicodePropertyV1Marker>), but I don't know much about how our more generic data providers work.

If it doesn't cause any problems, I'm fine either way.

@echeran
Copy link
Contributor Author

echeran commented Oct 21, 2021

Right now, we distinguish between enumerated property sets and binary property sets by [checking](https://github.com/unicode-org/icu4x/blob/main/provider/uprops/src/provider.rs#L37-L41 whether the key contains an '='. Will putting additional keys in the same namespace interfere with that? It seems fine in the specialized case (eg DataProvider<'_, UnicodePropertyV1Marker>), but I don't know much about how our more generic data providers work.

If it doesn't cause any problems, I'm fine either way.

I lean towards having different category names because the type/shape of the return data is different.

It is analogous to what we did for data struct naming -- even though we have the UnicodeProperty data struct that transports UnicodeSet data, we still had to create a UnicodePropertyMap to transport CodePointTrie data. (It would have been awkward to try to shoehorn a CodePointTrie into UnicodeProperty -- would you have to use Option<...> for all the fields? -- even though the name suggests that it should be encompassing of any type of property data).

From the data struct naming ambiguity, I conclude that it is simpler to name data structs after the shape of the data they return, not how they are used or what functionality they represent. A further example: in the hypothetical scenario that Iain wants to generate UnicodeSets from enumerated property key=val pairings using CodePointTrie (for overall memory savings at cost of some perf hit), he will care about the shape of data, not just the purpose or code component.

@sapriyag sapriyag added this to the 2021 Q4 0.5 Sprint B milestone Nov 8, 2021
@echeran echeran added needs-approval One or more stakeholders need to approve proposal and removed discuss Discuss at a future ICU4X-SC meeting labels Nov 9, 2021
@sffc
Copy link
Member

sffc commented Nov 12, 2021

We should consider the namespacing to have four types:

  1. Binary property of code points
  2. Binary property of code points or strings
  3. Enumerated property of code points
  4. Enumerated property of code points or strings

How about

  • prop_cp_set
  • prop_uni_set
  • prop_cp_map
  • prop_uni_map

@sffc
Copy link
Member

sffc commented Nov 15, 2021

Okay, after thinking about this again, I've gone back to preferring a single properties namespace. Reasons:

  1. The one and only thing that determines the data schema is the key itself. The resource category (top-level folder) is merely an organizational mechanism; it has no semantic meaning in terms of the data schema.
  2. Not all property data fits nicely into these four types. For example, Script_Extensions has its own schema. Do we put it into an existing category, or do we make a new category for it? I definitely do not think we should be creating new data categories for special cases like this.
  3. All data categories currently are per-component (one category per component), and I like it that way. It would imply renaming the category for properties data to "properties" (or "props" or "uprops").

@iainireland
Copy link
Contributor

Shane's position seems reasonable. I'm in favour of lumping everything into the same namespace, unless we need to split up namespaces to avoid ambiguities.

Is there any risk of ambiguity between Shane's categories 3+4 above (enumerated properties of code points vs enumerated properties of code points or strings)? Will we ever want to provide two versions of the same property, where one supports both strings and code points, and the other supports only code points?

@sffc
Copy link
Member

sffc commented Nov 15, 2021

Is there any risk of ambiguity between Shane's categories 3+4 above (enumerated properties of code points vs enumerated properties of code points or strings)? Will we ever want to provide two versions of the same property, where one supports both strings and code points, and the other supports only code points?

As I understand, Unicode will never change a property of code points to a property of code points or strings, so category 4 would only be used for new properties that don't exist yet. It could be used for a theoretical property like "emoji emotional state": given an emoji sequence, tell whether it is happy, sad, or neutral (enumerated property of strings).

@aethanyc
Copy link
Contributor

I don't have a strong opinion for either approach, but going for the simplicity is more appeal to me.

A further example: in the hypothetical scenario that Iain wants to generate UnicodeSets from enumerated property key=val pairings using CodePointTrie (for overall memory savings at cost of some perf hit), he will care about the shape of data, not just the purpose or code component.

This scenario is similar to my comment in #1273 (comment).

@iainireland
Copy link
Contributor

As I understand, Unicode will never change a property of code points to a property of code points or strings, so category 4 would only be used for new properties that don't exist yet. It could be used for a theoretical property like "emoji emotional state": given an emoji sequence, tell whether it is happy, sad, or neutral (enumerated property of strings).

I was imagining a scenario where there's a property that is defined for strings and code points, but there's a compelling use case that only needs code points. In theory we might want to be able to provide only the code point data for that property, in which case the unified namespace might hypothetically be an obstacle.

@sffc
Copy link
Member

sffc commented Nov 16, 2021

Interesting scenario. I think in such a case, we would just invent a new key in the same namespace; something like "foo@1" for the full version and "foo;cp@1" for the code-point-only version. The keys like "sc=Cyrl" are already special cases.

@echeran echeran added the discuss-priority Discuss at the next ICU4X meeting label Nov 18, 2021
@echeran
Copy link
Contributor Author

echeran commented Nov 18, 2021

I can see the argument for consistency in having the ResourceKey category be "properties" to directly correspond with its source code component.

Let me sketch this out and try to organize it for my own understanding...

We have 2 dimensions: *

  1. return type of data **
  2. property

* Maybe these 2 dimensions are not exactly independent, depending on how we want to interpret things. The Script property can be returned in its entirety as a code point map (for which our data key subcategory string currently uses sc), or the property in combination with a specific property value determines that we are returning a UnicodeSet (for which our data key subcategory string currently might use sc=Knda).
** In addition to the 4 listed above (prop_cp_set, prop_uni_set, prop_cp_map, prop_uni_map), we also need one for Script / Script_Extensions because it returns a CodePointTrie with specially modified data array values, plus a companion array. Options for this name: script or sc_scx ?

I wonder if we will have enough characters in the TinyStr16 to hold all this info based on the above discussion. Let me see if I can sketch out examples to answer that question one way or another:

Hex_Digit General_Category Script_Extensions Basic_Emoji
Binary property of code points Hex;cp gc=Lu;cp scx=Knda;cp Basic_Emoji;cp (don't use)
Binary property of code points or strings Hex (same as Hex;cp) gc=Lu scx=Knda Basic_Emoji
Enumerated property of code points ~ gc;cp ~ ~
Enumerated property of code points or strings ~ gc (same as gc;cp) ~ ~
Script_Extensions ~ ~ scx ~

The "Script Extensions" row has to be different from enumerated properties because the return value for a code point is an array (not a single value representable as an integer), so we need to have unique getter fns for it, similar to what ICU does.

@echeran echeran self-assigned this Nov 18, 2021
@sffc
Copy link
Member

sffc commented Nov 18, 2021

  • @sffc - I think the ;cp should only be used for the micro-optimization use case, not in general
  • @iainireland - It's only used in cases where you know for sure you don't want the string data

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Nov 18, 2021
@echeran echeran added the discuss Discuss at a future ICU4X-SC meeting label Dec 9, 2021
@sffc
Copy link
Member

sffc commented Dec 9, 2021

  • @iainireland - Do the conceptually simple thing
  • @sffc - I agree with the general concept, but the annotations like ;cp is not something we need right now
  • @aethanyc - I don't have a strong opinion

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Dec 9, 2021
@sffc sffc removed the needs-approval One or more stakeholders need to approve proposal label Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries S-small Size: One afternoon (small bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants