Discuss data key category for properties resources #1196

echeran · 2021-10-21T00:39:40Z

Currently, the ResourceKey category name is uniset for the UnicodeSet data (for binary and enumerated properties) and for CodePointTrie data. For more context, see this comment.

Briefly, that means we have keys like

...
"uniset/AHex@1"  // UnicodeSet for binary property (AHex aka ASCII_Hex_Digit)
...
"uniset/gc=Lu@1" // UnicodeSet for enumerated property prop=val (gc=Lu aka General_Category=Uppercase_Letter)
...
"uniset/gc@1" // CodePointTrie for enumerated property
...

We could change the ResourceKey category from uniset to properties. That would keep all of the keys in the same category, as they currently are. But it does not reflect the type of data in the payload being returned.

Alternatively, we could change the category for CodePointTrie keys from uniset to codepointttrie or cpt. Then, "uniset/gc@1" becomes "codepointtrie/gc@1". This would make the category name reflect the type of data in the data payload / data struct, but some properties will "appear" in these 2 different keys (albeit differently) -- ex: "uniset/gc=Lu@1" vs. "codepointtrie/gc@1".

Need approval:

The text was updated successfully, but these errors were encountered:

sffc · 2021-10-21T01:02:42Z

CC @iainireland

I could go either way on this one. I think I lean toward putting all the data in the properties category just for simplicity.

iainireland · 2021-10-21T16:27:05Z

Right now, we distinguish between enumerated property sets and binary property sets by checking whether the key contains an '='. Will putting additional keys in the same namespace interfere with that? It seems fine in the specialized case (eg DataProvider<'_, UnicodePropertyV1Marker>), but I don't know much about how our more generic data providers work.

If it doesn't cause any problems, I'm fine either way.

echeran · 2021-10-21T17:00:15Z

Right now, we distinguish between enumerated property sets and binary property sets by [checking](https://github.com/unicode-org/icu4x/blob/main/provider/uprops/src/provider.rs#L37-L41 whether the key contains an '='. Will putting additional keys in the same namespace interfere with that? It seems fine in the specialized case (eg DataProvider<'_, UnicodePropertyV1Marker>), but I don't know much about how our more generic data providers work.

If it doesn't cause any problems, I'm fine either way.

I lean towards having different category names because the type/shape of the return data is different.

It is analogous to what we did for data struct naming -- even though we have the UnicodeProperty data struct that transports UnicodeSet data, we still had to create a UnicodePropertyMap to transport CodePointTrie data. (It would have been awkward to try to shoehorn a CodePointTrie into UnicodeProperty -- would you have to use Option<...> for all the fields? -- even though the name suggests that it should be encompassing of any type of property data).

From the data struct naming ambiguity, I conclude that it is simpler to name data structs after the shape of the data they return, not how they are used or what functionality they represent. A further example: in the hypothetical scenario that Iain wants to generate UnicodeSets from enumerated property key=val pairings using CodePointTrie (for overall memory savings at cost of some perf hit), he will care about the shape of data, not just the purpose or code component.

sffc · 2021-11-12T23:40:02Z

We should consider the namespacing to have four types:

Binary property of code points
Binary property of code points or strings
Enumerated property of code points
Enumerated property of code points or strings

How about

prop_cp_set
prop_uni_set
prop_cp_map
prop_uni_map

sffc · 2021-11-15T21:03:21Z

Okay, after thinking about this again, I've gone back to preferring a single properties namespace. Reasons:

The one and only thing that determines the data schema is the key itself. The resource category (top-level folder) is merely an organizational mechanism; it has no semantic meaning in terms of the data schema.
Not all property data fits nicely into these four types. For example, Script_Extensions has its own schema. Do we put it into an existing category, or do we make a new category for it? I definitely do not think we should be creating new data categories for special cases like this.
All data categories currently are per-component (one category per component), and I like it that way. It would imply renaming the category for properties data to "properties" (or "props" or "uprops").

iainireland · 2021-11-15T23:27:25Z

Shane's position seems reasonable. I'm in favour of lumping everything into the same namespace, unless we need to split up namespaces to avoid ambiguities.

Is there any risk of ambiguity between Shane's categories 3+4 above (enumerated properties of code points vs enumerated properties of code points or strings)? Will we ever want to provide two versions of the same property, where one supports both strings and code points, and the other supports only code points?

sffc · 2021-11-15T23:41:16Z

Is there any risk of ambiguity between Shane's categories 3+4 above (enumerated properties of code points vs enumerated properties of code points or strings)? Will we ever want to provide two versions of the same property, where one supports both strings and code points, and the other supports only code points?

As I understand, Unicode will never change a property of code points to a property of code points or strings, so category 4 would only be used for new properties that don't exist yet. It could be used for a theoretical property like "emoji emotional state": given an emoji sequence, tell whether it is happy, sad, or neutral (enumerated property of strings).

aethanyc · 2021-11-15T23:42:38Z

I don't have a strong opinion for either approach, but going for the simplicity is more appeal to me.

A further example: in the hypothetical scenario that Iain wants to generate UnicodeSets from enumerated property key=val pairings using CodePointTrie (for overall memory savings at cost of some perf hit), he will care about the shape of data, not just the purpose or code component.

This scenario is similar to my comment in #1273 (comment).

iainireland · 2021-11-16T00:34:46Z

As I understand, Unicode will never change a property of code points to a property of code points or strings, so category 4 would only be used for new properties that don't exist yet. It could be used for a theoretical property like "emoji emotional state": given an emoji sequence, tell whether it is happy, sad, or neutral (enumerated property of strings).

I was imagining a scenario where there's a property that is defined for strings and code points, but there's a compelling use case that only needs code points. In theory we might want to be able to provide only the code point data for that property, in which case the unified namespace might hypothetically be an obstacle.

sffc · 2021-11-16T00:57:30Z

Interesting scenario. I think in such a case, we would just invent a new key in the same namespace; something like "foo@1" for the full version and "foo;cp@1" for the code-point-only version. The keys like "sc=Cyrl" are already special cases.

echeran · 2021-11-18T17:59:23Z

I can see the argument for consistency in having the ResourceKey category be "properties" to directly correspond with its source code component.

Let me sketch this out and try to organize it for my own understanding...

We have 2 dimensions: *

return type of data **
property

* Maybe these 2 dimensions are not exactly independent, depending on how we want to interpret things. The Script property can be returned in its entirety as a code point map (for which our data key subcategory string currently uses sc), or the property in combination with a specific property value determines that we are returning a UnicodeSet (for which our data key subcategory string currently might use sc=Knda).
** In addition to the 4 listed above (prop_cp_set, prop_uni_set, prop_cp_map, prop_uni_map), we also need one for Script / Script_Extensions because it returns a CodePointTrie with specially modified data array values, plus a companion array. Options for this name: script or sc_scx ?

I wonder if we will have enough characters in the TinyStr16 to hold all this info based on the above discussion. Let me see if I can sketch out examples to answer that question one way or another:

	Hex_Digit	General_Category	Script_Extensions	Basic_Emoji
Binary property of code points	Hex;cp	gc=Lu;cp	scx=Knda;cp	Basic_Emoji;cp (don't use)
Binary property of code points or strings	Hex (same as `Hex;cp`)	gc=Lu	scx=Knda	Basic_Emoji
Enumerated property of code points	~	gc;cp	~	~
Enumerated property of code points or strings	~	gc (same as `gc;cp`)	~	~
Script_Extensions	~	~	scx	~

The "Script Extensions" row has to be different from enumerated properties because the return value for a code point is an array (not a single value representable as an integer), so we need to have unique getter fns for it, similar to what ICU does.

sffc · 2021-11-18T18:38:27Z

@sffc - I think the ;cp should only be used for the micro-optimization use case, not in general
@iainireland - It's only used in cases where you know for sure you don't want the string data

sffc · 2021-12-09T18:55:38Z

@iainireland - Do the conceptually simple thing
@sffc - I agree with the general concept, but the annotations like ;cp is not something we need right now
@aethanyc - I don't have a strong opinion

echeran added discuss Discuss at a future ICU4X-SC meeting C-unicode Component: Props, sets, tries T-techdebt Type: ICU4X code health and tech debt labels Oct 21, 2021

sffc added the S-small Size: One afternoon (small bug fix or enhancement) label Oct 21, 2021

echeran mentioned this issue Oct 28, 2021

Connect properties provider to the icu4x_datagen exporter tool #1204

Merged

sapriyag added this to the 2021 Q4 0.5 Sprint B milestone Nov 8, 2021

echeran added needs-approval One or more stakeholders need to approve proposal and removed discuss Discuss at a future ICU4X-SC meeting labels Nov 9, 2021

echeran added the discuss-priority Discuss at the next ICU4X meeting label Nov 18, 2021

echeran self-assigned this Nov 18, 2021

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Nov 18, 2021

sffc modified the milestones: 2021 Q4 0.5 Sprint B, 2021 Q4 0.5 Sprint C Nov 18, 2021

echeran added the discuss Discuss at a future ICU4X-SC meeting label Dec 9, 2021

echeran modified the milestones: 2021 Q4 0.5 Sprint C, 2021 Q4 0.5 Sprint D Dec 9, 2021

sffc removed the discuss Discuss at a future ICU4X-SC meeting label Dec 9, 2021

sffc removed the needs-approval One or more stakeholders need to approve proposal label Dec 9, 2021

echeran mentioned this issue Dec 16, 2021

Rename resource key category for properties #1406

Merged

echeran closed this as completed in #1406 Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss data key category for properties resources #1196

Discuss data key category for properties resources #1196

echeran commented Oct 21, 2021 •

edited by sffc

Loading

sffc commented Oct 21, 2021

iainireland commented Oct 21, 2021 •

edited by aethanyc

Loading

echeran commented Oct 21, 2021

sffc commented Nov 12, 2021

sffc commented Nov 15, 2021

iainireland commented Nov 15, 2021

sffc commented Nov 15, 2021 •

edited

Loading

aethanyc commented Nov 15, 2021

iainireland commented Nov 16, 2021

sffc commented Nov 16, 2021

echeran commented Nov 18, 2021

sffc commented Nov 18, 2021

sffc commented Dec 9, 2021

Discuss data key category for properties resources #1196

Discuss data key category for properties resources #1196

Comments

echeran commented Oct 21, 2021 • edited by sffc Loading

sffc commented Oct 21, 2021

iainireland commented Oct 21, 2021 • edited by aethanyc Loading

echeran commented Oct 21, 2021

sffc commented Nov 12, 2021

sffc commented Nov 15, 2021

iainireland commented Nov 15, 2021

sffc commented Nov 15, 2021 • edited Loading

aethanyc commented Nov 15, 2021

iainireland commented Nov 16, 2021

sffc commented Nov 16, 2021

echeran commented Nov 18, 2021

sffc commented Nov 18, 2021

sffc commented Dec 9, 2021

echeran commented Oct 21, 2021 •

edited by sffc

Loading

iainireland commented Oct 21, 2021 •

edited by aethanyc

Loading

sffc commented Nov 15, 2021 •

edited

Loading