Support for UTF-8 Identifiers #2848

erslavin · 2023-12-07T16:38:36Z

Description of Change(s)

Added UTF-8 character class support to Path parser
Added UTF-8 based validation rules for SDF identifier methods
Added UTF-8 support to lex-based text file format lexer
Added tests for UTF-8 based identifiers

Fixes Issue(s)

I have verified that all unit tests pass with the proposed changes

I have submitted a signed Contributor License Agreement

jesschimein · 2023-12-08T22:10:17Z

Filed as internal issue #USD-9056

pxr/usd/sdf/path.cpp

gitamohr · 2023-12-19T18:14:07Z

pxr/usd/sdf/pathParser.h

+struct Utf8Identifier : PEGTL_NS::seq<
+    Utf8IdentifierStart,
+    PEGTL_NS::star<XidContinue>> {};
+


If it turns out this slows down the parser, one way we could speed it up perhaps is to make an 'Identifier' rule that first tries an ASCII ident first and only invokes the UTF-8 rule if that doesn't match (along the lines of sor<PEGTL_NS::identifier, Utf8Identifier>),

The perf runs came back ok, the cost is really just the cost of decoding and class validation

pxr/usd/sdf/pathParser.h

pxr/usd/sdf/textFileFormat.ll

pxr/usd/sdf/testenv/testSdfParsing.py

nvmkuruc · 2023-12-19T21:37:46Z

pxr/usd/sdf/textFileFormat.ll

+  */
+{UTF8NODIGU}{UTF8U}*(::{UTF8NODIGU}{UTF8U}*)+ {
+    (*yylval_param) = std::string(yytext, yyleng);
+    return TOK_CXX_NAMESPACED_IDENTIFIER;


@gitamohr In the spirit of your note about not adding UTF-8 support to mappers should, we leave CXX_NAMESPACED_IDENTIFIERs alone too?

I'm okay with that, I think.

nvmkuruc · 2023-12-20T18:37:41Z

pxr/usd/sdf/path.cpp

-            return !*p;
+
+        // substring must be a valid identifier
+        if (!SdfPath::IsValidIdentifier(std::string(remainder.substr(0, index)))) {


In lieu of creating a std::string, can we just make the internal _IsValidIdentifier (which SdfPath::IsValidIdentifier calls) take a string_view?

gitamohr · 2023-12-20T19:05:53Z

pxr/usd/sdf/textFileFormat.ll

+    // to make sure what we matched is actually a valid
+    // identifier because we can overmatch UTF-8 characters
+    // based on this definition
+    if (!SdfPath::IsValidIdentifier(matched)) {


Hmm -- I think we need a non-SdfPath specific function that checks if an identifier is the utf-8 style. This TOK_IDENTIFIER thing is used for other things in the parser like dictionary keys, or metadata keys, not just prim & property names. Which raises the question, are we intending to add utf-8 support for those elements too? Or would it be preferable to scope this just to prim & property names?

gitamohr · 2024-01-02T23:33:11Z

pxr/usd/sdf/path.h

+/// Returns whether \p name is a legal namespaced identifier.
+SDF_API
+bool SdfIsValidNamespacedIdentifier(const std::string& name);
+


Sorry I missed this before, but I think the right concept for these already exists as SdfSchema::IsValid[Namespaced]Identifier(). The current implementations of those just call through to the equivalently-named SdfPath functions so hopefully that should work for us.

- Added UTF-8 character class support to Path parser - Added UTF-8 based validation rules for SDF identifier methods - Added UTF-8 support to lex-based text file format lexer - Added tests for UTF-8 based identifiers

erslavin force-pushed the lex_yacc_utf8_text_file_format branch 2 times, most recently from 1261a11 to 1af2f1e Compare December 8, 2023 15:06

erslavin force-pushed the lex_yacc_utf8_text_file_format branch from 1af2f1e to e8fefa5 Compare December 13, 2023 01:09

nvmkuruc mentioned this pull request Dec 15, 2023

Replace boost::optional with std::optional in SdfAllowed #2874

Merged

2 tasks

gitamohr reviewed Dec 19, 2023

View reviewed changes

nvmkuruc reviewed Dec 19, 2023

View reviewed changes

erslavin force-pushed the lex_yacc_utf8_text_file_format branch from e8fefa5 to 9fa1f1f Compare December 20, 2023 18:19

nvmkuruc reviewed Dec 20, 2023

View reviewed changes

gitamohr reviewed Dec 20, 2023

View reviewed changes

erslavin force-pushed the lex_yacc_utf8_text_file_format branch 2 times, most recently from 0103665 to 732eb1f Compare December 26, 2023 15:18

tylerm-nv mentioned this pull request Dec 27, 2023

UTF8 search support in usdviewq #2890

Merged

2 tasks

gitamohr reviewed Jan 2, 2024

View reviewed changes

Support for UTF-8 Identifiers

cf05b2c

- Added UTF-8 character class support to Path parser - Added UTF-8 based validation rules for SDF identifier methods - Added UTF-8 support to lex-based text file format lexer - Added tests for UTF-8 based identifiers

erslavin force-pushed the lex_yacc_utf8_text_file_format branch from 732eb1f to cf05b2c Compare January 3, 2024 21:25

pixar-oss merged commit 252ac1e into PixarAnimationStudios:dev Jan 6, 2024
5 checks passed

sunyab added the usd-utf8-identifiers Issues/PRs for Unicode Identifiers in USD proposal label Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for UTF-8 Identifiers #2848

Support for UTF-8 Identifiers #2848

erslavin commented Dec 7, 2023

jesschimein commented Dec 8, 2023

gitamohr Dec 19, 2023

erslavin Dec 20, 2023

nvmkuruc Dec 19, 2023

gitamohr Dec 20, 2023

nvmkuruc Dec 20, 2023

gitamohr Dec 20, 2023

gitamohr Jan 2, 2024

Support for UTF-8 Identifiers #2848

Support for UTF-8 Identifiers #2848

Conversation

erslavin commented Dec 7, 2023

Description of Change(s)

Fixes Issue(s)

jesschimein commented Dec 8, 2023

gitamohr Dec 19, 2023

Choose a reason for hiding this comment

erslavin Dec 20, 2023

Choose a reason for hiding this comment

nvmkuruc Dec 19, 2023

Choose a reason for hiding this comment

gitamohr Dec 20, 2023

Choose a reason for hiding this comment

nvmkuruc Dec 20, 2023

Choose a reason for hiding this comment

gitamohr Dec 20, 2023

Choose a reason for hiding this comment

gitamohr Jan 2, 2024

Choose a reason for hiding this comment