Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for UTF-8 Identifiers #2848

Conversation

erslavin
Copy link
Contributor

@erslavin erslavin commented Dec 7, 2023

Description of Change(s)

  • Added UTF-8 character class support to Path parser
  • Added UTF-8 based validation rules for SDF identifier methods
  • Added UTF-8 support to lex-based text file format lexer
  • Added tests for UTF-8 based identifiers

Fixes Issue(s)

  • I have verified that all unit tests pass with the proposed changes
  • I have submitted a signed Contributor License Agreement

@erslavin erslavin force-pushed the lex_yacc_utf8_text_file_format branch 2 times, most recently from 1261a11 to 1af2f1e Compare December 8, 2023 15:06
@jesschimein
Copy link
Contributor

Filed as internal issue #USD-9056

pxr/usd/sdf/path.cpp Outdated Show resolved Hide resolved
pxr/usd/sdf/path.cpp Outdated Show resolved Hide resolved
struct Utf8Identifier : PEGTL_NS::seq<
Utf8IdentifierStart,
PEGTL_NS::star<XidContinue>> {};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it turns out this slows down the parser, one way we could speed it up perhaps is to make an 'Identifier' rule that first tries an ASCII ident first and only invokes the UTF-8 rule if that doesn't match (along the lines of sor<PEGTL_NS::identifier, Utf8Identifier>),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perf runs came back ok, the cost is really just the cost of decoding and class validation

pxr/usd/sdf/pathParser.h Outdated Show resolved Hide resolved
pxr/usd/sdf/textFileFormat.ll Show resolved Hide resolved
pxr/usd/sdf/testenv/testSdfParsing.py Show resolved Hide resolved
*/
{UTF8NODIGU}{UTF8U}*(::{UTF8NODIGU}{UTF8U}*)+ {
(*yylval_param) = std::string(yytext, yyleng);
return TOK_CXX_NAMESPACED_IDENTIFIER;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gitamohr In the spirit of your note about not adding UTF-8 support to mappers should, we leave CXX_NAMESPACED_IDENTIFIERs alone too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with that, I think.

@erslavin erslavin force-pushed the lex_yacc_utf8_text_file_format branch from e8fefa5 to 9fa1f1f Compare December 20, 2023 18:19
return !*p;

// substring must be a valid identifier
if (!SdfPath::IsValidIdentifier(std::string(remainder.substr(0, index)))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In lieu of creating a std::string, can we just make the internal _IsValidIdentifier (which SdfPath::IsValidIdentifier calls) take a string_view?

// to make sure what we matched is actually a valid
// identifier because we can overmatch UTF-8 characters
// based on this definition
if (!SdfPath::IsValidIdentifier(matched)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm -- I think we need a non-SdfPath specific function that checks if an identifier is the utf-8 style. This TOK_IDENTIFIER thing is used for other things in the parser like dictionary keys, or metadata keys, not just prim & property names. Which raises the question, are we intending to add utf-8 support for those elements too? Or would it be preferable to scope this just to prim & property names?

@erslavin erslavin force-pushed the lex_yacc_utf8_text_file_format branch 2 times, most recently from 0103665 to 732eb1f Compare December 26, 2023 15:18
@tylerm-nv tylerm-nv mentioned this pull request Dec 27, 2023
2 tasks
/// Returns whether \p name is a legal namespaced identifier.
SDF_API
bool SdfIsValidNamespacedIdentifier(const std::string& name);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this before, but I think the right concept for these already exists as SdfSchema::IsValid[Namespaced]Identifier(). The current implementations of those just call through to the equivalently-named SdfPath functions so hopefully that should work for us.

- Added UTF-8 character class support to Path parser
- Added UTF-8 based validation rules for SDF identifier methods
- Added UTF-8 support to lex-based text file format lexer
- Added tests for UTF-8 based identifiers
@erslavin erslavin force-pushed the lex_yacc_utf8_text_file_format branch from 732eb1f to cf05b2c Compare January 3, 2024 21:25
@pixar-oss pixar-oss merged commit 252ac1e into PixarAnimationStudios:dev Jan 6, 2024
5 checks passed
@sunyab sunyab added the usd-utf8-identifiers Issues/PRs for Unicode Identifiers in USD proposal label Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usd-utf8-identifiers Issues/PRs for Unicode Identifiers in USD proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants