-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for UTF-8 Identifiers #2848
Support for UTF-8 Identifiers #2848
Conversation
1261a11
to
1af2f1e
Compare
Filed as internal issue #USD-9056 |
1af2f1e
to
e8fefa5
Compare
struct Utf8Identifier : PEGTL_NS::seq< | ||
Utf8IdentifierStart, | ||
PEGTL_NS::star<XidContinue>> {}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it turns out this slows down the parser, one way we could speed it up perhaps is to make an 'Identifier' rule that first tries an ASCII ident first and only invokes the UTF-8 rule if that doesn't match (along the lines of sor<PEGTL_NS::identifier, Utf8Identifier>),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The perf runs came back ok, the cost is really just the cost of decoding and class validation
pxr/usd/sdf/textFileFormat.ll
Outdated
*/ | ||
{UTF8NODIGU}{UTF8U}*(::{UTF8NODIGU}{UTF8U}*)+ { | ||
(*yylval_param) = std::string(yytext, yyleng); | ||
return TOK_CXX_NAMESPACED_IDENTIFIER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gitamohr In the spirit of your note about not adding UTF-8 support to mappers should, we leave CXX_NAMESPACED_IDENTIFIERs alone too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with that, I think.
e8fefa5
to
9fa1f1f
Compare
pxr/usd/sdf/path.cpp
Outdated
return !*p; | ||
|
||
// substring must be a valid identifier | ||
if (!SdfPath::IsValidIdentifier(std::string(remainder.substr(0, index)))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In lieu of creating a std::string
, can we just make the internal _IsValidIdentifier
(which SdfPath::IsValidIdentifier
calls) take a string_view
?
pxr/usd/sdf/textFileFormat.ll
Outdated
// to make sure what we matched is actually a valid | ||
// identifier because we can overmatch UTF-8 characters | ||
// based on this definition | ||
if (!SdfPath::IsValidIdentifier(matched)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm -- I think we need a non-SdfPath specific function that checks if an identifier is the utf-8 style. This TOK_IDENTIFIER thing is used for other things in the parser like dictionary keys, or metadata keys, not just prim & property names. Which raises the question, are we intending to add utf-8 support for those elements too? Or would it be preferable to scope this just to prim & property names?
0103665
to
732eb1f
Compare
pxr/usd/sdf/path.h
Outdated
/// Returns whether \p name is a legal namespaced identifier. | ||
SDF_API | ||
bool SdfIsValidNamespacedIdentifier(const std::string& name); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this before, but I think the right concept for these already exists as SdfSchema::IsValid[Namespaced]Identifier()
. The current implementations of those just call through to the equivalently-named SdfPath
functions so hopefully that should work for us.
- Added UTF-8 character class support to Path parser - Added UTF-8 based validation rules for SDF identifier methods - Added UTF-8 support to lex-based text file format lexer - Added tests for UTF-8 based identifiers
732eb1f
to
cf05b2c
Compare
Description of Change(s)
Fixes Issue(s)