Relaxes node name sanitization in gltf documents. #45545

abaire · 2021-01-28T21:21:30Z

Expands the set of allowed characters in _sanitize_scene_name.

This is what my team uses to patch around #45544 though it'd probably be better to use some global method to sanitize names consistently across Godot.

Bugsquad edit: Fixes #45544.

fire · 2021-01-28T22:40:56Z

What test cases did you use to show it works?

modules/gltf/gltf_document.cpp

abaire · 2021-01-29T02:37:10Z

Tested with a few glb's generated via Blender (one of which is attached to #45544). The more extensive one has a cube named Cube=!%^&*(){}\\[\\]|<>,~ and a torus named Torus=!%^&*(){}\\[\\]|<>,~-noimp, just verified that the cube is imported as a MeshInstance w/ the full name and the torus is not imported at all.

This was intended as a demonstration of the workaround we're using for #45544 (on the 3.2 branch) and a discussion point more than a fully tested patch. Unfortunately I don't know enough about actual limitations on node names in Godot other than what I found enforced here:

godot/scene/main/node.cpp

Line 994 in 5496174

bool Node::_validate_node_name(String &p_name) {

. I'm not sure what the intent was behind the original sanitization (which looks like it initially broke import hints and was relaxed in a followup, and I also found that preserving # breaks hints but haven't tracked down why) so I'm not sure what cases to look out for.

I'd be happy to dig/test further if I can get a pointer in the right direction.

Wrt. non-English names there should be no change in the existing behavior; anything outside of ASCII and the explicitly listed symbolic characters would still be stripped.

fire · 2021-01-30T05:30:31Z

We had some fun discussion about adding the 💩 emoji and Japanese / Korean / Chinese / etc node names. So that's undecided.

However https://github.com/KhronosGroup/glTF/tree/master/specification/2.0#json-encoding says

Implementation Note: This allows generic glTF client implementations to not have full Unicode support. Application-specific strings (e.g., values of "name" properties or content of extras fields) may use any symbols.

However, they will still need to be unique.

Edited:

This expands the list of characters and it's something is hard to discover unless you read or try it, I can understand expanding to other languages, but these list of symbols have other meanings in code, not sure about the interaction with $ get node.

Want to have some other opinions, so I'll get back to you.

abaire · 2021-01-30T06:35:51Z

This expands the list of characters and it's something is hard to discover unless you read or try it, I can understand expanding to other languages, but these list of symbols have other meanings in code, not sure about the interaction with $ get node.

Totally agree, I'm not at all sure what ill effects there could be and would definitely appreciate the guidance.

Quickly tested the get_node behavior for the symbols in this PR:

	var t1 = $"normal=!%^&*(){}[]|<>,~"
	var t2 = get_node("normal=!%^&*(){}[]|<>,~")

and they are resolved correctly.

That being said, if there are any other consequences of allowing these symbols to be part of the node names I think it'd be good to capture them in node.cpp's invalid_character list, since those seem to be the only things that can't be entered when renaming nodes in the editor.

I think adding unicode support would also be great, having this sanitization apply the superset of node.cpp and glTF limitations feels like it'd be the best user experience.
Quickly testing in the main and 3.2 branches it looks like attempting to add a unicode emoji into a node name isn't prevented, but it's also not displayed correctly in the editor (whereas adding Japanese is correctly displayed). Similar to the symbols added in this PR, both emoji and Japanese will cause $ to fail unless the name is quoted (which appears to work as expected, even with the incorrect display of emoji), though I noticed that in master having a name "文字" causes the editor to behave abnormally.

akien-mga · 2021-02-01T20:13:56Z

Some comments I put in the Godot Contributors chat:

I'm not sure why we have to whitelist specific characters instead of just letting Node::set_name() do the validation? It calls Node::_validate_node_name() which strips:
String Node::invalid_character = ". : @ / \"";
The only problem would be if a node has an empty name once validated (e.g. if the name is .:@).
Or are there additional characters forbidden in the spec?

abaire · 2021-02-02T00:01:52Z

Some comments I put in the Godot Contributors chat:
I'm not sure why we have to whitelist specific characters instead of just letting Node::set_name() do the validation? It calls Node::_validate_node_name() which strips:
String Node::invalid_character = ". : @ / \"";
The only problem would be if a node has an empty name once validated (e.g. if the name is .:@).
Or are there additional characters forbidden in the spec?

Totally agree. I'd hope that we can assume any additional forbidden chars in the glTF spec would've already been stripped/replaced/whatever'd by the time the importer is run (unless you mean the Godot node spec, in which case I'd expect set_name to strip them).

I did notice that names including # caused issues with the importer hint processing but never got around to tracking down why.

fire · 2021-02-05T17:54:23Z

Anyone opposing:

Letting Node::set_name() do the validation?

Note this should probably match skeleton validation.

abaire · 2021-02-06T15:26:28Z

Anyone opposing:

Letting Node::set_name() do the validation?

No objection from my side. We'd just have to track down and fix the behavior I observed with '#' breaking import hints. I'm happy to take a pass at that.

Note this should probably match skeleton validation.

I agree in principle that it's a better user experience to be as permissive as possible but haven't looked at the bone code myself. My understanding is that they're not Node's so I'm not sure if they need something more restrictive than the Node enforcement. That said, I think it'd be a big improvement to centralize bone name sanitization across all the import paths as well for the sake of consistency/ease of documentation.
Do you recall the reason behind the current sanitization approach for bones?

Also, is there a mechanism to test the import behavior automatically that I should leverage?

fire · 2021-02-08T03:44:28Z

See https://github.com/godotengine/godot-tests/tree/master/tests/blend_export for some automated testing. You can add new test cases.

abaire · 2021-02-13T16:56:48Z

scene/main/node.h

@@ -187,7 +187,9 @@ class Node : public Object {

 #ifdef TOOLS_ENABLED
 friend class SceneTreeEditor;
+ friend class GLTFDocument;


I'm curious as to whether _validate_node_name should just be made public instead.

Is anything else using it? Do we want other things to use it?

I don't think anything else is using it today, but I'm also not sure if there are other cases where sanitization is being performed (maybe in the other import modules?). In my mind it'd be better for consistency's sake if we'd have everything that generates a Node and needs fine control over naming would leverage _validate_node_name so any future modifications to expand/reduce the restricted characters would be applied uniformly.

Similarly since AnimationPlayer adds a couple additional restrictions, I think we would ideally have a similar public static _validate_animation_name that can perform animation-specific cleanup.

I would try hard to find at least one other use or invent a use because if nothing uses it, we don't need to expose.

I would try hard to find at least one other use or invent a use because if nothing uses it, we don't need to expose.

You mean making _validate_node_name public specifically? On the one hand it feels odd to make GLTFDocument a full friend class just to get access to one method (I assume SceneTreeEditor does more interesting Node manipulation), but if there's a general trend towards being conservative about making utility methods public that's fine as well.

Glancing at the blame I suspect it was just to facilitate the use in STE which wants to display an error message if the node name had to be sanitized: 6758b6c
If that's the case it seems like it'd be fine to have it return the string and then do a (admittedly expensive) equality check on the original and returned string in that one spot.

I'm happy to make that change if you think it's worthwhile?

[Edit] We could also have a private version that does an in-place update for use in set_name since that seems to be the main potentially performance critical usecase?

Sorry, missed the later comment regarding moving the validation to String.
My initial reaction is that approach feels like it smears Node concerns into a potentially unexpected place (naively if I was looking for node validation and saw a Node class I'd assume it was there). That said, String does have things like simplify_path that arguably would be appropriate in _File.

Poking around a bit more I see that there doesn't appear to be a way to expose a static method via ClassDB so if there's a strong desire to make this bindable, moving it to String with the signature: String String::validate_node_name() const (where it returns a copy of the string after sanitizing invalid characters) seems like it'd be the most straightforward approach, though it still feels slightly out of place to me.

@akien-mga I'll wait for direction from you/@reduz before going that path, but in the meantime I've pushed a new commit with a script friendly API in case there's a way to bind static methods that I've missed. If it's not necessary to bind this method then it'd probably make sense to revert this latest commit to avoid the performance penalty of copying the string.

Yeah the scripting bindings don't support static methods, so we have to work it around for now. That's why we tend to (ab)use the String API for some File and Node checks. If we do implement support for static methods, then it would indeed make sense to have this under Node, but for the time being the String approach is the simplest one for a binding.

Makes sense, thanks for the guidance! I've made the change (and added documentation and a basic unit test).

modules/gltf/gltf_document.cpp

abaire · 2021-02-13T17:08:03Z

Switched over to leverage Node's validation. Still a couple outstanding issues with Animation handling and the Emoji tests.

For Emoji specifically, they are not displayed correctly in Blender or in Godot and they're encoded in the reserved range, causing them to trigger the Unicode parsing error: Invalid unicode codepoint handlers in ustring.cpp. This results in the creation of nodes that display but can't easily be referenced via getNode since the emoji chars get sanitized to the � replacement character.
@fire do you think we should explicitly strip Emoji in the sanitizers or just leave it as semi-supported behavior?

fire · 2021-02-13T18:15:44Z

I'll ping @bruvzg and the rocket chat development community.

bruvzg · 2021-02-13T18:23:08Z

Unicode parsing error: Invalid unicode codepoint

If it's triggering this error, the source string is probably UTF-16 (Godot strings are UTF-32) and emojis are encoded as surrogate pairs. It should be decoded using String::utf16(const char16_t *p_utf16, int p_len) method.

bruvzg · 2021-02-13T18:26:30Z

And there is no emoji font in the editor, but the correct string should not give any errors, emoji will be displayed as it hex code:

bruvzg · 2021-02-13T18:56:50Z

If it's triggering this error, the source string is probably UTF-16

Since it's JSON, UTF-16 pair might be stored as "\u0000\u0000" UTF-8 string, we should add handling for this case to the JSON::_get_token function (right now it's reading it as is, which will cause Invalid unicode codepoint errors).

Or it might be some sort of pseudo-UTF-8 (CESU-8, WTF-8 or some other garbage), in this case it's better to strip it.

Draft of JSON::get token change (untested):

diff --git a/core/io/json.cpp b/core/io/json.cpp
index bc4527869b..0a6279f9c6 100644
--- a/core/io/json.cpp
+++ b/core/io/json.cpp
@@ -233,6 +233,51 @@ Error JSON::_get_token(const char32_t *p_str, int &index, int p_len, Token &r_to
 									res |= v;
 								}
 								index += 4; //will add at the end anyway
+								if ((res & 0xfffffc00) == 0xd800) {
+									if (p_str[index + 1] != '\\' || p_str[index + 2] != 'u') {
+										r_err_str = "Invalid UTF-16 sequence in string, unpaired lead surrogate";
+										return ERR_PARSE_ERROR;
+									}
+									index += 2;
+									char32_t trail = 0;
+									for (int j = 0; j < 4; j++) {
+										char32_t c = p_str[index + j + 1];
+										if (c == 0) {
+											r_err_str = "Unterminated String";
+											return ERR_PARSE_ERROR;
+										}
+										if (!((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F'))) {
+											r_err_str = "Malformed hex constant in string";
+											return ERR_PARSE_ERROR;
+										}
+										char32_t v;
+										if (c >= '0' && c <= '9') {
+											v = c - '0';
+										} else if (c >= 'a' && c <= 'f') {
+											v = c - 'a';
+											v += 10;
+										} else if (c >= 'A' && c <= 'F') {
+											v = c - 'A';
+											v += 10;
+										} else {
+											ERR_PRINT("Bug parsing hex constant.");
+											v = 0;
+										}
+
+										trail <<= 4;
+										trail |= v;
+									}
+									if ((trail & 0xfffffc00) == 0xdc00) {
+										res = (res << 10UL) + trail - ((0xd800 << 10UL) + 0xdc00 - 0x10000);
+										index += 4; //will add at the end anyway
+									} else {
+										r_err_str = "Invalid UTF-16 sequence in string, unpaired lead surrogate";
+										return ERR_PARSE_ERROR;
+									}
+								} else if ((res & 0xfffffc00) == 0xdc00) {
+									r_err_str = "Invalid UTF-16 sequence in string, unpaired trail surrogate";
+									return ERR_PARSE_ERROR;
+								}
 
 							} break;
 							default: {

abaire · 2021-02-13T20:38:49Z

If it's triggering this error, the source string is probably UTF-16

Since it's JSON, UTF-16 pair might be stored as "\u0000\u0000" UTF-8 string, we should add handling for this case to the JSON::_get_token function (right now it's reading it as is, which will cause Invalid unicode codepoint errors).

It is indeed a UTF-16 pair in the glTF JSON: "name" : "Basic_Emoji_\ud83d\ude00\ud83e\udd16",

@bruvzg your attached fix for _get_token fixes this case, the string is now correctly displayed with the hex codes for the two emoji characters - thanks! Is that change something you're working on in a separate PR that I should rebase off of? I can also just disable my emoji-specific test cases until the _get_token improvement is available.

fire · 2021-02-14T17:12:16Z

Are there any specific places you wanted more review? I hope that the test suite is able to give you a feedback improvement cycle.

abaire · 2021-02-14T17:38:34Z

Are there any specific places you wanted more review? I hope that the test suite is able to give you a feedback improvement cycle.

I think the test suite change is the more interesting bit to review, assuming the general approach here seems reasonable to you. I'll switch that from a WIP to an actual review; there's still a bit of work left to be done and I may need to remove the emoji tests since they won't pass until the fix that @bruvzg suggested is incorporated.

fire · 2021-02-14T17:58:41Z

Well you could add that change to this pr for the sake of easier review.

abaire · 2021-02-15T02:47:15Z

Well you could add that change to this pr for the sake of easier review.

Sorry, can you point me at how? I didn't realize you can have a PR span multiple repos (and I'm failing to find out how).

akien-mga · 2021-02-22T09:48:58Z

modules/gltf/gltf_document.cpp

+
+ // TODO: Consider adding invalid_characters or a _validate_animation_name to animation_player to mirror Node.
+ String p_name = _sanitize_scene_name(name);
+ RegEx regex("([\\[,]+)");


I'd suggest doing a plain .replace() for the invalid characters instead of using RegEx. The regex module can be disabled, and here this would trigger compilation errors if you don't guard it with #ifdef MODULE_REGEX_ENABLED (that's a pre-existing issue in this class though).

Done, thanks!
I left the regex handling in _sanitize_bone_name but can do a followup if we decide to relax sanitization there as well.

abaire · 2021-02-22T17:07:05Z

To make the link bidirectional: tests are in godotengine/godot-tests#21

akien-mga · 2021-02-23T09:58:23Z

core/string/ustring.cpp

+const String String::invalid_node_name_characters = ". : @ / \"";
+
+String String::validate_node_name() const {
+ Vector<String> chars = String::invalid_node_name_characters.split(" ");


I know this is pre-existing code but I'm not sure if there's much gain having this as a static constant which then gets split() on each call. Compilers might be able to optimize this I guess? But we could just as well have const String restricted_chars = ".:@/\""; inside the function and iterate over the characters directly.

CC @bruvzg if you have a suggestion on how best to do that (don't want to get mixed up with const char * and char32_t :)).

Ah I see invalid_node_name_characters is used in SceneTreeEditor` to show a nice error message. Makes sense then I guess.

I thought the same (I'm dubious that this would get memoized by the compiler) but my gut feel is that validate_node_name is probably not called frequently enough for the split to have a meaningful impact on performance in typical scenarios. If there are scenarios where very large numbers of nodes are getting created it'd be trivial to modify the error message case (which should be even more rare), but I wanted to keep changes limited to the original bug as much as practical.

Suggested change

Vector<String> chars = String::invalid_node_name_characters.split(" ");

// Changes made to the set of invalid characters must also be reflected in the String documentation.

const String String::invalid_node_name_characters = ".:@/\"";

String String::validate_node_name() const {

String invalid_char(&String::invalid_node_name_characters[0], 1);

String replacement("");

String name = this->replace(invalid_char, replacement);

for (int i = 1; i < String::invalid_node_name_characters.length(); ++i) {

invalid_char[0] = String::invalid_node_name_characters[i];

name = name.replace(invalid_char, replacement);

}

return name;

}

doc/classes/String.xml

akien-mga

Core/scene changes look good to me. Could you squash the commits into one?

Would still like confirmation from @fire that the glTF-specific changes are also good.

fire · 2021-02-24T12:48:52Z

Test plan: load test suite once and check for irregularities. Then I’ll approve. Getting there later today.

abaire · 2021-02-24T15:26:13Z

Squashed and repushed.
@fire as a heads up the Emoji tests are still dependent on the _get_token patch from @bruvzg which is not included in this PR (so I'd expect those 3 cases to fail). I can either disable those tests, update them to expect the unicode replacement char (matching the current behavior), or let them fail with the expectation that _get_token will be fixed in the near future.

fire

Looks good.

Order of operations is:

Submit PR / Merge _get_token changes.
_get_token change allows conversion to unknown font glyph because the current behavior of invalid text isn't the correct behavior and this is a way to avoid creating mojibake. https://en.wikipedia.org/wiki/Mojibake
Merge this 45545 pr.
Merge the tests linked.

bruvzg · 2021-03-01T14:55:04Z

Submit PR / Merge _get_token changes.

Submitted and merged.

akien-mga · 2021-03-09T13:54:36Z

Thanks! And congrats for your first merged Godot contribution 🎉

akien-mga · 2021-03-12T09:11:23Z

This doesn't seem trivial to cherry-pick for the 3.2 branch, so it will likely require a dedicated backport PR.

Calinou added bug cherrypick:3.x Considered for cherry-picking into a future 3.x release topic:import labels Jan 28, 2021

Calinou added this to the 4.0 milestone Jan 28, 2021

akien-mga requested a review from a team January 28, 2021 22:09

fire reviewed Jan 29, 2021

View reviewed changes

modules/gltf/gltf_document.cpp Outdated Show resolved Hide resolved

abaire mentioned this pull request Feb 11, 2021

Adds tests for 45545 godotengine/godot-tests#21

Merged

abaire force-pushed the relaxes_gltf_name_sanitization branch 3 times, most recently from a83d4c1 to af7c199 Compare February 13, 2021 16:41

abaire commented Feb 13, 2021

View reviewed changes

modules/gltf/gltf_document.cpp Show resolved Hide resolved

Calinou mentioned this pull request Feb 20, 2021

glTF import process strips most non-alphanumeric characters when generating Nodes #45544

Closed

akien-mga reviewed Feb 22, 2021

View reviewed changes

abaire requested a review from a team as a code owner February 22, 2021 16:59

abaire requested review from a team as code owners February 22, 2021 21:38

akien-mga reviewed Feb 23, 2021

View reviewed changes

doc/classes/String.xml Outdated Show resolved Hide resolved

akien-mga approved these changes Feb 24, 2021

View reviewed changes

abaire force-pushed the relaxes_gltf_name_sanitization branch from f526ffc to bbe46d1 Compare February 24, 2021 15:23

Relaxes Node naming constraints in glTF documents to match the Editor.

61cc1c8

abaire force-pushed the relaxes_gltf_name_sanitization branch from bbe46d1 to 61cc1c8 Compare February 24, 2021 16:23

fire approved these changes Mar 1, 2021

View reviewed changes

bruvzg mentioned this pull request Mar 1, 2021

JSON parser: add UTF-16 surrogate pairs support. #46551

Merged

akien-mga merged commit 83b1acd into godotengine:master Mar 9, 2021

abaire mentioned this pull request Mar 12, 2021

Relaxes Node naming constraints in glTF documents to match the Editor. #46939

Merged

abaire deleted the relaxes_gltf_name_sanitization branch March 12, 2021 16:29

akien-mga removed the cherrypick:3.x Considered for cherry-picking into a future 3.x release label Mar 13, 2021

Calinou mentioned this pull request Nov 27, 2021

when importing .gltf with separate meshes, naming of imported meshes is unintuitive and different than what it was in < 3.4 #54922

Closed

raulsntos mentioned this pull request Oct 7, 2022

C#: Cleanup and sync StringExtensions with core #67031

Merged

- Vector<String> chars = String::invalid_node_name_characters.split(" ");
+// Changes made to the set of invalid characters must also be reflected in the String documentation.
+const String String::invalid_node_name_characters = ".:@/\"";
+String String::validate_node_name() const {
+ String invalid_char(&String::invalid_node_name_characters[0], 1);
+ String replacement("");
+ String name = this->replace(invalid_char, replacement);
+ for (int i = 1; i < String::invalid_node_name_characters.length(); ++i) {
+ invalid_char[0] = String::invalid_node_name_characters[i];
+ name = name.replace(invalid_char, replacement);
+ }
+ return name;
+}

Relaxes node name sanitization in gltf documents. #45545

Relaxes node name sanitization in gltf documents. #45545

Conversation

abaire commented Jan 28, 2021 • edited by akien-mga Loading

fire commented Jan 28, 2021

abaire commented Jan 29, 2021

fire commented Jan 30, 2021 • edited Loading

abaire commented Jan 30, 2021

akien-mga commented Feb 1, 2021

abaire commented Feb 2, 2021

fire commented Feb 5, 2021

abaire commented Feb 6, 2021

fire commented Feb 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abaire Feb 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abaire commented Feb 13, 2021

fire commented Feb 13, 2021

bruvzg commented Feb 13, 2021

bruvzg commented Feb 13, 2021

bruvzg commented Feb 13, 2021 • edited Loading

abaire commented Feb 13, 2021

fire commented Feb 14, 2021

abaire commented Feb 14, 2021

fire commented Feb 14, 2021

abaire commented Feb 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abaire commented Feb 22, 2021

akien-mga Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akien-mga left a comment

Choose a reason for hiding this comment

fire commented Feb 24, 2021

abaire commented Feb 24, 2021

fire left a comment

Choose a reason for hiding this comment

bruvzg commented Mar 1, 2021

akien-mga commented Mar 9, 2021 • edited Loading

akien-mga commented Mar 12, 2021

abaire commented Jan 28, 2021 •

edited by akien-mga

Loading

fire commented Jan 30, 2021 •

edited

Loading

abaire Feb 22, 2021 •

edited

Loading

bruvzg commented Feb 13, 2021 •

edited

Loading

akien-mga Feb 23, 2021 •

edited

Loading

akien-mga commented Mar 9, 2021 •

edited

Loading