-
-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GDExtension function to construct StringName directly from char*
#78580
Add GDExtension function to construct StringName directly from char*
#78580
Conversation
Thanks! The overall idea here looks great :-)
That's a really good question. Looking through the code, it seems as though
Maybe
I'm really not sure about this one. I think it may still be OK to invoke the destructor? I'm actually quite confused about how |
|
||
/** | ||
* @name string_name_new_with_utf8_chars | ||
* @since 4.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably actually Godot 4.2 material -- 4.1 is about to go into RC next week!
10c9fad
to
131fda4
Compare
It looks like the encoding depends on whether the I mentioned the semantics in the docs, but would appreciate a 2nd opinion:
Since the semantics vary wildly (encoding, lifetime, destruction), would it make sense to split this into two functions?
(the "with_xy_chars" naming follows the |
Regarding that, last time I ran into problems when destroying static
From UTF-32, so you mean converted from a But that's an interesting limitation... Why is the buffer not normalized to one format? |
Meant It would match with Latin-1/ASCII, but with anything in utf8 outside the ASCII range the hash won't match unless the hash function is altered If it creates the |
But doesn't Godot use that internally with Or that just happens to work because |
If you use |
Reason being: godot/core/string/string_name.cpp Lines 212 to 223 in 75f9c97
Lines 2765 to 2775 in 75f9c97
vs: Lines 2816 to 2826 in 75f9c97
|
Causing for example "£" to be hashed as if it was "0xC2 0xA3" instead of "0xA3" |
So getting back to this, is it even a good idea to expose the UTF-8 based construction to GDExtension? You can be sure that users will run into this trap sooner or later... Should we maybe just start by exposing the static ASCII constructor? Or allow UTF-8 construction and convert to UTF-32 in the presence of non-ASCII chars (and leaving it otherwise)? Might still be fast enough for the average case where |
I'd say to err on the side of caution and do conversion, arguably unconditionally, unless the performance benefits of having the shorter encoding is worth it So either:
The difference here for ASCII/Latin-1 is that if it is encoded in Latin-1 it'll work, but if it's Latin-1 encoded as utf8 it'll break While it won't cause problems with built-in |
Going always through Option (1) could still provide an string_name_new_with_latin1_chars(dest, c_string, is_static); It would then be the caller's responsibility to make sure that Later (or now?), we could then provide a constructor from UTF-8 that would unconditionally convert, just like the |
Sounds good! |
131fda4
to
2dcb5f8
Compare
I'd personally vote for option nr 1
I'd vote for later. Given that it's already possible to convert a (UPDATE: Although, if you really want to address it now, I'd be fine with that too :-)) |
I think this is where things break: while Maybe the hash functions should be aligned? If yes, that would become blocking. Or we only support ASCII and validate it. But in these constructor functions, there is currently no way to report errors. We could add a
I already implemented it in the draft 🙂 still needs some testing though. |
Hm, looking at the code, I guess I still don't see why the hash is different for Latin-1 vs UTF-32. The example of "£" is 0xA3 in Latin-1 and 0x000000A3 in UTF-32. I had thought this worked for all the Latin-1 characters? But I'm not an expert on this, so I could be wrong. Looking at the hash function (for Latin-1 data):
It's casting each character to a Of course, if you passed UTF-8 data into
However, if we operate under the assumption that I'm wrong about the hashing above - which very well may be the case :-) - and we can only correctly handle ASCII, I think it'd be fine documenting that it only takes ASCII and naming it accordingly (ie. The binding will have more knowledge about the data coming from the developer, and may already have validated that the data is ASCII (perhaps the language even only supports ASCII?) and adding extra validation in the GDExtension interface is unnecessary. There's already definitely parts of the GDExtension interface that expect you to pass correct data in with no validation, so this wouldn't be anything new. |
To clarify what I was saying, Latin-1, as in the encoding, should work correctly, but I don't know what encodings are used in source as C++ largely uses utf8 AFAIK, so that's my only point of concern, but assuming it is encoded as Latin-1 it should be fully safe |
Ah, ok, thanks for clarifying! The encoding used by the end-developer is going to depend on their specific programming language, so, in my opinion, it should be up to the language binding to ensure that it's passing data with the right encoding to GDExtension. |
2dcb5f8
to
83ff151
Compare
Discussed at the GDExtension meeting, and we think this API looks great! The only remaining question is if there is still an encoding issue. If not, this is probably good to move forward. If so, limiting the API to UTF-8 and ASCII (not Latin-1) would also be fine. |
Found the reason of the discrepancy between the hashes:
How should we proceed?
I'll try with interpreting |
83ff151
to
13fb823
Compare
Thanks for the review, will change parameter names. Added a 2nd commit that would modify the We also need to see how we would do it for |
13fb823
to
1fe778b
Compare
So far, an indirection via String was necessary, causing at least 2 allocations and copies (String; String inside StringName). Since StringNames often refer to string literals, this allows them to be directly constructed from C strings. There are two formats: Latin-1 and UTF-8. The Latin-1 constructor also provides the `p_is_static` flag: when the source has static storage duration, no copy/allocation will be needed. However, the extension developer needs to uphold this lifetime guarantee.
1fe778b
to
f284d55
Compare
Added a UTF-8 constructor that takes length, in addition to the one relying on null-termination. Should greatly enhance interoperability with other string types. (I can also remove the one expecting null-terminated Furthermore, I noticed the docs are imprecise -- the What's left is mostly the question whether the hash for |
Aha, great catch! I wouldn't have thought to consider the signedness of character type.
Yes, this is important, I don't want to have to map more hashes :-)
Sounds good to me!
Could we do something with
|
f284d55
to
2608410
Compare
Great idea! I actually ended up using
CI would report incompatibility if that happened, correct? |
This code looks great to me! However, we'll have to see how the core and production teams feel about using
Yes, it would, and the tests have passed, so I think we're good :-) |
Since char/wchar_t can be either signed or unsigned, its conversion to uint32_t leads to different values depending on platform. In particular, the same string represented as char* (Latin-1; StringName direct construction) or uint32_t (UTF-8; constructed via String) previously resulted in different hashes.
… vs. character count)
2608410
to
c770937
Compare
I see! I changed it to |
Thanks! |
char *
char *
char*
So far, an indirection via String was necessary, causing at least 2 allocations and copies (String; String inside StringName). Since StringNames often refer to string literals, this allows them to be directly constructed from C strings.
It also provides the
is_static
flag: when the source has static storage duration, no copy/allocation will be needed. However, the extension developer needs to uphold this lifetime guarantee.Some things need to be discussed:
String
constructor function, feel free to suggest a better one (especially if 1 doesn't hold).