-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Utf8String class to directly hold utf8 strings passed via JSRT #5348
base: master
Are you sure you want to change the base?
Conversation
Is this the same as the previous PR, or slightly reduced? |
This is the same as the previous PR except I split out the |
d5ac264
to
f22a24b
Compare
@@ -3729,26 +3729,40 @@ JsErrorCode GetScriptBufferDetails( | |||
} | |||
const bool isUtf8 = !isString && !(parseAttributes & JsParseScriptAttributeArrayBufferIsUtf16Encoded); | |||
|
|||
*script = isExternalArray ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boingoing I've refactored this method and added a use in CompileRun
. As part of that I noticed that this method previously didn't cater for the case of an ExternalArrayBuffer
with utf16 encoding; my change supports that, but I'm not sure if that was intentional or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that case was an oversight. Thanks for cleaning this up, looks good.
Also refactoring JSRT script handling behavior to detect Utf8Strings and pass them through without conversion to functions which can already deal with utf8 sources. Ensuring that OOM handling is present when calling GetSz which may allocate
5e6374f
to
7715a6a
Compare
|
||
private: | ||
|
||
void SetUtf8Buffer(_In_reads_(utf8Length) char* buffer, size_t utf8Length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much prefer reading the code with it all in the header like this, but I think we're meant to put most function definitions in cpp files unless they're templates
FieldNoBarrier(size_t) length; | ||
Field(char*) buffer; | ||
} PrefixedUtf8String; | ||
Field(PrefixedUtf8String*) utf8String; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If PrefixedUtf8String is only two pointers big, could we just put it directly in the object rather than requiring a separate allocation (and a separate pointer dereference every time we use it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the original motivation of this was to allow other kinds of strings to convert themselves into a Utf8String
, and when I was doing that one of the candidate string types only had 1 pointer's worth of space available, which is why I went down this path (see the ConvertString
method which implicitly assumes that there is space for the one pointer available). However right now that's not happening, so I could undo that for now and re-do it later on if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What had only one pointer's worth of space? All allocations are bumped up to the nearest 16 bytes anyway, so a 40-byte object actually gets allocated as 48
In reply to: 205622702 [](ancestors = 205622702)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it was specifically on 32bit one of the types was otherwise full... I don't recall which at the moment though.
{ | ||
private: | ||
typedef struct { | ||
FieldNoBarrier(size_t) length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why size_t not charcount_t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the length of the utf8 string, which can be up to 3 times as long as the same string in utf16. Since a utf16 string can be up to ~2^31 in chakra, the corresponding utf8 string may be more than 2^32 in length, so utf8 lengths need to be size_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return this->UnsafeGetBuffer(); | ||
} | ||
|
||
// TODO: This is currently wrong in the presence of unmatched surrogate pairs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it might be nice to move this comment down to right before DecodeUnitsIntoAndNullTerminateNoAdvance, because in this position I originally though it was talking about the allocation size being wrong (which would be Very Bad).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's also not the 'correct' place for this comment; been a while since I wrote it. The actual issue is with the encoding into utf8, and determining how to decode it. As things stand I'm using 'actual' utf8 which doesn't allow lone surrogate halves, but we could instead use 'cesu-8' which does allow that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know canonical utf-8 doesn’t encode surrogate halves at all (paired or otherwise)—the character is encoded directly. If you need to preserve the surrogates independently I think cesu-8 is your only option.
|
||
Assert(decodeLength == this->GetLength()); | ||
|
||
buffer[this->GetLength()] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to already be done by the AndNullTerminate
portion of the call above; maybe just assert that it is zero?
SetUtf8Buffer(buffer, utf8Length); | ||
|
||
this->SetLength(originalString->GetLength()); | ||
this->SetBuffer(originalString->UnsafeGetBuffer()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worries me. Maybe originalString->GetSz()
instead?
- SubString's buffer is not guaranteed to be null-terminated
- SubString and PropertyString both rely on holding other pointers to keep their buffers alive, because the buffer itself is not recycler allocated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that this should keep the finalized-ness of the source string: if the original string isn't finalized, it won't have a buffer so we would be setting it to nullptr (and then if someone asks for it, we generate it from the utf8 version), and otherwise it was finalized so we can remain finalized. I hadn't considered the case when a string would have an invalid buffer like a substring would.
Using getSz
would be safe, but could maybe flatten more strings than expected... but it avoids having to decode the utf8 in future anyway. I think I will go with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine with something that said explicitly "if finalized, GetSz, else null", but UnsafeGetBuffer is marked unsafe for a reason :)
In reply to: 205623819 [](ancestors = 205623819)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually even with GetSz the PropertyString problem persists. It returns an interior pointer to a PropertyRecord, which is kept alive by the string itself. We risk it getting collected from under us with this setup.
In reply to: 205624092 [](ancestors = 205624092,205623819)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer ownership is also a very real problem with the conversion code: we swap out a vtable, overwrite our owning pointer with some other data, and then our data might get collected.
In reply to: 205624399 [](ancestors = 205624399,205624092,205623819)
template <typename StringType> | ||
static Utf8String * ConvertString(StringType * originalString, _In_reads_(utf8Length) char* buffer, size_t utf8Length) | ||
{ | ||
VirtualTableInfo<Utf8String>::SetVirtualTable(originalString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of expected SetVirtualTable to be a template that static-asserted that the source type was big enough for the destination type, but I don't see any such assertion. Could you please add one? Otherwise this is concerning because LiteralString is not big enough to convert successfully.
@@ -497,6 +497,8 @@ enum tagDEBUG_EVENT_INFO_TYPE | |||
#include "Library/PropertyRecordUsageCache.h" | |||
#include "Library/PropertyString.h" | |||
#include "Library/SingleCharString.h" | |||
#include "Library/Utf8String.h" | |||
#include "Library/LazyJSONString.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is left over from when it was combined with another change. Didn't notice that before.
@@ -27,6 +27,7 @@ | |||
#include "Library/SingleCharString.h" | |||
#include "Library/SubString.h" | |||
#include "Library/BufferStringBuilder.h" | |||
#include "Library/Utf8String.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already included in Runtime.h above, can we remove here?
{ | ||
if (this->IsFinalized()) | ||
{ | ||
return this->UnsafeGetBuffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this was the result of ConvertString from a SubString, then this buffer might not be null terminated, violating the rules of GetSz. This part I think is not actually dangerous, because any substring should have a null character eventually in the source string it came from, but we might end up reading a lot more data than intended.
@@ -4781,9 +4795,17 @@ CHAKRA_API JsCreateString( | |||
|
|||
return ContextAPINoScriptWrapper([&](Js::ScriptContext *scriptContext, TTDRecorder& _actionEntryPopper) -> JsErrorCode { | |||
|
|||
Js::JavascriptString *stringValue = Js::LiteralStringWithPropertyStringPtr:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any data on how many strings from JsCreateString end up used as property keys? We might possibly be regressing some scenarios by changing what is essentially a heuristic here.
LoadScriptFlag_ExternalArrayBuffer); | ||
} | ||
else | ||
if (error != JsNoError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely we have some fancy macro for this
@@ -5165,14 +5163,23 @@ CHAKRA_API JsRunSerialized( | |||
{ | |||
PARAM_NOT_NULL(bufferVal); | |||
const WCHAR *url; | |||
JsErrorCode errorCode = ContextAPINoScriptWrapper_NoRecord([&](Js::ScriptContext *scriptContext) -> JsErrorCode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does our usual rule about {
on new line not apply in lambdas? Just curious, I see some of both in this file so either way can be argued to be consistent with the surrounding style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
Last I looked this had a minor perf impact, but it lays some of the groundwork which could be expanded in future to handle more cases.