Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Utf8String class to directly hold utf8 strings passed via JSRT #5348

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

MSLaguana
Copy link
Contributor

Last I looked this had a minor perf impact, but it lays some of the groundwork which could be expanded in future to handle more cases.

@dilijev
Copy link
Contributor

dilijev commented Jun 22, 2018

Is this the same as the previous PR, or slightly reduced?

@MSLaguana
Copy link
Contributor Author

This is the same as the previous PR except I split out the JSONString changes to another PR. Looks like I'll need to fix up a couple of test failures too

@@ -3729,26 +3729,40 @@ JsErrorCode GetScriptBufferDetails(
}
const bool isUtf8 = !isString && !(parseAttributes & JsParseScriptAttributeArrayBufferIsUtf16Encoded);

*script = isExternalArray ?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boingoing I've refactored this method and added a use in CompileRun. As part of that I noticed that this method previously didn't cater for the case of an ExternalArrayBuffer with utf16 encoding; my change supports that, but I'm not sure if that was intentional or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that case was an oversight. Thanks for cleaning this up, looks good.

Also refactoring JSRT script handling behavior to detect Utf8Strings and pass
them through without conversion to functions which can already deal with utf8 sources.

Ensuring that OOM handling is present when calling GetSz which may allocate

private:

void SetUtf8Buffer(_In_reads_(utf8Length) char* buffer, size_t utf8Length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much prefer reading the code with it all in the header like this, but I think we're meant to put most function definitions in cpp files unless they're templates

FieldNoBarrier(size_t) length;
Field(char*) buffer;
} PrefixedUtf8String;
Field(PrefixedUtf8String*) utf8String;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PrefixedUtf8String is only two pointers big, could we just put it directly in the object rather than requiring a separate allocation (and a separate pointer dereference every time we use it)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the original motivation of this was to allow other kinds of strings to convert themselves into a Utf8String, and when I was doing that one of the candidate string types only had 1 pointer's worth of space available, which is why I went down this path (see the ConvertString method which implicitly assumes that there is space for the one pointer available). However right now that's not happening, so I could undo that for now and re-do it later on if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What had only one pointer's worth of space? All allocations are bumped up to the nearest 16 bytes anyway, so a 40-byte object actually gets allocated as 48


In reply to: 205622702 [](ancestors = 205622702)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was specifically on 32bit one of the types was otherwise full... I don't recall which at the moment though.

{
private:
typedef struct {
FieldNoBarrier(size_t) length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why size_t not charcount_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the length of the utf8 string, which can be up to 3 times as long as the same string in utf16. Since a utf16 string can be up to ~2^31 in chakra, the corresponding utf8 string may be more than 2^32 in length, so utf8 lengths need to be size_t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, thanks


In reply to: 205622942 [](ancestors = 205622942)

return this->UnsafeGetBuffer();
}

// TODO: This is currently wrong in the presence of unmatched surrogate pairs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it might be nice to move this comment down to right before DecodeUnitsIntoAndNullTerminateNoAdvance, because in this position I originally though it was talking about the allocation size being wrong (which would be Very Bad).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's also not the 'correct' place for this comment; been a while since I wrote it. The actual issue is with the encoding into utf8, and determining how to decode it. As things stand I'm using 'actual' utf8 which doesn't allow lone surrogate halves, but we could instead use 'cesu-8' which does allow that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know canonical utf-8 doesn’t encode surrogate halves at all (paired or otherwise)—the character is encoded directly. If you need to preserve the surrogates independently I think cesu-8 is your only option.


Assert(decodeLength == this->GetLength());

buffer[this->GetLength()] = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to already be done by the AndNullTerminate portion of the call above; maybe just assert that it is zero?

SetUtf8Buffer(buffer, utf8Length);

this->SetLength(originalString->GetLength());
this->SetBuffer(originalString->UnsafeGetBuffer());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me. Maybe originalString->GetSz() instead?

  • SubString's buffer is not guaranteed to be null-terminated
  • SubString and PropertyString both rely on holding other pointers to keep their buffers alive, because the buffer itself is not recycler allocated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that this should keep the finalized-ness of the source string: if the original string isn't finalized, it won't have a buffer so we would be setting it to nullptr (and then if someone asks for it, we generate it from the utf8 version), and otherwise it was finalized so we can remain finalized. I hadn't considered the case when a string would have an invalid buffer like a substring would.

Using getSz would be safe, but could maybe flatten more strings than expected... but it avoids having to decode the utf8 in future anyway. I think I will go with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine with something that said explicitly "if finalized, GetSz, else null", but UnsafeGetBuffer is marked unsafe for a reason :)


In reply to: 205623819 [](ancestors = 205623819)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually even with GetSz the PropertyString problem persists. It returns an interior pointer to a PropertyRecord, which is kept alive by the string itself. We risk it getting collected from under us with this setup.


In reply to: 205624092 [](ancestors = 205624092,205623819)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer ownership is also a very real problem with the conversion code: we swap out a vtable, overwrite our owning pointer with some other data, and then our data might get collected.


In reply to: 205624399 [](ancestors = 205624399,205624092,205623819)

template <typename StringType>
static Utf8String * ConvertString(StringType * originalString, _In_reads_(utf8Length) char* buffer, size_t utf8Length)
{
VirtualTableInfo<Utf8String>::SetVirtualTable(originalString);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of expected SetVirtualTable to be a template that static-asserted that the source type was big enough for the destination type, but I don't see any such assertion. Could you please add one? Otherwise this is concerning because LiteralString is not big enough to convert successfully.

@@ -497,6 +497,8 @@ enum tagDEBUG_EVENT_INFO_TYPE
#include "Library/PropertyRecordUsageCache.h"
#include "Library/PropertyString.h"
#include "Library/SingleCharString.h"
#include "Library/Utf8String.h"
#include "Library/LazyJSONString.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is left over from when it was combined with another change. Didn't notice that before.

@@ -27,6 +27,7 @@
#include "Library/SingleCharString.h"
#include "Library/SubString.h"
#include "Library/BufferStringBuilder.h"
#include "Library/Utf8String.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already included in Runtime.h above, can we remove here?

{
if (this->IsFinalized())
{
return this->UnsafeGetBuffer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was the result of ConvertString from a SubString, then this buffer might not be null terminated, violating the rules of GetSz. This part I think is not actually dangerous, because any substring should have a null character eventually in the source string it came from, but we might end up reading a lot more data than intended.

@@ -4781,9 +4795,17 @@ CHAKRA_API JsCreateString(

return ContextAPINoScriptWrapper([&](Js::ScriptContext *scriptContext, TTDRecorder& _actionEntryPopper) -> JsErrorCode {

Js::JavascriptString *stringValue = Js::LiteralStringWithPropertyStringPtr::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any data on how many strings from JsCreateString end up used as property keys? We might possibly be regressing some scenarios by changing what is essentially a heuristic here.

LoadScriptFlag_ExternalArrayBuffer);
}
else
if (error != JsNoError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely we have some fancy macro for this

@@ -5165,14 +5163,23 @@ CHAKRA_API JsRunSerialized(
{
PARAM_NOT_NULL(bufferVal);
const WCHAR *url;
JsErrorCode errorCode = ContextAPINoScriptWrapper_NoRecord([&](Js::ScriptContext *scriptContext) -> JsErrorCode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does our usual rule about { on new line not apply in lambdas? Just curious, I see some of both in this file so either way can be argued to be consistent with the surrounding style.

Copy link
Contributor

@sethbrenith sethbrenith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

@rhuanjl rhuanjl mentioned this pull request Jan 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants