-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a field to the string record to store a cached size #15758
Add a field to the string record to store a cached size #15758
Conversation
a8b6390
to
59154ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have few mostly-stylistic comments, otherwise this looks good.
compiler/AST/symbol.cpp
Outdated
@@ -1493,7 +1493,9 @@ VarSymbol *new_StringSymbol(const char *str) { | |||
|
|||
std::string unescapedString = unescapeString(str, cstrMove); | |||
|
|||
if (!isValidString(unescapedString)) { | |||
int numCodepoints = 0; | |||
int ret = isValidString(unescapedString, &numCodepoints); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is a bool
function, let's make ret
a bool
or const bool
, too.
modules/internal/String.chpl
Outdated
@@ -575,13 +578,14 @@ module String { | |||
} | |||
|
|||
pragma "no doc" | |||
proc chpl_createStringWithLiteral(x: c_string, length:int) { | |||
proc chpl_createStringWithLiteral(x: c_string, length:int, numCodepoints: int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not part of your work, but while touching this part of the code you can also add a space between length:
and int
. That seems to be the "convention" for these functions.
modules/internal/String.chpl
Outdated
// NOTE: This is a "wellknown" function used by the compiler to create | ||
// string literals. Inlining this creates some bloat in the AST, slowing the | ||
// compilation. | ||
return chpl_createStringWithBorrowedBufferNV(x:c_ptr(uint(8)), | ||
length=length, | ||
size=length+1); | ||
size=length+1, | ||
numCodepoints); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised that this works.
I think it is a very good practice to stick with named arguments once you start using them in a call. i.e. length
and size
are explicitly named, and so should be numCodepoints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To solve this, would you say that I should have numCodepoints=cachedNumCodepoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, numCodepoints=numCodepoints
(just like for length
two lines above). First numCodepoints
is the name of the formal argument to the function chpl_createStringWithBorrowedBufferNV
, the other is the formal argument to the chpl_createStringWithLiteral
(that becomes the actual for the NV
call).
I forgot to mention in my comment above, but; could you also add a test where We have the concept of futures in our testing system that can allow you to create special tests that shows a bug or missing implementation. Alternatively you can do that to get some practice, but I hope to fix those soon, so it is OK if you don't do that right now. |
@@ -276,6 +281,7 @@ module BytesStringCommon { | |||
x.buff = other; | |||
x.buffSize = size; | |||
x.buffLen = length; | |||
if t == string then x.cachedNumCodepoints = x.numCodepoints; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this. WithOwnedBuffer
variations of factory functions can't be called with other string
s but only with C buffers. When that happens, we must be calling validateEncoding
in the call chain somewhere else.
} | ||
else { | ||
// if s is local create a copy of its buffer and own it | ||
const (buff, allocSize) = bufferCopyLocal(other.buff, otherLen); | ||
x.buff = buff; | ||
x.buff[x.buffLen] = 0; | ||
x.buffSize = allocSize; | ||
if t == string then x.cachedNumCodepoints = other.cachedNumCodepoints; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably add this line only once outside of if
somewhere
@@ -839,4 +848,13 @@ module BytesStringCommon { | |||
} | |||
return hash:uint; | |||
} | |||
|
|||
inline proc setCodepoints(ref lhs: ?t1, rhs: ?t2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect types to be (ref lhs: string, rhs: string)
. This function is not applicable to anything else but strings and as such, we want the compiler to be able to catch any error where this was used otherwise.
Just as a side note, as it is now this would allow any two arguments of any type (not necessarily the same). If you want them to be of same type you should do ref lhs: ?t1, rhs: t1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also make two other changes:
- Function name can be more verbose:
incrementCodepoints
maybe? Or you can make itaddCachedNumCodepoints
and make it return anint
instead of changinglhs
. - We can make this function
private
, right? In other words, I don't expect it to be used outside of this module, but if you are using it inString.chpl
, then leave it be. (And maybe add a comment to the private follow up issue, so that we can see if we can make itprivate
with some more effort)
…ache-string-length
…ache-string-length
Test status:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to go!
But I believe it either needs its commits to be cleaned up, or merged as a squash-merge.
Thanks!
Improve how we create string/bytes internally This PR mainly improves how we internally create strings and to some extent bytes. #15758 added a `cachedNumCodepoints` field in the string record to avoid computing that value on-demand. However, it also opened a can of worms about how we create strings. In general, that involved default initing a string and poking its fields. So, although the main motivation for this PR is to make `numCodepoint` management better, it has bunch of other changes that are related to that to different extents. More or less in the order of significance, this PR does the following. - Strengthen our management of `cachedNumCodepoints` - Add the argument to all non-validating string factory functions to force the caller to pre-compute the value either by counting it, or doing something smart about it. - Update quite a few helpers to try to produce this value efficiently - Update `getView` to return a tuple that contains number of codepoints in the view, if it is something that's easily computable. - Add 2 `countNumCodepoints` overloads to check whether `cachedNumCodepoints` is correct with `boundsChecking` - Serialize/deserialize that field - Make `c_string` to `string` assignment use appropriate factory functions instead of `reinitString`. - Remove `reinitString` from `string` and `bytes` in favor of `reinitWithNewBuffer` and `reinitWithOwnedBuffer` in `BytesStringCommon` - Change the strategy of creating string and bytes in functions in `BytesStringCommon` and few other similar ones. We used to do something like: ```chapel var ret: string; ret.isOwned = true; ret.someOtherField = otherValue; ret.yetAnotherField = 0; // .. return ret; ``` This was too cumbersome and felt principally wrong to me. With this PR we compute what we need to compute to create new string's (1) buffer, (2) bufferSize, (3) bufferLength and (4) numCodepoints, and then call a factory function. I hope this'll help reduce the overhead of adding new fields to the string record. - Fix a few bugs where the helpers attempted to return a `string` if the main argument in question was an empty `bytes`. - Add an optimization to `getView` for the cases where the view we are trying to get is actually the full string - Remove an internal cast to c_string to enable using explicit `bufferType` as an arg type in `bufferMemcpyLocal` - Update string and bytes doc to render references to types as "string" and not as "String" - Add tests [Reviewed by @mppf] Test: - [x] memleaks on string/bytes tests - [x] ml-memleaks on string/bytes tests - [x] valgrind on string/bytes tests - [x] string performance is back where we want it to be (if not better) - [x] full standard - [x] full gasnet
Control numCodepoints caching with a config param This PR avoids an expensive check in string implementation, but adds a secret back door to re-enable it. This is what I propose we should do instead of #18399 This code was added in #15870 to tighten up the ways in which strings are created. When we added number-of-codepoints caching in #15758, we have seen that we've had to change how strings are created in many places in the standard libraries. I think 15870 put us in a good place, but I still have some uneasiness around caching number of codepoints and relying on that cached value. But the current state of things is not ideal, either: we just halt with `--boundsChecking` if there's a discrepancy. With no way for the user to escape that, it is only marginally better than letting the wrong value survive. So, this PR adds a no-doced `config param useCachedNumCodepoints = true` in the string implementation. This way, even with `--boundsChecking` we don't count number of codepoints each time we need that value, but we have a way of avoiding using the cached value if we see any issue with caching. [Reviewed by @bradcray and @mppf] Test: - [x] types/string/cachedNumCodepoints - [x] full standard - [x] full gasnet
Add "cachedNumCodepoints" to string record and modify several methods to store that cached string size upon string creation.