Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a field to the string record to store a cached size #15758

Merged
merged 22 commits into from
Jun 10, 2020

Conversation

bmcdonald3
Copy link
Member

Add "cachedNumCodepoints" to string record and modify several methods to store that cached string size upon string creation.

@bmcdonald3 bmcdonald3 force-pushed the cache-string-length branch from a8b6390 to 59154ce Compare June 2, 2020 22:51
Copy link
Contributor

@e-kayrakli e-kayrakli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have few mostly-stylistic comments, otherwise this looks good.

@@ -1493,7 +1493,9 @@ VarSymbol *new_StringSymbol(const char *str) {

std::string unescapedString = unescapeString(str, cstrMove);

if (!isValidString(unescapedString)) {
int numCodepoints = 0;
int ret = isValidString(unescapedString, &numCodepoints);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is a bool function, let's make ret a bool or const bool, too.

modules/internal/String.chpl Show resolved Hide resolved
@@ -575,13 +578,14 @@ module String {
}

pragma "no doc"
proc chpl_createStringWithLiteral(x: c_string, length:int) {
proc chpl_createStringWithLiteral(x: c_string, length:int, numCodepoints: int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not part of your work, but while touching this part of the code you can also add a space between length: and int. That seems to be the "convention" for these functions.

// NOTE: This is a "wellknown" function used by the compiler to create
// string literals. Inlining this creates some bloat in the AST, slowing the
// compilation.
return chpl_createStringWithBorrowedBufferNV(x:c_ptr(uint(8)),
length=length,
size=length+1);
size=length+1,
numCodepoints);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that this works.

I think it is a very good practice to stick with named arguments once you start using them in a call. i.e. length and size are explicitly named, and so should be numCodepoints

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To solve this, would you say that I should have numCodepoints=cachedNumCodepoints?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, numCodepoints=numCodepoints (just like for length two lines above). First numCodepoints is the name of the formal argument to the function chpl_createStringWithBorrowedBufferNV, the other is the formal argument to the chpl_createStringWithLiteral (that becomes the actual for the NV call).

modules/internal/String.chpl Show resolved Hide resolved
@e-kayrakli
Copy link
Contributor

I forgot to mention in my comment above, but; could you also add a test where cachedNumCodepoints is -1 with this PR (you can include multiple different strings that this is the case in a single test) That would basically be a test that will fail after I do some of the follow up I had in mind, and I'll go and change their good files.

We have the concept of futures in our testing system that can allow you to create special tests that shows a bug or missing implementation. Alternatively you can do that to get some practice, but I hope to fix those soon, so it is OK if you don't do that right now.

@@ -276,6 +281,7 @@ module BytesStringCommon {
x.buff = other;
x.buffSize = size;
x.buffLen = length;
if t == string then x.cachedNumCodepoints = x.numCodepoints;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. WithOwnedBuffer variations of factory functions can't be called with other strings but only with C buffers. When that happens, we must be calling validateEncoding in the call chain somewhere else.

}
else {
// if s is local create a copy of its buffer and own it
const (buff, allocSize) = bufferCopyLocal(other.buff, otherLen);
x.buff = buff;
x.buff[x.buffLen] = 0;
x.buffSize = allocSize;
if t == string then x.cachedNumCodepoints = other.cachedNumCodepoints;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably add this line only once outside of if somewhere

@@ -839,4 +848,13 @@ module BytesStringCommon {
}
return hash:uint;
}

inline proc setCodepoints(ref lhs: ?t1, rhs: ?t2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect types to be (ref lhs: string, rhs: string). This function is not applicable to anything else but strings and as such, we want the compiler to be able to catch any error where this was used otherwise.

Just as a side note, as it is now this would allow any two arguments of any type (not necessarily the same). If you want them to be of same type you should do ref lhs: ?t1, rhs: t1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also make two other changes:

  • Function name can be more verbose: incrementCodepoints maybe? Or you can make it addCachedNumCodepoints and make it return an int instead of changing lhs.
  • We can make this function private, right? In other words, I don't expect it to be used outside of this module, but if you are using it in String.chpl, then leave it be. (And maybe add a comment to the private follow up issue, so that we can see if we can make it private with some more effort)

@bmcdonald3
Copy link
Member Author

bmcdonald3 commented Jun 9, 2020

Test status:

  • standard
  • gasnet

Copy link
Contributor

@e-kayrakli e-kayrakli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go!

But I believe it either needs its commits to be cleaned up, or merged as a squash-merge.

Thanks!

@bmcdonald3 bmcdonald3 merged commit 9f7daa0 into chapel-lang:master Jun 10, 2020
@bmcdonald3 bmcdonald3 deleted the cache-string-length branch June 10, 2020 16:47
e-kayrakli added a commit that referenced this pull request Jun 22, 2020
Improve how we create string/bytes internally

This PR mainly improves how we internally create strings and to some extent
bytes.

#15758 added a `cachedNumCodepoints` field in the string record to avoid
computing that value on-demand. However, it also opened a can of worms about how
we create strings. In general, that involved default initing a string and poking
its fields. So, although the main motivation for this PR is to make
`numCodepoint` management better, it has bunch of other changes that are related
to that to different extents.


More or less in the order of significance, this PR does the following.


- Strengthen our management of `cachedNumCodepoints`
  - Add the argument to all non-validating string factory functions to force the
    caller to pre-compute the value either by counting it, or doing something
    smart about it.
  - Update quite a few helpers to try to produce this value efficiently
  - Update `getView` to return a tuple that contains number of codepoints in the
    view, if it is something that's easily computable.
  - Add 2 `countNumCodepoints` overloads to check whether `cachedNumCodepoints`
    is correct with `boundsChecking`
  - Serialize/deserialize that field
- Make `c_string` to `string` assignment use appropriate factory functions
  instead of `reinitString`.
- Remove `reinitString` from `string` and `bytes` in favor of
  `reinitWithNewBuffer` and `reinitWithOwnedBuffer` in `BytesStringCommon`
- Change the strategy of creating string and bytes in functions in
  `BytesStringCommon` and few other similar ones.

  We used to do something like:

```chapel
var ret: string;
ret.isOwned = true;
ret.someOtherField = otherValue;
ret.yetAnotherField = 0;

// ..

return ret;
```

  This was too cumbersome and felt principally wrong to me. With this PR we
  compute what we need to compute to create new string's (1) buffer, (2)
  bufferSize, (3) bufferLength and (4) numCodepoints, and then call a factory
  function. I hope this'll help reduce the overhead of adding new fields to
  the string record.


- Fix a few bugs where the helpers attempted to return a `string` if the main
  argument in question was an empty `bytes`.
- Add an optimization to `getView` for the cases where the view we are trying to
  get is actually the full string
- Remove an internal cast to c_string to enable using explicit `bufferType` as
  an arg type in `bufferMemcpyLocal`
- Update string and bytes doc to render references to types as "string" and not
  as "String"
- Add tests

[Reviewed by @mppf]

Test:
- [x] memleaks on string/bytes tests
- [x] ml-memleaks on string/bytes tests
- [x] valgrind on string/bytes tests
- [x] string performance is back where we want it to be (if not better)
- [x] full standard
- [x] full gasnet
e-kayrakli added a commit that referenced this pull request Jun 22, 2020
Add performance annotations

- String regressions

  Caused by: #15758
  Likely fix: #15870

- Trivial leak

  Caused by: #15713
  Fixed by: #15871

- AoA Forall Intent Improvement

  Caused by: #15767

- Sampler compile time regression

  Likely caused by: #15800

[Reviewed by @ronawho]
e-kayrakli added a commit that referenced this pull request Sep 13, 2021
Control numCodepoints caching with a config param

This PR avoids an expensive check in string implementation, but adds a secret
back door to re-enable it.

This is what I propose we should do instead of #18399

This code was added in #15870 to tighten up the ways in which strings are
created. When we added number-of-codepoints caching in #15758, we have seen that
we've had to change how strings are created in many places in the standard
libraries. I think 15870 put us in a good place, but I still have some
uneasiness around caching number of codepoints and relying on that cached value.

But the current state of things is not ideal, either: we just halt with
`--boundsChecking` if there's a discrepancy. With no way for the user to escape
that, it is only marginally better than letting the wrong value survive.

So, this PR adds a no-doced `config param useCachedNumCodepoints = true` in the
string implementation. This way, even with `--boundsChecking` we don't count
number of codepoints each time we need that value, but we have a way of avoiding
using the cached value if we see any issue with caching.

[Reviewed by @bradcray and @mppf]

Test:
- [x] types/string/cachedNumCodepoints
- [x] full standard
- [x] full gasnet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants