Add a field to the string record to store a cached size #15758

bmcdonald3 · 2020-06-02T22:41:05Z

Add "cachedNumCodepoints" to string record and modify several methods to store that cached string size upon string creation.

e-kayrakli

I have few mostly-stylistic comments, otherwise this looks good.

e-kayrakli · 2020-06-03T16:37:59Z

compiler/AST/symbol.cpp

@@ -1493,7 +1493,9 @@ VarSymbol *new_StringSymbol(const char *str) {

  std::string unescapedString = unescapeString(str, cstrMove);

-  if (!isValidString(unescapedString)) {
+  int numCodepoints = 0;
+  int ret = isValidString(unescapedString, &numCodepoints);


The function is a bool function, let's make ret a bool or const bool, too.

modules/internal/String.chpl

e-kayrakli · 2020-06-03T16:41:39Z

modules/internal/String.chpl

@@ -575,13 +578,14 @@ module String {
  }

  pragma "no doc"
-  proc chpl_createStringWithLiteral(x: c_string, length:int) {
+  proc chpl_createStringWithLiteral(x: c_string, length:int, numCodepoints: int) {


It's not part of your work, but while touching this part of the code you can also add a space between length: and int. That seems to be the "convention" for these functions.

e-kayrakli · 2020-06-03T16:48:38Z

modules/internal/String.chpl

    // NOTE: This is a "wellknown" function used by the compiler to create
    // string literals. Inlining this creates some bloat in the AST, slowing the
    // compilation.
    return chpl_createStringWithBorrowedBufferNV(x:c_ptr(uint(8)),
                                                 length=length,
-                                                 size=length+1);
+                                                 size=length+1,
+                                                 numCodepoints);


I am surprised that this works.

I think it is a very good practice to stick with named arguments once you start using them in a call. i.e. length and size are explicitly named, and so should be numCodepoints

To solve this, would you say that I should have numCodepoints=cachedNumCodepoints?

Nope, numCodepoints=numCodepoints (just like for length two lines above). First numCodepoints is the name of the formal argument to the function chpl_createStringWithBorrowedBufferNV, the other is the formal argument to the chpl_createStringWithLiteral (that becomes the actual for the NV call).

modules/internal/String.chpl

e-kayrakli · 2020-06-03T17:04:47Z

I forgot to mention in my comment above, but; could you also add a test where cachedNumCodepoints is -1 with this PR (you can include multiple different strings that this is the case in a single test) That would basically be a test that will fail after I do some of the follow up I had in mind, and I'll go and change their good files.

We have the concept of futures in our testing system that can allow you to create special tests that shows a bug or missing implementation. Alternatively you can do that to get some practice, but I hope to fix those soon, so it is OK if you don't do that right now.

e-kayrakli · 2020-06-05T00:04:27Z

modules/internal/BytesStringCommon.chpl

@@ -276,6 +281,7 @@ module BytesStringCommon {
    x.buff = other;
    x.buffSize = size;
    x.buffLen = length;
+    if t == string then x.cachedNumCodepoints = x.numCodepoints;


I don't think we need this. WithOwnedBuffer variations of factory functions can't be called with other strings but only with C buffers. When that happens, we must be calling validateEncoding in the call chain somewhere else.

e-kayrakli · 2020-06-05T00:04:49Z

modules/internal/BytesStringCommon.chpl

      }
      else {
        // if s is local create a copy of its buffer and own it
        const (buff, allocSize) = bufferCopyLocal(other.buff, otherLen);
        x.buff = buff;
        x.buff[x.buffLen] = 0;
        x.buffSize = allocSize;
+        if t == string then x.cachedNumCodepoints = other.cachedNumCodepoints;


You can probably add this line only once outside of if somewhere

e-kayrakli · 2020-06-05T00:07:55Z

modules/internal/BytesStringCommon.chpl

@@ -839,4 +848,13 @@ module BytesStringCommon {
    }
    return hash:uint;
  }
+
+  inline proc setCodepoints(ref lhs: ?t1, rhs: ?t2) {


I expect types to be (ref lhs: string, rhs: string). This function is not applicable to anything else but strings and as such, we want the compiler to be able to catch any error where this was used otherwise.

Just as a side note, as it is now this would allow any two arguments of any type (not necessarily the same). If you want them to be of same type you should do ref lhs: ?t1, rhs: t1

I would also make two other changes:

Function name can be more verbose: incrementCodepoints maybe? Or you can make it addCachedNumCodepoints and make it return an int instead of changing lhs.

We can make this function private, right? In other words, I don't expect it to be used outside of this module, but if you are using it in String.chpl, then leave it be. (And maybe add a comment to the private follow up issue, so that we can see if we can make it private with some more effort)

…ache-string-length

bmcdonald3 · 2020-06-09T21:28:59Z

Test status:

standard
gasnet

modules/internal/BytesStringCommon.chpl

e-kayrakli

I think this is good to go!

But I believe it either needs its commits to be cleaned up, or merged as a squash-merge.

Thanks!

@mppf

Improve how we create string/bytes internally This PR mainly improves how we internally create strings and to some extent bytes. #15758 added a `cachedNumCodepoints` field in the string record to avoid computing that value on-demand. However, it also opened a can of worms about how we create strings. In general, that involved default initing a string and poking its fields. So, although the main motivation for this PR is to make `numCodepoint` management better, it has bunch of other changes that are related to that to different extents. More or less in the order of significance, this PR does the following. - Strengthen our management of `cachedNumCodepoints` - Add the argument to all non-validating string factory functions to force the caller to pre-compute the value either by counting it, or doing something smart about it. - Update quite a few helpers to try to produce this value efficiently - Update `getView` to return a tuple that contains number of codepoints in the view, if it is something that's easily computable. - Add 2 `countNumCodepoints` overloads to check whether `cachedNumCodepoints` is correct with `boundsChecking` - Serialize/deserialize that field - Make `c_string` to `string` assignment use appropriate factory functions instead of `reinitString`. - Remove `reinitString` from `string` and `bytes` in favor of `reinitWithNewBuffer` and `reinitWithOwnedBuffer` in `BytesStringCommon` - Change the strategy of creating string and bytes in functions in `BytesStringCommon` and few other similar ones. We used to do something like: ```chapel var ret: string; ret.isOwned = true; ret.someOtherField = otherValue; ret.yetAnotherField = 0; // .. return ret; ``` This was too cumbersome and felt principally wrong to me. With this PR we compute what we need to compute to create new string's (1) buffer, (2) bufferSize, (3) bufferLength and (4) numCodepoints, and then call a factory function. I hope this'll help reduce the overhead of adding new fields to the string record. - Fix a few bugs where the helpers attempted to return a `string` if the main argument in question was an empty `bytes`. - Add an optimization to `getView` for the cases where the view we are trying to get is actually the full string - Remove an internal cast to c_string to enable using explicit `bufferType` as an arg type in `bufferMemcpyLocal` - Update string and bytes doc to render references to types as "string" and not as "String" - Add tests [Reviewed by @mppf] Test: - [x] memleaks on string/bytes tests - [x] ml-memleaks on string/bytes tests - [x] valgrind on string/bytes tests - [x] string performance is back where we want it to be (if not better) - [x] full standard - [x] full gasnet

@ronawho

Add performance annotations - String regressions Caused by: #15758 Likely fix: #15870 - Trivial leak Caused by: #15713 Fixed by: #15871 - AoA Forall Intent Improvement Caused by: #15767 - Sampler compile time regression Likely caused by: #15800 [Reviewed by @ronawho]

@bradcray

Control numCodepoints caching with a config param This PR avoids an expensive check in string implementation, but adds a secret back door to re-enable it. This is what I propose we should do instead of #18399 This code was added in #15870 to tighten up the ways in which strings are created. When we added number-of-codepoints caching in #15758, we have seen that we've had to change how strings are created in many places in the standard libraries. I think 15870 put us in a good place, but I still have some uneasiness around caching number of codepoints and relying on that cached value. But the current state of things is not ideal, either: we just halt with `--boundsChecking` if there's a discrepancy. With no way for the user to escape that, it is only marginally better than letting the wrong value survive. So, this PR adds a no-doced `config param useCachedNumCodepoints = true` in the string implementation. This way, even with `--boundsChecking` we don't count number of codepoints each time we need that value, but we have a way of avoiding using the cached value if we see any issue with caching. [Reviewed by @bradcray and @mppf] Test: - [x] types/string/cachedNumCodepoints - [x] full standard - [x] full gasnet

bmcdonald3 assigned e-kayrakli Jun 2, 2020

bmcdonald3 added 7 commits June 2, 2020 15:51

c code changed, string uncomplete

63d2a6a

testing size

2ba0226

changes without compiler

e30c231

unfinished compiler

42e0882

unfinished compiler

dce28b0

finished compiler

4d41ba5

String.size returns result of numCodepoints

59154ce

bmcdonald3 force-pushed the cache-string-length branch from a8b6390 to 59154ce Compare June 2, 2020 22:51

bmcdonald3 added 2 commits June 2, 2020 16:10

Fixed function header for isValidString()

d99641f

add tests for cachedNumCodepoints

15d7f92

e-kayrakli reviewed Jun 3, 2020

View reviewed changes

bmcdonald3 added 4 commits June 3, 2020 17:23

Borrowed buffer cachedNumCodepoints properly saves

210d19b

Explicitly named arguments and formatting

5b5e71d

add helper function to set cachedNumCodepoints

26c6591

Changed from 'getCodepoints' to 'setCodepoints'

0749e23

e-kayrakli reviewed Jun 5, 2020

View reviewed changes

bmcdonald3 added 9 commits June 5, 2020 16:08

Merge branch 'master' of https://github.com/chapel-lang/chapel into c…

c11d8b1

…ache-string-length

clean up of type checks, made helper function private and changed name

eb6e989

remove inline from helper function

b5ffa2c

Merge branch 'master' of https://github.com/chapel-lang/chapel into c…

b02f090

…ache-string-length

whitespace fix in BytesStringCommon

1876b31

change numCodepoints type from int to int64

e30d8cf

change compiler type from int to int64

4eecb0c

change to int64_t from int

d1d0e0a

change offset test to account for new field

674b554

e-kayrakli reviewed Jun 9, 2020

View reviewed changes

modules/internal/BytesStringCommon.chpl Show resolved Hide resolved

e-kayrakli approved these changes Jun 10, 2020

View reviewed changes

bmcdonald3 merged commit 9f7daa0 into chapel-lang:master Jun 10, 2020

bmcdonald3 deleted the cache-string-length branch June 10, 2020 16:47

This was referenced Jun 17, 2020

Improve how we create string/bytes internally #15870

Merged

Add performance annotations #15880

Merged

ronawho mentioned this pull request Jul 14, 2020

String performance regressions from #12899 #13130

Closed

e-kayrakli mentioned this pull request Sep 12, 2021

Control numCodepoints caching with a config param #18403

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a field to the string record to store a cached size #15758

Add a field to the string record to store a cached size #15758

bmcdonald3 commented Jun 2, 2020

e-kayrakli left a comment

e-kayrakli Jun 3, 2020

e-kayrakli Jun 3, 2020

e-kayrakli Jun 3, 2020

bmcdonald3 Jun 4, 2020

e-kayrakli Jun 4, 2020

e-kayrakli commented Jun 3, 2020

e-kayrakli Jun 5, 2020

e-kayrakli Jun 5, 2020

e-kayrakli Jun 5, 2020

e-kayrakli Jun 5, 2020

bmcdonald3 commented Jun 9, 2020 •

edited

Loading

e-kayrakli left a comment

Add a field to the string record to store a cached size #15758

Add a field to the string record to store a cached size #15758

Conversation

bmcdonald3 commented Jun 2, 2020

e-kayrakli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-kayrakli commented Jun 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmcdonald3 commented Jun 9, 2020 • edited Loading

e-kayrakli left a comment

Choose a reason for hiding this comment

bmcdonald3 commented Jun 9, 2020 •

edited

Loading