Unicode String #179

rowland66 · 2020-08-03T02:22:46Z

RFC for changes to String class to better support unicode.

jemc

Thanks for putting this together. I have left some comments about things I would prefer to see changed.

Also, I suspect that this RFC may not be totally complete yet, in terms of tracking what all of the implications would be. I think it would be valuable for someone to start implementing this change (particularly the method signatures, even if the code for encoding/decoding is not fully robust yet) so that we could examine in more detail all of the things that will be affected by this change.

I also suspect there could be certain implications within this RFC that may be implicitly understood by the author, but may be not clear to the reader, resulting in confusion about exactly how these changes will look. The step I mentioned above of putting together a proof of concept level implementation could also help with this problem of potential misunderstandings of this very complex change.

jemc · 2020-08-03T15:04:06Z

text/0000-unicode-string.md

+1. The values() function will return an iterator over the string codepoints. Same as runes().
+1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes.
+
+Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate. 


Adding these type names to the builtin package means that no Pony program will be able to use those type names for user-defined types. The words Encoder and Decoder are incredibly general, and I can imagine many Pony programs or libraries using them to refer to things other than this string encoding use case.

I suggest using the names StringEncoder and StringDecoder instead.

jemc · 2020-08-03T15:10:53Z

text/0000-unicode-string.md

+
+The ByteSeq type defined in std_stream.pony will be changed to remove String.
+```
+type ByteSeq is (Array[U8] val)


There's not much point in having this type alias if it is not a union.

The purpose of this type is to define the types for which it is safe to call cpointer and size to pass to FFI calls. In the standard library it is only safe to do those for standard library types, since we can't trust any user-defined types that implement the cpointer and size interface to give us a pointer that is valid for the given number of bytes.

Given that size is no longer the right method to call (it needs to call byte_size instead, I recommend just removing this union altogether and using (String | Array[U8] val) in places where it is currently used, and in each such place we would need to introduce a match statement that checks which type it is and uses size for Array and byte_size for String.

jemc · 2020-08-03T15:12:44Z

text/0000-unicode-string.md

+
+Change Writer in buffered/writer.pony to accept Encoder parameters in the write() and writev() functions. 
+
+Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.


I think you meant to put this in the files package (where it can live alongside FileLines) rather than the buffered package.

Suggested change

Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.

Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.

jemc · 2020-08-03T15:14:47Z

text/0000-unicode-string.md

+1. The truncate() function will only support len parameter values less than the string size.
+1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed.
+1. The insert_byte() function will be changed to insert_utf32()
+1. The values() function will return an iterator over the string codepoints. Same as runes().


I suggest that we rename values to bytes instead of removing it entirely, so that the user can still iterate over bytes when they need to, without introducing an allocation to convert to Array[U8] val

Perhaps it's unclear to me about the allocations, but it does seem like the array() method has not been removed? If so, then one can still use that method for an allocation-less iteration method (and if necessary that type can be improved to fun box (): this->Array[U8] box). That said, I believe that bytes() is a much clearer name for that use case

jemc · 2020-08-03T15:25:25Z

text/0000-unicode-string.md

+# Alternatives
+
+1. Leave the String class as it is. This is likely to result in hidden errors in many programs that work with Strings as they will not work correctly with unicode data if they encounter it. It will also make it difficult to use Pony in settings where ASCII is not sufficient for local language (most non English speaking countries).
+1. A more complicated implementation with multiple String types capable of storing text data internally using different byte encodings. This approach would improve performance in situations where strings needed to be created from bytes of various encodings. The String type could be matched to the native byte encoding to eliminate any need for conversion. Such an implementation would add complexity to the String API as it would require use of a String trait with multiple implementations. It would also add considerable complexity to the String class implementation.


As I mentioned in our discussion in Zulip, I would prefer to see a version of this RFC that takes the encoding as a type parameter to all encoding-aware operations within the String class, using UTF-8 as the default (so that everything in this RFC works for the user as you describe by default), but so that non-UTF-8 use cases can also be supported.

This does not necessarily mean you need multiple String classes or a String trait - just multiple encoding classes and an encoding trait (I believe you already have an encoder/decoder traits as part of this RFC, so that could potentially be used).

mfelsche

Thanks for creating this RFC!

I left a few comments.

The main concern here is that this very substantial change to the language is missing some motivational comments. Especially: What will be improved? What will this change make possible, which isnt possible yet?

I wanna see the costs of implementing this RFC be justified.

mfelsche · 2020-08-03T20:07:26Z

text/0000-unicode-string.md

+
+# Summary
+
+Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value.


It is not 100% clear to me what the internal representation of String will be. Will it remain a pointer and a size only that the bytes pointed to will be actual U32 codepoints without encoding? Or will the actual underlying data remain encoded? This RFC should also state why this approach was chosen and the other ones dicarded.

I get the sense that by "present a sequence..." he may just mean that it implements Seq[U32] and would support those APIs and that as he's said below the data would still be represented internally as UTF-8.

So to make sure I understand correctly, "hello" would be internally stored as five bytes: 104, 101, 108, 108, 111, but if we asked for third character it would return 0,0,0,108 (assuming endian)?

and, when you stray out of ASCII into codepoints, for example:

"hӧle" => 104, **211, 167**, 108, 101

Would still have a length of 4, and the third character would be l?

mfelsche · 2020-08-03T20:17:58Z

text/0000-unicode-string.md

+  fun encode(codepoint: U32): (USize, U8, U8, U8, U8)
+
+trait val Decoder
+  fun decode(bytes: U32): (U32, U8)


What is the meaning of the elements of the returned tuple? Should this function be partial considering that not all byte-combinations are valid in every encoding?

Maybe a better signature/approach for encoding/decoding would be to encode a whole String into an Array[U8] and vice versa for decoding. This might seem more work on the implementor of Encoder/Decoder, but might allow to employ more tricks and be slightly more performant/convenient for some encodings. I also suspect, that implementing a special Encoder/Decoder is fairly rare.

rowland66 · 2020-08-04T14:53:04Z

Hi Joe, Thanks for providing feedback so quickly. I have make some comments below. *Rowland E. Smith* +1-201-396-3842

On 8/3/20 11:29 AM, Joe Eli McIlvain wrote: ***@***.**** commented on this pull request. Thanks for putting this together. I have left some comments about things I would prefer to see changed. Also, I suspect that this RFC may not be totally complete yet, in terms of tracking what all of the implications would be. I think it would be valuable for someone to start implementing this change (particularly the method signatures, even if the code for encoding/decoding is not fully robust yet) so that we could examine in more detail all of the things that will be affected by this change. I also suspect there could be certain implications within this RFC that may be implicitly understood by the author, but may be not clear to the reader, resulting in confusion about exactly how these changes will look. The step I mentioned above of putting together a proof of concept level implementation could also help with this problem of potential misunderstandings of this very complex change.

I agree that a sample implementation would help. I have 70% of an implementation done. How I go about sharing that with the group? Create a pull request?

------------------------------------------------------------------------ In text/0000-unicode-string.md <#179 (comment)>: > + +Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32). +``` +class val String is (Seq[U32] & Comparable[String box] & Stringable) +``` +Where String functions took parameters of type U8 or returned values of U8, they will take or return values of type U32. All indexes will be interpreted as codepoints. Only functions that manage String allocated memory will continue to refer to bytes. + +The following additional changes will be made to the String class: +1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes. +1. The truncate() function will only support len parameter values less than the string size. +1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed. +1. The insert_byte() function will be changed to insert_utf32() +1. The values() function will return an iterator over the string codepoints. Same as runes(). +1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes. + +Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate. Adding these type names to the builtin package means that no Pony program will be able to use those type names for user-defined types. The words |Encoder| and |Decoder| are incredibly general, and I can imagine many Pony programs or libraries using them to refer to things other than this string encoding use case. I suggest using the names |StringEncoder| and |StringDecoder| instead.

I agree, I will make the change.

------------------------------------------------------------------------ In text/0000-unicode-string.md <#179 (comment)>: > +1. The insert_byte() function will be changed to insert_utf32() +1. The values() function will return an iterator over the string codepoints. Same as runes(). +1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes. + +Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate. +``` +trait val Encoder + fun encode(codepoint: U32): (USize, U8, U8, U8, U8) + +trait val Decoder + fun decode(bytes: U32): (U32, U8) +``` + +The ByteSeq type defined in std_stream.pony will be changed to remove String. +``` +type ByteSeq is (Array[U8] val) There's not much point in having this type alias if it is not a union. The purpose of this type is to define the types for which it is safe to call |cpointer| and |size| to pass to FFI calls. In the standard library it is only safe to do those for standard library types, since we can't trust any user-defined types that implement the |cpointer| and |size| interface to give us a pointer that is valid for the given number of bytes. Given that |size| is no longer the right method to call (it needs to call |byte_size| instead, I recommend just removing this union altogether and using |(String | Array[U8] val)| in places where it is currently used, and in each such place we would need to introduce a |match| statement that checks which type it is and uses |size| for Array and |byte_size| for String.

I agree. I left the type in to improve backward compatibility.

------------------------------------------------------------------------ In text/0000-unicode-string.md <#179 (comment)>: > +Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter. + +A new StringIter interface will be added to std_stream.pony +``` +interface val StringIter + """ + An iterable collection of String box. + """ + fun values(): Iterator[this->String box] +``` + +Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed. + +Change Writer in buffered/writer.pony to accept Encoder parameters in the write() and writev() functions. + +Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class. I think you meant to put this in the |files| package (where it can live alongside |FileLines|) rather than the |buffered| package. ⬇️ Suggested change -Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class. +Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.

Yes, that is a mistake. I will fix.

------------------------------------------------------------------------ In text/0000-unicode-string.md <#179 (comment)>: > +Unicode is an accepted standard for representing text in all languages used on planet earth. Modern programming languages should have first class support for unicode. + +# Detailed design + +Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32). +``` +class val String is (Seq[U32] & Comparable[String box] & Stringable) +``` +Where String functions took parameters of type U8 or returned values of U8, they will take or return values of type U32. All indexes will be interpreted as codepoints. Only functions that manage String allocated memory will continue to refer to bytes. + +The following additional changes will be made to the String class: +1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes. +1. The truncate() function will only support len parameter values less than the string size. +1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed. +1. The insert_byte() function will be changed to insert_utf32() +1. The values() function will return an iterator over the string codepoints. Same as runes(). I suggest that we rename |values| to |bytes| instead of removing it entirely, so that the user can still iterate over bytes when they need to, without introducing an allocation to convert to |Array[U8] val|

||A bytes function would need to take an Encoder as a parameter, but I am OK with adding it.

------------------------------------------------------------------------ In text/0000-unicode-string.md <#179 (comment)>: > +This can be presented as a continuation of existing Pony patterns. + +The Pony tutorial will need to be updated to reflect these changes. + +# How We Test This + +Extend CI coverage to cover these changes and new features. + +# Drawbacks + +This is a change to the String API, and as such will break existing programs. String functions returning or taking as parameters U8 values now returning or taking U32 values will probably be the most common source of program breakage. Also, programs that used the existing unicode friendly functions in String will need to change, but they should be greatly simplified. + +# Alternatives + +1. Leave the String class as it is. This is likely to result in hidden errors in many programs that work with Strings as they will not work correctly with unicode data if they encounter it. It will also make it difficult to use Pony in settings where ASCII is not sufficient for local language (most non English speaking countries). +1. A more complicated implementation with multiple String types capable of storing text data internally using different byte encodings. This approach would improve performance in situations where strings needed to be created from bytes of various encodings. The String type could be matched to the native byte encoding to eliminate any need for conversion. Such an implementation would add complexity to the String API as it would require use of a String trait with multiple implementations. It would also add considerable complexity to the String class implementation. As I mentioned in our discussion in Zulip <https://ponylang.zulipchat.com/#narrow/stream/192795-contribute-to.20Pony/topic/Changes.20to.20String.20class/near/202574862>, I would prefer to see a version of this RFC that takes the encoding as a type parameter to all encoding-aware operations within the |String| class, using UTF-8 as the default (so that everything in this RFC works for the user as you describe by default), but so that non-UTF-8 use cases can also be supported. This does not necessarily mean you need multiple String classes or a String trait - just multiple encoding classes and an encoding trait (I believe you already have an encoder/decoder traits as part of this RFC, so that could potentially be used).

I am not sure what you mean by the "encoding-aware operations". I think that what would work would be adding an "internal encoding" parameter to all of the String constructors. This would require storing the internal encoding in every String instance, so some impact on memory footprint. Adding an "internal encoding" parameter to every encoding-aware operation would cover many operations. For example, the eq function would need to be encoding aware because it takes a String as a parameter. It would not be possible to compare 2 strings without knowing how the codepoints were encoded in each of them. It would also require that any class that takes a String as a parameter would need to know the strings are internal encoding in order to call its functions. That does not sound workable. Multiple encodings would also complicate FFI. Even in the standard lib, strings encoded with UTF-16 would require some changes in the pony runtime. I would prefer to defer this enhancement as it could be implemented later with minimal disruption if needed, and it would significantly increase the complexity and work required to make this String change.

…

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#179 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLV4FJR2Z4KZP66XU3WDR63JXNANCNFSM4PS4AMOQ>.

rowland66 · 2020-08-04T15:15:45Z

Hi @mfelsche, Thanks for the feedback. Comments below. *Rowland E. Smith* +1-201-396-3842

On 8/3/20 4:25 PM, Matthias Wahl wrote: ***@***.**** commented on this pull request. Thanks for creating this RFC! I left a few comments. The main concern here is that this very substantial change to the language is missing some motivational comments. Especially: What will be improved? What will this change make possible, which isnt possible yet? I wanna see the costs of implementing this RFC be justified. ------------------------------------------------------------------------ In text/0000-unicode-string.md <#179 (comment)>: > @@ -0,0 +1,97 @@ +- Feature Name: unicode-string +- Start Date: 2020-07-30 +- RFC PR: (leave this empty) +- Pony Issue: (leave this empty) + +# Summary + +Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value. It is not 100% clear to me what the internal representation of String will be. Will it remain a pointer and a size only that the bytes pointed to will be actual |U32| codepoints without encoding? Or will the actual underlying data remain encoded? This RFC should also state why this approach was chosen and the other ones dicarded.

The internal representation of String will remain as it is now except that the pointer will always reference a UTF-8 encoding of the strings codepoints. Today, there is no such certainty.

------------------------------------------------------------------------ In text/0000-unicode-string.md <#179 (comment)>: > + +The following additional changes will be made to the String class: +1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes. +1. The truncate() function will only support len parameter values less than the string size. +1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed. +1. The insert_byte() function will be changed to insert_utf32() +1. The values() function will return an iterator over the string codepoints. Same as runes(). +1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes. + +Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate. +``` +trait val Encoder + fun encode(codepoint: U32): (USize, U8, U8, U8, U8) + +trait val Decoder + fun decode(bytes: U32): (U32, U8) What is the meaning of the elements of the returned tuple? Should this function be partial considering that not all byte-combinations are valid in every encoding?

The tuple returned by the encode function contains the number of bytes required to encode the codepoint (1-4), and the byte values of the encoded codepoint. Perhaps it would be more efficient to compact the 4 bytes in a single U32. That is what is done with decode. In the case of bytes that do not correctly decode into codepoints, the unicode replacement character (0xFFFD) is inserted for each invalid byte consumed. This is how the existing UTF-8 decoding function worked, and I think it is the best approach. Without it, many String constructors become partial, and that is a big backward compatibility issue.

------------------------------------------------------------------------ In text/0000-unicode-string.md <#179 (comment)>: > + +The following additional changes will be made to the String class: +1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes. +1. The truncate() function will only support len parameter values less than the string size. +1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed. +1. The insert_byte() function will be changed to insert_utf32() +1. The values() function will return an iterator over the string codepoints. Same as runes(). +1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes. + +Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate. +``` +trait val Encoder + fun encode(codepoint: U32): (USize, U8, U8, U8, U8) + +trait val Decoder + fun decode(bytes: U32): (U32, U8) Maybe a better signature/approach for encoding/decoding would be to encode a whole String into an Array[U8] and vice versa for decoding. This might seem more work on the implementor of |Encoder|/|Decoder|, but might allow to employ more tricks and be slightly more performant/convenient for some encodings. I also suspect, that implementing a special |Encoder|/|Decoder| is fairly rare.

The Encoder/Decoder classes are used in a variety of situations. When reading from a file or a socket, the bytes read might not decode to a characters completely. An additional few bytes from the next read might be required. This approach deals with that situation with reasonable ease. Yes, I don't expect that there will be very many Encoder/Decoder's.

…

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#179 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLV644N6VX2AEHQSKKHTR64MK3ANCNFSM4PS4AMOQ>.

jemc · 2020-08-04T17:02:42Z

I agree that a sample implementation would help. I have 70% of an
implementation done. How I go about sharing that with the group? Create
a pull request?

You don't necessarily need to make a pull request - you can just make sure your branch is public on GitHub in your fork of the ponyc repo, then share the link here so we can go inspect your branch as we read the RFC.

jemc · 2020-08-04T17:03:54Z

Regarding some of your other responses - I'm a bit confused currently, but I'll wait and look at your sample implementation so I can be sure I understand your comments correctly before I respond.

rowland66 · 2020-08-04T17:31:44Z

Thanks Joe I will distribute the link as soon as I can. My power is out due to the storm on the east coast, so it may take me a while. Also it looks like am going to miss the weekly sync call.

…

On Tue, Aug 4, 2020, 1:02 PM Joe Eli McIlvain ***@***.***> wrote: I agree that a sample implementation would help. I have 70% of an implementation done. How I go about sharing that with the group? Create a pull request? You don't necessarily need to make a pull request - you can just make sure your branch is public on GitHub in your fork of the ponyc repo, then share the link here so we can go inspect your branch as we read the RFC. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLV3E73QK6M4F5E3KZRDR7A5MFANCNFSM4PS4AMOQ> .I

jasoncarr0 · 2020-08-05T03:51:55Z

I'm worried about the encoding, and it wasn't clear to me if the proposal would leave that unchanged and merely implement the Seq trait as though it were a sequence of U32 elements, or make the representation more pervasive.

With a representation as a subset of byte sequences we can have Array[U8] val -> String? and fun bytes(): this->Array[U8] box which would remain safe and allocation-less, enabling more interop and smoother interaction with existing code.

rowland66 · 2020-08-07T00:38:45Z

Hi All, My house has finally gotten electric power back, and I have been able to fork the ponyc repo and push a branch with my string class changes. The URL for the branch is: https://github.com/rowland66/ponyc/tree/string I hope that this make some of the items that we have been discussing clearer. I am new to GitHub, so if you have any trouble with access, please let me know. *Rowland E. Smith* +1-201-396-3842

…

On 8/4/20 1:02 PM, Joe Eli McIlvain wrote: I agree that a sample implementation would help. I have 70% of an implementation done. How I go about sharing that with the group? Create a pull request? You don't necessarily need to make a pull request - you can just make sure your branch is public on GitHub in your fork of the ponyc repo, then share the link here so we can go inspect your branch as we read the RFC. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLV3E73QK6M4F5E3KZRDR7A5MFANCNFSM4PS4AMOQ>.

SeanTAllen · 2020-08-11T18:11:46Z

Rowland has updated the RFC in response to comments.

@jemc @jasoncarr0 @mfelsche can you have another look?

jemc · 2020-08-18T19:40:02Z

Some notes from the post-sync-call on suggestions to improve the performance and flexibility of various use cases, based on some limited review of Rowland's branch so far:

add embed _array: Array[U8] as the field within String, instead of the three fields we have now that happen to be the same as Array[U8]
- this makes it so that string to array conversions can be zero allocation - every string already contains an array that it can share as a box/val. See the next bullet point for more info;
fun val array: Array[U8] val become fun array: this->Array[U8] box
- it is a superset of what we have now, and allows a readable "byte string" reference to a String ref, rather than requiring val.
add whatever methods we need to add Array[U8] that are restricted by A:U8 (e.g. read_u8) to make Array[U8] have everything that a "byte string" class needs
- this will mitigate the usability issues that will result from bytewise operations disappearing from String.
encoder as a type parameter rather than a runtime parameter
- also make the method use iftype E <: UTF8StringEncoder instead of if encoder is UTF8StringEncoder
- these together make it so that there is zero cost for detecting the encoder at runtime - it is known at compile time.

jasoncarr0 · 2020-08-18T20:07:28Z

* encoder as a type parameter rather than a runtime parameter
  
  * also make the method use `iftype E <: UTF8StringEncoder` instead of `if encoder is UTF8StringEncoder`
  * these together make it so that there is zero cost for detecting the encoder at runtime - it is known at compile time.

Though, we should perhaps verify that there is a runtime cost being paid here without the generics. I suspect given the size and structure of the method that LLVM would have an easy time eliminating it.

rowland66 · 2020-08-20T01:18:41Z

Hi Joe and all, I have added comments below.

On Tue, Aug 18, 2020 at 3:40 PM Joe Eli McIlvain ***@***.***> wrote: Some notes from the post-sync-call on suggestions to improve the performance and flexibility of various use cases, based on some limited review of Rowland's branch so far: - add embed _array: Array[U8] as the field within String, instead of the three fields we have now that happen to be the same as Array[U8] - this makes it so that string to array conversions can be zero allocation - every string already contains an array that it can share as a box/val. See the next bullet point for more info; [RS] I have no objection to doing this in principal. It would be a lot of

work, and is not related to the changes that I am proposing. Does anyone know why String is not implemented this way currently? I am just concerned that there might be some subtle reason why it won't work.

- fun val array: Array[U8] val become fun array: this->Array[U8] box - - it is a superset of what we have now, and allows a readable "byte string" reference to a String ref, rather than requiring val. [RS] OK. Again, this is just an improvement to String that could be

implemented independently of this RFC, right?

- add whatever methods we need to add Array[U8] that are restricted by A:U8 (e.g. read_u8) to make Array[U8] have everything that a "byte string" class needs - this will mitigate the usability issues that will result from bytewise operations disappearing from String. [RS] I like this idea. This way, it would not be necessary to convert a

byte array to a String just to get access to some String functions. I will look at both classes and figure out what is missing.

…

- encoder as a type parameter rather than a runtime parameter - - also make the method use iftype E <: UTF8StringEncoder instead of if encoder is UTF8StringEncoder - these together make it so that there is zero cost for detecting the encoder at runtime - it is known at compile time. [RS] I don't understand this. Can you provide a more complete example? - — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLVZFBZXHVKO3JOFVYZTSBLKKBANCNFSM4PS4AMOQ> .

-- *Rowland E. Smith* P: (862) 260-4163 M: (201) 396-3842

jasoncarr0 · 2020-08-20T01:40:11Z

@rowland66 The reasons that these changes are being mentioned now is because of the change to byte-wise API in the String. Efficient access to the underlying array and byte array methods is one way to allow us to access these lost APIs more effectively, without compromising on performance. It is true that I certainly don't know why the existing String has not been implemented that way.

Hence why these suggestions are intended to be part of this RFC (but note that implementation details such as the embed could be implemented in multiple PRs). Anyone else feel free to chime in if I'm not understanding correctly.

ramonacat · 2020-10-20T12:12:31Z

text/0000-unicode-string.md

+
+# Detailed design
+
+Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32). 


I think it would make sense to mention endianness here (which I assume will be the one of the machine for which the application is compiled)

ramonacat · 2020-10-20T12:22:43Z

text/0000-unicode-string.md

+  fun values(): Iterator[this->String box]
+```
+
+Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed.


I think it's worth clarifying here if only \r\n and \n will be considered line breaks or if support for unicode line breaks (like U+2028, U+2029) will be added

ramonacat · 2020-10-20T12:24:15Z

text/0000-unicode-string.md

+
+Change Writer in buffered/writer.pony to accept StringEncoder parameters in the write() and writev() functions. 
+
+Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class in files/file_lines.pony.


Since the Unicode standard does not define what a character is, I think it's worth claryfing if it means codepoints, grapheme clusters or something else

rowland66 · 2020-10-21T02:40:10Z

Hi Agares and all, Thank you for the feedback. I have added comments below, and will make changes to the RFC document.

On Tue, Oct 20, 2020 at 8:25 AM agares ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In text/0000-unicode-string.md <#179 (comment)>: > +- Feature Name: unicode-string +- Start Date: 2020-07-30 +- RFC PR: (leave this empty) +- Pony Issue: (leave this empty) + +# Summary + +Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value. + +# Motivation + +Unicode is an accepted standard for representing text in all languages used on planet earth. Modern programming languages should have first class support for unicode. + +# Detailed design + +Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32). I think it would make sense to mention endianness here (which I assume will be the one of the machine for which the application is compiled)

For the provided encoders, I have UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE. As I understand it, UTF-8 has no concept of endianness. I will update the text above to indicate the additional provided encoders. Does this cover your concern?

------------------------------ In text/0000-unicode-string.md <#179 (comment)>: > +The ByteSeq type defined in std_stream.pony will be changed to remove String. +``` +type ByteSeq is (Array[U8] val) +``` +Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter. + +A new StringIter interface will be added to std_stream.pony +``` +interface val StringIter + """ + An iterable collection of String box. + """ + fun values(): Iterator[this->String box] +``` + +Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed. I think it's worth clarifying here if only \r\n and \n will be considered line breaks or if support for unicode line breaks (like U+2028, U+2029) will be added I will add this clarification. Only \r\n and \n will be considered line

breaks. No change to what is done today. As per wikipedia https://en.wikipedia.org/wiki/Newline "Recognizing and using the newline codes greater than 0x7F (NEL, LS and PS) is not often done."

In text/0000-unicode-string.md <#179 (comment)>: > +Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter. + +A new StringIter interface will be added to std_stream.pony +``` +interface val StringIter + """ + An iterable collection of String box. + """ + fun values(): Iterator[this->String box] +``` + +Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed. + +Change Writer in buffered/writer.pony to accept StringEncoder parameters in the write() and writev() functions. + +Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class in files/file_lines.pony. Since the Unicode standard does not define what a character is, I think it's worth claryfing if it means codepoints, grapheme clusters or

something else

Agreed. I will change the FileCharacters class to FileCodepoints.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#179 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLVYAKMNI5CLAKNDENVLSLV6UTANCNFSM4PS4AMOQ> .

-- *Rowland E. Smith* P: (862) 260-4163 M: (201) 396-3842

jasoncarr0 · 2020-12-04T20:36:52Z

I'd like to leave a note here that, after looking over the PR (which has not yet implemented this portion), I believe my suggested change for the capability of the array() method causes very confusing behavior ~~and unsafety in some edge cases~~ with the expected implementation.

This can be fixed by using an embedded array, or two separate methods with restricted types and necessary warnings about confusing behavior.

jasoncarr0 · 2020-12-08T19:25:57Z

We talked about this over Sync. The plan we went forward with is keeping the array method as val only. This will continue the behavior of not copying for UTF8Encoder

The low-level access can be provided by a method with the suggested type signature:
fun current_byte_buffer(): this->Array[U8] box
which always constructs the underlying byte array without regard to encoding. This will share the buffer with the String, but construct a new read-only Array object pointing to it.

The documentation for the method will indicate that the array will track writes until the String's buffer is reallocated due to length changes.

rowland66 · 2020-12-09T04:06:57Z

Hi Jason, I have implemented this with this function. fun current_byte_buffer(): Array[U8] => Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc) I did not understand what "this->Array[U8] box" would mean. As I understand it, "this->" mean return the reference capability of the receiver, but then why say box. I think that we want Array[U8] box, but since the default reference capability for a function is box, so the function will return a box also. This code seems to work. I tried it with a val, trn and ref String. If this makes sense, I will add some tests and commit the change. Also, how does Pony garbage collection work for this type of thing. I seems that we now have 2 references to the same memory. How does the Pony runtime know than and not collect the underlying array bytes while they are still referenced?

…

On Tue, Dec 8, 2020 at 2:26 PM Jason Carr ***@***.***> wrote: We talked about this over Sync. The plan we went forward with is keeping the array method as val only. This will continue the behavior of not copying for UTF8Encoder The low-level access can be provided by a method with the suggested type signature: fun current_byte_buffer(): this->Array[U8] box which always constructs the underlying byte array without regard to encoding. The documentation for the method will indicate that the array will track updates until the String's buffer is reallocated. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLVZMTE2BHJRQRVFE5ETSTZ4VRANCNFSM4PS4AMOQ> .

-- *Rowland E. Smith* P: (862) 260-4163 M: (201) 396-3842

jasoncarr0 · 2020-12-09T04:44:41Z

I did not understand what "this->Array[U8] box" would mean

this->Array[U8] box is valid syntax for the return type.
What that means can be factored into
this->(Array[U8] box)
i.e. it's the type that the current receiver views box as. Specifically this means that if the receiver is box or ref, it is indeed just box. But if the receiver is val then it means val. This is still useful given the other array method as it doesn't copy for any encoding.

In this case we want to ensure regardless of the receiver that it is read-only (so as to not break the string invariants), but we can be a bit stronger for val than just using box

Also, how does Pony garbage collection work for this type of thing. I seems
that we now have 2 references to the same memory. How does the Pony runtime
know than and not collect the underlying array bytes while they are still
referenced?

As I understand it, the allocation as a whole is still garbage collected (and it needs to be read by the garbage collector for general arrays) and deallocation doesn't occur until neither the String nor Array are pointing to it.

rowland66 · 2020-12-11T04:07:55Z

Hi Jason, I have been struggling with this with no success. I can write three separate functions and it compiles: fun ref current_byte_buffer_ref(): Array[U8] ref => Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc) fun box current_byte_buffer_box(): Array[U8] box => Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc) fun val current_byte_buffer_val(): Array[U8] val => recover val Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc) end I can see how current_byte_buffer_ref() and current_byte_buffer_box() could be combined as a single function that returned this->Array[U8] box. But I can't figure out how to handle the val case, and since a box function works with a ref, val and box receiver I need to handle all three. What am I missing?

…

On Tue, Dec 8, 2020 at 11:44 PM Jason Carr ***@***.***> wrote: I did not understand what "this->Array[U8] box" would mean this->Array[U8] box is valid syntax for the return type. What that means can be factored into this->(Array[U8] box) i.e. it's the type that the current receiver views box as. Specifically this means that if the receiver is box or ref, it is indeed just box. But if the receiver is val then it means val. This is still useful given the other array method as it doesn't copy for any encoding. Also, how does Pony garbage collection work for this type of thing. I seems that we now have 2 references to the same memory. How does the Pony runtime know than and not collect the underlying array bytes while they are still referenced? As I understand it, the allocation as a whole is still garbage collected (and it needs to be read by the garbage collector for general arrays) and deallocation doesn't occur until neither the String nor Array are pointing to it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLV2T5RF2IYXVHNMWFUDST36ENANCNFSM4PS4AMOQ> .

-- *Rowland E. Smith* P: (862) 260-4163 M: (201) 396-3842

jasoncarr0 · 2020-12-11T04:58:20Z

In this case you can use a bare recover without specifying a capability. I think this is a bit of a bug that complex capabilities cannot be written as the destination cap of a recover block (although it's only needed because of pointers here rather than actual fields). That is:

  fun current_byte_buffer(): this->Array[U8] box =>
    recover Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc) end

Strictly speaking you should also be able to recover to val and cover it, but it's not actually "correct" for most of the cases

jasoncarr0 · 2020-12-11T08:02:29Z

Finally got my ponyc env working again. The field access checks for recover blocks are still a tiny bit too strict, they should drop down to tag instead of yelling in this case, so you actually need to do:

  fun current_byte_buffer(): this->Array[U8] box =>
    let ptr = _ptr
    recover Array[U8].from_cpointer(ptr._unsafe(), _size, _alloc) end

But this code builds for me

esoterra · 2021-02-28T00:38:38Z

Hello, just realized this PR existed and am going to carry my feedback over from the issue I originally commented on. I provided a quick overview of some differences between what was suggested there and what is done in Rust and the benefits of those design choices.

My understanding is that this RFC encodes all strings as UTF-8 byte arrays and only allows users to operate on them using their character indices. Assuming that is correct and all reads and mutations of strings are performed on the byte representation, then I am concerned there will be issues with offering efficient string operations.

Specifically because on an arbitrary UTF-8 byte sequence computing the byte index from the character index is O(n).
To my knowledge the string representations used by existing languages tend to avoid this issue by either not presenting the user with character indexing (e.g. Rust, see my prior response) or by encoding strings so that indexing by character can be done in constant time (e.g. Python, see this).

What is the intended approach of this RFC and how does it avoid or mitigate this issue?

rowland66 · 2021-03-01T00:49:05Z

Hi Kyle,

Thank you for taking the time to look at this. I am the author of the RFC, and have worked on the implementation. As you mention, the biggest down side to using a variable size encoding for unicode is that index operations are slow, and the implementation of this RFC has no particularly good solution. Pony does have an Array class that has many of the same functions as String, so for performance critical code, one option is to use Array's to hold String bytes. Also, some of the String functions take positive and negative index values. Negative indexes are relative to the end of the String, so in some cases index access using a negative index would be much faster.

In how different programming languages deal with Strings seems to be a tradeoff between speed, efficiency and correctness. As I understand it, Rust strings are UTF-8 encoded internally, but there are functions to access the underlying bytes, but they are prone to panics if used incorrectly. So Rust seems to be leaning toward correctness, but provides a loaded gun in case you want to take it and shoot yourself. Phython seems to be favoring efficiency and speed with the ability to choose between 1, 2 and 4 bytes per character strings. I assume that adding certain unicode characters to a 1 or 2 byte per character string results in some type of runtime error. So it is easier to write programs that fail at runtime unless you use 4 character byte strings, but that is pretty inefficient in most cases. The Pony implemenation in this RFC favors correctness (number 1 on the list in the Pony Philosophy).

In the RFC design, we have tried to leave room for alternate String implementations that might be better suited for some situations. The new String design also tries to remain as comaptible as possible with the existing Pony String class. The existing Pony String has a mix of functions some of which assume 1 byte per character and others that support a UTF-8 encoding. This was a confusing API that left a lot of room for programming error.

Hope this makes sense, and I am happy to discuss further if you have some ideas for improvement.

esoterra · 2021-03-01T17:59:01Z

Hi Rowland,

Thanks for getting back to me quickly, I have a few specific comments below about things you mentioned and have tried to better outline which Rust features would best inform a Pony Unicode String implementation.
Hopefully this will help clarify why I think Rust and Python act as archetypes for their chosen string strategy and why I think they are a good reference for designing a new string API.

RE: Negative Indices

That is an interesting point about negative indices. If reverse unicode segmentation is used, you can potentially traverse fewer bytes.

RE: Rust

they are prone to panics if used incorrectly

This is technically true of the Rust design, but there are panic-free options available and the ones that can panic are meant to be used with something like char_indices which ensure that you are at a valid boundary. The idiomatic ways of using Rust's String API don't meaningfully sacrifice correctness.

Additionally, the fact that it does allow panic using &a[i..j] is not a feature I am recommending Pony replicate.
The rest of the API and its tradeoffs can exist entirely independent of it because a.get(i..j) provides the same functionality.

RE: Python

I assume that adding certain unicode characters to a 1 or 2 byte per character string results in some type of runtime error

This is not correct because strings in python are immutable and you cannot create such a scenario. Python can more correctly be said to emphasize correctness over speed/efficiency in the same way that concatenating Java, JavaScript, or C# strings directly does.

i.e. Python is in the "shared immutable strings" family and is optimized for that case

What would it look like to "follow Rust's example"?

Where Python offers shared immutable strings and is a good example for how to do so, Rust offers owned mutable append-only strings with an efficient and clean (ignoring the aforementioned) API.

In Pony terms, as I understand them, implementing an approach similar to Rust's would involve the following characteristics.

Internally represent strings as UTF-8
Represent characters using a 4-byte wide type
Make strings character iterable
Make strings (character, byte-index) iterable
Support byte-index based reading operations using partial functions which check whether the indices are valid (this check is quite inexpensive)
Treat the string length as its byte length
Require users to use the char and char_indices-like iterators for cases where they want to think in terms of characters, explicitly taking on the cost of iterating over the string.

jemc · 2021-03-02T19:25:34Z

@Kylebrown9 - if I understand your proposal correctly, it sounds like you are recommending something like: Keep byte-based operations as the first-class operations, but also have character-based operations.

This seems similar to what we have in Pony today (perhaps with the differences of: preference for UTF8 rather than UTF32 and perhaps with support for more character operations), so I'd be curious to hear you compare your suggested approach to Pony's current approach, rather than the approach proposed in this PR.

esoterra · 2021-03-02T20:09:06Z

In Rust, making a string from bytes is effectively partial because not all byte sequences are valid UTF-8 strings.
Additionally, operations on strings preserve validity of the string as UTF-8, if an operation would violate this it instead fails.
This means that all Rust strings can be assumed to be valid UTF-8 and as I understand it is not something Pony guarantees.
Python actually makes a similar guarantee in a different way by keeping bytes and str distinct, converting between them can fail and operations that make new strings guarantee unicode validity.

As a change to Pony today

Making a string from a sequence of U8 would become partial, only succeeding when the resulting string would be valid utf-8.
Operations that use byte indices (e.g. apply, trim, chop, delete, substring) would become partial, failing if they index somewhere that is not a valid codepoint boundary.
Operations that read a U8 as a character (e.g. apply, pop, shift) could either
- be made explicitly byte-specific where possible (e.g. pop_byte),
- be converted to versions that produce U32 or a specific character type,
- or be removed.
Operations that provide a U8 as a character (e.g. update, unshift, push) could either
- be made explicitly byte-specific where possible (e.g. ushift_byte) although I wouldn't recommend this most of the time,
- be converted to versions that take in a U32 or special character type,
- or be removed.
The runes method would be kept, although I'd be tempted to call it chars, and another iterator would be added that also tells you the byte-index of the current character.

Edit: If this was going to follow the Rust approach, then a different string type(s) could/would be added to handle platform specific strings (see OsString) which may not satisfy these invariants or have other requirements.

esoterra · 2021-03-02T20:27:53Z

Also, I realize I've ended up proposing and defending a specific prescription for how to do strings in Pony.
That wasn't really my intention on arrival, I just ended up doing so in clarifying that I think it's important
to offer a string API with always-valid unicode strings that offers indexing which matches the stored representation.

The language comparisons and proposal outlines I've made are just intended to demonstrate what
I mean by those statements and ways they can be achieved.

jasoncarr0 · 2021-03-02T20:30:46Z

For byte reading in this RFC, we have the current_byte_buffer method (EDIT: actually it still seems to be called array here). Do you think that's clear enough for you to find? And does it make sense to you to have byte-reading but not byte-writing?

esoterra · 2021-03-02T20:47:20Z

The only two things I would ask of a byte-based method on a string is that

it has byte or bytes in the name
and it cannot cause the string to become invalid.

There is an argument to be made that some methods might confuse users or encourage them to ignore the fact
that a string is made of characters and not bytes, and in those cases it may make sense to not include them.
Each method and name should be treated on a case-by-case basis and I don't necessarily have strong views
on what the answer is because I don't know all of the existing idioms and use cases out there that they may impact.

Note: I don't consider a method to be "byte-based" if it does not take in or return U8 values, even if it uses byte-indices.
i.e. substring and chop would not be byte-based as long as they return strings and not bytes

Edit: Forgot to directly answer your question. What you described meets the requirements I would set for it and sounds fine but others may have opinions on those choices.

rowland66 · 2021-03-03T04:42:45Z

Hi Kyle,

There is a lot to digest here. Let me start with your list from above of String features in Rust and how the compare to Pony post adoption of this RFC.

Internally represent strings as UTF-8

Same

Represent characters using a 4-byte wide type

Same. Pony has no Character type, but String functions return unicode codepoints as U32

Make strings character iterable

Same

Make strings (character, byte-index) iterable

No. As mentioned above, String has an array function that is very efficient assuming you want the String bytes UTF-8 encoded. However, the resulting Array is only safe to use if you know that your String contains only ASCII characters. The resulting Array is also a val, so it cannot be modified. Since the Arrays shares the memory used by the String, modifying the Array might corrupt the String.

Support byte-index based reading operations using partial functions which check whether the indices are valid (this check is quite inexpensive)

This would add a bunch of partial functions to String which in undesireable. Byte index values are never safe to change (increment or decrement), so their value is limited to remembing a position in a String. They could be useful if you know that your String is ASCII, but then you can use the array function and work with the String as an array of bytes. There is no right or wrong answer here. Its a decision on whether adding functions that are likely to be misused resulting in runtime errors is worth some added utility provided by correct use of the functions.

Treat the string length as its byte length

There is already a byte_size() function that serves this purpose. Knowing a String size in codepoints is more often what is needed, so I think that the size() behavior is correct.

Require users to use the char and char_indices-like iterators for cases where they want to think in terms of characters, explicitly taking on the cost of iterating over the string.

I think that it makes sense that String be a sequence of Characters, and Characters are no longer single bytes. For applications that want to work with bytes, Pony has the Array class. I think that keeping String's and Arrays of bytes separate and distinct results in less error prone code and I value that highly.

Something worth considering is whether Pony should support alternate internal presentations of String data. The intent is to leave that possibility open for the future.

Another thing to consider is optimizations to the current String class. For example, a String could maintain a boolean indicator of whether it contains any multi-byte encoded characters. In cases where it does not, character index based operations could be much faster. This does add additional data to every String instance, so such a feature would need to be carefully considered.

esoterra · 2021-03-03T15:02:25Z

Hi Rowland,

I think so far it sounds like we're both in agreement about my priority A that
Pony should "offer a string API with always-valid unicode strings".

It seems like we disagree about priority B, that it should use "indexing which matches the stored representation".
My motivation for priority B could be rephrased as "indexing operations should be constant time"
and I think it is born out by (1) the principle of least surprise and (2) the goal of performance.

1. Principle of Least Surprise

Popular programming languages usually offer constant time indexing for strings and I believe users now expect this.
When users write for-loops that iterate over the index range or attempt to perform a series of random accesses because they don't realize this choice was made, they will encounter a performance cost that vastly exceeds their expectation, ergo surprise.

2. Performance

I think there are also other use cases where code that could have worked well on the existing string design or other options we have now will be meaningfully less performant.
e.g. a compiler which tokenizes a string (which may not be all ascii) and then refers to spans of the string using character indices throughout the rest of the compiler. Whenever they want to read that span, they will take on an extra linear time cost and "remembering" the byte-index they're actually referring to is actually quite useful.

jasoncarr0 · 2021-03-03T15:39:49Z

Honestly for 1. I'd be surprised to see constant-time indexing for anything other than ascii.
But for 2. Don't substrings cover this use-case for relatively low space cost? Obviously this costs a bit less in e.g. Rust or C than in Pony. It would be nice to have a raw slice type in Pony for this, but otherwise I would just use a string slice in Rust (i.e. a pointer,length pair)

rowland66 · 2021-03-03T18:23:24Z

@Kylebrown9 For 1, I agree with @jasoncarr0 why would you expect constant-time indexing for variable size elements. Also there is another way to look at principal of least surprise. I should not be surprised when my code fails because it encounters strings containing multi-byte characters. Having more functions in String that use byte indexes increases the likelyhood of these unpleasant surprises.

For 2, I agree that there is a performance pennalty in some cases for using Strings. In cases where the pennalty is large (a compiler tokenizing text) there is always the option to use Array.

esoterra · 2021-03-03T18:32:07Z

How many average programming language users are thinking actively about the fact that strings are variable length when using a unicode API that gives character indices? I think they'll tend to expect indexing into this Seq[U32] to be O(1).

As a simple example program that I think will surprise people, imagine writing a palindrome checker.
I think many users would iterate from 0 to n/2 comparing the values at index i and n-i (or -i).
I don't think they're going to expect that that strategy will put them in O(n^2).

esoterra · 2021-03-03T18:32:25Z

Here's a table I'm putting together to summarize different language's design choices on this matter.
I've tried to fill in as much of the most popular programming languages on GitHub,
feel free to suggest corrections or additions and I'll render them.

Language	Encodings	Indexing	Notes
JavaScript	UTF-16	?	Surprisingly hard to find info on indexing operations
Python	Latin-1, UCS-2, UCS-4	`O(1)`	Strings are immutable, encoding selected automatically
Java	UTF-16	`O(1)`	Caveat: may only return half a character
C++	ASCII / UTF-8?	`O(1)`	?
Rust	UTF-8	`O(1)`	Uses byte-indexes, must occur at codepoint boundary
Ruby	ASCII, EUC, UTF-8	?	Allows users to choose and convert encodings
PHP	Latin-1	`O(1)`	No default support for unicode

jasoncarr0 · 2021-03-04T00:52:39Z

Ah, so you're specifically talking about O(1) byte-indexing. I'm more thinking character indexing, and hence I would be surprised to see it be O(1), at least with good character support.

Could you help explain the use case for byte-indexing, particularly if there's something that's not covered by substrings? Or O(1) character-indexing?

esoterra · 2021-03-04T02:45:32Z

Among those examples there are some that index by character (e.g. Python) and some that index by byte (e.g. Rust).
As far as I can tell, irrespective of that choice, most popular programming languages indexing is O(1).
I'm suggesting that regardless of how we choose to index, some people will expect it to be O(1) (see palindrome example).

Most cases for byte-indexing (e.g. compiler lexers & error reporting and text editor piece tables) can probably be replaced by allocating a new substring if the cost of doing so is acceptable.
In Rust, character-based work is usually done by s.chars() with things like s.chars().nth(...) and s.chars().rev().nth(...) filling the role that the proposed character indexing would, while being obvious about their cost.

I think that offering O(n) character indexing is going to trip up some new users and that some form of O(1) indexing or slicing is probably needed for certain specific high-performance scenarios.
However, I'll be fine with any decision if it is aware those concerns and is clear about the chosen performance characteristics.

(I've spent quite a lot of time writing these responses and will be dialing that back, but I can clarify things if needed)

jasoncarr0 · 2021-03-04T03:36:31Z

Got it, thanks for the thoughtfulness @Kylebrown9

redvers · 2021-03-04T10:49:06Z

One more datapoint - The BEAM (erlang, elixir, lfe et al) uses utf-8 encoded binaries as the native string datatype. There are multiple other formats supported for external compatibility - but that is its primary.

All the string operations work on graphemes, not bytes.

redvers · 2021-03-04T10:56:11Z

(I'm also going to demonstrate my lack of CS education and admit I don't understand how you can have an algorithm which can identify say the 10,000th character in a string with O(1) performance when you have the variable width encoding that utf8 provides)

esoterra · 2021-03-04T14:22:49Z

The only examples on my list above that uses UTF-8 and offers O(1) indexing (Rust/C++) uses byte-indexing.
i.e. it does not identify the 10,000th character, it identifies the 10,000th byte

redvers · 2021-03-04T19:41:03Z

I think the only other thing I think I should mention should be in the "How We Teach" section.

Since we're counting and indexing on codepoints and not graphemes we should probably mention that it is possible to get legal codepoints out of this interface which are illegal graphemes.

For example, in a grapheme-based implementation:

iex(260)> "👨‍🦰" |> String.length
1
iex(261)> "👨‍🦰" |> String.codepoints
["👨", "‍", "🦰"]

ie: "Man" + "ZWJ" + "Red Hair"

In this RFC implementation, the length would be 3 (as there are three codepoints). Without documenting it - that's arguably unexpected behaviour because most (or many) developers don't know that there isn't a 1:1 mapping between codepoints and characters.

Having the interface be grapheme-based would end up having our functions that return single characters have to return Array[U8] again instead of U32. A significantly higher lift for the 99% use-case.

rowland66 added 2 commits August 2, 2020 20:47

Initial commit of unicode string rfc.

9518166

Minor edits to unicode string rfc

0183a7a

jemc reviewed Aug 3, 2020

View reviewed changes

mfelsche reviewed Aug 3, 2020

View reviewed changes

Made a number of changes based on rfc feedback.

56e1a46

rowland66 added 2 commits September 8, 2020 09:25

Incorporated Joe Eli McIlvain's changes.

151af1f

Removed String implementation change to use embedded array.

8101a70

ramonacat reviewed Oct 20, 2020

View reviewed changes

This was referenced Dec 1, 2020

Changes for rfc-179 unicode string ponylang/ponyc#3679

Closed

Changes resulting from RFC 179 Unicode-String. ponylang/corral#161

Closed

Base automatically changed from master to main February 8, 2021 22:15

SeanTAllen mentioned this pull request Feb 27, 2021

String class is error prone and has usability issues #107

Open


		Change Writer in buffered/writer.pony to accept Encoder parameters in the write() and writev() functions.

		Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.

	Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.
	Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.


		# Summary

		Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value.


		# Detailed design

		Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32).

Unicode String #179

Are you sure you want to change the base?

Unicode String #179

Conversation

rowland66 commented Aug 3, 2020

jemc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasoncarr0 Aug 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfelsche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

redvers Mar 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rowland66 commented Aug 4, 2020 via email

rowland66 commented Aug 4, 2020 via email

jemc commented Aug 4, 2020

jemc commented Aug 4, 2020

rowland66 commented Aug 4, 2020 via email

jasoncarr0 commented Aug 5, 2020 • edited Loading

rowland66 commented Aug 7, 2020 via email

SeanTAllen commented Aug 11, 2020

jemc commented Aug 18, 2020

jasoncarr0 commented Aug 18, 2020 • edited Loading

rowland66 commented Aug 20, 2020 via email

jasoncarr0 commented Aug 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rowland66 commented Oct 21, 2020 via email

jasoncarr0 commented Dec 4, 2020 • edited Loading

jasoncarr0 commented Dec 8, 2020 • edited Loading

rowland66 commented Dec 9, 2020 via email

jasoncarr0 commented Dec 9, 2020 • edited Loading

rowland66 commented Dec 11, 2020 via email

jasoncarr0 commented Dec 11, 2020 • edited Loading

jasoncarr0 commented Dec 11, 2020 • edited Loading

esoterra commented Feb 28, 2021 • edited Loading

rowland66 commented Mar 1, 2021

esoterra commented Mar 1, 2021 • edited Loading

RE: Negative Indices

RE: Rust

RE: Python

What would it look like to "follow Rust's example"?

jemc commented Mar 2, 2021

esoterra commented Mar 2, 2021 • edited Loading

As a change to Pony today

esoterra commented Mar 2, 2021 • edited Loading

jasoncarr0 commented Mar 2, 2021 • edited Loading

esoterra commented Mar 2, 2021 • edited Loading

rowland66 commented Mar 3, 2021 • edited Loading

esoterra commented Mar 3, 2021

1. Principle of Least Surprise

2. Performance

jasoncarr0 commented Mar 3, 2021 • edited Loading

rowland66 commented Mar 3, 2021

esoterra commented Mar 3, 2021

esoterra commented Mar 3, 2021 • edited Loading

jasoncarr0 commented Mar 4, 2021

esoterra commented Mar 4, 2021

jasoncarr0 commented Mar 4, 2021

redvers commented Mar 4, 2021

redvers commented Mar 4, 2021

esoterra commented Mar 4, 2021 • edited Loading

redvers commented Mar 4, 2021 • edited Loading

jasoncarr0 Aug 5, 2020 •

edited

Loading

redvers Mar 4, 2021 •

edited

Loading

jasoncarr0 commented Aug 5, 2020 •

edited

Loading

jasoncarr0 commented Aug 18, 2020 •

edited

Loading

jasoncarr0 commented Aug 20, 2020 •

edited

Loading

jasoncarr0 commented Dec 4, 2020 •

edited

Loading

jasoncarr0 commented Dec 8, 2020 •

edited

Loading

jasoncarr0 commented Dec 9, 2020 •

edited

Loading

jasoncarr0 commented Dec 11, 2020 •

edited

Loading

jasoncarr0 commented Dec 11, 2020 •

edited

Loading

esoterra commented Feb 28, 2021 •

edited

Loading

esoterra commented Mar 1, 2021 •

edited

Loading

esoterra commented Mar 2, 2021 •

edited

Loading

esoterra commented Mar 2, 2021 •

edited

Loading

jasoncarr0 commented Mar 2, 2021 •

edited

Loading

esoterra commented Mar 2, 2021 •

edited

Loading

rowland66 commented Mar 3, 2021 •

edited

Loading

jasoncarr0 commented Mar 3, 2021 •

edited

Loading

esoterra commented Mar 3, 2021 •

edited

Loading

esoterra commented Mar 4, 2021 •

edited

Loading

redvers commented Mar 4, 2021 •

edited

Loading