String basic units #2112

hydroper · 2022-08-22T12:17:16Z

hydroper
Aug 22, 2022

This is related to the issue #2065. Did Carbon ever consider if it is possible to implement "code-point" strings, like Python's str? It's obviously incompatible with C++ strings, but it could work as an alternative string version. C++ strings are UTF-8 encoded; differently, Python strings aren't in any encoding.

var s:String = '\u{10ffff}';
s.length == 1;
s.charCodeAt(0) == 0x10ffff;
s.slice(0, 1) == '\u{10ffff}';
String.fromCharCode(0x10ffff).length == 1;

Python represents strings in different ways depending on which code points it contains. For example, if all code points are less than U+100, then it stores 1 byte per character internally.

OlaFosheimGrostad · 2022-08-22T15:17:25Z

OlaFosheimGrostad
Aug 22, 2022

In C++ I think std::u8string, std::u8string_view, char8_t std::span<char8_t> all can be tied to UTF-8. It uses char16_t and char32_t for the UTF-16 and UTF-32 encodings. I believe std::string and std::wstring with char and wchar_t are platform specific in the recent C++ standard?

I think it would be very difficult to deal with an unspecified encoding in system level programming. And C++ already has UTF-32, if you want code points, but that is very wasteful.

It might make some sense for a OS/platform specific OsString type to be unspecified and only be accessible as raw bytes in Carbon, in addition to being convertible to/from UTF8 and UTF32.

On the other hand, this could be done just as well as a standard library type, I think?

0 replies

hydroper · 2022-08-22T16:03:35Z

hydroper
Aug 22, 2022
Author

On the other hand, this could be done just as well as a standard library type, I think?

That's kind of what I'm thinking. Besides Carbon providing specific encoding strings such as UTF-8 encoded ones, it can provide a multi-view string type like Python's str (possibly interned?), which uses the same optimization technique:

if all code points are less than U+100, then it stores 1 byte per character internally.

2 bytes per character when there is at least one code point greater than U+FF and, finally, 4 bytes per character if any greater than U+FFFF.

AFAIK the Swift language also has a complex string type, but I think it's not code point oriented.

0 replies

OlaFosheimGrostad · 2022-08-24T13:45:13Z

OlaFosheimGrostad
Aug 24, 2022

That's kind of what I'm thinking. Besides Carbon providing specific encoding strings such as UTF-8 encoded ones, it can provide a multi-view string type like Python's str (possibly interned?), which uses the same optimization technique:

Python interop would in general by interesting, I think.

One interesting project might be to look at the possibility for creating a library for pickle/unpickle with tagged unions that can hold most of the basic Python types.

https://docs.python.org/3/library/pickle.html

Not sure how difficult that would be, but it could be very useful.

0 replies

nacaclanga · 2022-08-24T15:05:38Z

nacaclanga
Aug 24, 2022

I'd say the number of application, where you need random access by unicode scalar value is quite limited. In most cases, you need read_only iteration access or access a cached locations, both of which can be achived for any specific string type (Case A). Even more limited is the number of applications, where you only need read only random access by unicode scalar value (Case B). Hence in most cases, a Python like representation is either unnessary (Case A) or insufficent (Case B). I think Carbon is well served with having a UTF8 based string type for most application, byte strings for raw byte /undecoded / exotic encoded string operations, UTF16 for ABI compatibility and UTF32 for applications where random access by unicode scalar value is needed. In each case a Python like variable encoding would be inferior.

Notice that one of the reason Python adapted this layout was that the string datatype previously supported random access and also used this format to access cached locations, which the designers had to consider.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String basic units #2112

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

String basic units #2112

hydroper Aug 22, 2022

Replies: 4 comments

OlaFosheimGrostad Aug 22, 2022

hydroper Aug 22, 2022 Author

OlaFosheimGrostad Aug 24, 2022

nacaclanga Aug 24, 2022

hydroper
Aug 22, 2022

OlaFosheimGrostad
Aug 22, 2022

hydroper
Aug 22, 2022
Author

OlaFosheimGrostad
Aug 24, 2022

nacaclanga
Aug 24, 2022