String basic units #2112
Replies: 4 comments
-
In C++ I think std::u8string, std::u8string_view, char8_t std::span<char8_t> all can be tied to UTF-8. It uses char16_t and char32_t for the UTF-16 and UTF-32 encodings. I believe std::string and std::wstring with char and wchar_t are platform specific in the recent C++ standard? I think it would be very difficult to deal with an unspecified encoding in system level programming. And C++ already has UTF-32, if you want code points, but that is very wasteful. It might make some sense for a OS/platform specific OsString type to be unspecified and only be accessible as raw bytes in Carbon, in addition to being convertible to/from UTF8 and UTF32. On the other hand, this could be done just as well as a standard library type, I think? |
Beta Was this translation helpful? Give feedback.
-
That's kind of what I'm thinking. Besides Carbon providing specific encoding strings such as UTF-8 encoded ones, it can provide a multi-view string type like Python's
2 bytes per character when there is at least one code point greater than U+FF and, finally, 4 bytes per character if any greater than U+FFFF. AFAIK the Swift language also has a complex string type, but I think it's not code point oriented. |
Beta Was this translation helpful? Give feedback.
-
Python interop would in general by interesting, I think. One interesting project might be to look at the possibility for creating a library for pickle/unpickle with tagged unions that can hold most of the basic Python types. https://docs.python.org/3/library/pickle.html Not sure how difficult that would be, but it could be very useful. |
Beta Was this translation helpful? Give feedback.
-
I'd say the number of application, where you need random access by unicode scalar value is quite limited. In most cases, you need read_only iteration access or access a cached locations, both of which can be achived for any specific string type (Case A). Even more limited is the number of applications, where you only need read only random access by unicode scalar value (Case B). Hence in most cases, a Python like representation is either unnessary (Case A) or insufficent (Case B). I think Carbon is well served with having a UTF8 based string type for most application, byte strings for raw byte /undecoded / exotic encoded string operations, UTF16 for ABI compatibility and UTF32 for applications where random access by unicode scalar value is needed. In each case a Python like variable encoding would be inferior. Notice that one of the reason Python adapted this layout was that the string datatype previously supported random access and also used this format to access cached locations, which the designers had to consider. |
Beta Was this translation helpful? Give feedback.
-
This is related to the issue #2065. Did Carbon ever consider if it is possible to implement "code-point" strings, like Python's
str
? It's obviously incompatible with C++ strings, but it could work as an alternative string version. C++ strings are UTF-8 encoded; differently, Python strings aren't in any encoding.Python represents strings in different ways depending on which code points it contains. For example, if all code points are less than U+100, then it stores 1 byte per character internally.
Beta Was this translation helpful? Give feedback.
All reactions