Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use UTF-8 internally for strings. #684

Open
markshannon opened this issue Jun 12, 2024 · 17 comments
Open

Use UTF-8 internally for strings. #684

markshannon opened this issue Jun 12, 2024 · 17 comments

Comments

@markshannon
Copy link
Member

markshannon commented Jun 12, 2024

I'm sure this has been discussed elsewhere and I don't know when, or if, we'll have time to implement it, but I think it's worth adding here.

Currently Python strs (PyUnicodeObject in C) are implemented as arrays of 1, 2 or 4 byte unicode code points, plus a header
I propose that we implement them as a utf8 encoded array of bytes, plus a header.

The advantages are numerous:

  • Non-ASCII strings will generally be more compact, without making ASCII strings any bigger.
  • Strings can be joined easily by simple concatenation of the data
  • The internet is utf8, so the vast majority of encoding and decoding operations should be fast
  • There need only be one implmentation of each str method and each PyUnicode_ C function, saving considerable code size and simplifying the code
  • Algorithms for fast operations on utf8 are well known, so many operations will fast, despite the variable length encoding.
  • The C struct is clearer, as we don't need an awkward union of uint8_t, uint16_t and uint32_t for character data.

However there are two problems:

  • Indexing into strings is supposed to be O(1).
  • Some of the C API exposes the internal encoding

Keeping indexing O(1)

To maintain this properly we will have to lazily create an offset table for larger, non-ASCII strings.
This is won't be a problem in practice because:

  • Creating the index is no more expensive than creating the current UCS1/2/4 strings.
  • We only allocate indexes if we need them, which should be relatively rare.

The C API

We will have to deprecate, and then remove, the C API that exposes the implementation details.
We should probably deprecate for 3.14, so that we can implement utf8 strings in 3.16, allowing a proper deprecation period.

Implementation

We will probably want to embed the string data directly into the object, so the struct will look something like this:

typedef struct {
    PyObject_HEAD
    uintptr_t interned: 2;   
    uintptr_t ascii: 1;   
    uintptr_t valid_utf8: 1;   
    uintptr_t length: (WORD_SIZE-4);    /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    PyUnicodeIndex *index;    /* NULL unless needed */
    uint8_t data[1];
} PyUnicodeObject;

The valid_utf8 bit helps fast encoding. It is false if the string contains half surrogate pairs or any other code point not allowed in legal utf-8.

Indexing operations

setitem(self, index) would be implemented something like this:

def getitem(self, index):
     if self.ascii:
         return self.data[index]
    if self.index is NULL:
        self.index = make_index(self)
    offset = offset_from_index(self.index, index)
    return read_one_char(self.data, offset)

The index table would be composed of len(s)/64 entries, each entry being:

struct index_entry {
     uintptr_t base_offset;
     uint8_t additional_offset[64];
};

With the offset being computed as base_offset[index/64] + additional_offset[index%64].

@markshannon
Copy link
Member Author

We will also need the length in bytes, with adds another uintptr_t entry, which I forgot.

typedef struct {
    PyObject_HEAD
    uintptr_t interned: 2;   
    uintptr_t ascii: 1;   
    uintptr_t valid_utf8: 1;   
    uintptr_t length: (WORD_SIZE-4);    /* Number of code points in the string */
    uintptr_t byte_count;      /* Number of bytes in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    PyUnicodeIndex *index;    /* NULL unless needed */
    uint8_t data[1];
} PyUnicodeObject;

On 64 bit machines that's 48 bytes of header, compared to the current 40 bytes, unfortunately.

Given that for most strings len(s) is much less than the maximum value, we can reduce this to 32 bytes for ASCII strings and 40 bytes for most other strings:

typedef struct {
    PyObject_HEAD
    uint32_t is_small: 1;   
    uint32_t is_ascii: 1;   
    uint32_t interned: 2;   
    uint32_t ascii: 1;   
    uint32_t valid_utf8: 1;   
    uint32_t small_length: 24;    /* Number of code points in the string */
    uint32_t small_count;      /* Number of bytes in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
} PyUnicodeObjectBase;

typedef struct {
    PyUnicodeObjectBase base;
    uint8_t data[1];
} PySmallAsciiObject;

typedef struct {
    PyUnicodeObjectBase base;
    PyUnicodeIndex *index;    /* NULL unless needed */
    uint8_t data[1];
} PySmallUnicodeObject;

typedef struct {
    PyUnicodeObjectBase base;
    PyUnicodeIndex *index;    /* NULL unless needed */
    uint64_t large_length;    /* Number of code points in the string */
    uint64_t large_count;      /* Number of bytes in the string */
    uint8_t data[1];
} PyLargeUnicodeObject;

typedef union { PySmallAsciiObject ascii; PySmallUnicodeObject small; PyLargeUnicodeObject large } PyUnicodeObject;

This makes the code more complex, but still considerably simpler than what we have now.

@mdboom
Copy link
Contributor

mdboom commented Jun 14, 2024

It does seem that the world at large has come around to UTF-8, with fixed-size encodings feeling more like legacy at this point. See UTF-8 Everywhere.

The relevant PEP 393 provides background for the current state of Python strings.

The effects on the C API and the runtime/memory semantic changes seem big enough this should also have a PEP, IMHO.

@cfbolz
Copy link

cfbolz commented Jun 17, 2024

PyPy has been using this approach since a number of years and we are indeed very happy with it (we even have zero-copy utf-8 decoding of bytes, at the cost of an extra indirection from our equivalent PyUnicode structure to the actual string data).

One of the most annoying parts in this approach is actually str.find for non-ascii strings. The bytes string find algorithm can be used just fine one the underlying bytes. However, the resulting byte-index needs to be converted to a codepoint-index, which is the other direction of what the index structure speeds up. What we do there is a conversion in O(log n) time using the index and a bit of local counting. A similar problem exists in our regex engine, but I don't know enough about that of CPython to know whether that would affect you.

I've been vaguely considering to rewrite the string find algorithm to also count the codepoints it skips over, but haven't gotten around to that yet.

(the JIT actually helps us here sometimes, because it knows that the functions byte_index_to_codepoint_index and codepoint_index_to_bytecode_index are inverses from each other. This helps if after a str.find call the result is then used to index into the string. but that is very hard to transfer to CPython, I think.)

@brandtbucher
Copy link
Member

Indexing operations

setitem(self, index) would be implemented something like this:

def getitem(self, index):
     if self.ascii:
         return self.data[index]
    if self.index is NULL:
        self.index = make_index(self)
    offset = offset_from_index(self.index, index)
    return read_one_char(self.data, offset)

The index table would be composed of len(s)/64 entries, each entry being:

struct index_entry {
     uintptr_t base_offset;
     uint8_t additional_offset;
};

With the offset being computed as base_offset[index/64] + additional_offset[index%64].

Can you elaborate on this more? This scheme seems clever, but after staring at it a while I still have no idea how it is supposed to work.

I'm assuming that you meant index_entries[index/64].base_offset + index_entries[index%64].additional_offset. But then "len(s)/64" entries seems wrong... the rest of the scheme seems to imply that there are at least max(len(s)/64, min(len(s), 64)) entries. Or maybe the struct is wrong and there is one array of len(s)/64 base offsets and another array of min(len(s), 64) additional offsets.

Either way, it seems like there are offsets that are impossible to encode... for example, a string with 64 one-byte characters followed by a bunch of two-byte characters, I think?

@gnprice
Copy link

gnprice commented Jul 22, 2024

I think the struct for an index entry may have been intended to be:

struct index_entry {
  uintptr_t base_offset;
  uint8_t additional_offsets[64];
};

and then lookup looks like

  entry = index_entries[index/64];
  offset = entry.base_offset + entry.additional_offsets[index%64];

This works because the UTF-8 encoding of a code point is at most 4 bytes, so the largest value one can need to store in additional_offsets is at most 63 * 4, which fits in a uint8_t. Effectively it's a lightweight compression scheme on having just an array of all the offsets.


A possible variation here, trading a bit of computation at lookup time for further reducing the size of the index, would be to have a sparse list of offsets (like only every 2 or 4 or 8 code points), and then partially decode the UTF-8 to skip past a few code points as necessary. Like this:

// or this could be 2 or 4, or 16
#define INDEX_SKIP_FACTOR 8

struct index_entry {
  uintptr_t base_offset;
  uint8_t additional_offsets[64 / INDEX_SKIP_FACTOR];
};

// …

  entry = index_entries[index/64];
  offset = entry.base_offset + entry.additional_offsets[(index%64) / INDEX_SKIP_FACTOR];
  for (index = index % INDEX_SKIP_FACTOR; index > 0; index--) {
    offset += utf8_codepoint_length_from_first_byte(data[offset]);
  }

where utf8_codepoint_length_from_first_byte is a small helper function from the UTF-8 decoder.

With the first scheme above (my interpretation of what Mark had in mind in the OP) the index is fairly large — about 112% the size of the string itself if the string is long but mostly ASCII — so this sort of further compression might be worthwhlie. With INDEX_SKIP_FACTOR of 8, the index would be 25% the size of the string itself in that case.

For short strings, the relative size of the index can be greater: 72 bytes, reduced to 16 bytes using a skip factor of 8.

@markshannon
Copy link
Member Author

I think the struct for an index entry may have been intended to be:

struct index_entry {
  uintptr_t base_offset;
  uint8_t additional_offsets[64];
};

Yes, thanks. I've corrected the original.

@markshannon
Copy link
Member Author

A possible variation here ... partially decode the UTF-8

My thinking was that either you don't need an index (all acsii strings and the majority of strings where indexing isn't used), or you do. If you do need indexing you'll want it to be fast, and having to parse the utf-8 might be too slow.

so this sort of further compression might be worthwhile.

The nice thing about this is that we can experiment with various schemes, to find the sweet spot.

@markshannon
Copy link
Member Author

This works because the UTF-8 encoding of a code point is at most 4 bytes, so the largest value one can need to store in additional_offsets is at most 63 * 4, which fits in a uint8_t. Effectively it's a lightweight compression scheme on having just an array of all the offsets.

Yes, exactly. Thanks for the clarification.

@gnprice
Copy link

gnprice commented Jul 23, 2024

My thinking was that either you don't need an index (all acsii strings and the majority of strings where indexing isn't used), or you do.

Reasonable! And yeah, this should lend itself very well to experimentation.

@ramalho
Copy link

ramalho commented Aug 15, 2024

My take is that my_str[n] is extremely rare with long strings. In general, we use __getitem__ with very small strings that ara database fields, log lines, and other structured or semi-structured data. Retrieving bible_full_text[1_000_000] is not useful and very rare in practice.

With .find() a sequential scan is inevitable anyway.

I've seen papers about fast utf-8 indexing, and one of them proposed an index entry for every 1000 or so characters. I am sorry but I don't have the reference to share now.

This would be a great enhancement proposal. In Fluent Python I wrote that the current scheme means that concatenating a single 🐀 (U+1F400) on "pure" Latin-1 text produces a str that is 4 times longer than the original 😱.

@ramalho
Copy link

ramalho commented Aug 15, 2024

Python's current scheme uses 1 byte for Latin-1 strings, not just ASCII. Latin-1 text is ASCII letters with some accents (e.g. "façade") plus curly braces, m-dashes, and other common symbols used in typeset text. In utf-8, non-ASCII Latin-1 takes 2 bytes per character instead of 1 in the current scheme. I believe migrating to utf-8 is a good idea, but just beware of the impact on text in Western European languages that uses Latin-1 which is mostly ASCII with accents here and there.

@vstinner
Copy link

We will have to deprecate, and then remove, the C API that exposes the implementation details.

I don't think that we can simply remove functions without providing better replacements.

I added a new PyUnicodeWriter API to Python 3.14 to create a string without leaking implementation details. It may be a replacement for PyUnicode_WRITE() for example.

We should get rid of functions writing directly into Python strings (PyUnicode_Fill(), PyUnicode_CopyCharacters(), etc.) since strings are immutable! I would prefer abstractions such as PyUnicodeWriter which only produce a Python str object once the caller is done with writing.

Allowing to modify an immutable str is a nightmare for caches. How can you make sure that a string is ASCII if it can be modified for example?

I'm also working on PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions which is somehow an abstraction for the "kind" and "data" things of PEP 393 C APIs.

@vstinner
Copy link

    uintptr_t ascii: 1;   
    uintptr_t valid_utf8: 1;   

Can we afford to have 2 bits for these members to create them uninitialized, and only compute them on demand? Checking if a string is ASCII or if it's valid UTF-8 are O(n) operations.

@methane
Copy link

methane commented Sep 23, 2024

PyUnicode_WriteChar() is stable ABI. And PyUnicode_New() is recommended API although it is not a stable ABI.
I think we need deprecation period much longer than usual. The legacy Unicode APIs are removed 10+ years after PEP 393 is implemented. After the period, PyUnicode_WriteChar() returns -1 always.

During deprecation period, we need to support PyUnicode_New() + PyUnicode_WriteChar() in some way.

I have removed "Legacy" unicode object when I removed wchar_t* representation. We can revive it for PEP 393 representation. Its memory layout separates Python object and its data.

Or we can over allocate memory and in-place conversion to use contiguous memory layout always:

  • When maxchar is 127, its ASCII. No need to conversion.
  • When maxchar is 255, its latin1. We need to allocate length*2 for UTF-8. It is 2x larger than latin1.
  • When maxchar is 65535, we need to allocate length*3 for UTF-8. It is 1.5x larger than UCS-2.
  • When maxchar is 1114111, we need to allocate length*4 for UCS-4. It is 4/3 larger than UTF-8.

PyUnicode_READY() do the conversion, as we did for legacy Unicode object.

@vasily-v-ryabov
Copy link

vasily-v-ryabov commented Oct 9, 2024

Checking if a string is ASCII or if it's valid UTF-8 are O(n) operations.

@vstinner
We can implement method is_ascii() extremely quickly: return byte_count == length. Of course, it is true only if utf-8 validation already passed: for example, when string is loaded from *.pyc file (as I understood, utf-8 validation happens at compile time).

Mojo🔥 implementation also looks very promising in general: modularml/mojo#3401 (this is in fact merged to internal OpenAI branch). It looks even faster than simdutf.

@ramalho
Copy link

ramalho commented Oct 9, 2024 via email

@vasily-v-ryabov
Copy link

@ramalho thanks. According to this table https://www.ascii-code.com/ 127 is still ASCII symbol. So the trick should be necessary and enough. Only 128 becomes 2-byte sequence in utf-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants