Skip to content

Graphemes

Andj edited this page Dec 1, 2024 · 7 revisions

When working with tokenisation and break iterators, it is sometimes necessary to work at the character, syllable, line, or sentence levels. Character level tokenisation is an interesting case. By character, I mean a user perceivable unit of text, which the Unicode standard would refer to as a grapheme. The usual way I see developers handling character level tokenisation of English is via list comprehension or typecasting a string to a list:

>>> t1 = "transformation"
>>> [char for char in t1]
['t', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n']

This will give you discrete characters or codepoints. But this approach doesn't work as well for other languages. Let's take a Dinka string as an example:

>>> t2 = "dɛ̈tëicëkäŋ akɔ̈ɔ̈n"
>>> [char for char in t2]
['d', 'ɛ', '̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ', '̈', 'ɔ', '̈', 'n']

There is a mixture of precomposed and decomposed character sequences.

>>> import unicodedata as ud
>>> ud.is_normalized('NFC', t2)
True

The text is fully precomposed, using Unicode Normalization Form C, but many character sequences that should be treated as a single unit do not exist as single codepoints. How do we work with ä vs ɛ̈? Unicode defines grapheme cluster boundaries in the annex on Unicode Text segmentation (UAX #29).

We can use Python packages such as grapheme and PyICU, or we can also get the same results from a simple regular expression assertion.

Unicode documentation refers to legacy grapheme clusters, extended grapheme clusters, and tailored grapheme clusters. Generally, when graphemes are referred to, extended grapheme clusters are meant.

Regex

I will use the regex module rather than re, since regex has more comprehensive Unicode support.

import regex as re
re.findall(r'\X',t2)
# ['d', 'ɛ̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ̈', 'ɔ̈', 'n']

As can be seen, each grapheme is treated as a single unit, so rather than splitting ɛ̈ into two codepoints or characters, it treats it as a single unit.

For the Dinka string, grapheme segmentation will be consistent across implementations.

If we look at a Devanagari example, differences can be observed between segmentation by different implementations. On older versions of regex:

t3 = "हिन्दी"
re.findall(r'\X',t3)
# ['हि', 'न्', 'दी']

This generates three grapheme clusters for the string.

While on the latest versions:

t3 = "हिन्दी"
re.findall(r'\X',t3)
# ['हि', 'न्दी']

We will get two grapheme clusters.

Why the difference? The update to UAX #29 for Unicode 15.1 introduced a new rule to grapheme cluster boundary identification. GB9c](https://www.unicode.org/reports/tr29/tr29-43.html#GB9c) Do not break within certain combinations with Indic_Conjunct_Break (InCB)=Linker. For Devanagari, prior to Unicode 15.1 grapheme boundaries occurred within conjunct consonants. With the update to Unicode 15.1 the conjunct consonants form a single grapheme.

The following overview of other grapheme solutions will use the latest version of the modules. For the Devanagari example modules using a pre-Unicode 15.1 implementation will give three graphemes and newer solutions will give two.

import icu
import unicodedataplus as ud
linkers = icu.UnicodeSet(r'\p{InCB=Linker}')
for linker in linkers:
    print(f'{ord(linker):04X}\t{linker}\t{ud.name(linker)}\t{ud.block(linker)}')
# 094D	्	DEVANAGARI SIGN VIRAMA	Devanagari
# 09CD	্	BENGALI SIGN VIRAMA	Bengali
# 0ACD	્	GUJARATI SIGN VIRAMA	Gujarati
# 0B4D	୍	ORIYA SIGN VIRAMA	Oriya
# 0C4D	్	TELUGU SIGN VIRAMA	Telugu
# 0D4D	്	MALAYALAM SIGN VIRAMA	Malayalam

The key difference is that UAX #29 for Unicode 15.1 onwards treats 094D ( ् ) DEVANAGARI SIGN VIRAMA as extending the grapheme cluster, while before Unicode 15.1, U+094D did not extend the grapheme cluster.

Grapheme

Alternatively we could use the grapheme package, which provides a number of functions to manipulate strings using graphemes as the basic unit, rather than individual codepoints, as standard Python string operations do.

N.B. The grapheme package hasn't been updated since 2020, and is based on an older version of UAX #29.

import grapheme
grapheme.UNICODE_VERSION
# '13.0.0'
list(grapheme.graphemes(t2))
# ['d', 'ɛ̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ̈', 'ɔ̈', 'n']

Likewise, grapheme package gives us three grapheme clusters:

list(grapheme.graphemes(t3))
['हि', 'न्', 'दी']

PyICU:

Using a break iterator

Alternatively, we could use PyICU, with icu4c, for grapheme tokenisation.

Using pyicu is more complex than using regex or grapheme, many of the functions available in icu4c are low level functions, and it is necessary to develop your own wrapper around it. The break iterator returns a set of breakpoints in the string. It is necessary to iterate through the breakpoints. The PyICU cheat sheet has a useful function, iterate_breaks(), to iterate through each breakpoint in the string.

We then need to create a break iterator instance, and then pass the string and instance to iterate_breaks(). In this instance, I will use the root locale for the break iterator.

import icu
def iterate_breaks(text, break_iterator):
    break_iterator.setText(text)
    lastpos = 0
    while True:
        next_boundary = break_iterator.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary
bi = icu.BreakIterator.createCharacterInstance(icu.Locale.getRoot())
list(iterate_breaks(t2, bi))
# ['d', 'ɛ̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ̈', 'ɔ̈', 'n']

Alternative code we use in our internal projects:

import icu
def generate_tokens(text, brkiter = None):
    if not brkiter:
        brkiter = icu.BreakIterator.createWordInstance(icu.Locale.getRoot())
    brkiter.setText(text)
    i = brkiter.first()
    for j in brkiter:
        yield text[i:j]
        i = j

def get_generated_tokens(text, bi = None):
    return [*generate_tokens(text, brkiter=bi)]

iter = icu.BreakIterator.createCharacterInstance(icu.Locale('und'))
get_generated_tokens(t2, iter)
# ['d', 'ɛ̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ̈', 'ɔ̈', 'n']

If we look at the Devanagari string:

get_generated_tokens(t3, iter)
# ['हि', 'न्दी']

The icu4c break iterator generates two grapheme clusters.

ICU regular expressions

Alternatively, it is possible to use icu.RegexMatcher to split a string into graphemes:

def find_all(pattern, text, flag=0):
    results = []
    matcher = icu.RegexMatcher(pattern, text, flag)
    while matcher.find():
        results.append(matcher.group())
    return results
find_all(r'\X', t3)
# ['हि', 'न्दी']

icu.RegexMatcher returns two graphemes.

pyuegc

Then there is pyuegc.

from pyuegc import EGC, UCD_VERSION
UCD_VERSION
# '16.0.0'

The lastest pyuegc package uses the current Unicode version.

EGC(t2)
# ['d', 'ɛ̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ̈', 'ɔ̈', 'n']

EGC(t3)
['हि', 'न्दी']

pyegc splits the string into two grapheme clusters.

ugrapheme

The final solution is ugrapheme, which also uses the latest version of Unicode. Unlike the other solutions, ugrapheme creates a class instance providing a range of methods for working with and manipulating strings on a grapheme level rather than a character or codepoint level. To mimic the other solutions, you can cast the object to a list:

from ugrapheme import graphemes
grs = graphemes(t3)
[*grs]    # list(grs)
# ['हि', 'न्दी']

or, alternatively use the grapheme_split() function:

from ugrapheme import grapheme_split
grapheme_split(t3)
# ['हि', 'न्दी']
Clone this wiki locally