GitHub - riwux/libencode: A simple C99 Unicode library

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
include		include
src		src
LICENSE		LICENSE
Makefile		Makefile
README		README
config.mak		config.mak

Repository files navigation

 libencode
+----------+

libencode provides interfaces to operate on plain unicode codepoints and UTF-8
encoded strings.
It doesn't provide any I/O functions, advanced buffering or other complicated
features. libencode is meant to be used as a base for more sophisticated
interfaces and programs. It conveniently handles the low-level encoding/decoding
of data while the user builds upon that and focuses on the specific needs of the
application in question.

  Glossary
+----------+

The following definitions attempt to explain the terms used troughout the
project in an understandable way, while still reflecting the original meaning
as specified by the Unicode Consortium.


codespace         | The range of integers available for encoding characters.
                  |   It spans from 0x0 to 0x10FFFF.

codepoint         | A number in the Unicode codespace that is assigned to an
                  |   abstract character or another part of a writing system.

grapheme          | The smallest unit of a writing system that carries meaning
                  |   (what a human perceives as a character) and is also called
                  |   a 'user-perceived character'.

grapheme cluster  | A sequence of one or more codepoints that represent a grapheme.
                  |   This is what you want "character" to mean if you operate on text.

character         | A character is *NOT* a 7-bit ASCII value.
                  |   The word 'character' is a synonym for 'user-perceived character'
                  |   but since it's common for people to think a character is an
                  |   ASCII value and a byte in size, the term 'grapheme' was
                  |   introduced and is preferred to avoid ambiguity.

UTF-8 sequence    | A series of code units (bytes in this case) that represent a single
                  |   codepoint encoded in the UTF-8 encoding scheme.

code unit         | Code units (or simply 'unit') are the building blocks out of which
                  |   a valid encoding is made. In the case of UTF-8 each unit is 8-bit
                  |   (hence the eight) in size. Chained together, they form the
                  |   UTF-8 sequence denoting an encoded codepoint.