-
Notifications
You must be signed in to change notification settings - Fork 1
utf.h
utf.c and utf.h are a newly written, unit-testable module created to implement logic related to UTF-8 handling. The module supports conversion from a small number of ISO-Latin1-class character sets to UTF-8.
Unit tests for utf.c are in tests/test_utf.c.
Trn was originally written on the assumption that all characters are one-byte ASCII. This assumption leads to practices like the copious use of the ++ operator to increment pointers by one byte, clearing up “grey space” (control and 8-bit characters) etc., all of which lead to the corruption of UTF-8. Adding support for UTF-8 thus involves finding places where this assumption is held. Because UTF-8 is variable-width (and because eventually we’ll need to support more character sets), the byte and pointer manipulations needed to correctly process UTF-8 are too involved to nicely fit into macros and therefore new functions need to be written.
An unsigned long, to represent a Unicode code point
Constant | Meaning |
---|---|
CHARSET_ASCII | The US-ASCII character set. |
CHARSET_ISO8859_1 | The ISO 8859-1 (Latin1) character set. |
CHARSET_ISO8859_15 | The ISO 8859-15 (Latin9) character set. |
CHARSET_UNKNOWN | An unknown character set. |
CHARSET_UTF8 | The UTF-8 character set. |
CHARSET_WINDOWS_1252 | The Windows-1252 character set. |
Constant | Meaning |
---|---|
TAG_ASCII | The US-ASCII character set. |
TAG_UTF8 | The UTF-8 character set. |
TAG_ISO8859_1 | The Latin1 character set. |
TAG_ISO8859_15 | The Latin9 character set. |
TAG_WINDOWS_1252 | The Windos-1252 character set. |
Constant | Meaning |
---|---|
INVALID_CODE_POINT | An invalid code point. |
- bool (const char *s)
- s: string to check
- returns: whether the character at *s should not be replaced by a space
Drop-in replacement for the AT_NORM_CHAR macro in util.h. Checks whether the (potentially non-ASCII) character at *s is a “normal” character (i.e., should not be replaced by a space). Returns 1 if the character at *s is “normal”, 0 if not.
This should be called through either the AT_NORM_CHAR or AT_GREY_SPACE macros in util.h, which have been modified to use the at_norm_char() function.
- int byte_length_at(const char *s)
- s: string to check
- returns: number of bytes taken up by the character at *s
Determines how many bytes the (potentially non-ASCII) character at *s takes up. Returns an int from 0 to 6.
(0 should only ever be returned if s is NULL; otherwise byte_length_at should return a value from 1 to 6.)
- CODE_POINT code_point_at(const char *s)
- s: string to check
- returns: code point at start of s
Returns the Unicode code point for the character at *s, as an unsigned long. Returns INVALID_CODE_POINT if s is NULL or if *s contains a bit pattern that’s invalid for UTF-8.
- const char *input_charset_name();
Returns a short label representing the currently active input character set. Used in current_charsubst() in charsubst.c to display the current conversion on screen and in decode_header() in cache.c to save system state.
- int insert_unicode_at(char *s, CODE_POINT c)
- s: buffer to modify
- c: Unicode code point to insert
- returns: number of bytes written
Inserts one UTF-8 character with code point c at the buffer pointed to by s; returns an int representing the number of bytes written. Caller must ensure there is enough space for the worst-case of 6 bytes plus 1 (6 for the character, 1 for the terminating '\0'). This is used for implementing numerical character references.
- const char *output_charset_name();
Returns a short label representing the currently active output character set. Used in current_charsubst() in charsubst.c to display the current conversion on screen and in decode_header() in cache.c to save system state.
- int put_char_adv(char **sp, bool_int outputok)
- sp: pointer to string to output
- outputok: whether to actually output anything
- returns: number of bytes written (or would have written)
Displays one UTF-8 character at **sp, then increments the pointer at *sp by the number of bytes the character takes up, then returns the number of character cells the character takes up. If outputok is FALSE the character is not actually written to stdout.
This (e.g., w = put_char_adv(&s);) is intended as a not-quite-drop-in replacement for both putchar(*s++) and i = putsubstchar(c, limit, outputok).
Example:
Old code | Unicode-safe equivalent |
---|---|
putchar(*s); | i += put_char_adv(&s, TRUE) - 1; s--; |
Note: The existing code often assumes counters and pointers are always incremented by one. Until these assumptions are completely rooted out, it is necessary to resort to kludges like the s-- in the example.
- int utf_init(const char *f, const char *t);
- f: input character set
- t: output character set (must be "utf-8")
- returns: an int representing the active input character set
Sets the input character set to f, if supported, and output character set to t, if supported. Returns an int representing the currently active input character set, which could correspond to either f (if supported) or the previously active input character set.
- int visual_length_of(const char *s)
Determines how many character cells the (potentially non-ASCII) string at s takes up on screen, as an int. The byte equivalent is just strlen().
- int visual_width_at(const char *s)
Determines how many character cells the (potentially non-ASCII) character at *s takes up on screen. Returns an int from 0 to 2.