Skip to content
Ambrose Li edited this page Aug 8, 2020 · 26 revisions

utf.c and utf.h are a newly written, unit-testable module created to implement logic related to UTF-8 handling. The module supports conversion from a small number of ISO-Latin1-class character sets to UTF-8.

Unit tests for utf.c are in tests/test_utf.c.

Rationale

Trn was originally written on the assumption that all characters are one-byte ASCII. This assumption leads to practices like the copious use of the ++ operator to increment pointers by one byte, clearing up “grey space” (control and 8-bit characters) etc., all of which lead to the corruption of UTF-8. Adding support for UTF-8 thus involves finding places where this assumption is held. Because UTF-8 is variable-width (and because eventually we’ll need to support more character sets), the byte and pointer manipulations needed to correctly process UTF-8 are too involved to nicely fit into macros and therefore new functions need to be written.

Exported types

CODE_POINT

An unsigned long, to represent a Unicode code point

Exported constants

Numeric identifiers for supported character sets

Constant Meaning
CHARSET_ASCII The US-ASCII character set.
CHARSET_ISO8859_1 The ISO 8859-1 (Latin1) character set.
CHARSET_ISO8859_15 The ISO 8859-15 (Latin9) character set.
CHARSET_UNKNOWN An unknown character set.
CHARSET_UTF8 The UTF-8 character set.
CHARSET_WINDOWS_1252 The Windows-1252 character set.

String labels for supported character sets

Constant Meaning
TAG_ASCII The US-ASCII character set.
TAG_UTF8 The UTF-8 character set.
TAG_ISO8859_1 The Latin1 character set.
TAG_ISO8859_15 The Latin9 character set.
TAG_WINDOWS_1252 The Windos-1252 character set.

Others

Constant Meaning
INVALID_CODE_POINT An invalid code point.

Exported functions

at_norm_char

  • bool (const char *s)
  • s: string to check
  • returns: whether the character at *s should not be replaced by a space

Drop-in replacement for the AT_NORM_CHAR macro in util.h. Checks whether the (potentially non-ASCII) character at *s is a “normal” character (i.e., should not be replaced by a space). Returns 1 if the character at *s is “normal”, 0 if not.

This should be called through either the AT_NORM_CHAR or AT_GREY_SPACE macros in util.h, which have been modified to use the at_norm_char() function.

byte_length_at

  • int byte_length_at(const char *s)
  • s: string to check
  • returns: number of bytes taken up by the character at *s

Determines how many bytes the (potentially non-ASCII) character at *s takes up. Returns an int from 0 to 6.

(0 should only ever be returned if s is NULL; otherwise byte_length_at should return a value from 1 to 6.)

code_point_at

  • CODE_POINT code_point_at(const char *s)
  • s: string to check
  • returns: code point at start of s

Returns the Unicode code point for the character at *s, as an unsigned long. Returns INVALID_CODE_POINT if s is NULL or if *s contains a bit pattern that’s invalid for UTF-8.

input_charset_name

  • const char *input_charset_name();

Returns a short label representing the currently active input character set. Used in current_charsubst() in charsubst.c to display the current conversion on screen and in decode_header() in cache.c to save system state.

insert_unicode_at

  • int insert_unicode_at(char *s, CODE_POINT c)
  • s: buffer to modify
  • c: Unicode code point to insert
  • returns: number of bytes written

Inserts one UTF-8 character with code point c at the buffer pointed to by s; returns an int representing the number of bytes written. Caller must ensure there is enough space for the worst-case of 6 bytes plus 1 (6 for the character, 1 for the terminating '\0'). This is used for implementing numerical character references.

output_charset_name

  • const char *output_charset_name();

Returns a short label representing the currently active output character set. Used in current_charsubst() in charsubst.c to display the current conversion on screen and in decode_header() in cache.c to save system state.

put_char_adv

  • int put_char_adv(char **sp, bool_int outputok)
  • sp: pointer to string to output
  • outputok: whether to actually output anything
  • returns: number of bytes written (or would have written)

Displays one UTF-8 character at **sp, then increments the pointer at *sp by the number of bytes the character takes up, then returns the number of character cells the character takes up. If outputok is FALSE the character is not actually written to stdout.

This (e.g., w = put_char_adv(&s);) is intended as a not-quite-drop-in replacement for both putchar(*s++) and i = putsubstchar(c, limit, outputok).

Example:

Old code Unicode-safe equivalent
putchar(*s); i += put_char_adv(&s, TRUE) - 1; s--;

Note: The existing code often assumes counters and pointers are always incremented by one. Until these assumptions are completely rooted out, it is necessary to resort to kludges like the s-- in the example.

utf_init

  • int utf_init(const char *f, const char *t);
  • f: input character set
  • t: output character set (must be "utf-8")
  • returns: an int representing the active input character set

Sets the input character set to f, if supported, and output character set to t, if supported. Returns an int representing the currently active input character set, which could correspond to either f (if supported) or the previously active input character set.

visual_length_of

  • int visual_length_of(const char *s)

Determines how many character cells the (potentially non-ASCII) string at s takes up on screen, as an int. The byte equivalent is just strlen().

visual_width_at

  • int visual_width_at(const char *s)

Determines how many character cells the (potentially non-ASCII) character at *s takes up on screen. Returns an int from 0 to 2.