Skip to content

Character Encodings and Unicode

Andy Theuninck edited this page Jan 16, 2017 · 2 revisions

CORE does not require or use the mbstring extension. This means its handling of Unicode is probably far less than perfect. Character sets tend to cause a lot of confusion and this attempts to clear up some points.

Basics

A string in PHP is just a sequence of bytes. It has no particular meaning until you try to compare it to something or manipulate it. When a string uses a multibyte encoding some of PHP's basic string functions break down. If the first character of $str is 2 bytes then substr($str, 0, 1) will return only the first byte as opposed to the first character. The fact that basic string functions don't handle multibyte encodings correctly is not always a problem. For instance strlen($str) > 0 will be true for any non-empty byte sequence in any encoding. There are certainly places where CORE will breakdown handling unicode strings but blindly replacing string functions with mb_* equivalents is probably more work than is actually necessary.

The Database

Database tables have a character set. This dictates how values are stored as bytes. During a given session a client connected to the database also specifies a character set. The database will return results in the client's character set. The important point here is just because the table says "utf8" does not mean PHP is receiving utf8 byte sequences. This is why various resources suggest querying SET NAMES utf8 immediately after opening a connection. Using the same character set for all tables and clients may be ideal from a performance and simplicity standpoint but is not strictly necessary if the characters you need to store can be encoded in both utf8 and ISO-8859-1 (latin1).

The Web Server

Your web server specifies a character set as an HTTP header. For example,

Content-Type: text/html; charset=iso-8859-1

How this is configured is different for different servers. However, some browsers may use this value regardless of what any <meta> tags on the page itself say. To eliminate potential problems make sure the Content-Type header matches the encoding you're trying to use. Headers can be examined using curl -I http://path/to/core.

Other Considerations

Using a <meta> tag with content type may help browsers identify the page's content type. Form data also needs to be encoded correctly. Forms do have an attribute for "accepts".

One Path for Unicode Support in CORE

  • Add a configuration option to include a SET NAMES query at the start of each connection
  • Make sure the web server charset is UTF-8.
  • Make sure the <meta> content tag also says UTF-8.
  • Use some non-ASCII characters that can be represented in both UTF-8 and ISO-8859-1 (e.g., accented vowels) and look for problems.

Problems likely involve string functions and preg_* functions. One of the more common uses of substr is to trim a value down to a size that will fit in a database column. For these the correct replacement is mb_strcut not mb_substr.

Receipt Printers

The Epson receipt printers currently supported by CORE do not support any variety of unicode. They support a number of code pages all of which are one byte encodings. The printers only support 256 characters, max, where byte values 0 through 127 are ASCII and byte values 128 through 255 are designated by the Epson code page.

Clone this wiki locally