Skip to content

How character set works

Mark Rotteveel edited this page Jul 19, 2015 · 1 revision

Encodings Types

Firebird uses character encodings in two different areas:

  • The database encoding defines a character set in which all CHAR, VARCHAR and BLOB SUB_TYPE 1 fields are physically stored on the disk. There is a default database encoding that is specified during database creation. It is also possible to specify character sets on per column basis.
  • The client connection encoding defines a character set in which client will send and expects to receive character data. This encoding might or might not match the database default encoding.

The Firebird RDBMS performs translation between character sets of the client connection and the character set of the content. The list of allowed character sets as well as the allowed translations between them are specified in the fbintl shared library located in the intl/ directory of the Firebird installation. There is also a special character set NONE that tells Firebird not to interpret the contents of the character field.

Firebird uses following algorithm when performing translations:

  • If source and target character sets match, send the content unchanged.
  • If the target character set is NONE, send source content unchanged.
  • If the source character set is NONE, send source content unchanged.
  • If there is a direct translation rule between source and target character sets, use that rule.
  • If there is no direct translation rule, check if there is rule to translate the source character set into the UTF8 (or UNICODE_FSS) character set and a rule to translate from UTF8 (or UNICODE_FSS) into the target character set. If yes, use these two rules for translation.
  • If no suitable translation rule can be found, throw an exception.

This algorithm is simple and straightforward, but it can create unexpected results in more complex cases, where multiple translations are required for a single result set row.

Encodings in Java

Java introduces additional complexity when the NONE character set is used. The reason for this is that Java internally stores all strings in Unicode format, and the application must define the character encoding for the byte contents to the JVM. When the NONE character set is used, the driver does not know how to interpret the received data. The two choices that are left to the driver is to construct a string using the default character encoding of the JVM or using the hint specified in the connection. The default encoding usually matches the regional settings of the operating system and can be accessed from within the JVM through the file.encoding system property. At this point the second conversion happens: JVM interprets the received content as if it was stored using the default VM character set.

It is clear that the interpretation that happens inside the JVM can lead to errors when the same content is accessed from two or more different Java Virtual Machines that have different configuration.

The most severe consequences of this phenomenon occur on Linux and other Unix platforms, as it is possible that regional settings are not configured and that the default "C" locale is used. If, for example, the default character set of the database is set to NONE, and the connection encoding is also NONE, no translation errors or problems occur between driver and database server. Jaybird will correctly receive contents of the database and keep it internally in byte array. However, when this content is accessed from the Java application, Jaybird converts that byte array into a java.lang.String instance using the default JVM encoding, which in our case leads to the situation where all non-ASCII characters are replaced with question marks ("?").

Therefore, application should use NONE character encoding as an encoding for a database and a connection only when at least one of the following is met:

  • Database will contain only ASCII characters
  • It is guaranteed that all Java Virtual Machines accessing the database will have the same default encoding that can correctly handle all characters stored in the database.

As a partial workaround one can specify the encoding that should be used to interpret bytes coming from the server in the charSet connection property. The following rules are used when interpreting the encoding and charSet properties:

  • When only encoding property specified, driver uses the default mapping between server and Java encodings. When encoding property is not set or set to NONE, the default JVM encoding is used to interpret bytes coming from the server.
  • When only charSet property is specified, driver uses the reverse mapping to specify the connection encoding for the server and interprets byte stream according to the value of the property.
  • When both encoding and charSet property are specified, driver sets the connection encoding according to the value of the encoding property, but interprets the byte stream according to the charSet property.

The last case is most powerful, but also is the most dangerous in use. When used properly, it can solve the problems with the legacy databases, when used incorrectly, one can trash the content of the database.

Unfortunately we cannot recommend using the UNICODE_FSS character set as the database default encoding in all versions prior to Firebird 2.0 because of two limitations:

  • UNICODE_FSS characters can occupy up to 3 bytes, that decreases the allowed maximum size of the column and the index.
  • The UNICODE_FSS character set does not define any collation and does not produce correct results when used in connection with ORDER BY clause.

Firebird 2.0 has better support for Unicode as well as increased maximum size of the index entry (it should be smaller than quarter of the database page size, e.g. max. size of indexed columns in case of 4k page is 1k).

Character encodings

See the list of the available character encodings and their mapping to the Java character set.