Support encodings other than UTF-8 and support BOM handling #226

AzimMuradov · 2023-10-02T22:12:04Z

Different encoding formats

Although UTF-8 is quite popular these days, the sad reality is that sometimes we need to handle other encodings (UTF-16, windows-1251, etc.).

Handling of byte order mark (BOM)

There are also problems with BOMs. There are should be API for taking them into account in UTF-16 and other formats that are affected by BOM. Sometimes, I even encounter them in UTF-8 encoded files too (which is allowed, but not recommended by the Unicode standard). In this case, BOM should be stripped out whenever possible.

Regarding the UTF-8 with BOM issue, I think that it should be the koltinx-io responsibly.

See also:

https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-with-bom

Related discussions:

The text was updated successfully, but these errors were encountered:

JakeWharton · 2023-10-02T23:53:43Z

To what API are you referring as needing BOM support?

BOMs are a document-level concept and much of this library deals with arbitrary bytes that could come from database rows, HTTP/2 frames, files, and many more. You don't want to be checking for BOMs any time strings are requested, but if you have an API that goes directly from document to string they can be queried.

For example, Okio doesn't do anything with BOMs. But OkHttp, which deals with response documents, will do BOM detection: https://github.com/square/okhttp/blob/master/okhttp/src/jvmMain/kotlin/okhttp3/internal/-UtilJvm.kt#L89-L99. But even with OkHttp you can read bytes and sometimes have to do BOM detection even later as Retrofit does: https://github.com/square/retrofit/blob/master/retrofit-converters/moshi/src/main/java/retrofit2/converter/moshi/MoshiResponseBodyConverter.java#L40-L44.

This library certainly already facilitates BOM detection with its APIs, but building it in needs to be careful to not assume the use of documents.

fzhinkin · 2023-10-09T10:02:55Z

@AzimMuradov thanks for opening the issue.

There were some thoughts regarding supporting encodings other than UTF-8, but currently, there are no particular plans on when and how it'll be supported.

Could you please clarify what kind of UTF-8 BOMs support you're expecting from the kotlinx-io?
As Jake wrote, BOMs are a document-level concept, so we can't simply skip BOM-alike prefix on every readString call.

fzhinkin added the encodings label Oct 12, 2023

fzhinkin closed this as not planned Won't fix, can't repro, duplicate, stale Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support encodings other than UTF-8 and support BOM handling #226

Support encodings other than UTF-8 and support BOM handling #226

AzimMuradov commented Oct 2, 2023 •

edited

Loading

JakeWharton commented Oct 2, 2023

fzhinkin commented Oct 9, 2023

Support encodings other than UTF-8 and support BOM handling #226

Support encodings other than UTF-8 and support BOM handling #226

Comments

AzimMuradov commented Oct 2, 2023 • edited Loading

Different encoding formats

Handling of byte order mark (BOM)

JakeWharton commented Oct 2, 2023

fzhinkin commented Oct 9, 2023

AzimMuradov commented Oct 2, 2023 •

edited

Loading