Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support encodings other than UTF-8 and support BOM handling #226

Closed
AzimMuradov opened this issue Oct 2, 2023 · 2 comments
Closed

Support encodings other than UTF-8 and support BOM handling #226

AzimMuradov opened this issue Oct 2, 2023 · 2 comments

Comments

@AzimMuradov
Copy link

AzimMuradov commented Oct 2, 2023

Different encoding formats

Although UTF-8 is quite popular these days, the sad reality is that sometimes we need to handle other encodings (UTF-16, windows-1251, etc.).

See also:

Handling of byte order mark (BOM)

There are also problems with BOMs. There are should be API for taking them into account in UTF-16 and other formats that are affected by BOM. Sometimes, I even encounter them in UTF-8 encoded files too (which is allowed, but not recommended by the Unicode standard). In this case, BOM should be stripped out whenever possible.

Regarding the UTF-8 with BOM issue, I think that it should be the koltinx-io responsibly.

See also:

Related discussions:

@JakeWharton
Copy link
Contributor

To what API are you referring as needing BOM support?

BOMs are a document-level concept and much of this library deals with arbitrary bytes that could come from database rows, HTTP/2 frames, files, and many more. You don't want to be checking for BOMs any time strings are requested, but if you have an API that goes directly from document to string they can be queried.

For example, Okio doesn't do anything with BOMs. But OkHttp, which deals with response documents, will do BOM detection: https://github.com/square/okhttp/blob/master/okhttp/src/jvmMain/kotlin/okhttp3/internal/-UtilJvm.kt#L89-L99. But even with OkHttp you can read bytes and sometimes have to do BOM detection even later as Retrofit does: https://github.com/square/retrofit/blob/master/retrofit-converters/moshi/src/main/java/retrofit2/converter/moshi/MoshiResponseBodyConverter.java#L40-L44.

This library certainly already facilitates BOM detection with its APIs, but building it in needs to be careful to not assume the use of documents.

@fzhinkin
Copy link
Collaborator

fzhinkin commented Oct 9, 2023

@AzimMuradov thanks for opening the issue.

There were some thoughts regarding supporting encodings other than UTF-8, but currently, there are no particular plans on when and how it'll be supported.

Could you please clarify what kind of UTF-8 BOMs support you're expecting from the kotlinx-io?
As Jake wrote, BOMs are a document-level concept, so we can't simply skip BOM-alike prefix on every readString call.

@fzhinkin fzhinkin closed this as not planned Won't fix, can't repro, duplicate, stale Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants