-
-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: Add bit_array.to_lossy_string()
or similar
#797
Comments
Does the following do what you're after? If it can't match a UTF-8 code point then it inserts the replacement character and tries again with the next byte. import gleam/string
pub fn to_string_lossy(bits: BitArray) -> String {
to_string_lossy_impl(bits, "")
}
fn to_string_lossy_impl(bits: BitArray, acc: String) -> String {
case bits {
<<x:utf8_codepoint, rest:bits>> ->
to_string_lossy_impl(rest, acc <> string.from_utf_codepoints([x]))
<<_, rest:bits>> -> to_string_lossy_impl(rest, acc <> "�")
_ -> acc
}
} |
Pretty much, I think. But it should simply drop the unknown char |
Sure yes the replacement char could be empty and/or configurable. The Rust function linked above adds the U+FFFD so the proposed Gleam code currently matches that behaviour. |
Being able to configure the behaviour when it fails would be useful. Would we want a fixed replacement or would we want a function that offers the non-unicode bit array and you pick a substitution? |
There are some use cases for having full control of the substitution, so maybe this: import gleam/string
pub fn to_string_lossy(
bits: BitArray,
map_invalid_byte: fn(Int) -> String,
) -> String {
to_string_lossy_impl(bits, map_invalid_byte, "")
}
fn to_string_lossy_impl(
bits: BitArray,
map_invalid_byte: fn(Int) -> String,
acc: String,
) -> String {
case bits {
<<x:utf8_codepoint, rest:bits>> ->
to_string_lossy_impl(
rest,
map_invalid_byte,
acc <> string.from_utf_codepoints([x]),
)
<<x, rest:bits>> ->
to_string_lossy_impl(rest, map_invalid_byte, acc <> map_invalid_byte(x))
_ -> acc
}
} The above isn't compatible with the JavaScript target, I can rework it to that end once the function signature is stabilised. |
What about when it's not a byte-aligned bit array? Would be nice to map the final bits rather than always delete them. |
How about this that parses any trailing bits at the end as a final codepoint rather than dropping them: import gleam/bit_array
import gleam/string
pub fn to_string_lossy(
bits: BitArray,
map_invalid_byte: fn(Int) -> String,
) -> String {
to_string_lossy_impl(bits, map_invalid_byte, "")
}
fn to_string_lossy_impl(
bits: BitArray,
map_invalid_byte: fn(Int) -> String,
acc: String,
) -> String {
case bits {
<<x:utf8_codepoint, rest:bits>> ->
to_string_lossy_impl(
rest,
map_invalid_byte,
acc <> string.from_utf_codepoints([x]),
)
<<x, rest:bits>> ->
to_string_lossy_impl(rest, map_invalid_byte, acc <> map_invalid_byte(x))
_ ->
case bit_array.bit_size(bits) {
0 -> acc
s -> {
let assert <<x:size(s)>> = bits
let assert Ok(cp) = string.utf_codepoint(x)
acc <> string.from_utf_codepoints([cp])
}
}
}
} |
It seems incorrect to me to use a different mapping function for those bits. If we are to let the programmer configure it then it should always be up to the programmer how to handle invalid bits rather than only when there's at least 1 byte |
Ok, this changes the signature of the mapping function to take a import gleam/string
pub fn to_string_lossy(
bits: BitArray,
map_invalid_bits: fn(BitArray) -> String,
) -> String {
to_string_lossy_impl(bits, map_invalid_bits, "")
}
fn to_string_lossy_impl(
bits: BitArray,
map_invalid_bits: fn(BitArray) -> String,
acc: String,
) -> String {
case bits {
<<>> -> acc
<<x:utf8_codepoint, rest:bits>> ->
to_string_lossy_impl(
rest,
map_invalid_bits,
acc <> string.from_utf_codepoints([x]),
)
<<x, rest:bits>> ->
to_string_lossy_impl(rest, map_invalid_bits, acc <> map_invalid_bits(x))
_ -> acc <> map_invalid_bits(bits)
}
} |
That sounds good! |
There is currently no way to (easily) convert from a BitArray that contains non-UTF codepoints to a String.
This is usable when, for example, you need to handle filepaths and you're not concerned with the exact naming but want a best-effort conversion.
In Rust there's
OsStr
andPath
-derivations for this. One first step could be to implement something like Rust'sto_string_lossy
The text was updated successfully, but these errors were encountered: