Suggestion: Add `bit_array.to_lossy_string()` or similar #797

JonasHedEng · 2025-01-28T17:57:16Z

There is currently no way to (easily) convert from a BitArray that contains non-UTF codepoints to a String.
This is usable when, for example, you need to handle filepaths and you're not concerned with the exact naming but want a best-effort conversion.

In Rust there's OsStr and Path-derivations for this. One first step could be to implement something like Rust's to_string_lossy

fn to_lossy_string(bytes: BitArray) -> String {
    todo
}

The text was updated successfully, but these errors were encountered:

richard-viney · 2025-01-29T11:46:11Z

Does the following do what you're after? If it can't match a UTF-8 code point then it inserts the replacement character and tries again with the next byte.

import gleam/string

pub fn to_string_lossy(bits: BitArray) -> String {
  to_string_lossy_impl(bits, "")
}

fn to_string_lossy_impl(bits: BitArray, acc: String) -> String {
  case bits {
    <<x:utf8_codepoint, rest:bits>> ->
      to_string_lossy_impl(rest, acc <> string.from_utf_codepoints([x]))
    <<_, rest:bits>> -> to_string_lossy_impl(rest, acc <> "�")
    _ -> acc
  }
}

JonasHedEng · 2025-01-29T12:16:08Z

Pretty much, I think. But it should simply drop the unknown char
to_string_lossy_impl(rest, acc <> "�") -> to_string_lossy_impl(rest, acc)

richard-viney · 2025-01-29T12:37:54Z

Sure yes the replacement char could be empty and/or configurable. The Rust function linked above adds the U+FFFD so the proposed Gleam code currently matches that behaviour.

lpil · 2025-02-02T17:29:23Z

Being able to configure the behaviour when it fails would be useful. Would we want a fixed replacement or would we want a function that offers the non-unicode bit array and you pick a substitution?

richard-viney · 2025-02-02T20:36:46Z

There are some use cases for having full control of the substitution, so maybe this:

import gleam/string

pub fn to_string_lossy(
  bits: BitArray,
  map_invalid_byte: fn(Int) -> String,
) -> String {
  to_string_lossy_impl(bits, map_invalid_byte, "")
}

fn to_string_lossy_impl(
  bits: BitArray,
  map_invalid_byte: fn(Int) -> String,
  acc: String,
) -> String {
  case bits {
    <<x:utf8_codepoint, rest:bits>> ->
      to_string_lossy_impl(
        rest,
        map_invalid_byte,
        acc <> string.from_utf_codepoints([x]),
      )

    <<x, rest:bits>> ->
      to_string_lossy_impl(rest, map_invalid_byte, acc <> map_invalid_byte(x))

    _ -> acc
  }
}

The above isn't compatible with the JavaScript target, I can rework it to that end once the function signature is stabilised.

lpil · 2025-02-04T11:17:56Z

What about when it's not a byte-aligned bit array? Would be nice to map the final bits rather than always delete them.

richard-viney · 2025-02-04T11:49:45Z

How about this that parses any trailing bits at the end as a final codepoint rather than dropping them:

import gleam/bit_array
import gleam/string

pub fn to_string_lossy(
  bits: BitArray,
  map_invalid_byte: fn(Int) -> String,
) -> String {
  to_string_lossy_impl(bits, map_invalid_byte, "")
}

fn to_string_lossy_impl(
  bits: BitArray,
  map_invalid_byte: fn(Int) -> String,
  acc: String,
) -> String {
  case bits {
    <<x:utf8_codepoint, rest:bits>> ->
      to_string_lossy_impl(
        rest,
        map_invalid_byte,
        acc <> string.from_utf_codepoints([x]),
      )

    <<x, rest:bits>> ->
      to_string_lossy_impl(rest, map_invalid_byte, acc <> map_invalid_byte(x))

    _ ->
      case bit_array.bit_size(bits) {
        0 -> acc
        s -> {
          let assert <<x:size(s)>> = bits
          let assert Ok(cp) = string.utf_codepoint(x)

          acc <> string.from_utf_codepoints([cp])
        }
      }
  }
}

lpil · 2025-02-04T11:52:08Z

It seems incorrect to me to use a different mapping function for those bits. If we are to let the programmer configure it then it should always be up to the programmer how to handle invalid bits rather than only when there's at least 1 byte

richard-viney · 2025-02-04T12:58:25Z

Ok, this changes the signature of the mapping function to take a BitArray, and any trailing partial byte is also passed to it:

import gleam/string

pub fn to_string_lossy(
  bits: BitArray,
  map_invalid_bits: fn(BitArray) -> String,
) -> String {
  to_string_lossy_impl(bits, map_invalid_bits, "")
}

fn to_string_lossy_impl(
  bits: BitArray,
  map_invalid_bits: fn(BitArray) -> String,
  acc: String,
) -> String {
  case bits {
    <<>> -> acc

    <<x:utf8_codepoint, rest:bits>> ->
      to_string_lossy_impl(
        rest,
        map_invalid_bits,
        acc <> string.from_utf_codepoints([x]),
      )

    <<x, rest:bits>> ->
      to_string_lossy_impl(rest, map_invalid_bits, acc <> map_invalid_bits(x))

    _ -> acc <> map_invalid_bits(bits)
  }
}

lpil · 2025-02-05T15:44:18Z

That sounds good!

richard-viney linked a pull request Feb 6, 2025 that will close this issue

Add bit_array.to_string_lossy #800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Add `bit_array.to_lossy_string()` or similar #797

Suggestion: Add `bit_array.to_lossy_string()` or similar #797

JonasHedEng commented Jan 28, 2025

richard-viney commented Jan 29, 2025

JonasHedEng commented Jan 29, 2025

richard-viney commented Jan 29, 2025

lpil commented Feb 2, 2025

richard-viney commented Feb 2, 2025

lpil commented Feb 4, 2025

richard-viney commented Feb 4, 2025

lpil commented Feb 4, 2025 •

edited

Loading

richard-viney commented Feb 4, 2025

lpil commented Feb 5, 2025

Suggestion: Add bit_array.to_lossy_string() or similar #797

Suggestion: Add bit_array.to_lossy_string() or similar #797

Comments

JonasHedEng commented Jan 28, 2025

richard-viney commented Jan 29, 2025

JonasHedEng commented Jan 29, 2025

richard-viney commented Jan 29, 2025

lpil commented Feb 2, 2025

richard-viney commented Feb 2, 2025

lpil commented Feb 4, 2025

richard-viney commented Feb 4, 2025

lpil commented Feb 4, 2025 • edited Loading

richard-viney commented Feb 4, 2025

lpil commented Feb 5, 2025

Suggestion: Add `bit_array.to_lossy_string()` or similar #797

Suggestion: Add `bit_array.to_lossy_string()` or similar #797

lpil commented Feb 4, 2025 •

edited

Loading