No way to (de)serialize a String from binary data? #187

AbstractUmbra · 2023-03-15T03:00:13Z

AbstractUmbra
Mar 15, 2023

I've found myself in a situation where the data I'm reading is giving strings of upto N size. In my example I'll use 16.
The way I've been doing this right now is that I use this:-

#[derive(BinRead, PartialEq, Debug)]
#[br(little)]
struct BinaryResourceData {
    #[br(count = 16)]
    reference: Vec<u8>,
    type_id: u16,
    id: u32,
}
impl BinaryResourceData {
    pub fn name(&self) -> String {
        std::str::from_utf8(&self.reference)
            .expect("Enable to read the data from the reference.")
            .trim_matches('\x00')
            .to_owned()
    }
}

It seems a little unusual to me that there's no native way to provide String as a type with perhaps a count like you can for Vec.
The NullString implement is sadly not useful in this case as the strings I'm seeing are not null terminated.

Answered by csnover

Mar 15, 2023

String has no default implementation because there is no obvious canonical representation of string data like there is for most other primitive types. Strings may be null-terminated, dollar-sign-terminated, length-prefixed (Pascal), fixed-length space-padded, fixed-length null-padded, length-prefixed and null-terminated, delimited with quotes, in a big block with a separate lookup table, etc. The encoding could be UTF-8, WTF-8, UTF-16, Win-1252, MacRoman, ISO-8859, Shift-JIS, EBCDIC, etc.

If you have a fixed-length array containing string data you know is ASCII or UTF-8 then you can do basically what you are now, or you can do something like #[br(try_map = |data: [u8; 16]| str::from_utf8(…

View full answer

csnover · 2023-03-15T05:45:23Z

csnover
Mar 15, 2023
Collaborator

String has no default implementation because there is no obvious canonical representation of string data like there is for most other primitive types. Strings may be null-terminated, dollar-sign-terminated, length-prefixed (Pascal), fixed-length space-padded, fixed-length null-padded, length-prefixed and null-terminated, delimited with quotes, in a big block with a separate lookup table, etc. The encoding could be UTF-8, WTF-8, UTF-16, Win-1252, MacRoman, ISO-8859, Shift-JIS, EBCDIC, etc.

If you have a fixed-length array containing string data you know is ASCII or UTF-8 then you can do basically what you are now, or you can do something like #[br(try_map = |data: [u8; 16]| str::from_utf8(&data)?.to_string())] reference: String which seems like it would get you the data representation you want.

It is also possible to use the count helper function to collect into any type that implements FromIterator, though you would need to create your own newtype for this or use something like bstr which does implement FromIterator<u8>.

0 replies

AbstractUmbra · 2023-03-15T15:35:19Z

AbstractUmbra
Mar 15, 2023
Author

Thank you for the possible answers. I'm sure most of my inability here is my very lacking Rust knowledge to date.

With some help we came up with the following:-

#[derive(NamedArgs, Clone)]
struct PaddedStringArgs {
    count: usize,
}

#[binrw::parser(reader)]
fn padded_string_parser(args: PaddedStringArgs, ...) -> binrw::BinResult<String> {
    let mut bytes = vec![];
    bytes.reserve_exact(args.count);

    let bytes_read = reader.take(args.count as u64).read_to_end(&mut bytes)?;

    // if bytes_read != args.count {
    //     return Err(());
    // }

    // TODO: handle gracefully
    let slice = std::str::from_utf8(&bytes).unwrap();

    Ok(slice.trim_end_matches('\0').to_owned())
}

#[derive(BinRead, PartialEq, Debug)]
#[br(little)]
struct BinaryResourceData {
    #[br(count=16, parse_with = padded_string_parser)]
    reference: String,
    type_id: u16,
    id: u32,
}

The reason this is preferred is that it takes a fixed sized of binary data (16), and then trims the null padding as is my needs.

I initially opened this issue thinking I may have missed some trickery with the macro magic but I see the issue is much much wider than anticipated.
Thank you for the responses.
If there's anything you can advise on that would be great!

2 replies

csnover Mar 15, 2023
Collaborator

If you want to use a separate parse_with function instead of a try_map lambda, a more generic and correct implementation would look like:

#[binrw::parser(reader)]
fn parse_padded_string<const SIZE: usize, T: for<'a> From<&'a str>>() -> binrw::BinResult<T> {
    let pos = reader.stream_position()?;
    <[u8; SIZE]>::read(reader).and_then(|bytes| {
        std::str::from_utf8(&bytes)
            .map(|s| s.trim_end_matches('\0').into())
            .map_err(|err| binrw::Error::Custom { pos, err: Box::new(err) })
        })
}

and usage:

#[derive(BinRead, PartialEq, Debug)]
#[br(little)]
struct BinaryResourceData {
    #[br(parse_with = parse_padded_string::<16, _, _>)]
    reference: String,
    type_id: u16,
    id: u32,
}

This will allow the type of reference to be any thing that can be converted from a string reference and provides an optimal read with no extra heap allocations, but monomorphises for each array size.

If you really want something that accepts arguments like Vec and returns String, then just use a Vec internally and use String::from_utf8 instead to transfer the heap allocation from the Vec into the String:

#[binrw::parser(reader)]
fn parse_padded_string(args: <Vec<u8> as BinRead>::Args<'_>, ...) -> binrw::BinResult<String> {
    let pos = reader.stream_position()?;
    Vec::<u8>::read_args(reader, args)
        .and_then(|bytes| {
            String::from_utf8(bytes)
            .map(|mut s| {
                s.truncate(s.find('\0').unwrap_or(s.len()));
                s
            })
            .map_err(|err| binrw::Error::Custom { pos, err: Box::new(err) })
        })
}

AbstractUmbra Mar 15, 2023
Author

Thank you so much for this. It works perfectly.

I can't accept a comment as an answer, so I will accept your previous comment in place of this!

scarletcafe · 2023-03-20T00:38:14Z

scarletcafe
Mar 20, 2023

Posting this compact implementation of read/write of UTF8 strings headed with a u32 size for reference because I was asked about the topic

use std::io::Cursor;
use binrw::*;  // binrw = "0.11"

#[binrw]
struct SizedUTF8 {
  #[br( temp )]
  #[bw( calc = content.as_bytes().len() as u32 )]
  embedded_size: u32,

  #[br( count = embedded_size, try_map = |data: Vec<u8>| std::str::from_utf8(&data).map(|s| s.to_string()) )]
  #[bw( map = |s| s.as_bytes().to_vec() )]
  content: String
}

// read
let mut cursor = Cursor::new(b"\x00\x00\x00\x05hello".to_vec());
let blah = SizedUTF8::read_be(&mut cursor).unwrap();
assert_eq!(&blah.content, "hello");

// write
let mut cursor = Cursor::new(Vec::new());
let blah = SizedUTF8 { content: "goodbye".to_string() };
blah.write_be(&mut cursor).unwrap();
assert_eq!(&cursor.into_inner(), b"\x00\x00\x00\x07goodbye");

You can modify this to change the size type or make the struct take an arg, just to post an example that doesn't use a custom parser/look too scary.
Note that embedded_size is byte size of string not character count, you'll need one of the more complex parsers if you only know the character count

2 replies

csnover Mar 20, 2023
Collaborator

If you have a length-prefixed string like this that doesn’t require any trimming, you don’t need to do anything other than #[br(count = embedded_size, try_map = String::from_utf8)].

scarletcafe Mar 20, 2023

Rust inference to the rescue again! I'll leave the main post as is since it helps understand why this also works, but definitely a nice improvement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No way to (de)serialize a String from binary data? #187

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

No way to (de)serialize a String from binary data? #187

AbstractUmbra Mar 15, 2023

Replies: 3 comments · 4 replies

csnover Mar 15, 2023 Collaborator

AbstractUmbra Mar 15, 2023 Author

csnover Mar 15, 2023 Collaborator

AbstractUmbra Mar 15, 2023 Author

scarletcafe Mar 20, 2023

csnover Mar 20, 2023 Collaborator

scarletcafe Mar 20, 2023

AbstractUmbra
Mar 15, 2023

Replies: 3 comments 4 replies

csnover
Mar 15, 2023
Collaborator

AbstractUmbra
Mar 15, 2023
Author

csnover Mar 15, 2023
Collaborator

AbstractUmbra Mar 15, 2023
Author

scarletcafe
Mar 20, 2023

csnover Mar 20, 2023
Collaborator