Resolve z_charset confusion and byte-swapping issue. #127

davidben · 2014-03-22T17:49:59Z

This is probably a better venue for discussion than -c davidben. Alright, so a while ago there was some ramblings on -c zephyr-dev about the z_charset field and how messed up it is. From memory, here's a summary of the situation:

Zephyr 3 added a charset field to notices with the character set of the message. It can have values UTF-8, ISO-8859-1, and UNKNOWN.
Prior to that, zephyrgrams didn't really have any character set associated with them.
Old zwrite does not set the charset field and just dumps the bytes it receives over the wire.
BarnOwl ignores the charset field on receive and sniffs for valid UTF-8. Yes => UTF-8, no => ISO-8859-1.
New zwgc interprets the charset field and converts accordingly before displaying the notice.
Empirically, from trying to do the right in Roost, there exist senders which tag as ISO-8859-1 and send as UTF-8. I think it was mostly bots, but I forget if any humans managed it too. I had to back out of doing it in Roost. Roost currently blindly assumes all messages are UTF-8 which seems to work pretty much okay, though it should grow BarnOwl's sniffing logic as I have seen ISO-8859-1 messages in the wild. (Unfortunately, I'm dumb and fail to save either the z_charset field and the original bytes, so we don't have historical data here.)

In addition, I discovered a new issue today. z_charset is endian-confused over the wire! This line (and a corresponding one for formatting notices) shouldn't be there.
https://github.com/zephyr-im/zephyr/blob/master/lib/ZParseNot.c#L292

So, to add to our situation list:

Messages from a big-endian machine received on a little-endian machine and vice versa see the charset fields byte-swapped from each other.
I assert the vast majority of zephyr senders and receivers are on little-endian machines.
But there do exist multics.mit.edu users and perhaps others.

This is a mess. It should get resolved.

So, I'm uneasy about switching Roost back over to assuming ISO-8859-1-tagged messages are actually telling the truth because I've been burned by that before. I also think protocols should minimize variability for the sake of sanity. (And for entirely selfish reasons that I'm working on a new from-scratch implementation and don't want more test vectors in my unit tests.) Here are two proposals I think I would be happy with to start things off:

Proposal davidben-there-is-no-multics

UTF-8 is the One True Encoding.
From now on, the correct encoding of the charset field is little-endian. Change the htons calls in libzephyr to something that byteswaps on big or little endian.
All new senders send UTF-8 over the wire and write ZCHARSET_UTF_8 into the charset field. If a sender doesn't know whether its input is UTF-8 or not, use ZCHARSET_UNKNOWN and make loud noises.
All new receivers, when receiving a message:
- If the charset field is ZCHARSET_UTF_8, assume the message is UTF-8.
- If the charset field is missing or any other value, sniff. If valid UTF-8, it's UTF-8. Otherwise, it's ISO-8859-1.
Senders and receivers dealing with non-UTF-8 have the responsibility to transcode. UTF-8-only senders and UTF-8-only receivers should not care about other encodings apart from the sniff. When the few senders producing non-UTF-8 get fixed, we can move to blindly assuming UTF-8.
- Unfortunately, non-UTF-8 locale is not enough for zwrite to transcode. Presumably the people sending UTF-8-tagged ISO-8859-1 have some confused configuration that would also confuse the new zwrite too. To avoid introducing problems when they upgrade, zwrite does NOT tag with ZCHARSET_UTF_8 if the system is on a non-ISO-8859-1 locale unless -x is explicitly passed and/or maybe some environment variable. Maybe print an angry message to stderr or something so we can get those setups fixed.
- zwrite: Assume UTF-8 rather than ISO-8859-1 in an ASCII locale #132 should be sufficient to deal with this.

Proposal davidben-okay-maybe-multics-exists

Same as above but replace "ZCHARSET_UTF_8" in the receiver section with "ZCHARSET_UTF_8 or byteswap16(ZCHARSET_UTF_8)". Big-endian senders still follow the rule about little-endian being the correct encoding. Transition back to davidben-there-is-no-multics when all big-endian machines are updated.

When shiny new Roost finally happens, we can get data on when and how often the backwards-compatibility cases occur to guide when we can drop them.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve z_charset confusion and byte-swapping issue. #127

Resolve z_charset confusion and byte-swapping issue. #127

davidben commented Mar 22, 2014

Resolve z_charset confusion and byte-swapping issue. #127

Resolve z_charset confusion and byte-swapping issue. #127

Comments

davidben commented Mar 22, 2014

Proposal davidben-there-is-no-multics

Proposal davidben-okay-maybe-multics-exists