huffman code error for single key dict #172

eyaler · 2022-04-13T23:36:40Z

This is an edge case, where huffman_code() fails if the dict has only a single key:

huffman_code(Counter('xxx'))

Out[1]: {'x': bitarray()}

This is not useful and will of course cause encode() to fail:

bitarray().encode(huffman_code(Counter('xxx')), 'xxx')

ValueError: non-empty bitarray expected

The text was updated successfully, but these errors were encountered:

eyaler · 2022-04-13T23:49:10Z

This is my fix, where also I find it useful to gracefully deal with the empty case:

        if len(counter) > 1:
            table = huffman_code(counter)
        elif len(counter) == 1:
            table = {list(counter)[0]: bitarray(1)}
        else:
            table = {}

ilanschnell · 2022-04-14T01:47:46Z

Thank you for using bitarray and discovering this special edge case!

I'm not sure if this is actually a bug. Obviously the huffman code {'x': bitarray()} isn't very useful. However, Huffman is meant to produce a minimal-length sequence of bits that contains all the information in the original sequence of symbols, assuming that the decoder already knows the set of symbols. If there's only one symbol, the input data contains no information except its length. In Huffman-based data formats, length is usually encoded separately, not as part of the Huffman-encoded bit sequence itself. The decoder of a single-symbol Huffman code therefore has all the information it needs to reconstruct the input without needing to read anything from the Huffman-encoded bit sequence. it is logical, then, that the Huffman encoder's output should be 0 bits long.

On the other hand, one could argue that one symbol can be encoded as a single bit (like you did), but would that still be "Huffman code"? Or is this a special case outside the scope of the function huffman_code()?

Note that in your fix bitarray(1) is uninitialized, so you either get a single bit which is 0 or 1. Either is fine, but it makes the table non-deterministic. So you probably want to be explicit (and replace bitarray(1) with bitarray('0').

…many of the same single character), see also #172

eyaler · 2022-04-19T18:17:43Z

@ilanschnell thanks for your detailed response and commit! (and bitarray of course)

as regard to the edge cases:

i am not aware of the practice of encoding the length with huffman. anyway this is not a requirement of huffman, and moreover streaming decoders do exist, so sometimes the length is not defined. specifically i implemented one of the stream decoders from here: https://www.researchgate.net/publication/3159499_On_the_implementation_of_minimum_redundancy_prefix_codes

although encoding the length for a single symbol makes sense from a compression perspective, this takes us to the domain of modified Huffman which makes use of run-length encoding, which sounds like what you described. also as you've mentioned, you still need to save the cleartext symbol in the table.

anyway, i was stress testing some edge cases where huffman encoding/decoding is part of the process and i did want the flow to go through also for these degenerate cases. so this was more of a pragmatic remark rather than a purist or philosophical one. while failing on empty strings seems like a reasonable design decision, i would at least except the single symbol case to work.

as regard to bitarray(1) i was indeed aiming for bitarray('1'). as this may be a common cause for bugs, i would suggest to require a keyword like length/shape/size, e.g. bitarray(length=1), or you could even make the keyword necessary just for the cases 0, 1 (assuming only binary codes are supported). otherwise, since this is not seem like a familiar idiom for python array constructors, i find the current behavior lacking in terms of least surprise and discoverability.

ilanschnell · 2022-04-19T21:17:54Z

Thanks for your detailed response. I agree that the having an empty bitarray as a result is surprising, and most likely not what one would expect or want. I'm thinking about returning {list(counter)[0]: bitarray('0')} for this case.

In regards to bitarray(1), this is the same behavior of Python's built-in bytearray:

>>> bytearray(1)
bytearray(b'\x00')

I have though about the cases bitarray(True) and bitarray(False) for which you get TypeError: cannot create bitarray from bool. Requiring a length keyword would break existing code.

BTW: I've been working on canonical Huffman coding for the last few days, see #173. This is be part of the upcoming bitarray 2.5.0 release.

ilanschnell added a commit that referenced this issue Apr 14, 2022

handle empty files as well as files which only one contain (possibly …

65c8f33

…many of the same single character), see also #172

eyaler closed this as completed Apr 19, 2022

ilanschnell added a commit that referenced this issue Apr 20, 2022

add speical cases for single symbol Huffman codes, see #172

7d80bae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

huffman code error for single key dict #172

huffman code error for single key dict #172

eyaler commented Apr 13, 2022

eyaler commented Apr 13, 2022

ilanschnell commented Apr 14, 2022

eyaler commented Apr 19, 2022

ilanschnell commented Apr 19, 2022

huffman code error for single key dict #172

huffman code error for single key dict #172

Comments

eyaler commented Apr 13, 2022

eyaler commented Apr 13, 2022

ilanschnell commented Apr 14, 2022

eyaler commented Apr 19, 2022

ilanschnell commented Apr 19, 2022