Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huffman code error for single key dict #172

Closed
eyaler opened this issue Apr 13, 2022 · 4 comments
Closed

huffman code error for single key dict #172

eyaler opened this issue Apr 13, 2022 · 4 comments

Comments

@eyaler
Copy link

eyaler commented Apr 13, 2022

This is an edge case, where huffman_code() fails if the dict has only a single key:

huffman_code(Counter('xxx'))

Out[1]: {'x': bitarray()}

This is not useful and will of course cause encode() to fail:

bitarray().encode(huffman_code(Counter('xxx')), 'xxx')

ValueError: non-empty bitarray expected

@eyaler
Copy link
Author

eyaler commented Apr 13, 2022

This is my fix, where also I find it useful to gracefully deal with the empty case:

        if len(counter) > 1:
            table = huffman_code(counter)
        elif len(counter) == 1:
            table = {list(counter)[0]: bitarray(1)}
        else:
            table = {}

@ilanschnell
Copy link
Owner

Thank you for using bitarray and discovering this special edge case!

I'm not sure if this is actually a bug. Obviously the huffman code {'x': bitarray()} isn't very useful. However, Huffman is meant to produce a minimal-length sequence of bits that contains all the information in the original sequence of symbols, assuming that the decoder already knows the set of symbols. If there's only one symbol, the input data contains no information except its length. In Huffman-based data formats, length is usually encoded separately, not as part of the Huffman-encoded bit sequence itself. The decoder of a single-symbol Huffman code therefore has all the information it needs to reconstruct the input without needing to read anything from the Huffman-encoded bit sequence. it is logical, then, that the Huffman encoder's output should be 0 bits long.

On the other hand, one could argue that one symbol can be encoded as a single bit (like you did), but would that still be "Huffman code"? Or is this a special case outside the scope of the function huffman_code()?

Note that in your fix bitarray(1) is uninitialized, so you either get a single bit which is 0 or 1. Either is fine, but it makes the table non-deterministic. So you probably want to be explicit (and replace bitarray(1) with bitarray('0').

ilanschnell added a commit that referenced this issue Apr 14, 2022
@eyaler
Copy link
Author

eyaler commented Apr 19, 2022

@ilanschnell thanks for your detailed response and commit! (and bitarray of course)

as regard to the edge cases:

i am not aware of the practice of encoding the length with huffman. anyway this is not a requirement of huffman, and moreover streaming decoders do exist, so sometimes the length is not defined. specifically i implemented one of the stream decoders from here: https://www.researchgate.net/publication/3159499_On_the_implementation_of_minimum_redundancy_prefix_codes

although encoding the length for a single symbol makes sense from a compression perspective, this takes us to the domain of modified Huffman which makes use of run-length encoding, which sounds like what you described. also as you've mentioned, you still need to save the cleartext symbol in the table.

anyway, i was stress testing some edge cases where huffman encoding/decoding is part of the process and i did want the flow to go through also for these degenerate cases. so this was more of a pragmatic remark rather than a purist or philosophical one. while failing on empty strings seems like a reasonable design decision, i would at least except the single symbol case to work.

as regard to bitarray(1) i was indeed aiming for bitarray('1'). as this may be a common cause for bugs, i would suggest to require a keyword like length/shape/size, e.g. bitarray(length=1), or you could even make the keyword necessary just for the cases 0, 1 (assuming only binary codes are supported). otherwise, since this is not seem like a familiar idiom for python array constructors, i find the current behavior lacking in terms of least surprise and discoverability.

@eyaler eyaler closed this as completed Apr 19, 2022
@ilanschnell
Copy link
Owner

Thanks for your detailed response. I agree that the having an empty bitarray as a result is surprising, and most likely not what one would expect or want. I'm thinking about returning {list(counter)[0]: bitarray('0')} for this case.

In regards to bitarray(1), this is the same behavior of Python's built-in bytearray:

>>> bytearray(1)
bytearray(b'\x00')

I have though about the cases bitarray(True) and bitarray(False) for which you get TypeError: cannot create bitarray from bool. Requiring a length keyword would break existing code.

BTW: I've been working on canonical Huffman coding for the last few days, see #173. This is be part of the upcoming bitarray 2.5.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants