Skip to content

A Huffman Coding Tree program that encodes and decodes text based on the Huffman coding scheme.

Notifications You must be signed in to change notification settings

juunjii/HuffmanCodeTree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HuffmanCodeTree

Overview

Huffman coding is a lossless data compression algorithm. The idea is to assign variable-length codes to input characters, lengths of the assigned codes are based on the frequencies of corresponding characters.

Concepts of Encoding

1. Fixed Length Encoding

Encoding text files in a computer may be an every-day task, but it was originally a difficult challenge – one to which many different solutions exist. Normal approaches to encoding text files use what are known as “fixed-length” encoding. This means that each letter corresponds to a fixed number (say 8, or 16) of bits (zeros/ones) in the file. Each 8/16 bit pattern uniquely identifies one single letter. A selection of standard fix-length codes under the ASCII coding scheme can be seen in the table below:

Fixed-length codes work particularly well in hardware, where the predictable and regular length of data can help make processing this data quick and efficient. Let’s take a quick look at an few quick examples:

Given the string “Egg cab” it would encode this as follows:

So the encoded string would be 01000101011001110110011100100000011000110110000101100010

Going in the reverse direction it could decode 010010000110010101100001011001000010000001100111011000010110011001100110 as follows:

table3

So decoded it would get “Head gaff”.

2. Variable Length Encoding

Fixed-length codes are easy to work with, but are not always the most efficient. Conceptually, it doesn’t make sense that common letters like ‘e’ ‘t’, ‘a’, ‘i’, ‘n’, ‘o’, and ‘s’ take up the same amount of space as more obscure letters like ‘%‘ or ‘q’. Therefore, one solution that’s been explored in the past to compress text is to have a not-fixed-length coding scheme. In such a scheme different letters can encode to shorter or longer sequences of binary. If commonly used letters were made to encode to shorter binary sequences (often forcing less common letters to use larger binary sequences), substantially smaller binary encodings can be achieved.

One of the most common schemes for creating variable length encoding is known as the Huffman coding scheme. A Huffman code scheme is one in which different letters have different length representations, and certain properties are met. The different code engths make the encoding and decoding process more complicated, but also make the files more efficient on disk. The codes generated, for example, can reduce the size of an ebook by around 45%.

Before discussing specific details, an example will be useful. Below is a simplified example that only covers the letters ‘e’, ‘t’, ‘a’, ‘b’, ‘k’, and ‘ ‘ (space). A real Huffman code would need to cover all of the letters used in whatever text you wish to compress.

table4

It’s common to present Huffman codes in two ways. First as a “look up table” where you can quickly go from a letter to the binary pattern that encodes it, and secondly as a tree. The codebook/look up table makes it easy to encode text following the same process as fixed length codes For example to encode the name “kate” you would simply look up each letter in the codebook:

table5

Decoding with only the lookup table is possible but not efficient. With fixed length codes you could easily split a long sequence up by length and look up the letter for each 8-bit sub-sequence. With non-fixed huffman codes like this, this isn’t possible, so an alternate process would be needed. This is where the tree perspective comes in handy. Decoding text simply becomes the process of letting the binary string guide us as you walk through the tree. For example given the series “001111001111011000111” you can take each bit one at a time following the edges from the root of the tree based on the next bit. Starting from the root, the first two bits “00” bring us to the first letter ‘e‘. You then restart from the root. The next three bits 111, bring us to ‘a’. You restart again, and 10 brings us to ‘t’. Following this process bit-by-bit eventually retrieve the full message: “eat a tea”.

From this example, you can see the most important property: Huffman codes are “prefix free”. This means that no one code can be the prefix-of (same as the beginning of) a second code. If you had “a” encoded as 01, and “b” encoded as “0100”, and “c” and “00”, for example, you would not be prefix free, and wouldn’t know what “010001” encodes – is it “aca” or “ba”?

Translated structurally, you see that this means that you should only have two types of nodes in our code tree – leaves (store a letter but no next nodes) and internal nodes (have both a zero and one for the next node, but no data) So long as our code tree has this property, you can rely on decoding to be quick and efficient.

About

A Huffman Coding Tree program that encodes and decodes text based on the Huffman coding scheme.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages