-
Notifications
You must be signed in to change notification settings - Fork 4
/
Shox96_Article_0_2_0.tex
405 lines (312 loc) · 24.4 KB
/
Shox96_Article_0_2_0.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
\documentclass[]{article}
\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhead{}
\fancyfoot{}
\fancyhead[C]{Data Compression techniques $\bullet$ Information Theory $\bullet$ Shox96 - Guaranteed Compression of Printable Short Strings}
\fancyfoot[L]{ Licensed under CC 4.0 Int'l Attribution License \copyright 2019 Siara Logics (cc)}
\fancyfoot[R]{\thepage}
%opening
\title{Shox96 - Guaranteed Compression of Printable Short Strings}
\author{Arundale Ramanathan, Charumathi A, Harsha N}
\begin{document}
\maketitle
\begin{abstract}
We formulate a hybrid encoding method with which short strings could be compressed using context aware pre-mapped codes resulting in surprisingly good ratios.
We also go on to prove that this technique can guarantee compression for any English language sentence of minimum 3 words.
\end{abstract}
\section{Summary}
Compression of Short Strings of arbitrary lengths have not been addressed sufficiently by lossless entropy encoding methods so far. Although it appears inconsequential, space occupied by such strings become significant in memory constrained environments such as Arduino Uno and when attempting storage of such independent strings in a database. While block compression is available for databases, retrieval efficiency could be improved if the strings are individually compressed.
\section{Basic Definitions}
In information theory, \emph{entropy encoding} is a lossless data compression scheme that is independent of the specific characteristics of the medium \cite{1}.
One of the main types of entropy coding creates and assigns a unique prefix-free code to each unique symbol that occurs in the input. These entropy encoders then compress data by replacing each fixed-length input symbol with the corresponding variable-length prefix-free output codeword.
According to Shannon's source coding theorem, the optimal code length for a symbol is $-log_bP$, where b is the number of symbols used to make output codes and P is the probability of the input symbol \cite{2}. Therefore, the most common symbols use the shortest codes.
The most popular and most used method (even today) for forming optimal prefix-free discrete codes is Huffman coding \cite{3}.
In contrast to entropy encoding, there are various other approaches to lossless coding including Lempel-Ziv coding \cite{4} and Burrows-Wheeler coding \cite{5}.
\section{Existing techniques - Smaz and shoco}
While technologies such as GZip, Deflate, Zip, LZMA are available for file compression, they do not provide optimal compression for short strings. Eventhough these methods compress far more than what we are proposing, these methods often expand the original source for short strings because the symbol-code mapping also needs to be attached to aid decompression.
To our knowledge, only two other competing technologies exist - Smaz and shoco.
Smaz is a simple compression library suitable for compressing very short strings \cite{9}. It was developed by Salvatore Sanfilippo and is released under the BSD license.
Shoco is a C library to compress short strings \cite{10}. It was developed by Christian Schramm and is released under the MIT license.
While both are lossless encoding methods, Smaz is dictionary based and Shoco classifies as an entropy coder \cite{10}.
In addition to providing a default frequency table as model, shoco provides an option to re-define the frequency table based on training text \cite{10}.
\section{This research}
We propose a hybrid encoding method which mainly relies on entropy encoding method for compression.
Unlike shoco, we propose a fixed frequency table generated based on the characterestics of English language letter frequency. We re-use the research carried out by Oxford University \cite{7} and other sources \cite{6} \cite{8} and come out with a unique method that takes advantage of the conventions of the language.
We propose a single model that presently is fixed because of the advantages it offers over the training models of shoco.
The disadvantage with the training model, although it may appear to offer more compression, is that it does not consider the patterns that usually appear during text formation.
We can actually see that this performs better than pre-trained model of shoco (See performance section).
For supporting other languages and types of text, we propose to use different models tailored to each of them, considering the patterns of the specific language.
Unlike smaz and shoco, we assume no \emph{a priori} knowledge about the input text. However we rely on \emph{a posteriori} knowledge about the research carried out on the language and common patterns of sentence formation and come out with pre-assigned codes for each letter.
\section{Model}
In the ASCII chart, we have 95 printable letters starting from 32 through 126. For the purpose of arriving at fixed codes for each of these letters, we use two sets of prefix-free codes.
The first set consists of 11 codes, which are: 00, 010, 011, 100, 1010, 1011, 1100, 1101, 1110, 11110, 11111. The second set consists of 7 codes, which are 0, 10, 110, 11100, 11101, 11110, 11111.
With these two sets of codes, we form several sets of letters as shown in the table below and use some rules based on how patterns appear in short strings, to arrive at frequency table.
\begin{center}
\begin{tabular}{ | l | l | l | l | l | l | l | l | } \hline
\textbf{hcode $\rightarrow$} & \textbf{10} & \textbf{0} & \textbf{110} & \textbf{11100} & \textbf{11101} & \textbf{11110} & \textbf{11111} \\ \hline
\textbf{$\downarrow$ vcode} & \textbf{Set 1} & \textbf{Set 1a} & \textbf{Set 1b} & \textbf{Set 2} & \textbf{Set 3} & \textbf{Set 4} & \textbf{Set 4a} \\ \hline
\textbf{00} & switch & c / C & v / V & switch & . & \& & @ \\ \hline
\textbf{010} & sp / tb & d / D & k / K & 9 & , & ; & ? \\ \hline
\textbf{011} & e / E & h / H & q / Q & 0 & - & : & ' \\ \hline
\textbf{100} & t / T & u / U & j / J & 1 & / & $\textless$ & \^{} \\ \hline
\textbf{1010} & a / A & p / P & x / X & 2 & = & $\textgreater$ & \# \\ \hline
\textbf{1011} & o / O & m / M & z / Z & 3 & + & * & \_ \\ \hline
\textbf{1100} & i / I & b / B & cr + lf & 4 & sp & \textquotedblleft & ! \\ \hline
\textbf{1101} & n / N & g / G & lf / cr & 5 & ( & \{ & \textbackslash \\ \hline
\textbf{1110} & s / S & w / W & dict & 6 & ) & \} & $|$ \\ \hline
\textbf{11110} & r / R & f / F & rpt & 7 & \$ & [ & \~{} \\ \hline
\textbf{11111} & l / L & y / Y & term & 8 & \% & ] & ` \\ \hline
\end{tabular}
\end{center}
\section{Rules}
\subsection{Basic rules}
\begin{itemize}
\item[$\bullet$] It can be seen that the more frequent symbols are assigned smaller codes.
\item[$\bullet$] Set 1 is always active when beginning compression. So the letter e has the code 011, t 100 and so on.
\item[$\bullet$] If the letter in Set 1a needs to be encoded, the switch code is used followed by 0 to indicate Set 1a. So the letter c is encoded as 00000, d as 000010 and so on.
\item[$\bullet$] Similarly, if the letter in Set 1b needs to be encoded, the switch code is used followed by 110. So k is encoded as 00110010, q as 00110011 and so on. Note that the terminator symbol is encoded as 0011011111.
\end{itemize}
\subsection{Upper case symbols}
\begin{itemize}
\item[$\bullet$] For encoding uppercase letters in Set 1, the switch symbol is used followed by 10 and the code against the symbol itself. For example, E is encoded as 0010011. The same applies to tab character.
\item[$\bullet$] For encoding uppercase letters in Set 1a, the switch symbol is used, followed by 10, again switch symbol, followed by 0 and then the corresponding code against the letter. For instance, the letter P is encoded as 00100001010.
\item[$\bullet$] Similarly, for uppercase letters in Set 1b, the prefix 001000110 is used. So the symbol X is encoded as 0010001101010.
\item[$\bullet$] If uppercase letters appear continuously, then the encoder may decide to switch to upper case using the prefix 00100010. After that, the same codes for lower case are used to indicate upper case letters until the code sequence 0010 is used again to return to lower case.
\end{itemize}
\subsection{Numbers and related symbols}
\begin{itemize}
\item[$\bullet$] Symbols in Set 2 are encoded by first switching to the set by using 00 followed by 11100. So the symbol 9 is encoded as 0011100010.
\item[$\bullet$] For Set 2, whenever is switch is made from Set 1, it makes Set 2 active. So subsequent numbers are encoded without the switch symbol, as in 1011 for 3, 1100 for 4 and so on.
\item[$\bullet$] To return to Set 1, the prefix 0010 is used.
\item[$\bullet$] To encode symbols in Set3, the prefix 00 is used followed by 0 and the corresponding code for the symbol. For example, + is encoded as 0001011.
\end{itemize}
\subsection{Other symbols (Set 4)}
\begin{itemize}
\item[$\bullet$] The special characters in Set 4 can be encoded by using the prefix 0011110 followed by the corresponding code for the letter. Example: ; is encoded as 0011110010.
\item[$\bullet$] The symbols in Set 4a are encoded using the prefix 0011111. Example: ? is encoded as 0011111010.
\end{itemize}
\subsection{Sticky sets}
\begin{itemize}
\item[$\bullet$] When switching to Set 2, it becomes active and is said to be sticky till Set 1 is made active using the symbol 0010.
\item[$\bullet$] However, no other set is sticky. Set 1 is default. Set 2 automatically becomes sticky when switched to it by using 0011100 and Upper case letters can be made sticky by using 00100010.
\item[$\bullet$] Symbols in Set1a, Set1b, Set3, Set 4 and Set 4a are never sticky. Once encoded the previous sticky set becomes active.
\end{itemize}
\subsection{Special symbols}
\begin{itemize}
\item[$\bullet$] term in Set 1b indicates termination of encoding. This is used if length of the encoded string is not stored.
\item[$\bullet$] rpt in Set 1b indicates that the symbol just encoded is to be repeated specified number of times.
\item[$\bullet$] dict in Set 1b indicates that the specified offset in file and length is to be copied at the current position. This is dictionary based encoding.
\item[$\bullet$] CRLF in Set 1a is encoded using a single code. It will be expanded as two bytes CR LF. If only LF is used, such as in Unix like systems, a separate code is used in Set 1a. Also, in the rare case that only CR appears, a longer code is provided in Set1a.
\end{itemize}
\subsection{Dual access for Set 3}
\begin{itemize}
\item[$\bullet$] Set 3 can be accessed both when Set 1 and Set 2 is active. This is because the symbols occur commonly in both Set 1 and 2. So it is necessary to have minimum length codes for these.
\item[$\bullet$] For the same reason, the space symbol appears both in Set 1 and Set 3.
\end{itemize}
\subsection{Repeating letters}
\begin{itemize}
\item[$\bullet$] If any letter repeats more than 3 times, we use a special code (rpt) shown in Set1b of the model.
\item[$\bullet$] The encoder first codes the letter using the above codes. Then the rpt code is used followed by the number of times the letter repeats.
\item[$\bullet$] The number of times the letter repeats is coded using a special bit sequence as explained in section "Encoding counts" that follows.
\end{itemize}
\subsection{Repeating sections}
\begin{itemize}
\item[$\bullet$] If a section repeats, another code of Set1b (dict) is used followed by four fields as described next.
\item[$\bullet$] The first field would be 0 or 1. 0 indicates that the distance of the previous section found repeating is to be counted from beginning. 1 indicates that it is to be counted backwards from current position of encoder.
\item[$\bullet$] The second field indicates the length of the section that repeats.
\item[$\bullet$] The third field indicates the distance of the repeating section.
\item[$\bullet$] The fourth field is coded only if the distance starts from the beginning. It is a number indicating the set that the section belongs to, assuming there are several sets of text are being encoded. If only one set of text is encoded, the first field is coded as 1 indicating the distance is to be counted from the current position.
\item[$\bullet$] The second, third and fourth fields are encoded as explained in the following section "Encoding counts".
\end{itemize}
\subsection{Encoding Counts}
\begin{itemize}
\item[$\bullet$] For encoding counts such as length and distance, the horizontal codes shown in the model are re-used, each code indicating how many bits will follow to indicate count.
\item[$\bullet$] If code is 0, 2 bits would follow, that is, count is between 0 and 3.
\item[$\bullet$] If code is 10, 5 bits would follow, that is, count is between 4 and 35.
\item[$\bullet$] If code is 110, 7 bits would follow, that is, count is between 36 and 163.
\item[$\bullet$] If code is 11100, 9 bits would follow, that is, count is between 164 and 675.
\item[$\bullet$] If code is 11101, 12 bits would follow, that is, count is between 676 and 4771.
\item[$\bullet$] If code is 11110, 16 bits would follow, that is, count is between 4772 and 70307.
\item[$\bullet$] If code is 11111, a varint would follow, which means a VarInt that uses a terminating byte would be used to indicate the count. The format for this code is kept for future expansion as it is not expected that an implementation of Shox96 would need this.
\end{itemize}
Based on the above rules the following Frequency table is formed.
\section{Frequency table}
\begin{center}
\begin{tabular}{ | l | l | l | l |} \hline
\textbf{ASCII Code} & \textbf{Letter} & \textbf{Code} & \textbf{Length} \\ \hline
32 & & 010 & 3 \\ \hline
33 & ! & 00111111100 & 11 \\ \hline
34 & \textquotedblleft & 00111101100 & 11 \\ \hline
35 & \# & 00111111010 & 11 \\ \hline
36 & \$ & 001110111110 & 11 \\ \hline
37 & \% & 001110111111 & 12 \\ \hline
38 & \& & 001111000 & 9 \\ \hline
39 & ' & 0011111011 & 10 \\ \hline
40 & ( & 00111011101 & 11 \\ \hline
41 & ) & 00111011110 & 11 \\ \hline
42 & * & 00111101011 & 11 \\ \hline
43 & + & 00111011011 & 11 \\ \hline
44 & , & 0011101010 & 10 \\ \hline
45 & - & 0011101011 & 10 \\ \hline
46 & . & 001110100 & 9 \\ \hline
47 & / & 0011101100 & 10 \\ \hline
48 & 0 & 0011100011 & 10 \\ \hline
49 & 1 & 0011100100 & 10 \\ \hline
50 & 2 & 00111001010 & 11 \\ \hline
51 & 3 & 00111001011 & 11 \\ \hline
52 & 4 & 00111001100 & 11 \\ \hline
53 & 5 & 00111001101 & 11 \\ \hline
54 & 6 & 00111001110 & 11 \\ \hline
55 & 7 & 001110011110 & 12 \\ \hline
56 & 8 & 001110011111 & 12 \\ \hline
57 & 9 & 0011100010 & 10 \\ \hline
58 & : & 0011110011 & 10 \\ \hline
59 & ; & 0011110010 & 10 \\ \hline
60 & $\textless$ & 0011110100 & 10 \\ \hline
61 & = & 00111011010 & 11 \\ \hline
62 & $\textgreater$ & 00111101010 & 11 \\ \hline
63 & ? & 0011111010 & 10 \\ \hline
64 & @ & 001111100 & 9 \\ \hline
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{ | l | l | l | l |} \hline
\textbf{ASCII Code} & \textbf{Letter} & \textbf{Code} & \textbf{Length} \\ \hline
65 & A & 00101010 & 8 \\ \hline
66 & B & 00100001100 & 11 \\ \hline
67 & C & 001000000 & 9 \\ \hline
68 & D & 0010000010 & 10 \\ \hline
69 & E & 0010011 & 7 \\ \hline
70 & F & 001000011110 & 12 \\ \hline
71 & G & 00100001101 & 11 \\ \hline
72 & H & 0010000011 & 10 \\ \hline
73 & I & 00101100 & 8 \\ \hline
74 & J & 001000110100 & 12 \\ \hline
75 & K & 001000110010 & 12 \\ \hline
76 & L & 001011111 & 9 \\ \hline
77 & M & 00100001011 & 11 \\ \hline
78 & N & 00101101 & 8 \\ \hline
79 & O & 00101011 & 8 \\ \hline
80 & P & 00100001010 & 11 \\ \hline
81 & Q & 001000110011 & 12 \\ \hline
82 & R & 001011110 & 9 \\ \hline
83 & S & 00101110 & 8 \\ \hline
84 & T & 0010100 & 7 \\ \hline
85 & U & 0010000100 & 10 \\ \hline
86 & V & 00100011000 & 11 \\ \hline
87 & W & 00100001110 & 11 \\ \hline
88 & X & 0010001101010 & 13 \\ \hline
89 & Y & 001000011111 & 12 \\ \hline
90 & Z & 0010001101011 & 13 \\ \hline
91 & [ & 001111011110 & 12 \\ \hline
92 & \textbackslash & 00111111101 & 11 \\ \hline
93 & ] & 001111011111 & 12 \\ \hline
94 & \^{} & 0011111100 & 10 \\ \hline
95 & \_ & 00111111011 & 11 \\ \hline
96 & ` & 001111111111 & 12 \\ \hline
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{ | l | l | l | l |} \hline
\textbf{ASCII Code} & \textbf{Letter} & \textbf{Code} & \textbf{Length} \\ \hline
97 & a & 1010 & 4 \\ \hline
98 & b & 0001100 & 7 \\ \hline
99 & c & 00000 & 5 \\ \hline
100 & d & 000010 & 6 \\ \hline
101 & e & 011 & 3 \\ \hline
102 & f & 00011110 & 8 \\ \hline
103 & g & 0001101 & 7 \\ \hline
104 & h & 000011 & 6 \\ \hline
105 & i & 1100 & 4 \\ \hline
106 & j & 00110100 & 8 \\ \hline
107 & k & 00110010 & 8 \\ \hline
108 & l & 11111 & 5 \\ \hline
109 & m & 0001011 & 7 \\ \hline
110 & n & 1101 & 4 \\ \hline
111 & o & 1011 & 4 \\ \hline
112 & p & 0001010 & 7 \\ \hline
113 & q & 00110011 & 8 \\ \hline
114 & r & 11110 & 5 \\ \hline
115 & s & 1110 & 4 \\ \hline
116 & t & 100 & 3 \\ \hline
117 & u & 000100 & 6 \\ \hline
118 & v & 0011000 & 7 \\ \hline
119 & w & 0001110 & 7 \\ \hline
120 & x & 001101010 & 9 \\ \hline
121 & y & 00011111 & 8 \\ \hline
122 & z & 001101011 & 9 \\ \hline
123 & \{ & 00111101101 & 11 \\ \hline
124 & | & 00111111110 & 11 \\ \hline
125 & \} & 00111101110 & 11 \\ \hline
126 & \~{} & 001111111110 & 12 \\ \hline
\end{tabular}
\end{center}
Even after the freqency table is formed, the original model is still needed for encoding Sticky sets as explained in the Rules section.
\section{Implementation}
According to the above Rules and Frequency table, a reference implementation has been developed and made available at https://github.com/siara-cc/Shox96 as shox96\_0\_2\_0.c. This is released under Apache License 2.0.
Further, Shox96 has been used in several open source projects shown below:
\begin{itemize}
\item[$\bullet$] Shox96 Compression Library for Arduino Progmem - https://github.com/siara-cc/Shox96\_Arduino\_Progmem\_lib
\item[$\bullet$] Shox96 Compression Library for Arduino - https://github.com/siara-cc/Shox96\_Arduino\_lib
\item[$\bullet$] Sqlite3 User Defined Function as loadable extension - https://github.com/siara-cc/Shox96\_Sqlite\_UDF
\item[$\bullet$] Sqlite3 Library for ESP32 - https://github.com/siara-cc/esp32\_arduino\_sqlite3\_lib
\item[$\bullet$] Sqlite3 Library for ESP8266 - https://github.com/siara-cc/esp\_arduino\_sqlite3\_lib
\item[$\bullet$] Sqlite3 Library for ESP-IDF - https://github.com/siara-cc/esp32-idf-sqlite3
\end{itemize}
\section{Performance Comparison}
The compression performance of all three techniques - Smaz, shoco and Shox96 were compared for different types of strings and results are tabulated below:
\begin{center}
\begin{tabular}{ | p{0.5\textwidth} | p{0.1\textwidth} | p{0.1\textwidth} | p{0.1\textwidth} | p{0.1\textwidth} |} \hline
\textbf{String} & \textbf{Length} & \textbf{Smaz} & \textbf{shoco} & \textbf{Shox96} \\ \hline
Hello World & 11 & 10 & 8 & 8 \\ \hline
The quick brown fox jumps over the lazy dog & 43 & 30 & 34 & 29 \\ \hline
I would have NEVER said that & 28 & 20 & 20 & 18 \\ \hline
In (1970-89), \$25.9 billion; OPEC bilateral aid [1979-89], \$213 million & 67 & 65 & 52 & 54 \\ \hline
\end{tabular}
\end{center}
Further - world95.txt - the text file obtained from \emph{The Project Gutenberg Etext of the 1995 CIA World Factbook} was compressed using the three techniques and following are the results:
Original size: 2988577 bytes
After Compression using shoco original model: 2385934 bytes
After Compression using shoco trained using world95.txt: 2088141 bytes
After Compression using Shox96: 1966019 bytes
As for memory requirements, shoco requires over 2k bytes, smaz requires over 1k. But Shox96 requires only around $95 * 3 = 285$ bytes for compressor and decompressor together, ideal for using it with even Arduino Uno.
\section{Proving guaranteed compression}
Guaranteed compression means that the length of compressed text will never exceed the length of the source text.
While it is not possible to prove it for any text, we can prove this for most real life scenarios good enough for using it without fear of expansion of original length.
At first we make the following assumptions for a given sentence in English language:
\begin{itemize}
\item[$\bullet$] The sentence will start with a capital letter.
\item[$\bullet$] The sentence will end in period (.).
\item[$\bullet$] The sentence will have at least 3 words.
\item[$\bullet$] Special characters other than a-z, A-Z and space will not be more than 2 or 3.
\item[$\bullet$] Terminator symbol is not needed. That is, length of compressed string in bits will be separately maintained.
\end{itemize}
With the above assumptions, we try to prove guaranteed compression as follows:
\begin{itemize}
\item[$\bullet$] Since the sentence will have atleast two spaces, it saves 5 + 5 = 10 bits.
\item[$\bullet$] Since any English word will have a vowel and the average length of code in our frequency table is 4, it will save another 12 bits, unless the vowel 'u' appears in all three words, which is not likely in real life.
\end{itemize}
So, with a saving of atleast 22 bits, we can say it is more than sufficient to offset for any symbol being used, such as Uppercase letter or Special character, provided such letters do not exceed 4, since \emph{the maximum length of any code in our frequency table is only 13}. So if there are 4 such exceeding codes, it will occupy at most $(13 - 8) * 4 = 20$ bits.
This assumption is towards defining a safe limit and since there will be more savings because of the known general frequency of letters, we can safely assume this guarantee.
\section{Conclusion}
As can be seen from the performance numbers, Shox96 performs better than available techniques. It can also be seen that it performs better for a variety of texts, especially those having a mixture of numbers and special characters.
\section{Further work}
We propose to improve Shox96 by including further rules for better compression. We also propose to develop such models for other languages and types of text such as Programs.
\section{About the Authors}
\emph{Arundale Ramanathan} has over 20 years of experience working in the IT industry. He has worked alternatively in large Corporates, MNCs and Startups, including Viewlocity Asia Pacific Pte. Ltd., IBC Systems Pte. Ltd. and Polaris Software Lab Ltd. He has founded Siara Logics (cc) and Siara Logics (in) and publishing his open source work at https://github.com/siara-cc and https://github.com/siara-in. He has a masters degree in Computer Science from Anna University. He can be reached at arun@siara.cc.
\emph{Charumathi A.} is a budding Electronics and Communications Engineer studying at Jerusalem College of Engineering, Chennai affiliated to Anna University. She is passionate about building useful technical gadgets, 3d printing, technical research and fashion designing. She can be reached at cm@cm21.in.
\emph{Harsha N.} is a budding Computer Science Engineer studying at Jerusalem College of Engineering, Chennai affiliated to Anna University. She is passionate about building software projects in C and Java and technical research. She can be reached at nnpharsha562@gmail.com.
\begin{thebibliography}{1}
\bibitem{1} David MacKay. {\em Information Theory, Inference, and Learning Algorithms}, Cambridge University Press, 2003.
\bibitem{2} Shannon, Claude E. (July 1948). {\em"A Mathematical Theory of Communication"}, Bell System Technical Journal. 27
\bibitem{3} D. A. Huffman, {\em“A method for the construction of minimum-redundancy codes“}, Proc. IRE, vol. 40, pp. 1098-1101,1952.
\bibitem{4} J. Ziv and A. Lempel. A Universal Algorithm for Data Compression. IEEE Transactions on Information Theory, 23(3):337–343, May 1977.
\bibitem{5} M. Burrows and D. Wheeler. A Block-Sorting Lossless Data Compression Algorithm. Research Report 124, Digital Equipment Corporation, Palo Alto, CA, USA, May 1994.
\bibitem{6} {\em "Statistical Distributions of English Text"}. data-compression.com. Archived from the original on 2017-09-18.
\bibitem{7} What is the frequency of the letters of the alphabet in English?, Oxford Dictionary. Oxford University Press. Retrieved 29 December 2012.
\bibitem{8} Wikipedia, {\em Letter frequency}, https://en.wikipedia.org/wiki/Letter\_frequency, updated December 2018.
\bibitem{9} Salvatore Sanfilippo, {\em SMAZ - compression for very small strings}, https://github.com/antirez/smaz, February 2012.
\bibitem{10} Christian Schramm, {\em shoco: a fast compressor for short strings}, https://github.com/Ed-von-Schleck/shoco, December 2015.
\end{thebibliography}
\end{document}