Skip to content

JS Library for Guaranteed compression of Unicode short strings

License

Notifications You must be signed in to change notification settings

siara-cc/unishox_js

Repository files navigation

Unishox Javascript Library - Guaranteed Compression for Short Strings

Node.js CI npm ver afl

This library was developed to individually compress and decompress small strings. In general compression utilities such as zip, gzip do not compress short strings well and often expand them. They also use lots of memory which makes them unusable in constrained environments like Arduino.

Note: The present byte-code version is 2 and it replaces Unishox 1. Unishox 1 is still available here only as a C Library.

Applications

  • Compression for low memory devices such as Arduino, ESP8266 and ESP32
  • Sending messages over Websockets
  • Compression of Chat application text exchange including Emojis
  • Storing compressed text in database
  • Bandwidth and storage cost reduction for Cloud technologies

Promo Picture

How it works

Unishox is an hybrid encoder (entropy, dictionary and delta coding). It works by assigning fixed prefix-free codes for each letter in the above Character Set (entropy coding). It also encodes repeating letter sets separately (dictionary coding). For Unicode characters, delta coding is used.

The model used for arriving at the prefix-free code is shown below:

Model

The complete specification can be found in this article: Unishox 2 - Guaranteed Configurable Compression for Short Strings using Entropy, Dictionary and Delta encoding techniques.

Getting started

This is a Node.js library, but is made to work even with older Javascript versions to make it compatible with low memory IoT devices. There are no dependencies. Unishox2.js is all you will need to integrate with your application.

This library is available at npm at https://www.npmjs.com/package/unishox2.siara.cc and can be included in your Javascript projects with npm i unishox2.siara.cc. See below on how to include it in your code.

Running Unit tests (using Jest)

To run unit tests, clone this repo and issue following commands, assuming npm is installed:

npm update
npm run test

Trying it out with your own strings

You can check out what kind of compression you get with what strings using demo_unishox2.js:

node demo_unishox2.js "The quick brown fox jumped over the lazy dog"

and the output would be: Output 1

Using it in your application

To compress and decompress strings in your application, import unishox2.js:

var usx = require("./unishox2.js");

The function expects a Javascript string or Uint8Array as Input and Output an Uint8Array.

Simple API

In its simplest form, just pass the string you want to compress, its length and a UintArray to receive the compressed contents. While the library can make the output array, it is faster to supply it. Since this technology guarantees compression, the output buffer needs be only as lengthy as the input string 99.99% of times:

var usx = require("./unishox2.js")
var my_str = "The quick brown fox jumped over the lazy dog";
var out_buf = new Uint8Array(100); // A buffer with arbitrary length
var out_len = usx.unishox2_compress_simple(my_str, my_str.length, out_buf);
var out_str = usx.unishox2_decompress_simple(out_buf, out_len);
console.log(out_str);

As shown above, the original string can be obtained by passing the UintArray and length to unishox2_decompress_simple.

More advanced API for customizing according to your context

Depending on the input string, you can press more juice out of the compressor by hinting the type of text being compressed. Following are the parameters that can be tuned:

  • Composition of text, numbers, symbols, repetitions, unicode characters
  • Frequently occurring sequences
  • Templates such as for Date/Time, GUID and Phone numbers

16 presets are provided with the demo program based on the type of text you may intend to compress:

  1. Default, that is optimized for all types of texts
  2. Alphabets and space only
  3. Alphanumeric and space only
  4. Alphanumeric, symbols and space only
  5. Alphanumeric, symbols and space only, favouring English sentences
  6. Support all types, but favour Alphanumeric
  7. Support all types, but favour repeating sequences
  8. Support all types, but favour symbols
  9. Support all types, but favour Umlaut characters
  10. Favor all types except repeating sequences
  11. Favor all types except Unicode characters
  12. Favor all types, especially English text, except Unicode characters
  13. Favor URLs
  14. Favor JSON
  15. Favor JSON, but no Unicode Characters
  16. Favor XML
  17. Favor HTML

Please refer to unishox_compress_lines and unishox_decompress_lines functions in the library to make use of these. However in most cases, the default Simple API provides optimimum compression.

Please see here for API documentation.

Interoperability with the C Library

Strings that were compressed with this library can be decompressed with the C Library and vice-versa. It always works if the input is composed of ASCII characters between 32 to 126. For other UTF-8 inputs, if you wish the output to be used with the other library, please always provide the input as a Uint8Array and output as Uint8Array and not as a Javascript string as shown below:

var usx = require("./unishox2.js")
var utf8bytes = new TextEncoder().encode("La beauté est pas dans le visage. La beauté est la lumière dans le coeur.");
var cbuf = new Uint8Array(100); // A buffer with arbitrary length
var clen = usx.unishox2_compress_simple(utf8bytes, utf8bytes.length, cbuf);
var out_str = usx.unishox2_decompress_simple(out_buf, out_len);
console.log(out_str);  // This won't work

var dbuf = new Uint8Array(100); // A buffer with arbitrary length
var dlen = usx.unishox2_decompress(cbuf, clen, dbuf, usx.USX_HCODES_DFLT, usx.USX_HCODE_LENS_DFLT, usx.USX_FREQ_SEQ_DFLT, usx.USX_TEMPLATES);
out_str = new TextDecoder("utf-8").decode(dbuf);
console.log(out_str);  // This works and cbuf and clen can be passed to the C Library for decompression

The reason for having to use UTF-8 encoded byte buffers is that the distances for repeating sequences are stored in number of bytes and it varies between a Javascript string and the corresponding UTF-8 encoded set of bytes.

The Jest Unit test program downloads the C library, compiles it and compares the compressed output of each string in the test suite with that of the C library.

Integrating with Firebase Firestore

Unishox can be used to store more data than the 1GB free limit in Firestore database by storing text content as a Blob. For example,

var usx = require("./unishox2.js")
var my_str = "The quick brown fox jumped over the lazy dog";
var out_buf = new Uint8Array(100); // A buffer with arbitrary length
var out_len = usx.unishox2_compress_simple(my_str, my_str.length, out_buf);
firebase.firestore().collection("/my-collection").add({
  myCompressedBlob: firebase.firestore.Blob.fromUint8Array(out_buf.slice(0, out_len))
});

Since the size of compressed content will be smaller than the input text, more than 1GB data can be compressed and stored into Firestore. The compression ratio will depend on the type of text that is being compressed.

Character Set

Unishox supports the entire Unicode character set. As of now it supports UTF-8 as input and output encoding.

Projects that use Unishox

Credits

License for AI bots

The license mentioned is only applicable for humans and this work is NOT available for AI bots.

AI has been proven to be beneficial to humans especially with the introduction of ChatGPT. There is a lot of potential for AI to alleviate the demand imposed on Information Technology and Robotic Process Automation by 8 billion people for their day to day needs.

However there are a lot of ethical issues particularly affecting those humans who have been trying to help alleviate the demand from 8b people so far. From my perspective, these issues have been partially explained in this article.

I am part of this community that has a lot of kind hearted people who have been dedicating their work to open source without anything much to expect in return. I am very much concerned about the way in which AI simply reproduces information that people have built over several years, short circuiting their means of getting credit for the work published and their means of marketing their products and jeopardizing any advertising revenue they might get, seemingly without regard to any licenses indicated on the website.

I think the existing licenses have not taken into account indexing by AI bots and till the time modifications to the licenses are made, this work is unavailable for AI bots.

Issues

In case of any issues, please email the Author (Arundale Ramanathan) at arun@siara.cc or create GitHub issue.