You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Most of time series databases (ex. DuckDB) are storing floating point data as 64 bits. They are reporting some extraordinary compression ratio by using a gorilla/chimp like algorithm.
However as shown in this benchmark, lot of time series data (ex. temparature, climate data, stocks,...) don't have more than 1 or 2 fixed decimal digits and can be stored losslessly in 16/32 bits integers. Integer compression algorithms can then be used, which results in significant compression ratio and several times faster as the gorilla like algorithms.
Gorilla or chimp based algos, cannot exceed 1GB/s in decompression, wherea TurboPFor is in the order of 15Gb/s, TurboBitByte is >100Gb/s.
We can also have 50% instant reduction by storing these values in 32 bits floats instead of 64 bits floats.
Note that outside time series with small changes between successive values the gorilla algorithm is of little use.
Not conviced? Verify yourself all these claims by downloading icapp and make your own benchmark with your data.
1 - Floating point mode (32 bits floats)
icapp city_temperature-fixed.csv -Ftf -K3 -I15 -J16 -e64,12,63,62
E MB/s size ratio D MB/s function floating point size=32 bits (lz=lz4,1)
593 6858573 59.01% 1262 64:fpcenc32 TurboFloat LzX bitio
1541 8323945 71.61% 9502 12:p4nzenc256v32 TurboPFor256 zigzag
690 8395129 72.23% 757 63:fphenc32 Chimp algo bitio
591 8741963 75.21% 826 62:fpgenc32 TurboGorilla bitio
2 - Integer mode: Convert floating point to 16 bits integer (multiplied by the number of decimal digits=1)
and use Integer -Compression. 'city_temperature' is now compressed to 2998946 vs the 6858573 above.
icapp city_temperature-fixed.csv -Fts.1 -K3 -I15 -J15 -v0 -e11,39,51
E MB/s size ratio D MB/s function integer size=16 bits (lz=lz4,1)
653.64 2998946 51.60% 3683.83 11:p4nzenc128v16 TurboPFor128 zigzag
366.69 3130714 53.87% 1123.45 39:vszenc16 TurboVSimple zigzag
4596.49 3221225 55.43% 6979.32 51:v8nzenc128v16 TByte+TPackV zigzag
3 - Quasi lossless mode (32 bits float): each input value is preprocessed by TurboRazor fprazor32 function using a
pointwise relative error (PWE) according to the decimal digits of the value.
(applicable only for text,csv files : option -FcA or -FtA ).
Ex. For the value 72.9, we use the PWE 0.5 rounding this value with TurboRazor to get 72.8906.
For the value 72.326 a PWE of 0.005 is used.
icapp city_temperature-fixed.csv -FtfA -K3 -I15 -J15 -e62,65,66,67,68,77,64
E MB/s size ratio D MB/s function floating point size=32 bits (lz=lz4,1)
979 5118429 44.03% 1632 62:fpgenc32 TurboGorilla bitio
1543 5589023 48.08% 7822 65:fpxenc32 TurboFloat XOR
1207 5757626 49.53% 1818 66:fpfcmenc32 TurboFloat FCM
1286 5844696 50.28% 1561 67:fpdfcmenc32 TurboFloat DFCM
1276 5936571 51.07% 1559 68:fp2dfcmenc32 TurboFloat DFCM 2D
609 7935767 68.27% 1338 64:fpcenc32 TurboFloat LzX bitio
TurboFloat LzX: TurboPFor's lz77 based algorithm for floating point data (can work with 16/32/64 bits floats).
TurboFloat LzX compress better and decompress faster than chimp128.
TurboGorilla : Improved compression and speed of the Gorilla algorithm in TurboPFor
Chimp : Faster implementation in c of the chimp (Gorilla based) algorithm in TurboPFor
Most of time series databases (ex. DuckDB) are storing floating point data as 64 bits. They are reporting some extraordinary compression ratio by using a gorilla/chimp like algorithm.
However as shown in this benchmark, lot of time series data (ex. temparature, climate data, stocks,...) don't have more than 1 or 2 fixed decimal digits and can be stored losslessly in 16/32 bits integers. Integer compression algorithms can then be used, which results in significant compression ratio and several times faster as the gorilla like algorithms.
Gorilla or chimp based algos, cannot exceed 1GB/s in decompression, wherea TurboPFor is in the order of 15Gb/s, TurboBitByte is >100Gb/s.
We can also have 50% instant reduction by storing these values in 32 bits floats instead of 64 bits floats.
Note that outside time series with small changes between successive values the gorilla algorithm is of little use.
Not conviced? Verify yourself all these claims by downloading icapp and make your own benchmark with your data.
Datasets
TurboFloat LzX compress better and decompress faster than chimp128.
The text was updated successfully, but these errors were encountered: