From 2b91c213328b61d28f30627d50a16e168e586497 Mon Sep 17 00:00:00 2001 From: loquacity Date: Wed, 30 Aug 2023 11:43:46 +1000 Subject: [PATCH 1/7] Pull policy section out of about --- .../compression/about-compression.md | 82 ---------------- .../compression/compression-policy.md | 96 +++++++++++++++++++ use-timescale/compression/index.md | 5 - use-timescale/page-index/page-index.js | 7 +- 4 files changed, 102 insertions(+), 88 deletions(-) create mode 100644 use-timescale/compression/compression-policy.md diff --git a/use-timescale/compression/about-compression.md b/use-timescale/compression/about-compression.md index 243f268ec2..953f7d26e3 100644 --- a/use-timescale/compression/about-compression.md +++ b/use-timescale/compression/about-compression.md @@ -23,84 +23,6 @@ best possible compression ratio. For more information about compressing chunks, see [manual compression][manual-compression]. -## Enable compression - -You can enable compression on individual hypertables, by declaring which column -you want to segment by. This procedure uses an example table, called `example`, -and segments it by the `device_id` column. Every chunk that is more than seven -days old is then marked to be automatically compressed. - -|time|device_id|cpu|disk_io|energy_consumption| -|-|-|-|-|-| -|8/22/2019 0:00|1|88.2|20|0.8| -|8/22/2019 0:05|2|300.5|30|0.9| - - - -### Enabling compression - -1. At the `psql` prompt, alter the table: - - ```sql - ALTER TABLE example SET ( - timescaledb.compress, - timescaledb.compress_segmentby = 'device_id' - ); - ``` - -1. Add a compression policy to compress chunks that are older than seven days: - - ```sql - SELECT add_compression_policy('example', INTERVAL '7 days'); - ``` - - - -For more information, see the API reference for -[`ALTER TABLE (compression)`][alter-table-compression] and -[`add_compression_policy`][add_compression_policy]. - -You can also set a compression policy through -the Timescale console. The compression tool automatically generates and -runs the compression commands for you. To learn more, see the -[Timescale documentation](/use-timescale/latest/services/service-explorer/#setting-a-compression-policy-from-timescale-cloud-console). - -## View current compression policy - -To view the compression policy that you've set: - -```sql -SELECT * FROM timescaledb_information.jobs - WHERE proc_name='policy_compression'; -``` - -For more information, see the API reference for [`timescaledb_information.jobs`][timescaledb_information-jobs]. - -## Remove compression policy - -To remove a compression policy, use `remove_compression_policy`. For example, to -remove a compression policy for a hypertable named `cpu`: - -```sql -SELECT remove_compression_policy('cpu'); -``` - -For more information, see the API reference for -[`remove_compression_policy`][remove_compression_policy]. - -## Disable compression - -You can disable compression entirely on individual hypertables. This command -works only if you don't currently have any compressed chunks: - -```sql -ALTER TABLE SET (timescaledb.compress=false); -``` - -If your hypertable contains compressed chunks, you need to -[decompress each chunk][decompress-chunks] individually before you can disable -compression. - ## Compression policy intervals Data is usually compressed after an interval of time, and not @@ -275,9 +197,5 @@ chunks. When you do this,the data that is being inserted is not compressed immediately. It is stored alongside the chunk it has been inserted into, and then a separate job merges it with the chunk and compresses it later on. -[alter-table-compression]: /api/:currentVersion:/compression/alter_table_compression/ -[add_compression_policy]: /api/:currentVersion:/compression/add_compression_policy/ [decompress-chunks]: /use-timescale/:currentVersion:/compression/decompress-chunks -[remove_compression_policy]: /api/:currentVersion:/compression/remove_compression_policy/ -[timescaledb_information-jobs]: /api/:currentVersion:/informational-views/jobs/ [manual-compression]: /use-timescale/:currentVersion:/compression/manual-compression/ diff --git a/use-timescale/compression/compression-policy.md b/use-timescale/compression/compression-policy.md new file mode 100644 index 0000000000..8a6f414853 --- /dev/null +++ b/use-timescale/compression/compression-policy.md @@ -0,0 +1,96 @@ +--- +title: Create a compression policy +excerpt: Create a compression policy on a hypertable +products: [cloud, mst, self_hosted] +keywords: [compression, hypertables, policy] +--- + +import CompressionIntro from 'versionContent/_partials/_compression-intro.mdx'; + +# Compression policy + +You can enable compression on individual hypertables, by declaring which column +you want to segment by. + +## Enable a compression policy + +This procedure uses an example table, called `example`, +and segments it by the `device_id` column. Every chunk that is more than seven +days old is then marked to be automatically compressed. + +|time|device_id|cpu|disk_io|energy_consumption| +|-|-|-|-|-| +|8/22/2019 0:00|1|88.2|20|0.8| +|8/22/2019 0:05|2|300.5|30|0.9| + + + +### Enabling compression + +1. At the `psql` prompt, alter the table: + + ```sql + ALTER TABLE example SET ( + timescaledb.compress, + timescaledb.compress_segmentby = 'device_id' + ); + ``` + +1. Add a compression policy to compress chunks that are older than seven days: + + ```sql + SELECT add_compression_policy('example', INTERVAL '7 days'); + ``` + + + +For more information, see the API reference for +[`ALTER TABLE (compression)`][alter-table-compression] and +[`add_compression_policy`][add_compression_policy]. + +You can also set a compression policy through +the Timescale console. The compression tool automatically generates and +runs the compression commands for you. To learn more, see the +[Timescale documentation](/use-timescale/latest/services/service-explorer/#setting-a-compression-policy-from-timescale-cloud-console). + +## View current compression policy + +To view the compression policy that you've set: + +```sql +SELECT * FROM timescaledb_information.jobs + WHERE proc_name='policy_compression'; +``` + +For more information, see the API reference for [`timescaledb_information.jobs`][timescaledb_information-jobs]. + +## Remove compression policy + +To remove a compression policy, use `remove_compression_policy`. For example, to +remove a compression policy for a hypertable named `cpu`: + +```sql +SELECT remove_compression_policy('cpu'); +``` + +For more information, see the API reference for +[`remove_compression_policy`][remove_compression_policy]. + +## Disable compression + +You can disable compression entirely on individual hypertables. This command +works only if you don't currently have any compressed chunks: + +```sql +ALTER TABLE SET (timescaledb.compress=false); +``` + +If your hypertable contains compressed chunks, you need to +[decompress each chunk][decompress-chunks] individually before you can turn off +compression. + +[alter-table-compression]: /api/:currentVersion:/compression/alter_table_compression/ +[add_compression_policy]: /api/:currentVersion:/compression/add_compression_policy/ +[decompress-chunks]: /use-timescale/:currentVersion:/compression/decompress-chunks +[remove_compression_policy]: /api/:currentVersion:/compression/remove_compression_policy/ +[timescaledb_information-jobs]: /api/:currentVersion:/informational-views/jobs/ diff --git a/use-timescale/compression/index.md b/use-timescale/compression/index.md index 159d5624d1..db120fcfc4 100644 --- a/use-timescale/compression/index.md +++ b/use-timescale/compression/index.md @@ -17,8 +17,3 @@ data to the form of compressed columns. This occurs across chunks of Timescale hypertables. - -[backfill-historical]: /use-timescale/:currentVersion:/compression/backfill-historical-data/ -[decompress-chunks]: /use-timescale/:currentVersion:/compression/decompress-chunks/ -[modify-schema]: /use-timescale/:currentVersion:/compression/modify-a-schema/ -[compression-tshoot]: /use-timescale/:currentVersion:/compression/troubleshooting/ diff --git a/use-timescale/page-index/page-index.js b/use-timescale/page-index/page-index.js index 478181a940..a61c63d626 100644 --- a/use-timescale/page-index/page-index.js +++ b/use-timescale/page-index/page-index.js @@ -251,7 +251,12 @@ module.exports = [ { title: "About compression", href: "about-compression", - excerpt: "Compress data chunks", + excerpt: "Learn about how compression works", + }, + { + title: "Enable a compression policy", + href: "compression-policy", + excerpt: "Create a compression policy on a hypertable", }, { title: "Manual compression", From df1e84542170a346669ad718356f5e92909f4575 Mon Sep 17 00:00:00 2001 From: loquacity Date: Wed, 30 Aug 2023 16:44:25 +1000 Subject: [PATCH 2/7] add blog post content and edit --- .../compression/compression-methods.md | 291 ++++++++++++++++++ 1 file changed, 291 insertions(+) create mode 100644 use-timescale/compression/compression-methods.md diff --git a/use-timescale/compression/compression-methods.md b/use-timescale/compression/compression-methods.md new file mode 100644 index 0000000000..88b812491e --- /dev/null +++ b/use-timescale/compression/compression-methods.md @@ -0,0 +1,291 @@ +--- +title: About compression methods +excerpt: Understand the different compression methods +products: [cloud, mst, self_hosted] +keywords: [compression] +--- + +# About compression methods + +TimescaleDB uses different compression algorithms, depending on the data type +that is being compressed. + +For integers, timestamps, and other integer-like types, a combination of +compression methods are used: [delta encoding][delta], +[delta-of-delta][delta-delta], [simple-8b][simple-8b], and +[run-length encoding][run-length]. + +For columns that do not have a high amount of repeated values, +[XOR-based][xor] compression is used, with some +[dictionary compression][dictionary]. + +For all other types, [dictionary compression][dictionary] is used. + +## Integer compression + +For integers, timestamps, and other integer-like types TimescaleDB uses a +combination of delta encoding, delta-of-delta, simple 8-b, and run-length +encoding. + +The simple-8b compression method has been extended so that data can be +decompressed in reverse order. Backward scanning queries are common in +time-series workloads. This means that these types of queries run much faster. + +### Delta encoding + +Delta encoding reduces the amount of information required to represent a data +object by only storing the difference, sometimes referred to as the delta, +between that object and one or more reference objects. These algorithms work +best where there is a lot of redundant information, and it is often used in +workloads like versioned file systems. For example, this is how Dropbox keeps +your files synchronized. Applying delta-encoding to time-series data means that +you can use fewer bytes to represent a data point, because you only need to +store the delta from the previous data point. + +For example, imagine you had a dataset that collected CPU, free memory, +temperature, and humidity over time. If you time column was stored as an integer +value, like seconds since UNIX epoch, your raw data would look a little like +this: + +|time|cpu|mem_free_bytes|temperature|humidity| +|-|-|-|-|-| +|2023-04-01 10:00:00|82|1,073,741,824|80|25| +|2023-04-01 10:05:00|98|858,993,459|81|25| +|2023-04-01 10:05:00|98|858,904,583|81|25| + +With delta encoding, you only need to store how much each value changed from the +previous data point, resulting in smaller values to store. So after the first +row, you can represent subsequent rows with less information, like this: + +|time|cpu|mem_free_bytes|temperature|humidity| +|-|-|-|-|-| +|2020-04-01 10:00:00|82|1,073,741,824|80|25| +|5 seconds|16|-214,748,365|1|0| +|5 seconds|0|-88,876|0|0| + +Applying delta encoding to time-series data takes advantage of the fact that +most time-series datasets are not random, but instead represent something that +is slowly changing over time. The storage savings over millions of rows can be +substantial, especially if the value changes very little, or doesn't change at +all. + +### Delta-of-delta encoding + +Delta-of-delta encoding takes delta encoding one step further and applies +delta-encoding over data that has previously been delta-encoded. With +time-series datasets where data collection happens at regular intervals, you can +apply delta-of-delta encoding to the time column, which results in only needing to +store a series of 0s. + +In other words, delta encoding stores the first derivative of the dataset, while +delta-of-delta encoding stores the second derivative of the dataset. + +Applied to the example dataset from earlier, delta-of-delta encoding results in this: + +|time|cpu|mem_free_bytes|temperature|humidity| +|-|-|-|-|-| +|2020-04-01 10:00:00|82|1,073,741,824|80|25| +|5 seconds|16|-214,748,365|1|0| +|0|0|-88,876|0|0| + +In this example, delta-of-delta further compresses 5 seconds in the time column +down to 0 for every entry in the time column after the second row, because the +five second gap remains constant for each entry. Note that you see two entries +in the table before the delta-delta 0 values, because you need two deltas to +compare. + +This compresses a full timestamp of 8 bytes, or 64 bits, down to just a single +bit, resulting in 64x compression. + +### Simple-8b + +With delta and delta-of-delta encoding, you can significantly reduce the number +of digits you need to store. But you still need an efficient way to store the +smaller integers. The previous examples used a standard integer datatype for the +time column, which needs 64 bits to represent the value of 0 when delta-delta +encoded. This means that even though you are only storing the integer 0, you are +still consuming 64 bits to store it, so you haven't actually saved anything. + +Simple-8b is one of the simplest and smallest methods of storing variable-length +integers. In this method, integers are stored as a series of fixed-size blocks. +For each block, every integer within the block is represented by the minimal +bit-length needed to represent the largest integer in that block. The first bits +of each block denotes the minimum bit-length for the block. + +This technique has the advantage of only needing to store the length once for a +given block, instead of once for each integer. Because the blocks are of a fixed +size, you can infer the number of integers in each block from the size of the +integers being stored. + +For example, if you wanted to store a temperature that changed over time, and +you applied delta encoding, you might end up needing to store this set of +integers: + +|temperature (deltas)| +|-| +|1| +|10| +|11| +|13| +|9| +|100| +|22| +|11| + +With a block size of 10 digits, you could store this set of integers as two +blocks: one block storing 5 2-digit numbers, and a second block storing 3 +3-digit numbers, like this: + + + +In this example, both blocks store about 10 digits worth of data, even though +some of the numbers have to be padded with a leading 0. You might also notice +that the second block only stores 9 digits, because 10 is not evenly divisible +by 3. + +Simple-8b works in this way, except it uses binary numbers instead of decimal, +and it usually uses 64-bit blocks. In general, the longer the integer, the fewer +number of integers that can be stored in each block. + +### Run-length encoding + +Simple-8b compresses integers very well, however, if you have a large number of +repeats of the same value, you can get even better compression with run-length +encoding. This method works well for values that don't change very often, or if +an earlier transformation removes the changes. + +Run-length encoding is one of the classic compression algorithms. For +time-series data with billions of contiguous 0s, or even a document with a +million identically repeated strings, run-length encoding works incredibly well. + +For example, if you wanted to store a temperature that changed minimally over +time, and you applied delta encoding, you might end up needing to store this set +of integers: + +|temperature (deltas)| +|-| +|11| +|12| +|12| +|12| +|12| +|12| +|12| +|1| +|12| +|12| +|12| +|12| + +For values like these, you do not need to store each instance of the value, but +rather how long the run, or number of repeats, is. You can store this set of +numbers as `{run; value}` pairs like this: + + + +This technique uses 11 digits of storage (1, 1, 1, 6, 1, 2, 1, 1, 4, 1, 2), +rather than 23 digits that an optimal series of variable-length integers +requires (11, 12, 12, 12, 12, 12, 12, 1, 12, 12, 12, 12). + +Run-length encoding is also used as a building block for many more advanced +algorithms, such as Simple-8b RLE, which is an algorithm that combines +run-length and Simple-8b techniques. TimescaleDB implements a variant of +Simple-8b RLE. This variant uses different sizes to standard Simple-8b, in order +to handle 64-bit values, and RLE. + +## Floating point compression + +For columns that do not have a high amount of repeated values, TimescaleDB uses +XOR-based compression. + +The standard XOR-based compression method has been extended so that data can be +decompressed in reverse order. Backward scanning queries are common in +time-series workloads. This means that queries that use backwards scans run much +faster. + +### XOR-based compression + +Floating point numbers are usually more difficult to compress than integers. +Fixed-length integers often have leading 0s, but floating point numbers usually +use all of their available bits, especially if they are converted from decimal +numbers, which can't be represented precisely in binary. + +Techniques like delta-encoding don't work well for floats, because they do not +reduce the number of bits sufficiently. This means that most floating-point +compression algorithms tend to be either complex and slow, or truncate +significant digits. One of the few simple and fast lossless floating-point +compression algorithms is XOR-based compression, built on top of Facebook's +Gorilla compression. + +XOR is the binary function `exclusive or`. In this algorithm, successive +floating point numbers are compared with XOR, and a difference results in a bit +being stored. The first data point is stored without compression, and subsequent +data points are represented using their XOR'd values. + +## Data-agnostic compression + +For values that are not integers or floating point, TiemscaleDB uses dictionary +compression. + +### Dictionary compression + +One of the earliest lossless compression algorithms, dictionary compression is +the basis of many popular compression methods. Dictionary compression can also +be found in areas outside of computer science, such as medical coding. + +Instead of storing values directly, dictionary compression works by making a +list of the possible values that can appear, and then storing an index into a +dictionary containing the unique values. This technique is quite versatile, can +be used regardless of data type, and works especially well when you have a +limited set of values that repeat frequently. + +For example, if you had the list of temperatures shown earlier, but you wanted +an additional column storing a city location for each measurement, you might +have a set of values like this: + +|City| +|-| +|New York| +|San Francisco| +|San Francisco| +|Los Angeles| + +Instead of storing all the city names directly, you can instead store a +dictionary, like this: + + + +You can then store just the indices in your column, like this: + +|City| +|-| +|0| +|1| +|1| +|2| + +For a dataset with a lot of repetition, this can offer significant compression. +In the example, each city name is on average 11 bytes in length, while the +indices are never going to be more than 4 bytes long, reducing space usage +nearly 3 times. In TimescaleDB, the list of indices is compressed even further +with the Simple-8b+RLE method, making the storage cost even smaller. + +Dictionary compression doesn't always result in savings. If your dataset doesn't +have a lot of repeated values, then the dictionary is the same size as the +original data. TimescaleDB automatically detects this case, and falls back to +not using a dictionary in that scenario. + +[decompress-chunks]: /use-timescale/:currentVersion:/compression/decompress-chunks +[manual-compression]: /use-timescale/:currentVersion:/compression/manual-compression/ +[delta]: /use-timescale/:currentVersion:/compression/compression-methods/#delta-encoding +[delta-delta]: /use-timescale/:currentVersion:/compression/compression-methods/#delta-of-delta-encoding +[simple-8b]: /use-timescale/:currentVersion:/compression/compression-methods/#simple-8b +[run-length]: /use-timescale/:currentVersion:/compression/compression-methods/#run-length-encoding +[xor]: /use-timescale/:currentVersion:/compression/compression-methods/#xor-based-encoding +[dictionary]: /use-timescale/:currentVersion:/compression/compression-methods/#dictionary-compression From 5cd55e7be2602a96efe41f5e637fcb2214ffebdb Mon Sep 17 00:00:00 2001 From: loquacity Date: Wed, 30 Aug 2023 16:45:16 +1000 Subject: [PATCH 3/7] add methods page to index --- use-timescale/page-index/page-index.js | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/use-timescale/page-index/page-index.js b/use-timescale/page-index/page-index.js index a61c63d626..ee8887b49f 100644 --- a/use-timescale/page-index/page-index.js +++ b/use-timescale/page-index/page-index.js @@ -253,6 +253,11 @@ module.exports = [ href: "about-compression", excerpt: "Learn about how compression works", }, + { + title: "About compression methods", + href: "compression-methods", + excerpt: "Learn about the different compression methods", + }, { title: "Enable a compression policy", href: "compression-policy", From a27ae3f43c7604a07c56a738cd9efc6754e729b5 Mon Sep 17 00:00:00 2001 From: loquacity Date: Thu, 31 Aug 2023 15:59:35 +1000 Subject: [PATCH 4/7] compression design --- .../compression/about-compression.md | 2 - .../compression/compression-design.md | 135 ++++++++++++++++++ .../compression/compression-policy.md | 11 +- 3 files changed, 138 insertions(+), 10 deletions(-) create mode 100644 use-timescale/compression/compression-design.md diff --git a/use-timescale/compression/about-compression.md b/use-timescale/compression/about-compression.md index 953f7d26e3..733b404f87 100644 --- a/use-timescale/compression/about-compression.md +++ b/use-timescale/compression/about-compression.md @@ -21,8 +21,6 @@ This section explains how to enable native compression, and then goes into detail on the most important settings for compression, to help you get the best possible compression ratio. -For more information about compressing chunks, see [manual compression][manual-compression]. - ## Compression policy intervals Data is usually compressed after an interval of time, and not diff --git a/use-timescale/compression/compression-design.md b/use-timescale/compression/compression-design.md new file mode 100644 index 0000000000..013c6538ef --- /dev/null +++ b/use-timescale/compression/compression-design.md @@ -0,0 +1,135 @@ +--- +title: Designing your database for compression +excerpt: Learn how to design your database for the most effective compression +products: [cloud, mst, self_hosted] +keywords: [compression, schema, tables] +--- + +# Designing for compression + +Traditionally, databases are considered either row-based, or column based. And +each type of database brings benefits and drawbacks, including query speed, +insert speed, and the level to which they can effectively compress data. +Generally speaking, column-oriented databases are highly compressible, but +inserting data can take longer. Conversely, row-oriented databases have faster +queries, but can't compress as well. + +Time-series data can be unique, in that it needs to handle both shall and wide +queries, such as "What's happened across the deployment in the last 10 minutes," +and deep and narrow, such as "What is the average CPU usage for this server +over the last 24 hours." Time-series data usually has a very high rate of +inserts as well; hundreds of thousands of writes per second can be very normal +for a time-series dataset. Additionally, time-series data is often very +granular, and data is collected at a higher resolution than many other +datasets. This can result in terabytes of data being collected over time. + +All this means that if you need great compression rates, you probably need to +consider the design of your database, before you start ingesting data. This +section covers some of the things you need to take into consideration when +designing your database for maximum compression effectiveness. + +## Array format + +TimescaleDB is built on PostgreSQL which is, by nature, a row-based database. +Because time-series data is accessed in order of time, TimescaleDB converts many +wide rows of data into a single row of data, called an array form. This means +that each field of that new, wide row stores an ordered set of data comprising +the entire column. + +For example, if you had a table with data that looked a bit like this: + +|Timestamp|Device ID|Status Code|Temperature| +|-|-|-|-| +|12:00:01|A|0|70.11| +|12:00:01|B|0|69.70| +|12:00:02|A|0|70.12| +|12:00:02|B|0|69.69| +|12:00:03|A|0|70.14| +|12:00:03|B|4|69.70| + +You can convert this to a single row in array form, like this: + +|Timestamp|Device ID|Status Code|Temperature| +|-|-|-|-| +|[12:00:01, 12:00:01, 12:00:02, 12:00:02, 12:00:03, 12:00:03]|[A, B, A, B, A, B]|[0, 0, 0, 0, 0, 4]|[70.11, 69.70, 70.12, 69.69, 70.14, 69.70]| + +Even before you compress any data, this format immediately saves storage by +reducing the per-row overhead. PostgreSQL typically adds around 27 bytes of +overhead per row. So even without any compression, if our schema above is say 32 +bytes, then 1000 rows of data which previously took about 59 kilobytes +(`1000 x (32 + 27) ~= 59 kilobytes`), now takes about 32 kilobytes +(`1000 x 32 + 27 ~= 32 kilobytes`) in this format. + +This format arranges the data so that similar data, such as timestamps, device +IDs, or temperature readings, is stored contiguously. This means that you can +then use type-specific compression algorithms to compress the data further, and +each array is separately compressed. For more information about the compression +methods used, see the [compression methods section][compression-methods]. + +When the data is in array format, you can perform queries that require a subset +of the columns very quickly. For example, if you have a query like this one, that +asks for the average temperature over the past day: + + now() - interval ‘1 day’ +ORDER BY minute DESC +GROUP BY minute; +`} /> + +The query engine can fetch and decompress only the timestamp and temperature +columns to efficiently compute and return these results. + +Finally, TimescaleDB uses non-inline disk pages to store the compressed arrays. +This means that the in-row data points to a secondary disk page that stores the +compressed array, and the actual row in the main table becomes very small, +because it is now just pointers to the data. When data stored like this is +queried, only the compressed arrays for the required columns are read from disk, +further improving performance by reducing disk reads and writes. + +## Querying compressed data + +In the previous example, the database has no way of knowing which rows need to +be fetched and decompressed to resolve a query. For example, the database can't +easily determine which rows contain data from the past day, as the timestamp +itself is in a compressed column. You don't want to have to decompress all the +data in a chunk, or even an entire hypertable, to determine which rows are +required. + +TimescaleDB automatically includes more information in the row and includes +additional groupings to improve query performance. When you compress a +hypertable, either manually or through a compression policy, you need to specify +an `ORDER BY` column. + +`ORDER BY` columns specify how the rows that are part of a compressed patch are +ordered. For most time-series workloads, this is by timestamp, but you can also +specify a second dimension, such as location. + +For each `ORDER BY` column, TimescaleDB automatically creates additional columns +that store the minimum and maximum value of that column. This way, the query +planner can look at the range of timestamps in the compressed column, without +having to do any decompression, and determine whether the row could possibly +match the query. + +When you compress your hypertable, you can also choose to specify a `SEGMENT BY` +column. This allows you to segment compressed rows by a specific column, +so that each compressed row corresponds to a data about a single item such as, +for example, a specific device ID. This further allows the query planner to +determine if the row could possibly match the query. For example: + +|Device ID|Timestamp|Status Code|Temperature|Min Timestamp|Max Timestamp| +|-|-|-|-|-|-| +|A|[12:00:01, 12:00:02, 12:00:03]|[0, 0, 0]|[70.11, 70.12, 70.14]|12:00:01|12:00:03| +|B|[12:00:01, 12:00:02, 12:00:03]|[0, 0, 0]|[70.11, 70.12, 70.14]|12:00:01|12:00:03| + +With the data segmented in this way, a query for device A between a time +interval becomes quite fast. The query planner can use an index to find those +rows for device A that contain at least some timestamps corresponding to the +specified interval, and even a sequential scan is quite fast since evaluating +device IDs or timestamps does not require decompression. This means the the +query executor only decompresses the timestamp and temperature columns +corresponding to those selected rows. + +[compression-methods]: /use-timescale/:currentVersion:/compression/compression-methods/ diff --git a/use-timescale/compression/compression-policy.md b/use-timescale/compression/compression-policy.md index 8a6f414853..470b86b51e 100644 --- a/use-timescale/compression/compression-policy.md +++ b/use-timescale/compression/compression-policy.md @@ -14,9 +14,9 @@ you want to segment by. ## Enable a compression policy -This procedure uses an example table, called `example`, -and segments it by the `device_id` column. Every chunk that is more than seven -days old is then marked to be automatically compressed. +This procedure uses an example table, called `example`, and segments it by the +`device_id` column. Every chunk that is more than seven days old is then marked +to be automatically compressed. The source data is organized like this: |time|device_id|cpu|disk_io|energy_consumption| |-|-|-|-|-| @@ -48,11 +48,6 @@ For more information, see the API reference for [`ALTER TABLE (compression)`][alter-table-compression] and [`add_compression_policy`][add_compression_policy]. -You can also set a compression policy through -the Timescale console. The compression tool automatically generates and -runs the compression commands for you. To learn more, see the -[Timescale documentation](/use-timescale/latest/services/service-explorer/#setting-a-compression-policy-from-timescale-cloud-console). - ## View current compression policy To view the compression policy that you've set: From 31b451abf5f8d37a96b89a4bcd8860b41384280b Mon Sep 17 00:00:00 2001 From: loquacity Date: Mon, 4 Sep 2023 16:49:29 +1000 Subject: [PATCH 5/7] Edits per feedback --- .../compression/compression-design.md | 45 ++++++++----------- 1 file changed, 19 insertions(+), 26 deletions(-) diff --git a/use-timescale/compression/compression-design.md b/use-timescale/compression/compression-design.md index 013c6538ef..e162beb465 100644 --- a/use-timescale/compression/compression-design.md +++ b/use-timescale/compression/compression-design.md @@ -7,14 +7,7 @@ keywords: [compression, schema, tables] # Designing for compression -Traditionally, databases are considered either row-based, or column based. And -each type of database brings benefits and drawbacks, including query speed, -insert speed, and the level to which they can effectively compress data. -Generally speaking, column-oriented databases are highly compressible, but -inserting data can take longer. Conversely, row-oriented databases have faster -queries, but can't compress as well. - -Time-series data can be unique, in that it needs to handle both shall and wide +Time-series data can be unique, in that it needs to handle both shallow and wide queries, such as "What's happened across the deployment in the last 10 minutes," and deep and narrow, such as "What is the average CPU usage for this server over the last 24 hours." Time-series data usually has a very high rate of @@ -28,13 +21,13 @@ consider the design of your database, before you start ingesting data. This section covers some of the things you need to take into consideration when designing your database for maximum compression effectiveness. -## Array format +## Compressing data TimescaleDB is built on PostgreSQL which is, by nature, a row-based database. -Because time-series data is accessed in order of time, TimescaleDB converts many -wide rows of data into a single row of data, called an array form. This means -that each field of that new, wide row stores an ordered set of data comprising -the entire column. +Because time-series data is accessed in order of time, when you enable +compression, TimescaleDB converts many wide rows of data into a single row of +data, called an array form. This means that each field of that new, wide row +stores an ordered set of data comprising the entire column. For example, if you had a table with data that looked a bit like this: @@ -54,11 +47,9 @@ You can convert this to a single row in array form, like this: |[12:00:01, 12:00:01, 12:00:02, 12:00:02, 12:00:03, 12:00:03]|[A, B, A, B, A, B]|[0, 0, 0, 0, 0, 4]|[70.11, 69.70, 70.12, 69.69, 70.14, 69.70]| Even before you compress any data, this format immediately saves storage by -reducing the per-row overhead. PostgreSQL typically adds around 27 bytes of -overhead per row. So even without any compression, if our schema above is say 32 -bytes, then 1000 rows of data which previously took about 59 kilobytes -(`1000 x (32 + 27) ~= 59 kilobytes`), now takes about 32 kilobytes -(`1000 x 32 + 27 ~= 32 kilobytes`) in this format. +reducing the per-row overhead. PostgreSQL typically adds a small number of bytes +of overhead per row. So even without any compression, the schema in this example +is now smaller on disk than the previous format. This format arranges the data so that similar data, such as timestamps, device IDs, or temperature readings, is stored contiguously. This means that you can @@ -100,12 +91,13 @@ required. TimescaleDB automatically includes more information in the row and includes additional groupings to improve query performance. When you compress a -hypertable, either manually or through a compression policy, you need to specify +hypertable, either manually or through a compression policy, it can help to specify an `ORDER BY` column. -`ORDER BY` columns specify how the rows that are part of a compressed patch are -ordered. For most time-series workloads, this is by timestamp, but you can also -specify a second dimension, such as location. +`ORDER BY` columns specify how the rows that are part of a compressed batch are +ordered. For most time-series workloads, this is by timestamp, so if you don't +specify an `ORDER BY` column, TimescaleDB defaults to using the time column. You +can also specify additional dimensions, such as location. For each `ORDER BY` column, TimescaleDB automatically creates additional columns that store the minimum and maximum value of that column. This way, the query @@ -114,10 +106,11 @@ having to do any decompression, and determine whether the row could possibly match the query. When you compress your hypertable, you can also choose to specify a `SEGMENT BY` -column. This allows you to segment compressed rows by a specific column, -so that each compressed row corresponds to a data about a single item such as, -for example, a specific device ID. This further allows the query planner to -determine if the row could possibly match the query. For example: +column. This allows you to segment compressed rows by a specific column, so that +each compressed row corresponds to a data about a single item such as, for +example, a specific device ID. This further allows the query planner to +determine if the row could possibly match the query without having to decompress +the column first. For example: |Device ID|Timestamp|Status Code|Temperature|Min Timestamp|Max Timestamp| |-|-|-|-|-|-| From d86c2ccb6b0170d167888ef9744d449ede028f40 Mon Sep 17 00:00:00 2001 From: loquacity Date: Mon, 4 Sep 2023 16:49:46 +1000 Subject: [PATCH 6/7] remove repeated word --- use-timescale/compression/compression-design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/use-timescale/compression/compression-design.md b/use-timescale/compression/compression-design.md index e162beb465..b82a924df3 100644 --- a/use-timescale/compression/compression-design.md +++ b/use-timescale/compression/compression-design.md @@ -121,7 +121,7 @@ With the data segmented in this way, a query for device A between a time interval becomes quite fast. The query planner can use an index to find those rows for device A that contain at least some timestamps corresponding to the specified interval, and even a sequential scan is quite fast since evaluating -device IDs or timestamps does not require decompression. This means the the +device IDs or timestamps does not require decompression. This means the query executor only decompresses the timestamp and temperature columns corresponding to those selected rows. From 407784974f1fb15120dbba212bdd6e0f7f974034 Mon Sep 17 00:00:00 2001 From: Mats Kindahl Date: Wed, 1 Nov 2023 12:10:24 +0100 Subject: [PATCH 7/7] Use "zeroes" instead of "0s" --- use-timescale/compression/compression-methods.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/use-timescale/compression/compression-methods.md b/use-timescale/compression/compression-methods.md index 88b812491e..58e457630f 100644 --- a/use-timescale/compression/compression-methods.md +++ b/use-timescale/compression/compression-methods.md @@ -75,7 +75,7 @@ Delta-of-delta encoding takes delta encoding one step further and applies delta-encoding over data that has previously been delta-encoded. With time-series datasets where data collection happens at regular intervals, you can apply delta-of-delta encoding to the time column, which results in only needing to -store a series of 0s. +store a series of zeroes. In other words, delta encoding stores the first derivative of the dataset, while delta-of-delta encoding stores the second derivative of the dataset. @@ -157,7 +157,7 @@ encoding. This method works well for values that don't change very often, or if an earlier transformation removes the changes. Run-length encoding is one of the classic compression algorithms. For -time-series data with billions of contiguous 0s, or even a document with a +time-series data with billions of contiguous zeroes, or even a document with a million identically repeated strings, run-length encoding works incredibly well. For example, if you wanted to store a temperature that changed minimally over @@ -210,7 +210,7 @@ faster. ### XOR-based compression Floating point numbers are usually more difficult to compress than integers. -Fixed-length integers often have leading 0s, but floating point numbers usually +Fixed-length integers often have leading zeroes, but floating point numbers usually use all of their available bits, especially if they are converted from decimal numbers, which can't be represented precisely in binary.