Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Compression content #2664

Merged
merged 7 commits into from
Nov 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 0 additions & 84 deletions use-timescale/compression/about-compression.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,86 +21,6 @@ This section explains how to enable native compression, and then goes into
detail on the most important settings for compression, to help you get the
best possible compression ratio.

For more information about compressing chunks, see [manual compression][manual-compression].

## Enable compression

You can enable compression on individual hypertables, by declaring which column
you want to segment by. This procedure uses an example table, called `example`,
and segments it by the `device_id` column. Every chunk that is more than seven
days old is then marked to be automatically compressed.

|time|device_id|cpu|disk_io|energy_consumption|
|-|-|-|-|-|
|8/22/2019 0:00|1|88.2|20|0.8|
|8/22/2019 0:05|2|300.5|30|0.9|

<Procedure>

### Enabling compression

1. At the `psql` prompt, alter the table:

```sql
ALTER TABLE example SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'device_id'
);
```

1. Add a compression policy to compress chunks that are older than seven days:

```sql
SELECT add_compression_policy('example', INTERVAL '7 days');
```

</Procedure>

For more information, see the API reference for
[`ALTER TABLE (compression)`][alter-table-compression] and
[`add_compression_policy`][add_compression_policy].

You can also set a compression policy through
the Timescale console. The compression tool automatically generates and
runs the compression commands for you. To learn more, see the
[Timescale documentation](/use-timescale/latest/services/service-explorer/#setting-a-compression-policy-from-timescale-cloud-console).

## View current compression policy

To view the compression policy that you've set:

```sql
SELECT * FROM timescaledb_information.jobs
WHERE proc_name='policy_compression';
```

For more information, see the API reference for [`timescaledb_information.jobs`][timescaledb_information-jobs].

## Remove compression policy

To remove a compression policy, use `remove_compression_policy`. For example, to
remove a compression policy for a hypertable named `cpu`:

```sql
SELECT remove_compression_policy('cpu');
```

For more information, see the API reference for
[`remove_compression_policy`][remove_compression_policy].

## Disable compression

You can disable compression entirely on individual hypertables. This command
works only if you don't currently have any compressed chunks:

```sql
ALTER TABLE <TABLE_NAME> SET (timescaledb.compress=false);
```

If your hypertable contains compressed chunks, you need to
[decompress each chunk][decompress-chunks] individually before you can disable
compression.

## Compression policy intervals

Data is usually compressed after an interval of time, and not
Expand Down Expand Up @@ -275,9 +195,5 @@ chunks. When you do this,the data that is being inserted is not compressed
immediately. It is stored alongside the chunk it has been inserted into, and
then a separate job merges it with the chunk and compresses it later on.

[alter-table-compression]: /api/:currentVersion:/compression/alter_table_compression/
[add_compression_policy]: /api/:currentVersion:/compression/add_compression_policy/
[decompress-chunks]: /use-timescale/:currentVersion:/compression/decompress-chunks
[remove_compression_policy]: /api/:currentVersion:/compression/remove_compression_policy/
[timescaledb_information-jobs]: /api/:currentVersion:/informational-views/jobs/
[manual-compression]: /use-timescale/:currentVersion:/compression/manual-compression/
128 changes: 128 additions & 0 deletions use-timescale/compression/compression-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
title: Designing your database for compression
excerpt: Learn how to design your database for the most effective compression
products: [cloud, mst, self_hosted]
keywords: [compression, schema, tables]
---

# Designing for compression

Time-series data can be unique, in that it needs to handle both shallow and wide
queries, such as "What's happened across the deployment in the last 10 minutes,"
and deep and narrow, such as "What is the average CPU usage for this server
over the last 24 hours." Time-series data usually has a very high rate of
inserts as well; hundreds of thousands of writes per second can be very normal
for a time-series dataset. Additionally, time-series data is often very
granular, and data is collected at a higher resolution than many other
datasets. This can result in terabytes of data being collected over time.

All this means that if you need great compression rates, you probably need to
consider the design of your database, before you start ingesting data. This
section covers some of the things you need to take into consideration when
designing your database for maximum compression effectiveness.

## Compressing data

TimescaleDB is built on PostgreSQL which is, by nature, a row-based database.
Because time-series data is accessed in order of time, when you enable
compression, TimescaleDB converts many wide rows of data into a single row of
data, called an array form. This means that each field of that new, wide row
stores an ordered set of data comprising the entire column.

For example, if you had a table with data that looked a bit like this:

|Timestamp|Device ID|Status Code|Temperature|
|-|-|-|-|
|12:00:01|A|0|70.11|
|12:00:01|B|0|69.70|
|12:00:02|A|0|70.12|
|12:00:02|B|0|69.69|
|12:00:03|A|0|70.14|
|12:00:03|B|4|69.70|

You can convert this to a single row in array form, like this:

|Timestamp|Device ID|Status Code|Temperature|
|-|-|-|-|
|[12:00:01, 12:00:01, 12:00:02, 12:00:02, 12:00:03, 12:00:03]|[A, B, A, B, A, B]|[0, 0, 0, 0, 0, 4]|[70.11, 69.70, 70.12, 69.69, 70.14, 69.70]|

Even before you compress any data, this format immediately saves storage by
reducing the per-row overhead. PostgreSQL typically adds a small number of bytes
of overhead per row. So even without any compression, the schema in this example
is now smaller on disk than the previous format.

This format arranges the data so that similar data, such as timestamps, device
IDs, or temperature readings, is stored contiguously. This means that you can
then use type-specific compression algorithms to compress the data further, and
each array is separately compressed. For more information about the compression
methods used, see the [compression methods section][compression-methods].

When the data is in array format, you can perform queries that require a subset
of the columns very quickly. For example, if you have a query like this one, that
asks for the average temperature over the past day:

<CodeBlock canCopy={false} showLineNumbers={false} children={`
SELECT time_bucket(‘1 minute’, timestamp) as minute
AVG(temperature)
FROM table
WHERE timestamp > now() - interval ‘1 day’
ORDER BY minute DESC
GROUP BY minute;
`} />

The query engine can fetch and decompress only the timestamp and temperature
columns to efficiently compute and return these results.

Finally, TimescaleDB uses non-inline disk pages to store the compressed arrays.
This means that the in-row data points to a secondary disk page that stores the
compressed array, and the actual row in the main table becomes very small,
because it is now just pointers to the data. When data stored like this is
queried, only the compressed arrays for the required columns are read from disk,
further improving performance by reducing disk reads and writes.

## Querying compressed data

In the previous example, the database has no way of knowing which rows need to
be fetched and decompressed to resolve a query. For example, the database can't
easily determine which rows contain data from the past day, as the timestamp
itself is in a compressed column. You don't want to have to decompress all the
data in a chunk, or even an entire hypertable, to determine which rows are
required.

TimescaleDB automatically includes more information in the row and includes
additional groupings to improve query performance. When you compress a
hypertable, either manually or through a compression policy, it can help to specify
an `ORDER BY` column.

`ORDER BY` columns specify how the rows that are part of a compressed batch are
ordered. For most time-series workloads, this is by timestamp, so if you don't
specify an `ORDER BY` column, TimescaleDB defaults to using the time column. You
can also specify additional dimensions, such as location.

For each `ORDER BY` column, TimescaleDB automatically creates additional columns
that store the minimum and maximum value of that column. This way, the query
planner can look at the range of timestamps in the compressed column, without
having to do any decompression, and determine whether the row could possibly
match the query.

When you compress your hypertable, you can also choose to specify a `SEGMENT BY`
column. This allows you to segment compressed rows by a specific column, so that
each compressed row corresponds to a data about a single item such as, for
example, a specific device ID. This further allows the query planner to
determine if the row could possibly match the query without having to decompress
the column first. For example:

|Device ID|Timestamp|Status Code|Temperature|Min Timestamp|Max Timestamp|
|-|-|-|-|-|-|
|A|[12:00:01, 12:00:02, 12:00:03]|[0, 0, 0]|[70.11, 70.12, 70.14]|12:00:01|12:00:03|
|B|[12:00:01, 12:00:02, 12:00:03]|[0, 0, 0]|[70.11, 70.12, 70.14]|12:00:01|12:00:03|

With the data segmented in this way, a query for device A between a time
interval becomes quite fast. The query planner can use an index to find those
rows for device A that contain at least some timestamps corresponding to the
specified interval, and even a sequential scan is quite fast since evaluating
device IDs or timestamps does not require decompression. This means the
query executor only decompresses the timestamp and temperature columns
corresponding to those selected rows.

[compression-methods]: /use-timescale/:currentVersion:/compression/compression-methods/
Loading