chore: add first cookbook recipe. (#3550)

* chore: add first cookbook recipe. Co-authored-by: atovpeko <114177030+atovpeko@users.noreply.github.com> Co-authored-by: Jônatas Davi Paganini <jonatas@timescale.com>
timescale · Nov 28, 2024 · 0afb88f · 0afb88f
1 parent 20262e0
commit 0afb88f
Show file tree

Hide file tree

Showing 4 changed files with 236 additions and 0 deletions.
diff --git a/_partials/_cookbook-hypertables.md b/_partials/_cookbook-hypertables.md
@@ -0,0 +1,67 @@
+
+## Hypertable recipes
+
+This section contains recipes about hypertables.
+
+### Remove duplicates from an existing hypertable
+
+Looking to remove duplicates from an existing hypertable? One method is to run a `PARTITION BY` query to get
+`ROW_NUMBER()` and then the `ctid` of rows where `row_number>1`. You then delete these rows.  However,
+you need to check `tableoid` and `ctid`. This is because `ctid` is not unique and might be duplicated in
+different chunks. The following code example took 17 hours to process a table with 40 million rows:
+
+```sql
+CREATE OR REPLACE FUNCTION deduplicate_chunks(ht_name TEXT, partition_columns TEXT, bot_id INT DEFAULT NULL)
+    RETURNS TABLE
+            (
+                chunk_schema  name,
+                chunk_name    name,
+                deleted_count INT
+            )
+AS
+$$
+DECLARE
+    chunk         RECORD;
+    where_clause  TEXT := '';
+    deleted_count INT;
+BEGIN
+    IF bot_id IS NOT NULL THEN
+        where_clause := FORMAT('WHERE bot_id = %s', bot_id);
+    END IF;
+
+    FOR chunk IN
+        SELECT c.chunk_schema, c.chunk_name
+        FROM timescaledb_information.chunks c
+        WHERE c.hypertable_name = ht_name
+        LOOP
+            EXECUTE FORMAT('
+            WITH cte AS (
+                SELECT ctid,
+                       ROW_NUMBER() OVER (PARTITION BY %s ORDER BY %s ASC) AS row_num,
+                       *
+                FROM %I.%I
+                %s
+            )
+            DELETE FROM %I.%I
+            WHERE ctid IN (
+                SELECT ctid
+                FROM cte
+                WHERE row_num > 1
+            )
+            RETURNING 1;
+        ', partition_columns, partition_columns, chunk.chunk_schema, chunk.chunk_name, where_clause, chunk.chunk_schema,
+                           chunk.chunk_name)
+                INTO deleted_count;
+
+            RETURN QUERY SELECT chunk.chunk_schema, chunk.chunk_name, COALESCE(deleted_count, 0);
+        END LOOP;
+END
+$$ LANGUAGE plpgsql;
+
+
+SELECT *
+FROM deduplicate_chunks('nudge_events', 'bot_id, session_id, nudge_id, time', 2540);
+```
+
+Shoutout to **Mathias Ose** and **Christopher Piggott** for this recipe. 
+
diff --git a/_partials/_cookbook-iot.md b/_partials/_cookbook-iot.md
@@ -0,0 +1,131 @@
+## IoT recipes
+
+This section contains recipes for IoT issues:
+
+### Work with columnar IoT data
+
+Narrow and medium width tables are a great way to store IoT data. A lot of reasons are outlined in
+[Designing Your Database Schema: Wide vs. Narrow Postgres Tables][blog-wide-vs-narrow].
+
+One of the key advantages of narrow tables is that the schema does not have to change when you add new
+sensors. Another big advantage is that each sensor can sample at different rates and times. This helps
+support things like hysteresis, where new values are written infrequently unless the value changes by a
+certain amount.
+
+#### Narrow table format example
+
+Working with narrow table data structures presents a few challenges. In the IoT world one concern is that
+many data analysis approaches - including machine learning as well as more traditional data analysis -
+require that your data is resampled and synchronized to a common time basis. Fortunately, TimescaleDB provides
+you with [hyperfunctions][hyperfunctions] and other tools to help you work with this data. 
+
+An example of a narrow table format is:
+
+| ts                      | sensor_id | value |
+|-------------------------|-----------|-------|
+| 2024-10-31 11:17:30.000 | 1007      | 23.45 |
+
+Typically you would couple this with a sensor table:
+
+| sensor_id | sensor_name  | units                    |
+|-----------|--------------|--------------------------|
+| 1007      | temperature  | degreesC                 |
+| 1012      | heat_mode    | on/off                   |
+| 1013      | cooling_mode | on/off                   | 
+| 1041      | occupancy    | number of people in room |
+
+A medium table retains the generic structure but adds columns of various types so that you can
+use the same table to store float, int, bool, or even JSON (jsonb) data:
+
+| ts                      | sensor_id | d     | i    | b    | t    | j    |
+|-------------------------|-----------|-------|------|------|------|------|
+| 2024-10-31 11:17:30.000 | 1007      | 23.45 | null | null | null | null |
+| 2024-10-31 11:17:47.000 | 1012      | null  | null | TRUE | null | null |
+| 2024-10-31 11:18:01.000 | 1041      | null  | 4    | null | null | null |
+
+To remove all-null entries, use an optional constraint such as:
+
+```sql
+    CONSTRAINT at_least_one_not_null
+        CHECK ((d IS NOT NULL) OR (i IS NOT NULL) OR (b IS NOT NULL) OR (j IS NOT NULL) OR (t IS NOT NULL))
+```
+
+#### Get the last value of every sensor
+
+There are several ways to get the latest value of every sensor. The following examples use the
+structure defined in [Narrow table format example][setup-a-narrow-table-format] as a reference:
+
+- [SELECT DISTINCT ON][select-distinct-on]
+- [JOIN LATERAL][join-lateral]
+
+##### SELECT DISTINCT ON
+
+If you have a list of sensors, the easy way to get the latest value of every sensor is to use 
+`SELECT DISTINCT ON`:
+
+```sql
+WITH latest_data AS (
+    SELECT DISTINCT ON (sensor_id) ts, sensor_id, d
+    FROM iot_data
+    WHERE d is not null
+      AND ts > CURRENT_TIMESTAMP - INTERVAL '1 week'  -- important
+    ORDER BY sensor_id, ts DESC
+)
+SELECT
+    sensor_id, sensors.name, ts, d
+FROM latest_data
+LEFT OUTER JOIN sensors ON latest_data.sensor_id = sensors.id
+WHERE latest_data.d is not null
+ORDER BY sensor_id, ts; -- Optional, for displaying results ordered by sensor_id
+```
+
+The common table expression (CTE) used above is not strictly necessary. However, it is an elegant way to join 
+to the sensor list to get a sensor name in the output.  If this is not something you care about,
+you can leave it out:
+
+```sql
+SELECT DISTINCT ON (sensor_id) ts, sensor_id, d
+    FROM iot_data
+    WHERE d is not null
+      AND ts > CURRENT_TIMESTAMP - INTERVAL '1 week'  -- important
+    ORDER BY sensor_id, ts DESC
+```
+
+It is important to take care when down-selecting this data. In the previous examples,
+the time that the query would scan back was limited. However, if there any sensors that have either 
+not reported in a long time or in the worst case, never reported, this query devolves to a full table scan.  
+In a database with 1000+ sensors and 41 million rows, an unconstrained query takes over an hour.
+
+#### JOIN LATERAL
+
+An alternative to [SELECT DISTINCT ON][select-distinct-on] is to use a `JOIN LATERAL`. By selecting your entire 
+sensor list from the sensors table rather than pulling the IDs out using `SELECT DISTINCT`, `JOIN LATERAL` can offer 
+some improvements in performance:
+
+```sql
+SELECT sensor_list.id, latest_data.ts, latest_data.d
+FROM sensors sensor_list
+    -- Add a WHERE clause here to downselect the sensor list, if you wish
+LEFT JOIN LATERAL (
+    SELECT ts, d
+    FROM iot_data raw_data
+    WHERE sensor_id = sensor_list.id
+    ORDER BY ts DESC
+    LIMIT 1
+) latest_data ON true
+WHERE latest_data.d is not null -- only pulling out float values ("d" column) in this example
+  AND latest_data.ts > CURRENT_TIMESTAMP - interval '1 week' -- important
+ORDER BY sensor_list.id, latest_data.ts;
+```
+
+Limiting the time range is important, especially if you have a lot of data. Best practice is to use these 
+kinds of queries for dashboards and quick status checks. To query over a much larger time range, encapsulate 
+the previous example into a materialized query that refreshes infrequently, perhaps once a day.
+
+Shoutout to **Christopher Piggott** for this recipe.
+
+[blog-wide-vs-narrow]: https://www.timescale.com/learn/designing-your-database-schema-wide-vs-narrow-postgres-tables
+[setup-a-narrow-table-format]: /tutorials/:currentVersion:/cookbook/#narrow-table-format-example
+[select-distinct-on]: /tutorials/:currentVersion:/cookbook/#select-distinct-on
+[join-lateral]: /tutorials/:currentVersion:/cookbook/#join-lateral
+[hyperfunctions]: /use/:currentVersion:/hyperfunctions/
diff --git a/tutorials/cookbook.md b/tutorials/cookbook.md
@@ -0,0 +1,33 @@
+---
+title: Timescale cookbook
+excerpt: Code examples from the community that help you with loads of common conundrums.   
+product: [cloud, mst, self_hosted] 
+---
+
+import Hypertables from "versionContent/_partials/_cookbook-hypertables.mdx";
+import IOT from "versionContent/_partials/_cookbook-iot.mdx";
+
+
+# Timescale community cookbook
+
+This page contains suggestions from the [TimescaleDB Community](https://timescaledb.slack.com/) about how to resolve 
+common issues. Use these code examples as guidance to work with your own data. 
+
+
+## Prerequisites
+
+To follow the examples in this page, you need a:
+
+- [Target Timescale Cloud service][create-a-service]
+- [Connection to your service][connect-to-service]
+
+
+
+<Hypertables />
+
+<IOT />
+
+
+
+[create-a-service]: /getting-started/:currentVersion:/services/#create-a-timescale-cloud-service
+[connect-to-service]: /getting-started/:currentVersion:/run-queries-from-console/
diff --git a/tutorials/page-index/page-index.js b/tutorials/page-index/page-index.js
@@ -154,6 +154,11 @@ module.exports = [
         href: "simulate-iot-sensor-data",
         excerpt: "Simulate and query an IoT sensor dataset",
       },
+      {
+        title: "Timescale community cookbook",
+        href: "cookbook",
+        excerpt: "Code examples from the community that help you with loads of common conundrums.",
+      },
     ],
   },
 ];