Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CrateDB] Add support for data acquisition and data export #148

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

amotl
Copy link
Member

@amotl amotl commented Jun 9, 2023

About

Migrating to InfluxDB version 2 would mean to leave SQL behind 1. While using the Flux query language is intriguing, and I will not reject bringing in support for InfluxDB2 and its successor IOx, supporting an SQL-based timeseries database makes sense for me, and this time maybe even a more capable one than InfluxDB in terms of broader support for data types and SQL operations.

So, I think viable alternatives are both CrateDB and TimescaleDB 2, which may even share parts of their corresponding adapter implementations, because both are building upon PostgreSQL standards. This patch makes a start by adding support for CrateDB, let's have a look at TimescaleDB later.

Documentation

https://kotori--148.org.readthedocs.build/en/148/database/cratedb.html

Backlog

  • Make Grafana instant dashboard provisioning work.
  • Make the data export feature work.
  • Update documentation across the board.
  • Demonstrate LTTB downsampling on behalf of a secondary Grafana panel.
  • Investigate whether some of the pandas routines on the data export subsystem could be replaced/optimized by query statements using LOCF and NOCB.
  • Investigate how and where max_by and min_by could also be utilized in a sensible way.

Footnotes

  1. https://docs.influxdata.com/influxdb/v2.7/query-data/flux/

  2. With the drawback that TimescaleDB also changed the license for parts of their code to non-FOSS, see https://github.com/timescale/timescaledb/blob/main/tsl/LICENSE-TIMESCALE.

@codecov
Copy link

codecov bot commented Jun 9, 2023

Codecov Report

Attention: Patch coverage is 89.88764% with 18 lines in your changes missing coverage. Please review.

Project coverage is 78.86%. Comparing base (90e815a) to head (c30b6ad).

Current head c30b6ad differs from pull request most recent head a6d66d5

Please upload reports for the commit a6d66d5 to get more accurate results.

Files Patch % Lines
kotori/daq/storage/cratedb.py 88.33% 14 Missing ⚠️
kotori/daq/graphing/grafana/manager.py 83.33% 2 Missing ⚠️
kotori/daq/graphing/grafana/dashboard.py 92.30% 1 Missing ⚠️
kotori/daq/services/mig.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     crate/crate-python#148      +/-   ##
==========================================
+ Coverage   78.59%   78.86%   +0.26%     
==========================================
  Files          55       58       +3     
  Lines        3014     3180     +166     
==========================================
+ Hits         2369     2508     +139     
- Misses        645      672      +27     
Flag Coverage Δ
unittests 78.86% <89.88%> (+0.26%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@amotl amotl force-pushed the amo/cratedb branch 9 times, most recently from f395199 to f774a90 Compare June 9, 2023 20:36
@amotl amotl changed the title [CrateDB] Add basic data acquisition support for CrateDB [CrateDB] Add data acquisition support for CrateDB Jun 10, 2023
@amotl
Copy link
Member Author

amotl commented Jun 17, 2023

Grafana instant dashboards

About

8abe55d added baseline support for producing Grafana instant dashboards, and 9c663b2 now improves it by using proper time bucketing within the standard SQL statement template, based on emulating GROUP BY DATE_BIN() by using Grafana's $__timeGroupAlias macro for casting $__interval values, until CrateDB's DATE_BIN() function understands Grafana's native interval values.

Reference documentation

$__timeGroup(dateColumn, $__interval) will be replaced by an expression usable in a GROUP BY clause.
$__timeGroupAlias(dateColumn, $__interval) will be replaced identical to $__timeGroup but with an added column alias.

-- https://grafana.com/docs/grafana/latest/datasources/postgres/#macros

Thanks

Thank you for the guidance, @seut and @hammerhead.

@amotl amotl force-pushed the amo/cratedb branch 2 times, most recently from a609bbb to 9c663b2 Compare June 17, 2023 22:34
@amotl amotl changed the title [CrateDB] Add data acquisition support for CrateDB [CrateDB] Add support for data acquisition and data export Jun 18, 2023
@amotl amotl force-pushed the amo/cratedb branch 2 times, most recently from fcd4379 to 2a2ec79 Compare June 21, 2023 19:45
Comment on lines +134 to +141
def record_from_dict(item):
record = OrderedDict()
record.update({"time": item["time"]})
record.update(item["tags"])
record.update(item["fields"])
return record
Copy link
Member Author

@amotl amotl Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking for a function to merge those two objects, tags and fields, into a single record, but have not been able to discover it. Only now, I luckily discovered the concat(object, object) function in @proddata's tutorial 1, and that it can also operate on objects, effectively merging those.

I think the two reasons why I have not been able to discover this function were:

a) That both sets of functions operating on container data types 23 have been on the "Scalar functions" page, and I did not expect to find them there.
b) That the search term "merge" did not occur in the corresponding documentation section of the concat(object, object) function, contrary to the documentation of the array_unique() function.

The concat(object, object) function combines two objects into a new object.
The array_unique(array, array, ...) function merges two arrays into one array with unique elements.

Please let me know if you think this could be improved on the CrateDB documentation.

Footnotes

  1. https://crate.io/resources/videos/json-object-data-in-cratedb

  2. https://crate.io/docs/crate/reference/en/latest/general/builtins/scalar-functions.html#array-functions

  3. https://crate.io/docs/crate/reference/en/latest/general/builtins/scalar-functions.html#object-functions

Copy link
Member Author

@amotl amotl Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. concat() does not completely do what I am aiming at here.

cr> select time, concat(tags, fields) from mqttkit_2_itest.foo_bar_sensors;
+---------------+------------------------------------------+
|          time | concat(tags, fields)                     |
+---------------+------------------------------------------+
| 1687469154383 | {"humidity": 83.1, "temperature": 42.84} |
+---------------+------------------------------------------+

It is nice that it will merge two objects, but now, I would like to destructure the top-level attributes of that single object into individual fields again.

Is there any chance to do this, or, if not, would submitting a corresponding feature request make sense?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we can break down the object into fields at the moment, but maybe using object_keys and then referencing the fields one by one could be sufficient for what you are trying to do?

@@ -26,14 +26,16 @@ Infrastructure components

- Kotori_, a data acquisition, graphing and telemetry toolkit
- Grafana_, a graph and dashboard builder for visualizing time series metrics
- InfluxDB_, a time-series database
- CrateDB_, a time-series database ¹
Copy link
Member Author

@amotl amotl Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone have a better suggestion, describing CrateDB within a single, short and concise sentence?

Suggested change
- CrateDB_, a time-series database ¹
- CrateDB_, a time-series database with document features and more ¹

doc/source/handbook/usage/cratedb.rst Outdated Show resolved Hide resolved
doc/source/setup/linux-debian.rst Outdated Show resolved Hide resolved
doc/source/setup/macos.rst Outdated Show resolved Hide resolved
Comment on lines +166 to +174
def write(self, meta, data):
"""
Format ingress data chunk and store it into database table.

TODO: This dearly needs efficiency improvements. Currently, there is no
batching, just single records/inserts. That yields bad performance.
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be forgotten. When wrapping up all review comments, put this TODO item into the backlog for a subsequent iteration.

Copy link
Member Author

@amotl amotl Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most probably, we should look at using the improvement from crate/crate-python#553 here, if that would be applicable.

Comment on lines +249 to +272
class TimezoneAwareCrateJsonEncoder(json.JSONEncoder):
epoch_aware = datetime(1970, 1, 1, tzinfo=pytz.UTC)
epoch_naive = datetime(1970, 1, 1)

def default(self, o):
if isinstance(o, Decimal):
return str(o)
if isinstance(o, datetime):
if o.tzinfo:
delta = o - self.epoch_aware
else:
delta = o - self.epoch_naive
return int(delta.microseconds / 1000.0 +
(delta.seconds + delta.days * 24 * 3600) * 1000.0)
if isinstance(o, date):
return calendar.timegm(o.timetuple()) * 1000
return json.JSONEncoder.default(self, o)


# Monkey patch.
# TODO: Submit upstream.
crate.client.http.CrateJsonEncoder = TimezoneAwareCrateJsonEncoder
Copy link
Member Author

@amotl amotl Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This monkeypatch should be submitted upstream to the Python driver on behalf of the crate Python package. It may resolve https://github.com/crate/crate-python/issues/361.

Effectively, it is only this change:

if o.tzinfo:
    delta = o - self.epoch_aware
else:
    delta = o - self.epoch_naive

Comment on lines +27 to +29
# TODO: Add querying by tags.
tags = {}
# tags = CrateDBAdapter.get_tags(data)
Copy link
Member Author

@amotl amotl Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not forget about implementing this, querying by tags. It hasn't been implemented for InfluxDB, but that does not mean it should stay like this.

Comment on lines +1 to +10
{
"alias": "{{ alias }}",
"format": "table",
"resultFormat": "time_series",
"tags": {{ tags }},
"groupByTags": [],
"measurement": "{{ measurement }}",
"rawQuery": true,
"rawSql": "SELECT $__timeGroupAlias(time, $__interval), MEAN(fields['{{ name }}']) AS {{ alias }} FROM {{ table }} WHERE $__timeFilter(time) GROUP BY time ORDER BY time"
}
Copy link
Member Author

@amotl amotl Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hammerhead helped me to discover the right solution for the SQL query here, and he also told me that the DATE_BIN() function, which @seut recommended to use, does not yet understand Grafana's interval values.

This issue is already being tracked at crate/crate#14211. After it has been resolved, adjust the SQL statement template here, to use DATE_BIN() instead of $__timeGroupAlias(time, $__interval).

Copy link
Member Author

@amotl amotl Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DATE_BIN() function, which @seut recommended to use, does not yet understand Grafana's interval values.

It looks like this may change with CrateDB 5.7. Thank you, @matriv!

"groupByTags": [],
"measurement": "{{ measurement }}",
"rawQuery": true,
"rawSql": "SELECT $__timeGroupAlias(time, $__interval), MEAN(fields['{{ name }}']) AS {{ alias }} FROM {{ table }} WHERE $__timeFilter(time) GROUP BY time ORDER BY time"
Copy link
Member Author

@amotl amotl Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On behalf of a subsequent iteration, we may also want to demonstrate advanced downsampling on a secondary panel, using the largest triangle three buckets (LTTB) algorithm, as presented by @hlcianfagna 1, when possible.

Footnotes

  1. https://community.crate.io/t/advanced-downsampling-with-the-lttb-algorithm/1287

- Extremely fast distributed query execution.
- Auto-partitioning, auto-sharding, and auto-replication.
- Self-healing and auto-rebalancing.
- User-defined functions (UDFs) can be used to extend the functionality of CrateDB.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User-defined functions are not mentioned on the canonical CrateDB README at all. It should be added.

-- https://github.com/crate/crate/blob/master/README.rst

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +245 to +278
.. code-block:: sql

-- An SQL DDL statement defining a custom schema for holding sensor data.
CREATE TABLE iot_data (
timestamp TIMESTAMP WITH TIME ZONE,
sensor_data OBJECT (DYNAMIC) AS (
temperature FLOAT,
humidity FLOAT,
location OBJECT (DYNAMIC) AS (
latitude DOUBLE PRECISION, longitude DOUBLE PRECISION
)
)
);
Copy link
Member Author

@amotl amotl Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the only SQL DDL statement within the "query examples" section. Adding a few more, including use of other CrateDB special data types, may be sensible. Do you have any suggestions in your toolboxes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance of using GEO_POINT maybe?

Comment on lines 83 to 87
CREATE TABLE IF NOT EXISTS {tablename} (
time TIMESTAMP WITH TIME ZONE DEFAULT NOW() NOT NULL,
tags OBJECT(DYNAMIC),
fields OBJECT(DYNAMIC)
);
Copy link
Member Author

@amotl amotl Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use such a DDL from the very beginning, as partitioning by year seems to be a general useful approach to be used as a reasonable default, @hlcianfagna?

Suggested change
CREATE TABLE IF NOT EXISTS {tablename} (
time TIMESTAMP WITH TIME ZONE DEFAULT NOW() NOT NULL,
tags OBJECT(DYNAMIC),
fields OBJECT(DYNAMIC)
);
CREATE TABLE IF NOT EXISTS {tablename} (
time TIMESTAMP WITH TIME ZONE DEFAULT NOW() NOT NULL,
tags OBJECT(DYNAMIC),
fields OBJECT(DYNAMIC),
year TIMESTAMP GENERATED ALWAYS AS DATE_TRUNC('year', time)
) PARTITIONED BY (year);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you went for this, I think it is a good idea, yes

Comment on lines +68 to +70
self.db_client = client.connect(
self.host_uri, username=self.username, password=self.password, pool_size=20,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify and demonstrate connecting also to CrateDB Cloud, maybe on behalf of a subsequent iteration.

Comment on lines +29 to +30
into tables. Tables are grouped into schemas, which is equivalent to the concept of hosting
multiple databases on the same server instance.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@proddata mentioned that, at least from a PostgreSQL perspective, databases are more like catalogs. Thanks.

Copy link
Member Author

@amotl amotl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions by @hammerhead. Thanks.

@@ -85,8 +85,8 @@ We are standing on the shoulders of giants:
- Leverage the open infrastructure based on Twisted_ - an event-driven networking engine -
to implement custom software components.
- Listen and talk M2M_ using the *MQ Telemetry Transport* connectivity protocol and software bus (MQTT_).
- Store data points into InfluxDB_, a leading open source time series database suitable
for realtime analytics and sensor data storage.
- Store data points into CrateDB_, InfluxDB_, or other open source time series databases
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Store data points into CrateDB_, InfluxDB_, or other open source time series databases
- Store data points into CrateDB_, InfluxDB_, or other open source time-series databases

@@ -26,14 +26,16 @@ Infrastructure components

- Kotori_, a data acquisition, graphing and telemetry toolkit
- Grafana_, a graph and dashboard builder for visualizing time series metrics
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Grafana_, a graph and dashboard builder for visualizing time series metrics
- Grafana_, a graph and dashboard builder for visualizing time-series metrics


| ¹ MongoDB is only required when doing CSV data acquisition, so it is completely
| ¹ Kotori can either use CrateDB or InfluxDB as timeseries database.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| ¹ Kotori can either use CrateDB or InfluxDB as timeseries database.
| ¹ Kotori can either use CrateDB or InfluxDB as time-series database.

Purpose
=======

Kotori uses CrateDB to store **timeseries-data** of data acquisition channels.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kotori uses CrateDB to store **timeseries-data** of data acquisition channels.
Kotori uses CrateDB to store **time-series data** of data acquisition channels.

and based on Lucene.

<small>
<strong>Categories:</strong> timeseries-database, multi-modal database
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<strong>Categories:</strong> timeseries-database, multi-modal database
<strong>Categories:</strong> time-series database, multi-modal database

CrateDB
=======

This example uses CrateDB as timeseries-database.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This example uses CrateDB as timeseries-database.
This example uses CrateDB as time-series database.

elif "influxdb" in self.config:
self.dbtype = TimeseriesDatabaseType.INFLUXDB1
else:
raise ValueError("Timeseries database type not defined")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError("Timeseries database type not defined")
raise ValueError("Time-series database type not defined")

Submit single reading in JSON format to HTTP API and proof
it can be retrieved back from the HTTP API in different formats.

This uses CrateDB as timeseries database.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This uses CrateDB as timeseries database.
This uses CrateDB as time-series database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants