feat(reader): Add ClickHouse HTTPReader #819

tkaemming · 2020-03-05T19:15:28Z

This adds a Reader[ClickhouseQuery] that uses the HTTP interface (and JSON format) to ClickHouse rather than the native TCP interface. This approach was based on #406 which itself was broken up into smaller changes such as #812.

tkaemming · 2020-03-05T19:39:04Z

snuba/clickhouse/http.py

+        query_settings: MutableMapping[str, str] = (
+            {**settings} if settings is not None else {}
+        )
+
+        # XXX: mypy won't allow redefining ``settings`` as mutable, so delete
+        # the original variable to avoid accidentally referencing ``settings``
+        # instead of ``query_settings``.
+        del settings


This is one of my least favorite things about mypy.

snuba/clickhouse/http.py

tkaemming · 2020-03-05T19:46:49Z

snuba/clickhouse/http.py

+        query: ClickhouseQuery,
+        settings: Optional[Mapping[str, str]] = None,
+        query_id: Optional[str] = None,
+        with_totals: bool = False,  # NOTE: unnecessary with FORMAT JSON


This comment I guess is a bit misleading. FORMAT JSON doesn't implicitly add WITH TOTALS or anything, but it does already split out the totals row from the data rows so that the caller doesn't need to. When the AST is available everywhere, it'd be better to remove with_totals from this function signature and just use the query attribute to decide whether or not the totals row should exist or not for the native reader.

tkaemming · 2020-03-05T19:50:04Z

tests/clickhouse/test_reader.py

+class SimpleClickhouseQuery(ClickhouseQuery):
+    def __init__(self, columns: Sequence[Tuple[str, str]]) -> None:
+        self.__columns = columns
+
+    def _format_query_impl(self) -> str:
+        columns = ", ".join(f"{value} as {alias}" for alias, value in self.__columns)
+        return f"SELECT {columns}"


I wanted to use the AST for this but that would have meant either

relying on the other steps of query processing and executing the query against an actual dataset, coupling this test to a lot of unnecessary things

refactoring the AstClickhouseQuery constructor so that it would be more like a dataclass, independent of the Query object entirely

Both of those are a rabbit holes I have no interest in falling down into right now. We should do the second one eventually.

To me it is fine not to use the AST to write a unit test at this level, anyway if you want to use it, you don't need to depend on dataset to build a Query object and transform it into a Clickhosue query since the Query object depends on a DataSource (which is the abstract class above TableStorage and can be created from scratch without all the heavy weight of the dataset.

Example:
If you just want a Query object starting from the AST you can do it this way
https://github.com/getsentry/snuba/blob/master/tests/query/parser/test_query.py#L20-L32
(you have to use ASTClickhouseQuery instead of DictClickhouseQuery).

fpacifici

Generally some nits and questions. o you already have a rollout plan ?

snuba/clickhouse/http.py

fpacifici · 2020-03-06T00:58:04Z

snuba/clickhouse/http.py

+        if settings is not None:
+            assert "query_id" not in settings, "query_id cannot be passed as a setting"
+


Is this a new condition we did not have on native?

Sort of.

The settings passed here are supposed to be closer to "session settings" (client_settings for the ClickhousePool constructor) than they are to "query settings", but we don't use ClickHouse HTTP sessions here due to not wanting to deal with session timeouts, etc. I needed to be some way to allow constant (session-like) settings here to support things like output_format_json_quote_64bit_integers without leaking that into raw_query. This is more of a manifestation of query_id being passed as a setting, rather than some other way (e.g. a header.) I don't know how the native driver here would deal with query_id, such as whether or not it considers that a valid setting, or if that's an implementation detail of the way settings are implemented in the HTTP interface. (I don't know what it would do if query_id was provided in both the settings mapping as well as the query_id keyword argument to execute.)

fpacifici · 2020-03-06T01:07:31Z

snuba/clickhouse/http.py

+        response = self.__pool.urlopen(
+            "POST",
+            "/?" + urlencode({**self.__default_settings, **query_settings}),
+            headers={"Connection": "keep-alive", "Accept-Encoding": "gzip,deflate"},
+            body=query.format_sql("JSON"),
+        )


I think it may be useful to set up retries for network connection errors. Also what's the default timeout we would have here?

I think it may be useful to set up retries for network connection errors.

Makes sense, will look into that.

Also what's the default timeout we would have here?

Not sure. Will investigate tomorrow.

Context on how we handled the retries and timeout in the queries from sentry to snuba with the reasoning around retries if it helps
getsentry/sentry@6d54866

I think it may be useful to set up retries for network connection errors.

I started writing a comment about how on second thought, I didn't think we should just blindly retry on errors without a reason to, but then I checked the distinction between execute and execute_robust in ClickhousePool and I guess we retry there as well.

Also what's the default timeout we would have here?

This basically falls through to to socket.getdefaulttimeout:

(snuba) ted@veneno % docker run --rm -it getsentry/snuba python -c 'import socket; print(socket.getdefaulttimeout())' + '[' python = bash ']' + '[' p = - ']' + '[' python = api ']' + snuba python --help + exec gosu snuba python -c 'import socket; print(socket.getdefaulttimeout())' None

I guess this should be able to be parameterized (the writer also has this issue.)

About the retries, the difference between execute_robust and using the default Retry object of urllib3 is that execute_robust also waits by default 1 second (besides being more flexible on the type of errors).
I would say we should always retry (a given amount of times) for errors when establishing the connection (NetworkError, SocketTimeout, etc.) That covers the transient network issues that we run into. In these case you will get the Exception from urllib.
Whether to retry on ServerError that is debatable. Here (this is only a read path so idempotent) it is not dangerous but it puts useless load to the DB if we do that on errors that cannot be transient. I am not sure we have a clean way to make this distinction on clickhouse ServerErrors, unless the HTTP reader clickhosue used HTTP 5xx for actual server errors and HTTP 4xx for query errors. If it does not we would have to selectively blacklist/whitelist error codes I guess.
I'd be ok being conservative and not retry at all for these cases. (Sentry retries them by the way).
I would certainly not retry on ReadTimeout errors.
How does it sound as a strategy?

Actually the Retry object from urllib3 is already capable of making the distinction between network errors, read errors and redirect errors so you do not need to implement the logic in your code to do network connection retries:
https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#module-urllib3.util.retry

About the retries, the difference between execute_robust and using the default Retry object of urllib3 is that execute_robust also waits by default 1 second (besides being more flexible on the type of errors).

Yeah, my original comment was unclear — I meant that i was surprised that execute also had some number of baked in retries. (That doesn't really change anything about your comment, though.) It makes some more sense why after this conversation.

Whether to retry on ServerError that is debatable.

In my opinion, we should try and avoid implicit retries of application operations as much as possible, and allow (well, not so much allow as require) the caller to retry when necessary. I don't think a one-size-fits-all solution in the reader (or writer, or connection pool, etc.) is a good idea.

Like you mentioned, I do think it makes sense to always retry on transport level errors (connection timeouts.) I don't think of this so much as the responsibility of the Reader as the transport itself (specifically the connection pool, native or HTTP) so it probably makes sense to just pass the instantiated pool to the constructor, rather than constructing it in the reader itself.

This also is similar to some of the conversations around #844 around connection sharing. It'd be unnecessarily wasteful to create a new connection pool for every reader if they are all communicating with the same cluster.

I would be totally ok if we decided to retry only on transport errors. I am not a big fan of baking application specific retries in the Reader (or lower) layer either.

fpacifici · 2020-03-06T01:19:00Z

tests/clickhouse/test_reader.py

+class SimpleClickhouseQuery(ClickhouseQuery):
+    def __init__(self, columns: Sequence[Tuple[str, str]]) -> None:
+        self.__columns = columns
+
+    def _format_query_impl(self) -> str:
+        columns = ", ".join(f"{value} as {alias}" for alias, value in self.__columns)
+        return f"SELECT {columns}"


To me it is fine not to use the AST to write a unit test at this level, anyway if you want to use it, you don't need to depend on dataset to build a Query object and transform it into a Clickhosue query since the Query object depends on a DataSource (which is the abstract class above TableStorage and can be created from scratch without all the heavy weight of the dataset.

Example:
If you just want a Query object starting from the AST you can do it this way
https://github.com/getsentry/snuba/blob/master/tests/query/parser/test_query.py#L20-L32
(you have to use ASTClickhouseQuery instead of DictClickhouseQuery).

snuba/clickhouse/http.py

fpacifici · 2022-07-07T22:54:03Z

Will resurrect when we will get to this. Anyway the PR has to be largely rewritten as the code structure is quite different now.

tkaemming added 4 commits March 5, 2020 10:51

test reader

7d1949d

refactor error response handling

60a231f

add http reader

8151f95

cleanup and consolidate

1d38b33

tkaemming requested a review from a team March 5, 2020 19:15

turn off int64 quoting by default

c4ead7c

tkaemming mentioned this pull request Mar 5, 2020

ref: Add HTTPReader #406

Closed

type fixes

d264c37

tkaemming force-pushed the reader branch from 30c074b to d264c37 Compare March 5, 2020 19:32

tkaemming commented Mar 5, 2020

View reviewed changes

fpacifici reviewed Mar 6, 2020

View reviewed changes

tkaemming added 3 commits March 9, 2020 15:06

Merge branch 'master' into reader

88d615a

throw away type variable

2cc0846

Merge branch 'master' into reader

ecbad6c

fpacifici closed this Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(reader): Add ClickHouse HTTPReader #819

feat(reader): Add ClickHouse HTTPReader #819

tkaemming commented Mar 5, 2020 •

edited

Loading

tkaemming Mar 5, 2020

tkaemming Mar 5, 2020

tkaemming Mar 5, 2020

fpacifici Mar 6, 2020

fpacifici left a comment

fpacifici Mar 6, 2020

tkaemming Mar 6, 2020

fpacifici Mar 6, 2020

tkaemming Mar 6, 2020

fpacifici Mar 6, 2020

tkaemming Mar 9, 2020

fpacifici Mar 10, 2020

fpacifici Mar 10, 2020

tkaemming Mar 24, 2020

fpacifici Mar 25, 2020

fpacifici Mar 6, 2020

fpacifici commented Jul 7, 2022

		if settings is not None:
		assert "query_id" not in settings, "query_id cannot be passed as a setting"

feat(reader): Add ClickHouse HTTPReader #819

feat(reader): Add ClickHouse HTTPReader #819

Conversation

tkaemming commented Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpacifici left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpacifici commented Jul 7, 2022

tkaemming commented Mar 5, 2020 •

edited

Loading