Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(reader): Add ClickHouse HTTPReader #819

Closed
wants to merge 9 commits into from
Closed

feat(reader): Add ClickHouse HTTPReader #819

wants to merge 9 commits into from

Conversation

tkaemming
Copy link
Contributor

@tkaemming tkaemming commented Mar 5, 2020

This adds a Reader[ClickhouseQuery] that uses the HTTP interface (and JSON format) to ClickHouse rather than the native TCP interface. This approach was based on #406 which itself was broken up into smaller changes such as #812.

@tkaemming tkaemming requested a review from a team March 5, 2020 19:15
@tkaemming tkaemming mentioned this pull request Mar 5, 2020
Comment on lines +139 to +146
query_settings: MutableMapping[str, str] = (
{**settings} if settings is not None else {}
)

# XXX: mypy won't allow redefining ``settings`` as mutable, so delete
# the original variable to avoid accidentally referencing ``settings``
# instead of ``query_settings``.
del settings
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of my least favorite things about mypy.

snuba/clickhouse/http.py Show resolved Hide resolved
query: ClickhouseQuery,
settings: Optional[Mapping[str, str]] = None,
query_id: Optional[str] = None,
with_totals: bool = False, # NOTE: unnecessary with FORMAT JSON
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment I guess is a bit misleading. FORMAT JSON doesn't implicitly add WITH TOTALS or anything, but it does already split out the totals row from the data rows so that the caller doesn't need to. When the AST is available everywhere, it'd be better to remove with_totals from this function signature and just use the query attribute to decide whether or not the totals row should exist or not for the native reader.

Comment on lines +14 to +20
class SimpleClickhouseQuery(ClickhouseQuery):
def __init__(self, columns: Sequence[Tuple[str, str]]) -> None:
self.__columns = columns

def _format_query_impl(self) -> str:
columns = ", ".join(f"{value} as {alias}" for alias, value in self.__columns)
return f"SELECT {columns}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use the AST for this but that would have meant either

  1. relying on the other steps of query processing and executing the query against an actual dataset, coupling this test to a lot of unnecessary things
  2. refactoring the AstClickhouseQuery constructor so that it would be more like a dataclass, independent of the Query object entirely

Both of those are a rabbit holes I have no interest in falling down into right now. We should do the second one eventually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it is fine not to use the AST to write a unit test at this level, anyway if you want to use it, you don't need to depend on dataset to build a Query object and transform it into a Clickhosue query since the Query object depends on a DataSource (which is the abstract class above TableStorage and can be created from scratch without all the heavy weight of the dataset.

Example:
If you just want a Query object starting from the AST you can do it this way
https://github.com/getsentry/snuba/blob/master/tests/query/parser/test_query.py#L20-L32
(you have to use ASTClickhouseQuery instead of DictClickhouseQuery).

Copy link
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally some nits and questions. o you already have a rollout plan ?

snuba/clickhouse/http.py Outdated Show resolved Hide resolved
snuba/clickhouse/http.py Show resolved Hide resolved
Comment on lines +120 to +122
if settings is not None:
assert "query_id" not in settings, "query_id cannot be passed as a setting"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new condition we did not have on native?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of.

The settings passed here are supposed to be closer to "session settings" (client_settings for the ClickhousePool constructor) than they are to "query settings", but we don't use ClickHouse HTTP sessions here due to not wanting to deal with session timeouts, etc. I needed to be some way to allow constant (session-like) settings here to support things like output_format_json_quote_64bit_integers without leaking that into raw_query. This is more of a manifestation of query_id being passed as a setting, rather than some other way (e.g. a header.) I don't know how the native driver here would deal with query_id, such as whether or not it considers that a valid setting, or if that's an implementation detail of the way settings are implemented in the HTTP interface. (I don't know what it would do if query_id was provided in both the settings mapping as well as the query_id keyword argument to execute.)

Comment on lines +155 to +160
response = self.__pool.urlopen(
"POST",
"/?" + urlencode({**self.__default_settings, **query_settings}),
headers={"Connection": "keep-alive", "Accept-Encoding": "gzip,deflate"},
body=query.format_sql("JSON"),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be useful to set up retries for network connection errors. Also what's the default timeout we would have here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be useful to set up retries for network connection errors.

Makes sense, will look into that.

Also what's the default timeout we would have here?

Not sure. Will investigate tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context on how we handled the retries and timeout in the queries from sentry to snuba with the reasoning around retries if it helps
getsentry/sentry@6d54866

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be useful to set up retries for network connection errors.

I started writing a comment about how on second thought, I didn't think we should just blindly retry on errors without a reason to, but then I checked the distinction between execute and execute_robust in ClickhousePool and I guess we retry there as well.

Also what's the default timeout we would have here?

This basically falls through to to socket.getdefaulttimeout:

(snuba) ted@veneno % docker run --rm -it getsentry/snuba python -c 'import socket; print(socket.getdefaulttimeout())'
+ '[' python = bash ']'
+ '[' p = - ']'
+ '[' python = api ']'
+ snuba python --help
+ exec gosu snuba python -c 'import socket; print(socket.getdefaulttimeout())'
None

I guess this should be able to be parameterized (the writer also has this issue.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the retries, the difference between execute_robust and using the default Retry object of urllib3 is that execute_robust also waits by default 1 second (besides being more flexible on the type of errors).
I would say we should always retry (a given amount of times) for errors when establishing the connection (NetworkError, SocketTimeout, etc.) That covers the transient network issues that we run into. In these case you will get the Exception from urllib.
Whether to retry on ServerError that is debatable. Here (this is only a read path so idempotent) it is not dangerous but it puts useless load to the DB if we do that on errors that cannot be transient. I am not sure we have a clean way to make this distinction on clickhouse ServerErrors, unless the HTTP reader clickhosue used HTTP 5xx for actual server errors and HTTP 4xx for query errors. If it does not we would have to selectively blacklist/whitelist error codes I guess.
I'd be ok being conservative and not retry at all for these cases. (Sentry retries them by the way).
I would certainly not retry on ReadTimeout errors.
How does it sound as a strategy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the Retry object from urllib3 is already capable of making the distinction between network errors, read errors and redirect errors so you do not need to implement the logic in your code to do network connection retries:
https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#module-urllib3.util.retry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the retries, the difference between execute_robust and using the default Retry object of urllib3 is that execute_robust also waits by default 1 second (besides being more flexible on the type of errors).

Yeah, my original comment was unclear — I meant that i was surprised that execute also had some number of baked in retries. (That doesn't really change anything about your comment, though.) It makes some more sense why after this conversation.

Whether to retry on ServerError that is debatable.

In my opinion, we should try and avoid implicit retries of application operations as much as possible, and allow (well, not so much allow as require) the caller to retry when necessary. I don't think a one-size-fits-all solution in the reader (or writer, or connection pool, etc.) is a good idea.

Like you mentioned, I do think it makes sense to always retry on transport level errors (connection timeouts.) I don't think of this so much as the responsibility of the Reader as the transport itself (specifically the connection pool, native or HTTP) so it probably makes sense to just pass the instantiated pool to the constructor, rather than constructing it in the reader itself.

This also is similar to some of the conversations around #844 around connection sharing. It'd be unnecessarily wasteful to create a new connection pool for every reader if they are all communicating with the same cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be totally ok if we decided to retry only on transport errors. I am not a big fan of baking application specific retries in the Reader (or lower) layer either.

Comment on lines +14 to +20
class SimpleClickhouseQuery(ClickhouseQuery):
def __init__(self, columns: Sequence[Tuple[str, str]]) -> None:
self.__columns = columns

def _format_query_impl(self) -> str:
columns = ", ".join(f"{value} as {alias}" for alias, value in self.__columns)
return f"SELECT {columns}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it is fine not to use the AST to write a unit test at this level, anyway if you want to use it, you don't need to depend on dataset to build a Query object and transform it into a Clickhosue query since the Query object depends on a DataSource (which is the abstract class above TableStorage and can be created from scratch without all the heavy weight of the dataset.

Example:
If you just want a Query object starting from the AST you can do it this way
https://github.com/getsentry/snuba/blob/master/tests/query/parser/test_query.py#L20-L32
(you have to use ASTClickhouseQuery instead of DictClickhouseQuery).

snuba/clickhouse/http.py Show resolved Hide resolved
snuba/clickhouse/http.py Show resolved Hide resolved
@fpacifici
Copy link
Contributor

Will resurrect when we will get to this. Anyway the PR has to be largely rewritten as the code structure is quite different now.

@fpacifici fpacifici closed this Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants