-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(reader): Add ClickHouse HTTPReader #819
Conversation
query_settings: MutableMapping[str, str] = ( | ||
{**settings} if settings is not None else {} | ||
) | ||
|
||
# XXX: mypy won't allow redefining ``settings`` as mutable, so delete | ||
# the original variable to avoid accidentally referencing ``settings`` | ||
# instead of ``query_settings``. | ||
del settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of my least favorite things about mypy.
query: ClickhouseQuery, | ||
settings: Optional[Mapping[str, str]] = None, | ||
query_id: Optional[str] = None, | ||
with_totals: bool = False, # NOTE: unnecessary with FORMAT JSON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment I guess is a bit misleading. FORMAT JSON
doesn't implicitly add WITH TOTALS
or anything, but it does already split out the totals
row from the data
rows so that the caller doesn't need to. When the AST is available everywhere, it'd be better to remove with_totals
from this function signature and just use the query attribute to decide whether or not the totals row should exist or not for the native reader.
class SimpleClickhouseQuery(ClickhouseQuery): | ||
def __init__(self, columns: Sequence[Tuple[str, str]]) -> None: | ||
self.__columns = columns | ||
|
||
def _format_query_impl(self) -> str: | ||
columns = ", ".join(f"{value} as {alias}" for alias, value in self.__columns) | ||
return f"SELECT {columns}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to use the AST for this but that would have meant either
- relying on the other steps of query processing and executing the query against an actual dataset, coupling this test to a lot of unnecessary things
- refactoring the
AstClickhouseQuery
constructor so that it would be more like a dataclass, independent of theQuery
object entirely
Both of those are a rabbit holes I have no interest in falling down into right now. We should do the second one eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it is fine not to use the AST to write a unit test at this level, anyway if you want to use it, you don't need to depend on dataset to build a Query
object and transform it into a Clickhosue query since the Query
object depends on a DataSource (which is the abstract class above TableStorage and can be created from scratch without all the heavy weight of the dataset.
Example:
If you just want a Query object starting from the AST you can do it this way
https://github.com/getsentry/snuba/blob/master/tests/query/parser/test_query.py#L20-L32
(you have to use ASTClickhouseQuery instead of DictClickhouseQuery).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally some nits and questions. o you already have a rollout plan ?
if settings is not None: | ||
assert "query_id" not in settings, "query_id cannot be passed as a setting" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new condition we did not have on native?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of.
The settings passed here are supposed to be closer to "session settings" (client_settings
for the ClickhousePool
constructor) than they are to "query settings", but we don't use ClickHouse HTTP sessions here due to not wanting to deal with session timeouts, etc. I needed to be some way to allow constant (session-like) settings here to support things like output_format_json_quote_64bit_integers
without leaking that into raw_query
. This is more of a manifestation of query_id
being passed as a setting, rather than some other way (e.g. a header.) I don't know how the native driver here would deal with query_id
, such as whether or not it considers that a valid setting, or if that's an implementation detail of the way settings are implemented in the HTTP interface. (I don't know what it would do if query_id
was provided in both the settings
mapping as well as the query_id
keyword argument to execute
.)
response = self.__pool.urlopen( | ||
"POST", | ||
"/?" + urlencode({**self.__default_settings, **query_settings}), | ||
headers={"Connection": "keep-alive", "Accept-Encoding": "gzip,deflate"}, | ||
body=query.format_sql("JSON"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be useful to set up retries for network connection errors. Also what's the default timeout we would have here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be useful to set up retries for network connection errors.
Makes sense, will look into that.
Also what's the default timeout we would have here?
Not sure. Will investigate tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context on how we handled the retries and timeout in the queries from sentry to snuba with the reasoning around retries if it helps
getsentry/sentry@6d54866
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be useful to set up retries for network connection errors.
I started writing a comment about how on second thought, I didn't think we should just blindly retry on errors without a reason to, but then I checked the distinction between execute
and execute_robust
in ClickhousePool
and I guess we retry there as well.
Also what's the default timeout we would have here?
This basically falls through to to socket.getdefaulttimeout
:
(snuba) ted@veneno % docker run --rm -it getsentry/snuba python -c 'import socket; print(socket.getdefaulttimeout())'
+ '[' python = bash ']'
+ '[' p = - ']'
+ '[' python = api ']'
+ snuba python --help
+ exec gosu snuba python -c 'import socket; print(socket.getdefaulttimeout())'
None
I guess this should be able to be parameterized (the writer also has this issue.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the retries, the difference between execute_robust and using the default Retry object of urllib3 is that execute_robust also waits by default 1 second (besides being more flexible on the type of errors).
I would say we should always retry (a given amount of times) for errors when establishing the connection (NetworkError, SocketTimeout, etc.) That covers the transient network issues that we run into. In these case you will get the Exception from urllib.
Whether to retry on ServerError that is debatable. Here (this is only a read path so idempotent) it is not dangerous but it puts useless load to the DB if we do that on errors that cannot be transient. I am not sure we have a clean way to make this distinction on clickhouse ServerErrors, unless the HTTP reader clickhosue used HTTP 5xx for actual server errors and HTTP 4xx for query errors. If it does not we would have to selectively blacklist/whitelist error codes I guess.
I'd be ok being conservative and not retry at all for these cases. (Sentry retries them by the way).
I would certainly not retry on ReadTimeout errors.
How does it sound as a strategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the Retry object from urllib3 is already capable of making the distinction between network errors, read errors and redirect errors so you do not need to implement the logic in your code to do network connection retries:
https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#module-urllib3.util.retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the retries, the difference between execute_robust and using the default Retry object of urllib3 is that execute_robust also waits by default 1 second (besides being more flexible on the type of errors).
Yeah, my original comment was unclear — I meant that i was surprised that execute
also had some number of baked in retries. (That doesn't really change anything about your comment, though.) It makes some more sense why after this conversation.
Whether to retry on ServerError that is debatable.
In my opinion, we should try and avoid implicit retries of application operations as much as possible, and allow (well, not so much allow as require) the caller to retry when necessary. I don't think a one-size-fits-all solution in the reader (or writer, or connection pool, etc.) is a good idea.
Like you mentioned, I do think it makes sense to always retry on transport level errors (connection timeouts.) I don't think of this so much as the responsibility of the Reader
as the transport itself (specifically the connection pool, native or HTTP) so it probably makes sense to just pass the instantiated pool to the constructor, rather than constructing it in the reader itself.
This also is similar to some of the conversations around #844 around connection sharing. It'd be unnecessarily wasteful to create a new connection pool for every reader if they are all communicating with the same cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be totally ok if we decided to retry only on transport errors. I am not a big fan of baking application specific retries in the Reader (or lower) layer either.
class SimpleClickhouseQuery(ClickhouseQuery): | ||
def __init__(self, columns: Sequence[Tuple[str, str]]) -> None: | ||
self.__columns = columns | ||
|
||
def _format_query_impl(self) -> str: | ||
columns = ", ".join(f"{value} as {alias}" for alias, value in self.__columns) | ||
return f"SELECT {columns}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it is fine not to use the AST to write a unit test at this level, anyway if you want to use it, you don't need to depend on dataset to build a Query
object and transform it into a Clickhosue query since the Query
object depends on a DataSource (which is the abstract class above TableStorage and can be created from scratch without all the heavy weight of the dataset.
Example:
If you just want a Query object starting from the AST you can do it this way
https://github.com/getsentry/snuba/blob/master/tests/query/parser/test_query.py#L20-L32
(you have to use ASTClickhouseQuery instead of DictClickhouseQuery).
Will resurrect when we will get to this. Anyway the PR has to be largely rewritten as the code structure is quite different now. |
This adds a
Reader[ClickhouseQuery]
that uses the HTTP interface (and JSON format) to ClickHouse rather than the native TCP interface. This approach was based on #406 which itself was broken up into smaller changes such as #812.