Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error message "concurrent query failed" #224

Closed
Steuf opened this issue Sep 6, 2022 · 10 comments
Closed

[BUG] Error message "concurrent query failed" #224

Steuf opened this issue Sep 6, 2022 · 10 comments

Comments

@Steuf
Copy link

Steuf commented Sep 6, 2022

Describe the bug
I use chproxy with redis cache (keydb in fact, but it's the same), chproxy are used between clickhouse and dataviz (Grafana). But I have randomly the error "concurrent query failed" (and this error are cached :/).

This message are not really explicit, when I see the code the error was fired by the cache code. What can be the origin of this error and how I can solve it ?

Expected behavior
No error and request dont failed

Environment information
chproxy: 1.17.0
Clickhouse: 22.8.3.13
Keydb: v6.3.1

Kubernetes cluster

@mga-chka
Copy link
Collaborator

mga-chka commented Sep 6, 2022

Hi, this error means that you tryed to run the same query at the same time (or the first query then the 2nd one while the first one is still running). Instead of running twice the same query into clickhouse, chproxy will run the first one then wait for the result of the first query to use the cached result from the 2nd one. But if the first one failed (for example because of a timeout), chproxy will return "concurrent query failed" for the 2nd query and sometimes it can still return the same anwser if you run query 2 after query one failed (it's a mechanism useful to avoid users doing the same query to often while clickhouse is under water)

@gontarzpawel
Copy link
Contributor

To follow up on the answer from @mga-chka , we can consider an enhancement to this mechanism by storing error message of originally failed query. It would be more intuitive to user and hide the technical detail on how we optimise execution of concurrent queries.

I'm marking this issue as an enhancement and will try to work on it in a near future.

@PhantomPhreak
Copy link

I also stepped on it.
Despite the unclear error message, this result is cached for more than confugured cache TTL

For example, i'm setting 60s for the cache TTL and 30s for the max_execution_time setting
Then i'm running a long query, getting the error:

Received exception from server:
[ Id: 1712D2E3DCBFD77D; User "..."(1) proxying as "..."(1) to "..."(6); RemoteAddr: "..."; LocalAddr: ".."; Duration: 30030411 μs]: timeout for user "..." exceeded: 30s; query: "..."

After 60s (cache TTL) i have 2 keys in redis:

redis> keys *
1) "118713aa48f49bd1fa292673691d416d-transaction"
2) "b804d509c7159fa1923eb66f7609da0c-transaction"
redis> get 118713aa48f49bd1fa292673691d416d-transaction                                                                                       
"2"
redis> get b804d509c7159fa1923eb66f7609da0c-transaction                                                                                       
"1"
redis> TTL 118713aa48f49bd1fa292673691d416d-transaction                                                                                       
(integer) 172719
redis TTL b804d509c7159fa1923eb66f7609da0c-transaction                                                                                       
(integer) 172714
redis>

Running the query again resulting to the error: "concurrent query failed"

"2" means "unsuccessfull query" i guess, but somehow this result is not stored for the cache TTL (60s), but for 86400*2seconds, which is weird. I have 86400s max_execution_time set for other users in chproxy, but not for the user i'm running the test queries.

Also, caching the error is not always a good idea, because error may happen for the different reasons:
a) max memory limit for CH instance reached, but after few mins memory is released
b) execution speed is low due to the load spike, so CH says query won't fit into max_execution_time limit (this behavior is controlled by timeout_before_checking_execution_speed setting)
... and many others.

In this case, when client want to re-try the query after a while, it will constantly recieve the error without a chance to re-try, which is really bad.

IMO, errors caching should be disabled, or, at least, there should be a setting to explicitly control this behavior.

@gontarzpawel
Copy link
Contributor

gontarzpawel commented Sep 11, 2022

Hello @PhantomPhreak ,

Thanks for sharing. Keys with the suffix -transaction relate to the feature of concurrent transactions. When query arrives to the 'cacheable' chproxy it firstly starts so called transaction. The purpose of this transaction is prevention from thundering herd effect as such that the concurrent request relating to the exactly same query will await for the result of the computation from the first request.

When Clickhouse responds upon the first query, the key with the -transaction suffix is updated accordingly:

  • if succeeded, as completed
  • if failed, as failed

This information in redis is kept for 2 * grace_time or 2 * max_execution_time. It has no correlation with cache expiration / ttl whatsoever.

Also, caching the error is not always a good idea, because error may happen for the different reasons:
a) max memory limit for CH instance reached, but after few mins memory is released
b) execution speed is low due to the load spike, so CH says query won't fit into max_execution_time limit (this behavior is controlled by timeout_before_checking_execution_speed setting)
... and many others.

Very relevant point. We've come to similar reflections except we had thought that it is still worth to protect Clickhouse from fatal queries (unrecoverable). We've introduced this idea in #193

The concept is to not mark transaction as failed if received response from Clickhouse is recoverable.

var clickhouseRecoverableStatusCodes = map[int]struct{}{http.StatusServiceUnavailable: {}}

Following your feedback, it may be indeed relevant to add http status codes representing timeout (408). Regarding memory limit it's more tricky to configure because it's not curried in a distinct http status code, but within 500.

@PhantomPhreak
Copy link

to protect Clickhouse from fatal queries (unrecoverable).

This is a good point, but there are a big question how to separate recoverable and unrecoverable queries. Also, if unrecoverable queries don't do any harm to the ClickHouse itself, it may be safe to allow them, and control the query execution rate by max requests per minute/max concurrent queries on CHproxy side, or even a quotas and query complexity settings in ClickHouse itself. We're doing that quite succefully with no problems.

Following your feedback, it may be indeed relevant to add http status codes representing timeout (408). Regarding memory limit it's more tricky to configure because it's not curried in a distinct http status code, but within 500.

I need to say that the scope of potential causes may be wider, and now, for example, query will be marked as failed if i just press Ctrl+C in clickhouse-cli during the query execution, and if just repeat this query "as is" after, it doesn't work, which is very annying.

IMO, if some query was failed, and another query came after that - it's not a concurrent queries, so this "new" query should not fail because of "concurrent query protection logic/etc.", just because it's not applicable in this case. These are sequential requests, not concurrent ones.

I may miss some details in the implementation (sorry for the stupid proposals if any), but this behavior looks logical for me: all concurrent queries (waiting for the "grace_period") will fail if currently running query fails (with the error/timeout received from ClickHouse, not because of the user cancelled the query), but following queries (re-try) will be allowed to execute.

@Steuf
Copy link
Author

Steuf commented Sep 19, 2022

Thanks for update Guys, I have tested release 1.17.2, at start I was unreadable logs, and I think it was caused by the http compression actived on Clickhouse :

Capture d’écran 2022-09-19 à 18 59 12

I have deactivated HTTP Compression on CH to have "readable" error but now I juste have "empty" error :/ :

Capture d’écran 2022-09-19 à 19 01 02

Have you an idea that he can be caused "empty" reporting on concurrency error ?

NB: Nothing on Clickhouse logs :/

@Steuf
Copy link
Author

Steuf commented Sep 20, 2022

I think I have found the pattern that generate this behavior.

Sometime on Grafana, query are immediatly canceled by the client and re-fired.

Capture d’écran 2022-09-20 à 14 23 09

In this case it always generate the concurrency query failed error on chproxy.

@mga-chka
Copy link
Collaborator

yep, this is the issue @PhantomPhreak was mentionning (with the ctrl +c).
FYI we are planning to modify this part (and at least let the chproxy admin to customize this part with the config.yml).
cf #230

@Steuf
Copy link
Author

Steuf commented Sep 21, 2022

👍

@mga-chka
Copy link
Collaborator

FYI v1.18.0 should improve this issue since now a failing call will be cached for 500 msec.
cf #235

I'm closing this issue since the new behavior of chproxy (the root cause error of the "cached" error is shown & the error is cached only for 500 msec to avoid dogpile effect) should solve most cases.
Feel free to open it again if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants