Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel replicas: Logical error: 'No pending events in epoll.'. #60140

Closed
Algunenano opened this issue Feb 19, 2024 · 6 comments · Fixed by #60430 or #61201
Closed

Parallel replicas: Logical error: 'No pending events in epoll.'. #60140

Algunenano opened this issue Feb 19, 2024 · 6 comments · Fixed by #60430 or #61201
Assignees
Labels
comp-parallel-replicas Parallel reading from replicas experimental feature Bug in the feature that should not be used in production fuzz Problem found by one of the fuzzers

Comments

@Algunenano
Copy link
Member

https://s3.amazonaws.com/clickhouse-test-reports/60077/f06e15ce5f7c5610a4f083bec3f6c514ea332e48/ast_fuzzer__debug_.html

CREATE TABLE test2 (id UInt64, date Date)
ENGINE = MergeTree
ORDER BY id;


CREATE TABLE IF NOT EXISTS test2_d AS test2 ENGINE = Distributed(test_cluster_two_shard_three_replicas_localhost, currentDatabase(), test2, id);


SELECT count(toFixedString(toFixedString(toFixedString('-- 1 shard, 3 replicas', 22), 22), 22)) FROM test2_d SETTINGS allow_experimental_parallel_reading_from_replicas = 1023, max_parallel_replicas = 0, prefer_localhost_replica = 0, parallel_replicas_for_non_replicated_merge_tree = 1
[Mordor] 2024.02.19 14:03:42.631746 [ 164690 ] {8041e81d-5679-4451-a736-d7d9eea7860a} <Fatal> : Logical error: 'No pending events in epoll.'.
[Mordor] 2024.02.19 14:03:42.633292 [ 165712 ] <Fatal> BaseDaemon: ########################################
[Mordor] 2024.02.19 14:03:42.633501 [ 165712 ] <Fatal> BaseDaemon: (version 24.2.1.1, build id: 21B17C07C9809F300CC2CD699D2558DE155BFA1B, git hash: 04a695d05b7c2de60b2c45e4c5281ef32c94076e) (from thread 164690) (query_id: 8041e81d-5679-4451-a736-d7d9eea7860a) (query: SELECT count(toFixedString(toFixedString(toFixedString('-- 1 shard, 3 replicas', 22), 22), 22)) FROM test2_d SETTINGS allow_experimental_parallel_reading_from_replicas = 1023, max_parallel_replicas = 0, prefer_localhost_replica = 0, parallel_replicas_for_non_replicated_merge_tree = 1) Received signal Aborted (6)
[Mordor] 2024.02.19 14:03:42.633727 [ 165712 ] <Fatal> BaseDaemon: 
[Mordor] 2024.02.19 14:03:42.633885 [ 165712 ] <Fatal> BaseDaemon: Stack trace: 0x0000759bf1ef432c 0x0000759bf1ea36c8 0x0000759bf1e8b4b8 0x0000000014d02b9f 0x0000000014d02c75 0x0000000014d0327f 0x000000000adbae77 0x000000000adbd4c8 0x00000000205ac8c9 0x000000001ceb7092 0x000000001ceb5857 0x000000001ceb5705 0x000000001ceb5688 0x000000001ceb5648 0x000000001ceb554b 0x000000001ceb51fd
[Mordor] 2024.02.19 14:03:42.634083 [ 165712 ] <Fatal> BaseDaemon: 4. ? @ 0x0000759bf1ef432c
[Mordor] 2024.02.19 14:03:42.634231 [ 165712 ] <Fatal> BaseDaemon: 5. ? @ 0x0000759bf1ea36c8
[Mordor] 2024.02.19 14:03:42.634408 [ 165712 ] <Fatal> BaseDaemon: 6. ? @ 0x0000759bf1e8b4b8
[Mordor] 2024.02.19 14:03:42.631747 [ 164750 ] {8041e81d-5679-4451-a736-d7d9eea7860a} <Fatal> : Logical error: 'No pending events in epoll.'.
[Mordor] 2024.02.19 14:03:42.636150 [ 165713 ] <Fatal> BaseDaemon: ########################################
[Mordor] 2024.02.19 14:03:42.636349 [ 165713 ] <Fatal> BaseDaemon: (version 24.2.1.1, build id: 21B17C07C9809F300CC2CD699D2558DE155BFA1B, git hash: 04a695d05b7c2de60b2c45e4c5281ef32c94076e) (from thread 164750) (query_id: 8041e81d-5679-4451-a736-d7d9eea7860a) (query: SELECT count(toFixedString(toFixedString(toFixedString('-- 1 shard, 3 replicas', 22), 22), 22)) FROM test2_d SETTINGS allow_experimental_parallel_reading_from_replicas = 1023, max_parallel_replicas = 0, prefer_localhost_replica = 0, parallel_replicas_for_non_replicated_merge_tree = 1) Received signal Aborted (6)
[Mordor] 2024.02.19 14:03:42.636580 [ 165713 ] <Fatal> BaseDaemon: 
[Mordor] 2024.02.19 14:03:42.636741 [ 165713 ] <Fatal> BaseDaemon: Stack trace: 0x0000759bf1ef432c 0x0000759bf1ea36c8 0x0000759bf1e8b4b8 0x0000000014d02b9f 0x0000000014d02c75 0x0000000014d0327f 0x000000000adbae77 0x000000000adbd4c8 0x00000000205ac8c9 0x000000001ceb7092 0x000000001ceb5857 0x000000001ceb5705 0x000000001ceb5688 0x000000001ceb5648 0x000000001ceb554b 0x000000001ceb51fd
[Mordor] 2024.02.19 14:03:42.636946 [ 165713 ] <Fatal> BaseDaemon: 4. ? @ 0x0000759bf1ef432c
[Mordor] 2024.02.19 14:03:42.637091 [ 165713 ] <Fatal> BaseDaemon: 5. ? @ 0x0000759bf1ea36c8
[Mordor] 2024.02.19 14:03:42.637259 [ 165713 ] <Fatal> BaseDaemon: 6. ? @ 0x0000759bf1e8b4b8
[Mordor] 2024.02.19 14:03:42.703320 [ 165712 ] <Fatal> BaseDaemon: 7. /mnt/ch/ClickHouse/src/Common/Exception.cpp:0: DB::abortOnFailedAssertion(String const&) @ 0x0000000014d02b9f
[Mordor] 2024.02.19 14:03:42.711289 [ 165713 ] <Fatal> BaseDaemon: 7. /mnt/ch/ClickHouse/src/Common/Exception.cpp:0: DB::abortOnFailedAssertion(String const&) @ 0x0000000014d02b9f
[Mordor] 2024.02.19 14:03:42.771661 [ 165712 ] <Fatal> BaseDaemon: 8. /mnt/ch/ClickHouse/src/Common/Exception.cpp:63: DB::handle_error_code(String const&, int, bool, std::vector<void*, std::allocator<void*>> const&) @ 0x0000000014d02c75
[Mordor] 2024.02.19 14:03:42.783728 [ 165713 ] <Fatal> BaseDaemon: 8. /mnt/ch/ClickHouse/src/Common/Exception.cpp:63: DB::handle_error_code(String const&, int, bool, std::vector<void*, std::allocator<void*>> const&) @ 0x0000000014d02c75
[Mordor] 2024.02.19 14:03:42.831683 [ 165712 ] <Fatal> BaseDaemon: 9. /mnt/ch/ClickHouse/src/Common/Exception.cpp:100: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000014d0327f
[Mordor] 2024.02.19 14:03:42.845823 [ 165713 ] <Fatal> BaseDaemon: 9. /mnt/ch/ClickHouse/src/Common/Exception.cpp:100: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000014d0327f
[Mordor] 2024.02.19 14:03:42.895499 [ 165712 ] <Fatal> BaseDaemon: 10. /mnt/ch/ClickHouse/src/Common/Exception.h:90: DB::Exception::Exception(String&&, int, bool) @ 0x000000000adbae77
[Mordor] 2024.02.19 14:03:42.913256 [ 165713 ] <Fatal> BaseDaemon: 10. /mnt/ch/ClickHouse/src/Common/Exception.h:90: DB::Exception::Exception(String&&, int, bool) @ 0x000000000adbae77
[Mordor] 2024.02.19 14:03:42.959993 [ 165712 ] <Fatal> BaseDaemon: 11. /mnt/ch/ClickHouse/src/Common/Exception.h:109: DB::Exception::Exception<>(int, FormatStringHelperImpl<>) @ 0x000000000adbd4c8
[Mordor] 2024.02.19 14:03:42.981665 [ 165713 ] <Fatal> BaseDaemon: 11. /mnt/ch/ClickHouse/src/Common/Exception.h:109: DB::Exception::Exception<>(int, FormatStringHelperImpl<>) @ 0x000000000adbd4c8
[Mordor] 2024.02.19 14:03:43.101606 [ 165712 ] <Fatal> BaseDaemon: 12. /mnt/ch/ClickHouse/src/Client/HedgedConnections.cpp:313: DB::HedgedConnections::receivePacketUnlocked(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>) @ 0x00000000205ac8c9
[Mordor] 2024.02.19 14:03:43.128832 [ 165713 ] <Fatal> BaseDaemon: 12. /mnt/ch/ClickHouse/src/Client/HedgedConnections.cpp:313: DB::HedgedConnections::receivePacketUnlocked(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>) @ 0x00000000205ac8c9
[Mordor] 2024.02.19 14:03:43.162714 [ 165712 ] <Fatal> BaseDaemon: 13. /mnt/ch/ClickHouse/src/QueryPipeline/RemoteQueryExecutorReadContext.cpp:50: DB::RemoteQueryExecutorReadContext::Task::run(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>, std::function<void ()>) @ 0x000000001ceb7092
[Mordor] 2024.02.19 14:03:43.192633 [ 165713 ] <Fatal> BaseDaemon: 13. /mnt/ch/ClickHouse/src/QueryPipeline/RemoteQueryExecutorReadContext.cpp:50: DB::RemoteQueryExecutorReadContext::Task::run(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>, std::function<void ()>) @ 0x000000001ceb7092
[Mordor] 2024.02.19 14:03:43.210593 [ 165712 ] <Fatal> BaseDaemon: 14. /mnt/ch/ClickHouse/src/Common/AsyncTaskExecutor.cpp:88: DB::AsyncTaskExecutor::Routine::operator()(std::function<void ()>) @ 0x000000001ceb5857
[Mordor] 2024.02.19 14:03:43.242850 [ 165713 ] <Fatal> BaseDaemon: 14. /mnt/ch/ClickHouse/src/Common/AsyncTaskExecutor.cpp:88: DB::AsyncTaskExecutor::Routine::operator()(std::function<void ()>) @ 0x000000001ceb5857
[Mordor] 2024.02.19 14:03:43.258441 [ 165712 ] <Fatal> BaseDaemon: 15. /mnt/ch/ClickHouse/src/Common/Fiber.h:76: Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>::operator()(boost::context::fiber&&) @ 0x000000001ceb5705
[Mordor] 2024.02.19 14:03:43.293069 [ 165713 ] <Fatal> BaseDaemon: 15. /mnt/ch/ClickHouse/src/Common/Fiber.h:76: Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>::operator()(boost::context::fiber&&) @ 0x000000001ceb5705
[Mordor] 2024.02.19 14:03:43.307581 [ 165712 ] <Fatal> BaseDaemon: 16. /mnt/ch/ClickHouse/contrib/llvm-project/libcxx/include/__functional/invoke.h:394: ? @ 0x000000001ceb5688
[Mordor] 2024.02.19 14:03:43.344546 [ 165713 ] <Fatal> BaseDaemon: 16. /mnt/ch/ClickHouse/contrib/llvm-project/libcxx/include/__functional/invoke.h:394: ? @ 0x000000001ceb5688
[Mordor] 2024.02.19 14:03:43.357051 [ 165712 ] <Fatal> BaseDaemon: 17. /mnt/ch/ClickHouse/contrib/llvm-project/libcxx/include/__functional/invoke.h:531: ? @ 0x000000001ceb5648
[Mordor] 2024.02.19 14:03:43.395803 [ 165713 ] <Fatal> BaseDaemon: 17. /mnt/ch/ClickHouse/contrib/llvm-project/libcxx/include/__functional/invoke.h:531: ? @ 0x000000001ceb5648
[Mordor] 2024.02.19 14:03:43.404790 [ 165712 ] <Fatal> BaseDaemon: 18. /mnt/ch/ClickHouse/contrib/boost/boost/context/fiber_fcontext.hpp:157: boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>::run(void*) @ 0x000000001ceb554b
[Mordor] 2024.02.19 14:03:43.445969 [ 165713 ] <Fatal> BaseDaemon: 18. /mnt/ch/ClickHouse/contrib/boost/boost/context/fiber_fcontext.hpp:157: boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>::run(void*) @ 0x000000001ceb554b
[Mordor] 2024.02.19 14:03:43.453777 [ 165712 ] <Fatal> BaseDaemon: 19. /mnt/ch/ClickHouse/contrib/boost/boost/context/fiber_fcontext.hpp:97: void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>>(boost::context::detail::transfer_t) @ 0x000000001ceb51fd
[Mordor] 2024.02.19 14:03:43.453987 [ 165712 ] <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
[Mordor] 2024.02.19 14:03:43.454247 [ 165712 ] <Fatal> BaseDaemon: This ClickHouse version is not official and should be upgraded to the official build.
[Mordor] 2024.02.19 14:03:43.454697 [ 165712 ] <Fatal> BaseDaemon: Changed settings: max_threads = 16, use_uncompressed_cache = false, compile_aggregate_expressions = false, max_parallel_replicas = 0, allow_experimental_parallel_reading_from_replicas = 1023, parallel_replicas_for_non_replicated_merge_tree = true, insert_deduplicate = true, distributed_ddl_task_timeout = 120, joined_subquery_requires_alias = false, allow_experimental_analyzer = false, max_bytes_before_external_group_by = 1442450940, max_execution_time = 0., max_expanded_ast_elements = 50000, join_algorithm = 'hash', max_memory_usage = 50000000000, memory_usage_overcommit_max_wait_microseconds = 50000, log_query_threads = false, prefer_localhost_replica = false, max_partitions_per_insert_block = 100, optimize_monotonous_functions_in_order_by = false, enable_global_with_statement = false, optimize_rewrite_sum_if_to_count_if = false, allow_experimental_database_replicated = true, database_replicated_initial_query_timeout_sec = 120, database_replicated_enforce_synchronous_settings = true, database_replicated_always_detach_permanently = true, distributed_ddl_output_mode = 'none', insert_keeper_max_retries = 1000, insert_keeper_retry_initial_backoff_ms = 1, insert_keeper_retry_max_backoff_ms = 1, insert_keeper_fault_injection_probability = 0., input_format_null_as_default = false
[Mordor] 2024.02.19 14:03:43.497325 [ 165713 ] <Fatal> BaseDaemon: 19. /mnt/ch/ClickHouse/contrib/boost/boost/context/fiber_fcontext.hpp:97: void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>>(boost::context::detail::transfer_t) @ 0x000000001ceb51fd
[Mordor] 2024.02.19 14:03:43.497552 [ 165713 ] <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
[Mordor] 2024.02.19 14:03:43.497835 [ 165713 ] <Fatal> BaseDaemon: This ClickHouse version is not official and should be upgraded to the official build.
[Mordor] 2024.02.19 14:03:43.498276 [ 165713 ] <Fatal> BaseDaemon: Changed settings: max_threads = 16, use_uncompressed_cache = false, compile_aggregate_expressions = false, max_parallel_replicas = 0, allow_experimental_parallel_reading_from_replicas = 1023, parallel_replicas_for_non_replicated_merge_tree = true, insert_deduplicate = true, distributed_ddl_task_timeout = 120, joined_subquery_requires_alias = false, allow_experimental_analyzer = false, max_bytes_before_external_group_by = 1442450940, max_execution_time = 0., max_expanded_ast_elements = 50000, join_algorithm = 'hash', max_memory_usage = 50000000000, memory_usage_overcommit_max_wait_microseconds = 50000, log_query_threads = false, prefer_localhost_replica = false, max_partitions_per_insert_block = 100, optimize_monotonous_functions_in_order_by = false, enable_global_with_statement = false, optimize_rewrite_sum_if_to_count_if = false, allow_experimental_database_replicated = true, database_replicated_initial_query_timeout_sec = 120, database_replicated_enforce_synchronous_settings = true, database_replicated_always_detach_permanently = true, distributed_ddl_output_mode = 'none', insert_keeper_max_retries = 1000, insert_keeper_retry_initial_backoff_ms = 1, insert_keeper_retry_max_backoff_ms = 1, insert_keeper_fault_injection_probability = 0., input_format_null_as_default = false
↑ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.)                                                                                                                                                                                                     Exception on client:
Code: 32. DB::Exception: Attempt to read after eof: while receiving packet from clickhouse-01:49000. (ATTEMPT_TO_READ_AFTER_EOF)

Connecting to clickhouse-01:49000 as user default.
Code: 210. DB::NetException: Connection refused (clickhouse-01:49000). (NETWORK_ERROR)
@Algunenano Algunenano added fuzz Problem found by one of the fuzzers experimental feature Bug in the feature that should not be used in production comp-parallel-replicas Parallel reading from replicas labels Feb 19, 2024
@Algunenano
Copy link
Member Author

A simpler one:

SELECT
    count(3) IGNORE NULLS,
    avg(id)
FROM test2_d
SETTINGS allow_experimental_parallel_reading_from_replicas = 0

Query id: faff8c07-4e72-4674-bc1f-b7eec7e82851

┌─count() IGNORE NULLS─┬─avg(id)─┐
│                    0 │     nan │
└──────────────────────┴─────────┘

1 row in set. Elapsed: 0.034 sec

Ok

SELECT
    count(3) IGNORE NULLS,
    avg(id)
FROM test2_d
SETTINGS allow_experimental_parallel_reading_from_replicas = 0, max_parallel_replicas = 0

KO

Something with allow_experimental_parallel_reading_from_replicas = 0, max_parallel_replicas =0 makes this fail hard for some reason.

@Avogar
Copy link
Member

Avogar commented Feb 26, 2024

The exception happens because we create HedgedConnections with max_parallel_replicas = 0, and HedgedConnectionsFactory just returnes 0 connections, and when we are trying to read packets from such connections we just fail that there is not events in epoll as there were no connection established.

BTW, afaik, parlallel replicas should not use hedged connections and explicitely disables them:

if (settings.use_hedged_requests.value)
{
if (settings.use_hedged_requests.changed)
{
LOG_WARNING(
getLogger("executeQueryWithParallelReplicas"),
"Setting 'use_hedged_requests' explicitly with enabled 'allow_experimental_parallel_reading_from_replicas' has no effect. "
"Hedged connections are not used for parallel reading from replicas");
}
else
{
LOG_INFO(
getLogger("executeQueryWithParallelReplicas"),
"Disabling 'use_hedged_requests' in favor of 'allow_experimental_parallel_reading_from_replicas'. Hedged connections are "
"not used for parallel reading from replicas");
}
/// disable hedged connections -> parallel replicas uses own logic to choose replicas
new_context->setSetting("use_hedged_requests", Field{false});
}

@Avogar
Copy link
Member

Avogar commented Feb 26, 2024

:) SELECT count(3) IGNORE NULLS, avg(id) FROM test2_d SETTINGS allow_experimental_parallel_reading_from_replicas = 0, max_parallel_replicas = 0, prefer_localhost_replica = 0, parallel_replicas_for_non_replicated_merge_tree = 0

This also fails, so the problem is in hedged connections and not in parallel replcias. We just need to do smth with max_parallel_replicas = 0, maybe just throw an excetpion that this is wrong value for the setting

@Avogar
Copy link
Member

Avogar commented Feb 26, 2024

Without hedged requests it fails with another exception:

avogar-dev :) SELECT count(3) IGNORE NULLS, avg(id) FROM test2_d SETTINGS allow_experimental_parallel_reading_from_replicas = 0, max_parallel_replicas = 0, prefer_localhost_replica = 0, parallel_replicas_for_non_replicated_merge_tree = 0, use_hedged_requests=0

SELECT
    count(3) IGNORE NULLS,
    avg(id)
FROM test2_d
SETTINGS allow_experimental_parallel_reading_from_replicas = 0, max_parallel_replicas = 0, prefer_localhost_replica = 0, parallel_replicas_for_non_replicated_merge_tree = 0, use_hedged_requests = 0

Query id: 40a1fecf-b909-4766-9d17-7945ab5e1522


Elapsed: 0.030 sec.

Received exception from server (version 24.2.1):
Code: 279. DB::Exception: Received from localhost:9000. DB::NetException. DB::NetException: All connection tries failed. Log:


: While executing Remote. (ALL_CONNECTION_TRIES_FAILED)

All connection tries failed here also means 0 connections. Let's just add a check for max_parallel_replicas that it's not 0, will create a PR

@Avogar
Copy link
Member

Avogar commented Feb 26, 2024

Hmm, I see in some places we explicitly set max_parallel_replicas = 0:

query_context->setSetting("max_parallel_replicas", UInt64{0});

AFAIU it's done to disable parallel replicas, but shouldn't we actually set it to 1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-parallel-replicas Parallel reading from replicas experimental feature Bug in the feature that should not be used in production fuzz Problem found by one of the fuzzers
Projects
None yet
3 participants