Skip to content

pr #1337 swarms fail investigation #1509

@alsugiliazova

Description

@alsugiliazova

Swarms node_failure Network Failure Test: Behavior

Regression test /swarms/feature/node failure/network failure failes in pr #1337.

What this test does

The scenario is implemented in clickhouse-regression/swarms/tests/node_failure.py (network_failure).

It validates failure behavior for a long-running distributed read from DataLakeCatalog
when one swarm node loses network connectivity in the middle of query execution.

High-level flow:

  1. Create and populate an Iceberg table.
  2. Verify swarm cluster is functional.
  3. Create a DataLakeCatalog database.
  4. Start a long query on initiator (clickhouse1), reading via:
    • object_storage_cluster='static_swarm_cluster' with 2 swarm nodes
  5. In parallel, disconnect a random swarm node from Docker network, wait, reconnect, then restart node.
  6. Validate that the long query fails in an expected way.

Exact expectation currently encoded in test

In network_failure, run_long_query(...) is currently called with:

  • exitcode=138
  • message="DB::Exception: Query was cancelled."

So the test currently allows only cancel-style failure.

But sometimes test returns code 32

Logs show:

  • RemoteSource: Error occurs on cancellation
  • DB::Exception: Connection to clickhouse2:9000 terminated. (NETWORK_ERROR)
  • DB::Exception: Attempt to read after eof ... While executing Remote. (ATTEMPT_TO_READ_AFTER_EOF)
  • final shell exit code 32

This is a valid failure path when the remote connection drops while packets are being read.

In ClickHouse code:

  • throwReadAfterEOF() throws ATTEMPT_TO_READ_AFTER_EOF (Code: 32) in src/IO/VarInt.cpp.
  • RemoteSource::onCancel() catches/logs cancellation-time exceptions ("Error occurs on cancellation.") in src/Processors/Sources/RemoteSource.cpp.
  • RemoteQueryExecutor::tryCancel() sends cancel to remote via connections->sendCancel() in src/QueryPipeline/RemoteQueryExecutor.cpp.
  • If socket is already broken, cancellation can race with connection loss and produce NETWORK_ERROR plus ATTEMPT_TO_READ_AFTER_EOF.

This means that under network fault timing, both outcomes are realistic:

  • graceful cancel outcome (138, "Query was cancelled"), or
  • transport-break outcome (32, "Attempt to read after eof").

Expected exception message types for this scenario

For this network-failure simulation, these message families are expected in practice:

  1. Cancellation path

    • Code: 138
    • DB::Exception: Query was cancelled.
  2. Connection-break path

    • Code: 32
    • DB::Exception: Attempt to read after eof ... While executing Remote.
  3. Cancellation during broken connection (secondary log)

    • Code: 210
    • DB::Exception: Connection to <host>:9000 terminated. (NETWORK_ERROR)
    • often logged as RemoteSource: Error occurs on cancellation.

Practical conclusion

Your observed Code: 32 in this test is technically expected for this kind of induced network outage, but the current test assertion is narrower (138 only), so the test fails despite valid failure semantics.

Initiator node trace:

2026.03.10 19:22:23.215433 [ 790 ] {23d8386b-b160-466b-afcc-87ad19862fcd} <Error> RemoteSource: Error occurs on cancellation.: Code: 210. DB::Exception: Connection to clickhouse2:9000 terminated. (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000135decdf
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000ca85b8e
2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000ca85640
3. DB::Exception::Exception<String const&>(int, FormatStringHelperImpl<std::type_identity<String const&>::type>, String const&) @ 0x000000000dd380ab
4. DB::Connection::sendCancel() @ 0x0000000019c6e6b7
5. DB::MultiplexedConnections::sendCancel() @ 0x0000000019cbd72b
6. DB::RemoteQueryExecutor::tryCancel(char const*) @ 0x000000001716d7b3
7. DB::RemoteQueryExecutor::cancel() @ 0x000000001716dd7e
8. DB::RemoteSource::onCancel() @ 0x000000001a29d3d3
9. DB::ExecutingGraph::cancel(bool) @ 0x0000000019eb6dfc
10. DB::PipelineExecutor::executeStepImpl(unsigned long, DB::IAcquiredSlot*, std::atomic<bool>*) @ 0x0000000019eaf000
11. DB::PipelineExecutor::execute(unsigned long, bool) @ 0x0000000019ead21d
12. void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0>(DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x0000000019ec78da
13. ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker() @ 0x000000001373bc92
14. void* std::__thread_proxy[abi:ne190107]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x000000001374375a
15. ? @ 0x0000000000094ac3
16. ? @ 0x0000000000126850
 (version 25.8.16.20001.altinityantalya)
2026.03.10 19:22:23.216203 [ 44 ] {23d8386b-b160-466b-afcc-87ad19862fcd} <Error> executeQuery: Code: 32. DB::Exception: Attempt to read after eof: while receiving packet from clickhouse2:9000, 172.18.0.7, local address: 172.18.0.9:51490: While executing Remote. (ATTEMPT_TO_READ_AFTER_EOF) (version 25.8.16.20001.altinityantalya) (from 127.0.0.1:35912) (query 1, line 2) (in query: SELECT count(), hostName() FROM datalakecatalog_db_0a35ccf7_1cae_11f1_949e_920007258eee.`namespace_0a35ece2_1cae_11f1_b7e6_920007258eee.table_0a35ed8c_1cae_11f1_82ed_920007258eee` WHERE NOT ignore(sleepEachRow(1)) GROUP BY hostName() SETTINGS object_storage_cluster='static_swarm_cluster', max_threads=1 ), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000135decdf
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000ca85b8e
2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000ca85640
3. DB::Exception::Exception<>(int, FormatStringHelperImpl<>) @ 0x000000000ca946eb
4. DB::throwReadAfterEOF() @ 0x00000000136f074f
5. DB::Connection::receivePacket() @ 0x0000000019c746fb
6. DB::MultiplexedConnections::receivePacketUnlocked(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>) @ 0x0000000019cbd11f
7. DB::RemoteQueryExecutorReadContext::Task::run(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>, std::function<void ()>) @ 0x0000000017186554
8. void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>>(boost::context::detail::transfer_t) @ 0x0000000017185a83

2026.03.10 19:22:23.218029 [ 44 ] {} <Error> TCPHandler: Code: 32. DB::Exception: Attempt to read after eof: while receiving packet from clickhouse2:9000, 172.18.0.7, local address: 172.18.0.9:51490: While executing Remote. (ATTEMPT_TO_READ_AFTER_EOF), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000135decdf
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000ca85b8e
2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000ca85640
3. DB::Exception::Exception<>(int, FormatStringHelperImpl<>) @ 0x000000000ca946eb
4. DB::throwReadAfterEOF() @ 0x00000000136f074f
5. DB::Connection::receivePacket() @ 0x0000000019c746fb
6. DB::MultiplexedConnections::receivePacketUnlocked(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>) @ 0x0000000019cbd11f
7. DB::RemoteQueryExecutorReadContext::Task::run(std::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, String const&, unsigned int)>, std::function<void ()>) @ 0x0000000017186554
8. void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>>(boost::context::detail::transfer_t) @ 0x0000000017185a83

Diconnected swarm node trace:

2026.03.10 19:21:03.413035 [ 45 ] {d91a1e31-59cc-4f33-b6b1-13f0ef073cdd} <Error> executeQuery: Code: 236. DB::NetException: Client has dropped the connection, cancel the query. (ABORTED) (version 25.8.16.20001.altinityantalya) (from 172.18.0.9:36488) (query 1, line 2) (in query: SELECT count() AS `count()`, hostName() AS `hostName()` FROM icebergS3Cluster('static_swarm_cluster', 'http://minio:9000/warehouse/data/', 'admin', '[HIDDEN]', 'Parquet', '`name` Nullable(String), `double` Nullable(Float64), `integer` Nullable(Int64)', SETTINGS iceberg_metadata_file_path = 'metadata/00001-139f78a2-4acc-440e-9566-338649611910.metadata.json') AS __table1 WHERE NOT ignore(sleepEachRow(1)) GROUP BY hostName() SETTINGS max_threads = 1), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000135decdf
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000ca85b8e
2. DB::NetException::NetException<char const (&) [53]>(int, T&&) @ 0x0000000019e0dec4
3. DB::TCPHandler::receivePacketsExpectCancel(DB::QueryState&) @ 0x0000000019e0bb11
4. DB::TCPHandler::runImpl() @ 0x0000000019debd80
5. DB::TCPHandler::run() @ 0x0000000019e0df99
6. Poco::Net::TCPServerConnection::start() @ 0x000000001f34f207
7. Poco::Net::TCPServerDispatcher::run() @ 0x000000001f34f699
8. Poco::PooledThread::run() @ 0x000000001f3156c7
9. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001f313ac1
10. ? @ 0x0000000000094ac3
11. ? @ 0x0000000000126850

2026.03.10 19:21:03.414054 [ 45 ] {d91a1e31-59cc-4f33-b6b1-13f0ef073cdd} <Error> TCPHandler: Can't send logs or exception to client. Close connection.: Code: 210. DB::NetException: I/O error: Broken pipe, while writing to socket (172.18.0.10:9000 -> 172.18.0.9:36488). (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000135decdf
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000ca85b8e
2. DB::NetException::NetException<String, String, String>(int, FormatStringHelperImpl<std::type_identity<String>::type, std::type_identity<String>::type, std::type_identity<String>::type>, String&&, String&&, String&&) @ 0x000000001380d373
3. DB::WriteBufferFromPocoSocket::nextImpl() @ 0x000000001380e09a
4. DB::WriteBuffer::next() @ 0x000000000ca94d5e
5. DB::TCPHandler::runImpl() @ 0x0000000019decb1a
6. DB::TCPHandler::run() @ 0x0000000019e0df99
7. Poco::Net::TCPServerConnection::start() @ 0x000000001f34f207
8. Poco::Net::TCPServerDispatcher::run() @ 0x000000001f34f699
9. Poco::PooledThread::run() @ 0x000000001f3156c7
10. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001f313ac1
11. ? @ 0x0000000000094ac3
12. ? @ 0x0000000000126850
 (version 25.8.16.20001.altinityantalya)
2026.03.10 19:21:03.415299 [ 45 ] {} <Error> TCPHandler: Code: 236. DB::NetException: Client has dropped the connection, cancel the query. (ABORTED), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000135decdf
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000ca85b8e
2. DB::NetException::NetException<char const (&) [53]>(int, T&&) @ 0x0000000019e0dec4
3. DB::TCPHandler::receivePacketsExpectCancel(DB::QueryState&) @ 0x0000000019e0bb11
4. DB::TCPHandler::runImpl() @ 0x0000000019debd80
5. DB::TCPHandler::run() @ 0x0000000019e0df99
6. Poco::Net::TCPServerConnection::start() @ 0x000000001f34f207
7. Poco::Net::TCPServerDispatcher::run() @ 0x000000001f34f699
8. Poco::PooledThread::run() @ 0x000000001f3156c7
9. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001f313ac1
10. ? @ 0x0000000000094ac3
11. ? @ 0x0000000000126850

2026.03.10 19:21:25.980949 [ 45 ] {} <Error> TCPHandler: TCPHandler: Code: 210. DB::NetException: Connection reset by peer, while reading from socket (peer: 172.18.0.9:38214, local: 172.18.0.10:9000). (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000135decdf
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000ca85b8e
2. DB::NetException::NetException<String, String, String>(int, FormatStringHelperImpl<std::type_identity<String>::type, std::type_identity<String>::type, std::type_identity<String>::type>, String&&, String&&, String&&) @ 0x000000001380d373
3. DB::ReadBufferFromPocoSocketBase::socketReceiveBytesImpl(char*, unsigned long) @ 0x000000001380eff3
4. DB::ReadBufferFromPocoSocketBase::nextImpl() @ 0x000000001380f975
5. DB::ReadBuffer::next() @ 0x00000000136cdd2d
6. DB::TCPHandler::runImpl() @ 0x0000000019deaab0
7. DB::TCPHandler::run() @ 0x0000000019e0df99
8. Poco::Net::TCPServerConnection::start() @ 0x000000001f34f207
9. Poco::Net::TCPServerDispatcher::run() @ 0x000000001f34f699
10. Poco::PooledThread::run() @ 0x000000001f3156c7
11. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001f313ac1
12. ? @ 0x0000000000094ac3
13. ? @ 0x0000000000126850
 (version 25.8.16.20001.altinityantalya)
2026.03.10 19:21:25.980985 [ 45 ] {} <Error> ServerErrorHandler: Code: 210. DB::NetException: Connection reset by peer, while reading from socket (peer: 172.18.0.9:38214, local: 172.18.0.10:9000). (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000135decdf
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000ca85b8e
2. DB::NetException::NetException<String, String, String>(int, FormatStringHelperImpl<std::type_identity<String>::type, std::type_identity<String>::type, std::type_identity<String>::type>, String&&, String&&, String&&) @ 0x000000001380d373
3. DB::ReadBufferFromPocoSocketBase::socketReceiveBytesImpl(char*, unsigned long) @ 0x000000001380eff3
4. DB::ReadBufferFromPocoSocketBase::nextImpl() @ 0x000000001380f975
5. DB::ReadBuffer::next() @ 0x00000000136cdd2d
6. DB::TCPHandler::runImpl() @ 0x0000000019deaab0
7. DB::TCPHandler::run() @ 0x0000000019e0df99
8. Poco::Net::TCPServerConnection::start() @ 0x000000001f34f207
9. Poco::Net::TCPServerDispatcher::run() @ 0x000000001f34f699
10. Poco::PooledThread::run() @ 0x000000001f3156c7
11. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001f313ac1
12. ? @ 0x0000000000094ac3
13. ? @ 0x0000000000126850
 (version 25.8.16.20001.altinityantalya)
2026.03.10 19:22:23.217299 [ 45 ] {be8da4f2-b184-40e9-8fae-379043958e47} <Error> TCPHandler: Can't send logs or exception to client. Close connection.: Code: 210. DB::NetException: Connection reset by peer, while writing to socket (172.18.0.10:9000 -> 172.18.0.9:54396). (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000135decdf
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000ca85b8e
2. DB::NetException::NetException<String, String, String>(int, FormatStringHelperImpl<std::type_identity<String>::type, std::type_identity<String>::type, std::type_identity<String>::type>, String&&, String&&, String&&) @ 0x000000001380d373
3. DB::WriteBufferFromPocoSocket::nextImpl() @ 0x000000001380deda
4. DB::WriteBuffer::next() @ 0x000000000ca94d5e
5. DB::TCPHandler::runImpl() @ 0x0000000019decaa9
6. DB::TCPHandler::run() @ 0x0000000019e0df99
7. Poco::Net::TCPServerConnection::start() @ 0x000000001f34f207
8. Poco::Net::TCPServerDispatcher::run() @ 0x000000001f34f699
9. Poco::PooledThread::run() @ 0x000000001f3156c7
10. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001f313ac1
11. ? @ 0x0000000000094ac3
12. ? @ 0x0000000000126850
 (version 25.8.16.20001.altinityantalya)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions