decommissioning ingest_v2 hangs #5068

PSeitz · 2024-06-03T09:39:46Z

When trying to quit quickwit, it occasionally hangs at decommissioning ingester. Only kill -9 works

^C2024-06-03T09:35:15.935Z  INFO quickwit_ingest::ingest_v2::ingester: decommissioning ingester
^C

I couldn't reproduce it, seems to happen randomly.

The text was updated successfully, but these errors were encountered:

guilload · 2024-06-03T15:09:55Z

If there's data inflight, have you waited long enough for the next commit? It can take up to commit_timeout_secs + ε.

PSeitz · 2024-06-04T23:58:04Z

In some cases I didn't ingest data, so there shouldn't be any data inflight. I also didn't enable ingestv2 via QW_ENABLE_INGEST_V2.

guilload · 2024-06-07T14:36:59Z

Here is the bug.

Executing the REST API tests against a running node that does not have ingest V2 enabled still creates a shard and drops a few records in it. The shard does not get indexed and is not cleaned up. The next time you try to decommission the node without v2 enabled, we wait for the shard to be drained, which will never happen.

rdettai · 2024-08-02T08:08:37Z

Executing the REST API tests against a running node that does not have ingest V2 enabled still creates a shard and drops a few records in it. The shard does not get indexed and is not cleaned up. The next time you try to decommission the node without v2 enabled, we wait for the shard to be drained, which will never happen.

I don´t think this is the only issue. See analysis in #5283.

TL;DR, I would say they are 2 other issues:

when shutting down the control plane before it has a chance to schedule the ingest pipeline of a new indexer node, that node will hang forever during shutdown because its shard never gets indexed
when shutting down all nodes of a cluster at once, the indexer tries to commit one last empty batch (I would assume to notify the the shard is closed), but it indefinitely fails doing so as the metastore/cp are not there anymore

rdettai · 2024-08-02T13:44:01Z

In #5283 I added 2 integration tests covering the two issues mentioned above:

ingest_tests::test_shutdown_metastore_first

Both pass on ingest V1 and fail if we enable ingest V2

rdettai · 2025-01-06T09:06:32Z

Some extra docs: #5418

PSeitz added the bug Something isn't working label Jun 3, 2024

guilload self-assigned this Jun 3, 2024

rdettai mentioned this issue Aug 1, 2024

Switch OTLP endpoints to ingest v2 #5283

Merged

rdettai added this to Quickwit 0.9 Aug 21, 2024

This was referenced Sep 4, 2024

Add timeout to ingester decommissioning #5382

Closed

Trigger early commit when shutting down ingest V2 #5390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decommissioning ingest_v2 hangs #5068

decommissioning ingest_v2 hangs #5068

PSeitz commented Jun 3, 2024

guilload commented Jun 3, 2024

PSeitz commented Jun 4, 2024

guilload commented Jun 7, 2024

rdettai commented Aug 2, 2024 •

edited

Loading

rdettai commented Aug 2, 2024 •

edited

Loading

rdettai commented Jan 6, 2025

decommissioning ingest_v2 hangs #5068

decommissioning ingest_v2 hangs #5068

Comments

PSeitz commented Jun 3, 2024

guilload commented Jun 3, 2024

PSeitz commented Jun 4, 2024

guilload commented Jun 7, 2024

rdettai commented Aug 2, 2024 • edited Loading

rdettai commented Aug 2, 2024 • edited Loading

rdettai commented Jan 6, 2025

rdettai commented Aug 2, 2024 •

edited

Loading

rdettai commented Aug 2, 2024 •

edited

Loading