Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decommissioning ingest_v2 hangs #5068

Open
PSeitz opened this issue Jun 3, 2024 · 6 comments
Open

decommissioning ingest_v2 hangs #5068

PSeitz opened this issue Jun 3, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@PSeitz
Copy link
Contributor

PSeitz commented Jun 3, 2024

When trying to quit quickwit, it occasionally hangs at decommissioning ingester. Only kill -9 works

^C2024-06-03T09:35:15.935Z  INFO quickwit_ingest::ingest_v2::ingester: decommissioning ingester
^C

I couldn't reproduce it, seems to happen randomly.

@PSeitz PSeitz added the bug Something isn't working label Jun 3, 2024
@guilload guilload self-assigned this Jun 3, 2024
@guilload
Copy link
Member

guilload commented Jun 3, 2024

If there's data inflight, have you waited long enough for the next commit? It can take up to commit_timeout_secs + ε.

@PSeitz
Copy link
Contributor Author

PSeitz commented Jun 4, 2024

In some cases I didn't ingest data, so there shouldn't be any data inflight. I also didn't enable ingestv2 via QW_ENABLE_INGEST_V2.

@guilload
Copy link
Member

guilload commented Jun 7, 2024

Here is the bug.

Executing the REST API tests against a running node that does not have ingest V2 enabled still creates a shard and drops a few records in it. The shard does not get indexed and is not cleaned up. The next time you try to decommission the node without v2 enabled, we wait for the shard to be drained, which will never happen.

@rdettai
Copy link
Contributor

rdettai commented Aug 2, 2024

Executing the REST API tests against a running node that does not have ingest V2 enabled still creates a shard and drops a few records in it. The shard does not get indexed and is not cleaned up. The next time you try to decommission the node without v2 enabled, we wait for the shard to be drained, which will never happen.

I don´t think this is the only issue. See analysis in #5283.

TL;DR, I would say they are 2 other issues:

  • when shutting down the control plane before it has a chance to schedule the ingest pipeline of a new indexer node, that node will hang forever during shutdown because its shard never gets indexed
  • when shutting down all nodes of a cluster at once, the indexer tries to commit one last empty batch (I would assume to notify the the shard is closed), but it indefinitely fails doing so as the metastore/cp are not there anymore

@rdettai
Copy link
Contributor

rdettai commented Aug 2, 2024

In #5283 I added 2 integration tests covering the two issues mentioned above:

  • ingest_tests::test_shutdown_metastore_first

Both pass on ingest V1 and fail if we enable ingest V2

@rdettai
Copy link
Contributor

rdettai commented Jan 6, 2025

Some extra docs: #5418

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Ready to be picked
Development

Successfully merging a pull request may close this issue.

3 participants