Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: stuck when dropping sink #7639

Closed
Tracked by #6640
fuyufjh opened this issue Feb 1, 2023 · 9 comments
Closed
Tracked by #6640

bug: stuck when dropping sink #7639

fuyufjh opened this issue Feb 1, 2023 · 9 comments
Assignees
Labels
component/connector type/bug Something isn't working
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Feb 1, 2023

Describe the bug

Create sink (nexmark_q4), wait for a while, and then drop it. The client was stuck then.

CREATE SINK nexmark_q4
AS
SELECT Q.category,
       AVG(Q.final) as avg
FROM (SELECT MAX(B.price) AS final,
             A.category
      FROM auction A,
           bid B
      WHERE A.id = B.auction
        AND B.date_time BETWEEN A.date_time AND A.expires
      GROUP BY A.id, A.category) Q
GROUP BY Q.category
WITH ( connector = 'blackhole' );

Observations:

  1. The CPU usage is zero.
  2. async stack trace (verbose):

image

  1. Barriers:

image

To Reproduce

Failed to reproduce it in my environment...

Expected behavior

No response

Additional context

@fuyufjh fuyufjh added the type/bug Something isn't working label Feb 1, 2023
@github-actions github-actions bot added this to the release-0.1.17 milestone Feb 1, 2023
@BugenZhao
Copy link
Member

Looks like we're stuck in sync_epoch. Does it mean that there's a deadlock in the Hummock event loop? Note that dropping streaming jobs triggers the de-registration of the compaction groups and bloom filter catalog for internal tables, and it's easy to get it wrong.

Will #7637 be related?

@fuyufjh
Copy link
Member Author

fuyufjh commented Feb 1, 2023

Will #7637 be related?

I guess not? #7637 was a deterministic bug while this one looks accidental.

Does it mean that there's a deadlock in the Hummock event loop? Note that dropping streaming jobs triggers the de-registration of the compaction groups and bloom filter catalog for internal tables, and it's easy to get it wrong.

Can you give some hint on how to investigate on this? I am trying to enable verbose_async_trace now.

@yezizp2012
Copy link
Member

yezizp2012 commented Feb 1, 2023

Will #7637 be related?

Guess not +1. #7637 only fix the metadata of sink catalog in catalog manager. Compaction group related registration and un-registration are already handled properly via TableFragments.

@BugenZhao
Copy link
Member

I am trying to enable verbose_async_trace now.

async_stack_trace seems to be set to Verbose by default on the testing cluster as "store_await_sync" is already a verbose span. 🤔

I've quickly checked the code of the HummockEventHandler and there's even no await while processing an event. I've also run the gdb -batch -ex 'thread apply all bt' -p 1 to check the backtrace of all threads and find nothing interesting like parking-lot lock parking. 🤔

@fuyufjh
Copy link
Member Author

fuyufjh commented Feb 2, 2023

async_stack_trace seems to be set to Verbose by default on the testing cluster as "store_await_sync" is already a verbose span. 🤔

Haha, that screenshot was updated by me after I set aync_stack_trace to "verbose"

@wenym1
Copy link
Contributor

wenym1 commented Feb 15, 2023

I've tried to reproduce the bug both in our local environment and in the CI environment, but failed.

The CI environment used to be able to reproduce the bug, but now we have run under the same benchmark setting for many times, but still cannot reproduce it.

I suspect that the bug has been fixed by some PR by the way, and I have added necessary logs to trace the progress of sync. If there is some pending sync task, we can see from the log.

I think we can hold this issue and observe whether the bug happens again in the future.

@fuyufjh
Copy link
Member Author

fuyufjh commented Jul 12, 2023

It seems that this bug has not been seen for several months?

@wenym1
Copy link
Contributor

wenym1 commented Jul 12, 2023

It seems that this bug has not been seen for several months?

I think so. Maybe we can close this issue?

@fuyufjh
Copy link
Member Author

fuyufjh commented Jul 12, 2023

Agree. Let's close it now.

@fuyufjh fuyufjh closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/connector type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants