Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-36690][runtime] Fix schema operator hanging under extreme parallelized pressure #3680

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yuxiqian
Copy link
Contributor

@yuxiqian yuxiqian commented Nov 1, 2024

This closes FLINK-36690 by fixing schema operator hanging glitch under extreme parallelized pressure.


After FLINK-36114, SchemaOperators will ask for schema evolve permission first, before sending FlushEvents to downstream sinks.

Normal case

However, Flink regards FlushEvents as normal data records and might block it to align checkpoint barriers. It might cause the following deadlock situation:

  • SchemaOperator A has obtained schema evolution permission
  • SchemaOperator B does not get the permission and hangs
  • SchemaOperator A sends a FlushEvent, but after a checkpoint barrier
  • SchemaOperator B received a checkpoint barrier after the schema change event (which is blocked)
Bad case

Now, neither A nor B can post any event records to downstream, and the entire job blocks with the following iconic error message (in TM):

0> Schema Registry is busy now, waiting for next request...
1> Schema Registry is busy now, waiting for next request...
2> Schema Registry is busy now, waiting for next request...
[Repeated lines]

This PR changes the schema evolution permission requesting workflow by:

  • SchemaOperators emit FlushEvent immediately when they received a SchemaChangeEvent.
  • A schema change request could only be permitted when a) SchemaRegistry is IDLE and b) the requesting SchemaOperator has finished data flushing already.

Since FlushEvent might be emitted from multiple SchemaOperators simultaneously, a nonce value that is uniquely bound to a schema change event is added into FlushEvent payload. WAITING_FOR_FLUSH stage is no longer necessary since this state will not block the SchemaRegistry but one single SchemaOperator now.

Proposed

It should be noted that current schema evolution design implicitly assumes that for each table, it won't be schema evolving in subTask A and emits normal data change events without blocking in subTask B at the same time.

So, after a SchemaOperator successfully triggered a flush event, there can't be any more uncommitted dirty data got written down since 1) any following data from this subTask is still being blocked and 2) other subTask can't carry any data belonging to this tableId (according to our previous guarantee).

@yuxiqian
Copy link
Contributor Author

yuxiqian commented Nov 1, 2024

Need more eyes on this PR since this PR tweaks schema evolution communication process. cc @leonardBang @ruanhang1993

@yuxiqian yuxiqian marked this pull request as ready for review November 1, 2024 12:16
@leonardBang leonardBang self-requested a review November 4, 2024 11:56
@yuxiqian yuxiqian changed the title [hotfix][runtime] Fix schema operator hanging under extreme parallelized pressure [FLINK-36690][runtime] Fix schema operator hanging under extreme parallelized pressure Nov 12, 2024
@yuxiqian yuxiqian marked this pull request as draft November 12, 2024 03:22
@yuxiqian yuxiqian force-pushed the fix/schema-evolve-e2e-passing-rate branch 2 times, most recently from 7da4d15 to ecd20b1 Compare November 12, 2024 06:35
@yuxiqian
Copy link
Contributor Author

yuxiqian commented Nov 12, 2024

Based on previous discussions, I've made the following changes:

  • Adjusted flush event nonce generating algorithm. Now the higher 32 bits are timestamp and lower 32 bits are Java hash code (like snowflake ID).
  • added parallelized schema change cases in MySQL integrated test & route e2e test.

As we're not going to merge it into 3.2.1, I would propose downgrading parallelized E2e cases to single-parallelism ones (#3718) to avoid breaking our CI pipeline.

@leonardBang
Copy link
Contributor

Thanks @yuxiqian for the contribution, it's great that you've added the flow picture to make the PR easy to catch.

@yuxiqian yuxiqian marked this pull request as ready for review November 21, 2024 06:25
@yuxiqian
Copy link
Contributor Author

@leonardBang Will this PR be reviewed soon? I'm planning to implement FLINK-36763 based on this.

@leonardBang
Copy link
Contributor

@leonardBang Will this PR be reviewed soon? I'm planning to implement FLINK-36763 based on this.

Added to my todo list

…llelized pressure

Signed-off-by: yuxiqian <34335406+yuxiqian@users.noreply.github.com>
@yuxiqian yuxiqian force-pushed the fix/schema-evolve-e2e-passing-rate branch from 9a17bb2 to 3109474 Compare December 11, 2024 10:15
Copy link
Contributor

@Shawn-Hx Shawn-Hx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! Left some minor comments.

Signed-off-by: yuxiqian <34335406+yuxiqian@users.noreply.github.com>
@yuxiqian
Copy link
Contributor Author

yuxiqian commented Dec 13, 2024

Thanks for @Shawn-Hx's review! Addressed your comments.

Sorry to make your review more painful, but considering FLINK-36763 is expected to modify current codebase anyway, I've baked changes in this PR into #3801. Looking forward to your comments on that, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants