Skip to content

[Bug] The time out for UpdateMetadata request is too short to cause job failover  #311

@luoyuxia

Description

@luoyuxia

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

main

Minimal reproduce step

Currently, sendMetadataRequestAndRebuildCluster will have a timeout of 3s. After 3s, the future will be completed without updating metadata.

What doesn't meet your expectations?

In my case, a partitioned table with 512 buckets, 512 parallelism for flink sink, it'll be timeout easily and then cause sink job fail..

First, write to a partition, it will try to update the metadata in method checkAndUpdatePartitionMetadata to fetch the partition's metadata. If it timeout, the metadata won't be updated in client, and it will then throw PartitionNotExists exception althogth the Partition does exist.

For my case, a time out of 60s works....

Anything else?

I can see we need to introduce a request time out mechanism to avoid a request to hang out forever.. But for updating metadata request, it should throw Timeout exception instead of just log it to enable caller to decide retry or fail directly..

For example, when creating FlussTable, it'll try to fetch the metadata of the table in metadataUpdater.checkAndUpdateTableMetadata(Collections.singleton(tablePath)). If the metadata is timeout, the metadata can't be updated and cause it to throw table not found in cluster exception although the table does exist... At least, it should throw timeout exception instead of table not found in cluster exception which is really confused.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions