Skip to content

[Bug] Fallback invalid when local storage's all disks are in high watermark  #1428

@zuston

Description

@zuston

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

I specify the following conf.

# HDFS fallback strategy
rss.server.hybrid.storage.fallback.strategy.class       org.apache.uniffle.server.storage.LocalStorageManagerFallbackStrategy
rss.server.hybrid.storage.manager.selector.class        org.apache.uniffle.server.storage.hybrid.HugePartitionSensitiveStorageManagerSelector

For one shuffle-server, fallback to hadoop storage is invalid when local storage's all disks are in high watermark.

Because once all disks are in high-watermark, it will return null in selectStorage, and because the retry time default value is 0 in LocalStorageManagerFallbackStrategy. So it won't be fallbacked and then the event will be discarded in DefaultFlushEventHandler.

The logs are as follows:

image

I will fix this using the following 2 steps

  1. allow setting the negative value for rss.server.hybrid.storage.fallback.max.fail.times
  2. optimize DefaultFlushEventHandler when encountering null storage for one flush event.

Affects Version(s)

master

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions