Skip to content

Conversation

@zczhao0809
Copy link
Owner

@zczhao0809 zczhao0809 commented Oct 28, 2025

Improve slive stress tests with path scanning and better file handling

Problem Statement

The slive stress test framework currently has several issues that impact test reliability and usability:

  1. High Failure Rate: All operations (create, read, rename, delete) use random path generation, leading to numerous FAILURES and NOT_FOUND errors when targeting non-existent files.

  2. Random Algorithm Issues: Every operation in hadoop 3.5.0 slive test source code uses random algorithm to generate paths, causing high FAILURES and NOT_FOUND errors

  3. Runtime Conflict: Directly replacing the JAR file to apply patches conflicts with other Hadoop test tools such as TestDFSIO and NNBench, making it impractical for production use.

  4. Commented Features: In Hadoop 3.5.0 slive test source code, critical operation phase configurations (beg, mid, end) were commented out, preventing proper test execution control and phase-based operation scheduling.

  5. Unclear Logging: Some log messages lack clarity, making it difficult to filter and diagnose issues during test execution.

  6. Concurrent Operation Issues: Random path generation causes significant failures when multiple mappers operate on the same directory concurrently, reducing test accuracy.

Solution Overview

This PR introduces a backward-compatible improvement to the slive test framework with the following changes:

1. Configurable Algorithm Selection

  • Add USE_NEW_ALGORITHM configuration option to enable/disable the new path selection algorithm
  • Maintains full backward compatibility with existing configurations
  • Default behavior preserves original functionality
  • Solves problem [HADOOP-10724] better interoperation with sort -h apache/hadoop#3: No need to replace JAR file, can use alongside other test tools

2. Smart Path Selection for Operations

  • CREATE operations: Use UUID-based unique path generation to ensure file uniqueness
    • May still result in FileAlreadyExistsException in rare cases, but this exception is now properly handled
  • READ/DELETE/RENAME/APPEND/TRUNCATE operations:
    • Scan existing files before each operation
    • Randomly select from existing files to ensure operations target valid files
    • Significantly reduces NOT_FOUND errors
  • LS operations:
    • Scan existing directories before listing
    • Randomly select from existing directories
    • Improves test reliability

3. Improved Exception Handling

  • Distinguish between FileAlreadyExistsException and general IOException in CREATE operations
  • Correct error classification in DELETE operations (treat false return as NOT_FOUND instead of FAILURES)
  • More accurate error statistics and reporting

4. Enhanced Logging

  • Simplified log messages for better readability
  • Remove unnecessary variable references that could cause compilation issues
  • Better error context in warning messages
  • Solves problem HADOOP-11049 apache/hadoop#5: Clear and filterable log messages

5. Code Quality Improvements

  • Remove redundant data structures (e.g., existingFilesList, existingDirsList)
  • Direct use of List<Path> for cleaner code
  • Maintain comprehensive logging for debugging

6. Restore Commented Features

Technical Details

Modified Files (14 files)

  • PathFinder.java: New file (177 lines) - Core implementation of path scanning and selection
  • RenameOp.java: Auto-create target directory, operation-specific path selection
  • CreateOp.java: Fixed FileAlreadyExistsException import, improved error handling
  • DeleteOp.java: Better error categorization
  • ReadOp/AppendOp/TruncateOp.java: FileNotFoundException handling improvements
  • ConfigExtractor.java: Add USE_NEW_ALGORITHM support
  • WeightSelector.java: Support for different operation contexts
  • And more...

Testing

  • Tested with 100-1000 ops on 10 mappers
  • Verified reduced failure rates in concurrent scenarios
  • Confirmed backward compatibility with default settings
  • Validated with multiple FileSystem implementations (HDFS, SFS)

Benefits

  1. Improved Test Reliability: Reduced NOT_FOUND and FAILURES by significant margins
  2. Better Concurrent Behavior: Operations now target valid files, reducing conflicts
  3. Backward Compatible: Existing tests continue to work without modification
  4. Production Ready: No need for JAR replacement, works alongside other test tools
  5. Enhanced Debugging: Clearer logs and more accurate error reporting

Usage

Enable the new algorithm:

-D slive.use.new.algorithm=true

Run standard tests (existing behavior):

# Default uses original algorithm

Related Issues

  • Addresses test reliability issues in concurrent MapReduce scenarios
  • Resolves conflicts with other Hadoop test tools

…andling

Signed-off-by: zhengchen.zhao <zhengchen.zhao@iomesh.com>
if (isExistingFileOperation(operationType)) {
if (useNewAlgorithm) {
LOG.info("Use new algorithm mode: scanning base_dir for " + operationType + " operation");
scanBaseDirectory();
Copy link

@NorthCedar NorthCedar Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A full path scan for each getFile operation will slow down the test

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also tried the following methods:
Short-term caching of directory scan results: for example, using the previous scan results for 5 seconds + error-driven refresh: if a file_not_found error is encountered three times in a row, a refresh is forced; however, this method still has a high failure rate.

Using the new algorithm, 5000 operations took approximately 16 minutes, with most operations successfully executed. Using the original algorithm, 5000 operations took 3 minutes, with approximately 50% of the operations failing. This seems acceptable if accuracy is the only concern, not performance.

Do you have any suggestions for avoiding this problem?

if ("LS".equals(operationType)) {
if (useNewAlgorithm) {
LOG.info("Starting to scan base_dir and select existing directories for LS operation");
scanBaseDirectory();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants