feat: Improve slive stress tests with path scanning and better file h… #1

zczhao0809 · 2025-10-28T03:30:00Z

Improve slive stress tests with path scanning and better file handling

Problem Statement

The slive stress test framework currently has several issues that impact test reliability and usability:

High Failure Rate: All operations (create, read, rename, delete) use random path generation, leading to numerous FAILURES and NOT_FOUND errors when targeting non-existent files.
Random Algorithm Issues: Every operation in hadoop 3.5.0 slive test source code uses random algorithm to generate paths, causing high FAILURES and NOT_FOUND errors
Runtime Conflict: Directly replacing the JAR file to apply patches conflicts with other Hadoop test tools such as TestDFSIO and NNBench, making it impractical for production use.
Commented Features: In Hadoop 3.5.0 slive test source code, critical operation phase configurations (beg, mid, end) were commented out, preventing proper test execution control and phase-based operation scheduling.
Unclear Logging: Some log messages lack clarity, making it difficult to filter and diagnose issues during test execution.
Concurrent Operation Issues: Random path generation causes significant failures when multiple mappers operate on the same directory concurrently, reducing test accuracy.

Solution Overview

This PR introduces a backward-compatible improvement to the slive test framework with the following changes:

1. Configurable Algorithm Selection

Add USE_NEW_ALGORITHM configuration option to enable/disable the new path selection algorithm
Maintains full backward compatibility with existing configurations
Default behavior preserves original functionality
Solves problem [HADOOP-10724] better interoperation with sort -h apache/hadoop#3: No need to replace JAR file, can use alongside other test tools

2. Smart Path Selection for Operations

CREATE operations: Use UUID-based unique path generation to ensure file uniqueness
- May still result in FileAlreadyExistsException in rare cases, but this exception is now properly handled
READ/DELETE/RENAME/APPEND/TRUNCATE operations:
- Scan existing files before each operation
- Randomly select from existing files to ensure operations target valid files
- Significantly reduces NOT_FOUND errors
LS operations:
- Scan existing directories before listing
- Randomly select from existing directories
- Improves test reliability

3. Improved Exception Handling

Distinguish between FileAlreadyExistsException and general IOException in CREATE operations
Correct error classification in DELETE operations (treat false return as NOT_FOUND instead of FAILURES)
More accurate error statistics and reporting

4. Enhanced Logging

Simplified log messages for better readability
Remove unnecessary variable references that could cause compilation issues
Better error context in warning messages
Solves problem HADOOP-11049 apache/hadoop#5: Clear and filterable log messages

5. Code Quality Improvements

Remove redundant data structures (e.g., existingFilesList, existingDirsList)
Direct use of List<Path> for cleaner code
Maintain comprehensive logging for debugging

6. Restore Commented Features

Re-enable critical operation phase configurations (beg, mid, end)
Proper support for phase-based operation scheduling
Enable proper test execution control that was previously blocked
Solves problem Update TaskInputOutputContext.java javadoc apache/hadoop#4: Restores commented functionality

Technical Details

Modified Files (14 files)

PathFinder.java: New file (177 lines) - Core implementation of path scanning and selection
RenameOp.java: Auto-create target directory, operation-specific path selection
CreateOp.java: Fixed FileAlreadyExistsException import, improved error handling
DeleteOp.java: Better error categorization
ReadOp/AppendOp/TruncateOp.java: FileNotFoundException handling improvements
ConfigExtractor.java: Add USE_NEW_ALGORITHM support
WeightSelector.java: Support for different operation contexts
And more...

Testing

Tested with 100-1000 ops on 10 mappers
Verified reduced failure rates in concurrent scenarios
Confirmed backward compatibility with default settings
Validated with multiple FileSystem implementations (HDFS, SFS)

Benefits

Improved Test Reliability: Reduced NOT_FOUND and FAILURES by significant margins
Better Concurrent Behavior: Operations now target valid files, reducing conflicts
Backward Compatible: Existing tests continue to work without modification
Production Ready: No need for JAR replacement, works alongside other test tools
Enhanced Debugging: Clearer logs and more accurate error reporting

Usage

Enable the new algorithm:

-D slive.use.new.algorithm=true

Run standard tests (existing behavior):

# Default uses original algorithm

Related Issues

Addresses test reliability issues in concurrent MapReduce scenarios
Resolves conflicts with other Hadoop test tools

…andling Signed-off-by: zhengchen.zhao <zhengchen.zhao@iomesh.com>

NorthCedar · 2025-10-28T06:01:58Z

...t/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/fs/slive/PathFinder.java

+    if (isExistingFileOperation(operationType)) {
+      if (useNewAlgorithm) {
+        LOG.info("Use new algorithm mode: scanning base_dir for " + operationType + " operation");
+        scanBaseDirectory();


A full path scan for each getFile operation will slow down the test

Yes, I also tried the following methods:
Short-term caching of directory scan results: for example, using the previous scan results for 5 seconds + error-driven refresh: if a file_not_found error is encountered three times in a row, a refresh is forced; however, this method still has a high failure rate.

Using the new algorithm, 5000 operations took approximately 16 minutes, with most operations successfully executed. Using the original algorithm, 5000 operations took 3 minutes, with approximately 50% of the operations failing. This seems acceptable if accuracy is the only concern, not performance.

Do you have any suggestions for avoiding this problem?

NorthCedar · 2025-10-28T06:03:21Z

...t/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/fs/slive/PathFinder.java

+    if ("LS".equals(operationType)) {
+      if (useNewAlgorithm) {
+        LOG.info("Starting to scan base_dir and select existing directories for LS operation");
+        scanBaseDirectory();


feat: Improve slive stress tests with path scanning and better file h…

9683eb1

…andling Signed-off-by: zhengchen.zhao <zhengchen.zhao@iomesh.com>

NorthCedar reviewed Oct 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Improve slive stress tests with path scanning and better file h… #1

feat: Improve slive stress tests with path scanning and better file h… #1

Uh oh!

zczhao0809 commented Oct 28, 2025 •

edited

Loading

Uh oh!

NorthCedar Oct 28, 2025 •

edited

Loading

Uh oh!

zczhao0809 Oct 28, 2025

Uh oh!

NorthCedar Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Improve slive stress tests with path scanning and better file h… #1

Are you sure you want to change the base?

feat: Improve slive stress tests with path scanning and better file h… #1

Uh oh!

Conversation

zczhao0809 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improve slive stress tests with path scanning and better file handling

Problem Statement

Solution Overview

1. Configurable Algorithm Selection

2. Smart Path Selection for Operations

3. Improved Exception Handling

4. Enhanced Logging

5. Code Quality Improvements

6. Restore Commented Features

Technical Details

Modified Files (14 files)

Testing

Benefits

Usage

Related Issues

Uh oh!

NorthCedar Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zczhao0809 Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

NorthCedar Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zczhao0809 commented Oct 28, 2025 •

edited

Loading

NorthCedar Oct 28, 2025 •

edited

Loading