Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NC | NSFS | Glacier] Filter out failed recalls #8044

Merged

Conversation

tangledbytes
Copy link
Member

@tangledbytes tangledbytes commented May 13, 2024

Explain the changes

This PR fixes a major issue where NooBaa records the failures for retries but does NOT filters out the failed recalls while assigning extended attributes. This will make it seem to NSFS process that the object is restored and it will allow read which will result in a blocking read!

Testing Instructions:

  1. Run ./node_modules/.bin/mocha src/test/unit_tests/test_nsfs_glacier_backend.js
  • Doc added/updated
  • Tests added

@tangledbytes tangledbytes force-pushed the utkarsh/fix/tapecloud-recall branch 2 times, most recently from 440bda9 to 508f79b Compare May 13, 2024 15:00
@tangledbytes tangledbytes force-pushed the utkarsh/fix/tapecloud-recall branch from 508f79b to 203a0ad Compare May 13, 2024 15:03
@tangledbytes tangledbytes requested a review from guymguym May 13, 2024 16:14
@tangledbytes tangledbytes changed the title [NC | NSFS] Filter out failed recalls [NC | NSFS | Glacier] Filter out failed recalls May 13, 2024
@tangledbytes tangledbytes force-pushed the utkarsh/fix/tapecloud-recall branch from 203a0ad to 5b6fc3d Compare May 14, 2024 06:06
@tangledbytes tangledbytes requested a review from naveenpaul1 May 16, 2024 08:11
src/sdk/nsfs_glacier_backend/tapecloud.js Outdated Show resolved Hide resolved
src/sdk/nsfs_glacier_backend/tapecloud.js Show resolved Hide resolved
src/sdk/nsfs_glacier_backend/tapecloud.js Outdated Show resolved Hide resolved
src/sdk/nsfs_glacier_backend/tapecloud.js Show resolved Hide resolved
src/sdk/nsfs_glacier_backend/tapecloud.js Outdated Show resolved Hide resolved
src/sdk/nsfs_glacier_backend/tapecloud.js Outdated Show resolved Hide resolved
src/sdk/nsfs_glacier_backend/tapecloud.js Show resolved Hide resolved
@tangledbytes tangledbytes force-pushed the utkarsh/fix/tapecloud-recall branch 2 times, most recently from f502d35 to 0ed81fb Compare May 16, 2024 10:26
@y-sasn
Copy link

y-sasn commented May 23, 2024

Hi, Very quick review/comment.

The command "task show -r" displays success or failure counts at the head of each task.
We can't use the "--json" with "-r" unfortunately, so yes, parsing the result seems like a natural choice.
It should only show success/fail for the first column, so parsing the status with success/fail status with line.startsWith("Fail/Start"); seems ok too.

(It might be more concrete to run "eeadm file state" for each of the files given after the task has finished since the information we need is the actual status of the file, not the result of the recall command. Though I guess this could be handled in the future.)

What we need to care about is the definition of the "Success" and "Failed" tasks.
"Success" stands for recalled/already_premigrated/skipped_resident, so that seems ok as is.
"Fail" stands for duplicate/not_found/other_failure. "duplicate" might be an issue depending on how dup files are handled.

If NooBaa removes any dup file requests before handing it to EE's recall, we are good.
If NooBaa does hand dup files to EE, I recommend we add a logic to delete the dups prior to giving it to EE.
If that is not possible, we might need more work on post-proc of the "Fail"ed files.

@tangledbytes
Copy link
Member Author

Hi, Very quick review/comment.

The command "task show -r" displays success or failure counts at the head of each task.

We can't use the "--json" with "-r" unfortunately, so yes, parsing the result seems like a natural choice.

It should only show success/fail for the first column, so parsing the status with success/fail status with line.startsWith("Fail/Start"); seems ok too.

(It might be more concrete to run "eeadm file state" for each of the files given after the task has finished since the information we need is the actual status of the file, not the result of the recall command. Though I guess this could be handled in the future.)

What we need to care about is the definition of the "Success" and "Failed" tasks.

"Success" stands for recalled/already_premigrated/skipped_resident, so that seems ok as is.

"Fail" stands for duplicate/not_found/other_failure. "duplicate" might be an issue depending on how dup files are handled.

If NooBaa removes any dup file requests before handing it to EE's recall, we are good.

If NooBaa does hand dup files to EE, I recommend we add a logic to delete the dups prior to giving it to EE.

If that is not possible, we might need more work on post-proc of the "Fail"ed files.

@y-sasn, if we find a failure which has the error code of "duplicate task" then we don't consider it to be a failure and don't requeue the task. Does that sound okay?

@y-sasn
Copy link

y-sasn commented May 23, 2024

@tangledbytes Yes, if one of the other "dup" tasks fails due to a different reason, I'm guessing that that file will be queued. If that's the case I'm guessing we're good.

@tangledbytes
Copy link
Member Author

@tangledbytes Yes, if one of the other "dup" tasks fails due to a different reason, I'm guessing that that file will be queued. If that's the case I'm guessing we're good.

@y-sasn, yes if we find any other failure code then we requeue the item.

src/sdk/nsfs_glacier_backend/backend.js Outdated Show resolved Hide resolved
src/sdk/nsfs_glacier_backend/tapecloud.js Outdated Show resolved Hide resolved
src/sdk/nsfs_glacier_backend/tapecloud.js Show resolved Hide resolved
src/sdk/nsfs_glacier_backend/tapecloud.js Show resolved Hide resolved
src/sdk/nsfs_glacier_backend/tapecloud.js Show resolved Hide resolved
src/sdk/nsfs_glacier_backend/tapecloud.js Outdated Show resolved Hide resolved
@tangledbytes tangledbytes force-pushed the utkarsh/fix/tapecloud-recall branch 2 times, most recently from fac31c6 to 132d4e6 Compare May 23, 2024 12:33
@tangledbytes tangledbytes requested a review from guymguym May 23, 2024 12:38
@tangledbytes tangledbytes force-pushed the utkarsh/fix/tapecloud-recall branch from 132d4e6 to ccaa458 Compare May 27, 2024 06:32
@tangledbytes tangledbytes requested a review from guymguym May 27, 2024 06:36
@tangledbytes tangledbytes force-pushed the utkarsh/fix/tapecloud-recall branch 2 times, most recently from 1408f27 to 58a29c4 Compare May 27, 2024 10:37
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

update comments

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

fix linting issue

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

revamp restore code and add tests

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

add more tests

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

add cleanup to the test

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

address self review comments

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

fix partial failure of finalize_restore

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

improving task_show output parsing

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

fix parsing edge case

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

fix formatting

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

add error logging

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
@tangledbytes tangledbytes force-pushed the utkarsh/fix/tapecloud-recall branch from 58a29c4 to b1c3570 Compare May 27, 2024 14:54
@tangledbytes tangledbytes merged commit 6beca0b into noobaa:master May 27, 2024
10 checks passed
@guymguym guymguym added this to the 5.15.3 milestone May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants