Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mount fails because "LTFS17285E Failed to search the final index in IP (1)" even when ltfs can try to search on DP. #479

Open
amissael95 opened this issue Aug 23, 2024 · 8 comments · May be fixed by #480
Assignees
Labels

Comments

@amissael95
Copy link
Contributor

amissael95 commented Aug 23, 2024

Describe the bug

When a tape cartridge with a write permanent error is trying to be mounted and the MAM (cartridge memory) attribute of the Index Partion (IP) stores a generation number lower than the MAM attribute of the Data Partition (DP), the mount process fails with error LTFS17285E even when ltfs can still search for the index in the DP.

Following logs shows that scenario.

LTFS11005I Mounting the volume.
LTFS30252I Logical block protection is disabled.
LTFS11333I A cartridge with write-perm error is detected on IP. Seek the newest index (IP: Gen = 26, VCR = 152) (DP: Gen = 27, VCR = 252) (VCR = 180).
LTFS17283I Detected unmatched VCR value between MAM and VCR (152, 180).
LTFS17284I Seaching the final index in IP.
LTFS17285E Failed to search the final index in IP (1).
LTFS14013E Cannot mount the volume.

To Reproduce

  1. Select a tape with a write permamnt error to be mounted
  2. Look for message LTFS11333I and confirm that the IP Generation is lower than than the DP Generation:
LTFS11333I  A cartridge with write-perm error is detected on %s. Seek the newest index (IP: Gen = %llu, VCR = %llu) (DP: Gen = %llu, VCR = %llu) (VCR = %llu)." }
  1. The mount process fails with LTFS14013E since Failed to search the final index in IP (LTFS17285E)

Note: It is hard to reproduce since as mentioned above the tape cartridge needs to have a write permanent error.

Expected behavior
A clear and concise description of what you expected to happen.

It seems the issue can be solved by making the _ltfs_search_index_wp@ltfs/src/libltfs/ltfs.c process to continue searching on the DP even if the search on the IP fails. (It can be done by setting can_skip_ip = true).

ltfs/src/libltfs/ltfs.c

Lines 1464 to 1507 in 7271446

static inline int _ltfs_search_index_wp(bool recover_symlink, bool can_skip_ip,
struct tc_position *seekpos, struct ltfs_volume *vol)
{
int ret = 0;
tape_block_t end_pos, index_end_pos;
bool fm_after, blocks_after;
ltfsmsg(LTFS_INFO, 17284I, "IP");
ret = ltfs_seek_index(vol->label->partid_ip, &end_pos, &index_end_pos, &fm_after,
&blocks_after, recover_symlink, vol);
if (ret) {
if (can_skip_ip) {
ltfsmsg(LTFS_INFO, 17289I);
vol->ip_coh.count = 0;
vol->ip_coh.set_id = 0;
} else {
ltfsmsg(LTFS_ERR, 17285E, "IP", ret);
return -LTFS_INDEX_INVALID;
}
}
ltfsmsg(LTFS_INFO, 17284I, "DP");
ret = ltfs_seek_index(vol->label->partid_dp, &end_pos, &index_end_pos, &fm_after,
&blocks_after, recover_symlink, vol);
if (ret < 0) {
ltfsmsg(LTFS_ERR, 17285E, "DP", ret);
return -LTFS_INDEX_INVALID;
}
/* Use the latest index on the tape */
ltfsmsg(LTFS_INFO, 17288I,
(unsigned long long)vol->ip_coh.count, (unsigned long long)vol->ip_coh.set_id,
(unsigned long long)vol->dp_coh.count, (unsigned long long)vol->dp_coh.set_id);
if (vol->ip_coh.count > vol->dp_coh.count) {
seekpos->partition = ltfs_part_id2num(vol->label->partid_ip, vol);
seekpos->block = vol->ip_coh.set_id;
} else {
seekpos->partition = ltfs_part_id2num(vol->label->partid_dp, vol);
seekpos->block = vol->dp_coh.set_id;
}
return 0;
}

Additional context

This makes me to ask, was there any reason to avoid the index to be searched on the Data Partition?

The "can_skip_ip" flag was explicitaly set to false in the following commit 3287850, was there any special reason to do that?

@piste-jp
Copy link
Member

It looks a bug.

The blocks

ltfs/src/libltfs/ltfs.c

Lines 1661 to 1663 in 7271446

ret = _ltfs_search_index_wp(recover_symlink, false, &seekpos, vol);
if (ret < 0)
goto out_unlock;

and

ltfs/src/libltfs/ltfs.c

Lines 1690 to 1693 in 7271446

/* Index of IP could be corrupted. So set skip flag */
ret = _ltfs_search_index_wp(recover_symlink, true, &seekpos, vol);
if (ret < 0)
goto out_unlock;

shall be swapped.

Upper code belongs to the logic that handles WP happens on IP. So index on IP might corrupted, thus skip flag shall be true.

But lower code belongs to the logic that handles WP happens on DP. The index shall be searched from IP. So skip flag shall be false;

@amissael95
Copy link
Contributor Author

Hello @piste-jp,

Thanks for quick response. I am curious. Could we just remove the "can_skip_ip" flag and let _ltfs_search_index_wp function to try to search the index on both, in the DP and DP?

At the end the logic consists in using the latest index on tape, so it does not hurt to simply try to search the index on both partitions, mark the index as 0 in case some searching failed, and use the latest index.

Regards

@piste-jp
Copy link
Member

Could we just remove the "can_skip_ip" flag and let _ltfs_search_index_wp function to try to search the index on both, in the DP and DP?

I believe it's little bit dangerous. Because the block starts from L1680 means the tape says IP has the latest index. So an index on IP must be existed at least. Why do we provide a skip flag or obsolete the skip flag and always allow the skip?

ltfs/src/libltfs/ltfs.c

Lines 1680 to 1695 in 7271446

if (volume_change_ref != vol->ip_coh.volume_change_ref) {
/*
* Cannot trust the index info on MAM, search the last indexes
* This would happen when the drive returns an error against acquiring the VCR
* while write error handling.
*/
ltfsmsg(LTFS_INFO, 17283I,
(unsigned long long)vol->dp_coh.volume_change_ref,
(unsigned long long)volume_change_ref);
/* Index of IP could be corrupted. So set skip flag */
ret = _ltfs_search_index_wp(recover_symlink, true, &seekpos, vol);
if (ret < 0)
goto out_unlock;
} else {

Your proposal might relax acceptable tape condition a little bit it just ignores unexpected behavior of tape drive or LTFS itself. I believe we need to understand why it happens if that really happens. And fix it correctly. But your proposal just hide that fact with any knowledge.

I believe it's not time to do that at this time.

@amissael95 amissael95 linked a pull request Aug 27, 2024 that will close this issue
6 tasks
@amissael95
Copy link
Contributor Author

@piste-jp,

I have created the following PR #480 with the modifications that you pointed out.

Do you think we can ensure that the change will not break the tape, and it is safe to be Implemented? I am currently trying to replicate this scenario using itdt... I think the only problem is in case we write incorrect index data into the tape.

In addition, it is good to emphasize that this involves a "data lost" scenario, since the index found will not point to all files within the tape.

Regards

@piste-jp piste-jp linked a pull request Aug 27, 2024 that will close this issue
6 tasks
@piste-jp piste-jp added the bug label Aug 27, 2024
@piste-jp
Copy link
Member

piste-jp commented Aug 27, 2024

I have created the following PR #480 with the modifications that you pointed out.

Do you think we can ensure that the change will not break the tape, and it is safe to be Implemented? I am currently trying to replicate this scenario using itdt... I think the only problem is in case we write incorrect index data into the tape.

For PR discussion, you need to use the comment thread on the PR. Let's use #480.

In addition, it is good to emphasize that this involves a "data lost" scenario, since the index found will not point to all files within the tape.

I cannot understand this ... Why?

@amissael95
Copy link
Contributor Author

I cannot understand this ... Why?

What I meant is because the write perm error I am not sure if we can trust the state of the indexes within the tape. According to the LTFS standard v2.4:

A volume that has been locked because a permanent write error "shall be mounted as read-only using the highest generation index available on the tape in either partition"

Is it possible that the highest index found available within the tape corresponds to a previous generation and therefore do not specify the latest files within the tape?

Could you confirm if after write perm error and successfully find the latest index in either partition that index will always point to the latest file within the tape?

Really appreciate your support

Regards

@piste-jp
Copy link
Member

First of all, data lost or data loss is really strong word for storage engineers. They must be used only when data that is once written on medium disappear unexpectedly in some reasons. So, we have to say this is a data loss only when that happens because of a bug of LTFS's logic.

In this case, it is clear that your scenario is not a data loss problem at all. Because LTFS never write (or overwrite) anything at read only mount process.

Second, it looks you pointed out the scenario based on reading through only the mount process logic. I believe it is not a correct approach. You need to understand the implementation of write side.

Long story short, when LTFS gets a write perm from the drive, LTFS writes down an index
to another partition, writes current index information on MAM and marks the tape as single write perm tape. So the latest index shall be read by the tape drive at mount time.

Is it possible that the highest index found available within the tape corresponds to a previous generation and therefore do not specify the latest files within the tape?

The drive returned a GOOD response after writing latest index on tape. So LTFS marks it is single write perm tape. The drive must find the latest index correctly or return read perm error from specification point of view.

Could you confirm if after write perm error and successfully find the latest index in either partition that index will always point to the latest file within the tape?

I can review if you provide such code. But honestly, it's not sure I need to do this because,

  1. The required information is already logged
  2. May be final result (mount with an index that is found this scan) is same

I believe reporting mount error (and fail) when read index generator is matched to the one on MAM is no benefit to users.

@perezle
Copy link

perezle commented Sep 10, 2024

Nice talking to you @piste-jp. Yes we will take care of the pull request. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants