Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata not rebuilt after replacing one store and hot reloading #131

Open
scottyeager opened this issue Nov 22, 2024 · 17 comments · May be fixed by #134
Open

Metadata not rebuilt after replacing one store and hot reloading #131

scottyeager opened this issue Nov 22, 2024 · 17 comments · May be fixed by #134
Assignees

Comments

@scottyeager
Copy link

scottyeager commented Nov 22, 2024

I am trying to test this behavior, as described on the README:

A metadata store can be replaced by a new one, by removing the old one in the config and inserting the new one. The repair subsystem will take care of rebulding the data, regenerating the shards, and storing the new shards on the new metatada store.

As I understand it, these steps should be sufficient:

  1. Start zstor with an initial config and write some data
  2. Replace the config file with a new one where 3/4 data backends and 3/4 metadata backends are the same but one is changed (one original backend of each type has been decommissioned so the namespace is deleted)
  3. Do a hot reload of the config with SIGUSR1

Here's my status output after step 1:

+----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                        | reachable | objects | used space | free space | usage percentage |
+================================================================+===========+=========+============+============+==================+
| [2a02:1802:5e:0:e8f8:dcff:fe44:6b2]:9900 - 5545-708153-meta10  | Yes       |      44 |      20380 | 1073741824 |                0 |
+----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [2a02:1802:5e:0:8c0e:bfff:fe33:fa8b]:9900 - 5545-708150-meta11 | Yes       |      44 |      20380 | 1073741824 |                0 |
+----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [2a02:1802:5e:0:107d:6dff:fe38:c095]:9900 - 5545-708152-meta13 | Yes       |      44 |      20380 | 1073741824 |                0 |
+----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [2a02:1802:5e:0:a89b:bfff:fec4:5205]:9900 - 5545-708151-meta8  | Yes       |      44 |      20380 | 1073741824 |                0 |
+----------------------------------------------------------------+-----------+---------+------------+------------+------------------+

Then after step 3:

+----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                        | reachable | objects | used space | free space | usage percentage |
+================================================================+===========+=========+============+============+==================+
| [2a02:1802:5e:0:e8f8:dcff:fe44:6b2]:9900 - 5545-708153-meta10  | Yes       |      44 |      20380 | 1073741824 |                0 |
+----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [2a02:1802:5e:0:8c0e:bfff:fe33:fa8b]:9900 - 5545-708150-meta11 | Yes       |      44 |      20380 | 1073741824 |                0 |
+----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [2a02:1802:5e:0:107d:6dff:fe38:c095]:9900 - 5545-708152-meta24 | Yes       |       0 |          0 | 1073741824 |                0 |
+----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [2a02:1802:5e:0:a89b:bfff:fec4:5205]:9900 - 5545-708151-meta8  | Yes       |      44 |      20380 | 1073741824 |                0 |
+----------------------------------------------------------------+-----------+---------+------------+------------+------------------+

Even after some hours of waiting for the repair system to kick in and recreate the metadata shards on the new backend, there are no objects and no space used on the new backend.

I also noticed that zstor rejects writes in this state, indicating that it doesn't detect that it has four healthy metadata stores for writing new data.

@scottyeager
Copy link
Author

scottyeager commented Nov 23, 2024

See #132 (comment). After writing new data to the new config, the metadata is rebuilt as expected.

I need to double check the bit about zstor rejecting writes, as that doesn't seem consistent. I was able to perform writes after trying to recreate the same conditions from the test that prompted this issue.

@iwanbk
Copy link
Member

iwanbk commented Nov 25, 2024

From checking the code a bit, there is indeed no code that detects based on the metastore backends.
The detection is only based on the data backends.

cc @LeeSmet
This is what i mean

let backends = match backend_manager
.send(RequestBackends {
backend_requests,
interest: StateInterest::Readable,
})
.await
{
Err(e) => {
error!("Failed to request backends: {}", e);
return;
}
Ok(backends) => backends,
};
let must_rebuild = backends.into_iter().any(|b| !matches!(b, Ok(Some(_))));

@iwanbk
Copy link
Member

iwanbk commented Nov 25, 2024

I need to double check the bit about zstor rejecting writes, as that doesn't seem consistent. I was able to perform writes after trying to recreate the same conditions from the test that prompted this issue.

Do you mean you could do write when one of the metastore was down?

@scottyeager
Copy link
Author

Do you mean you could do write when one of the metastore was down?

No, that part is okay.

I also noticed that zstor rejects writes in this state, indicating that it doesn't detect that it has four healthy metadata stores for writing new data.

I was comparing to this condition, which is that writes were rejected when a full set of live metadata backends was present after a hot reload of the config. This only happened once and I'm not sure how to reproduce it reliably.

To summarize, these are the concerns:

  1. After replacing one meta and one data backend in the manner I described, repair operations for both are blocked until something happens. Making a new write unblocks them but there might be other events
  2. In one case I also wasn't able to make new writes, despite reloading the config with four live meta backends (this is more of a side note for now, pending reproducibility)

@iwanbk
Copy link
Member

iwanbk commented Nov 26, 2024

Found few things that we could improve

  1. When replacing meta backends

Need to be implemented

  1. When replacing data backends

Same as meta backends, it is not implemented yet.
But it is already handled by the periodic checking.
So, perhaps we don't need it at all

  1. Periodic checking
    Some improvements here:
  • get all the keys from the metadata, currently returns a Vector, which is not scalable if number of keys is too much
  • checking the dead data backends. Currently the same backends could be checked multiple times, which is a waste of time.
  • Not sure how, but detecting meta backends should be doable as well.

The periodic checking is currently run every 10 minutes, which might be overkill.

I'm thinking of this way:

  • Main mechanism is number (1) & (2)
  • Periodic checking frequency could be reduced to once a day

wdyt @LeeSmet ?

@scottyeager
Copy link
Author

I think it definitely makes sense to trigger repair whenever updating the backends.

As for the ten minute cycle I'd say maybe we can reduce to one hour but a day feels too long.

@iwanbk
Copy link
Member

iwanbk commented Nov 26, 2024

As for the ten minute cycle I'd say maybe we can reduce to one hour but a day feels too long.

Actually if number (1) and (2) are working properly, we should not need this periodic checking at all.
I haven't proposed to remove it because i'm still not sure that i know the whole architecture.

@scottyeager
Copy link
Author

The periodic check will always be relevant I think. The reason is that the user can provide more data backends than expected shards. So there can still be a case where some backend goes down and there's a chance to restore the shard count for affected files by rebuilding it on remaining data backends.

@iwanbk
Copy link
Member

iwanbk commented Nov 26, 2024

So there can still be a case where some backend goes down and there's a chance to restore the shard count for affected files by rebuilding it on remaining data backends.

In this case, 10 minutes is definitely too fast.
Imagine that a node is temporary down, then we rebuild the data.
But then it goes up again.

So, i think the down definition should be more complete, like down for 1 hour?

@scottyeager
Copy link
Author

Agreed. I think adding this time period as a config option with a one hour default makes sense.

@LeeSmet
Copy link
Contributor

LeeSmet commented Nov 26, 2024

Need to be implemented
2. When replacing data backends

This is actually not needed (unless the data backend is down, in which case its picked up by the periodic rebuilder, though you could manually trigger it). The reasoning here is as follows: the location of the shards (i.e. the 0-db backend) is recorded in the object metadata. Therefore, as long as the backend is reachable, the data can be recovered, even if said backend is no longer part of the running config. And thus it does not need to be rebuild. The running config is relevant for stores, but not for loads.

For metadata backends we should rebuild the data if they are replaced, since we don't have a separate reference, so here the runnig config is relevant for both stores and loads.

The vector is indeed not good, and should be replaced by a Stream or channel.

Periodic check frequency can be reduced, but if the period is longer we should make sure it still runs even under multiple process restarts

@iwanbk
Copy link
Member

iwanbk commented Nov 26, 2024

This is actually not needed (unless the data backend is down, in which case its picked up by the periodic rebuilder, though you could manually trigger it). The reasoning here is as follows: the location of the shards (i.e. the 0-db backend) is recorded in the object metadata.

Yes, on second thought, after more checking, i also don't think that it is needed, periodic checking will do it anyway, and the procedures will be the same

Therefore, as long as the backend is reachable, the data can be recovered, even if said backend is no longer part of the running config. And thus it does not need to be rebuild. The running config is relevant for stores, but not for loads.

Oh, i didn't know about this. More reasons to only do it on periodic checking

@iwanbk
Copy link
Member

iwanbk commented Nov 26, 2024

For metadata backends we should rebuild the data if they are replaced, since we don't have a separate reference, so here the runnig config is relevant for both stores and loads.

I was thinking about which keys that need to be rebuild.
But then i realize that it should be ALL the keys, CMIIW @LeeSmet

@iwanbk
Copy link
Member

iwanbk commented Nov 27, 2024

To summarize, this is the plan:

  1. for data backends, improve periodic checking:
    a. replace object_metas (func to get all metadata) which returns Vector to something which returns Stream
    b. Avoid duplicated backend checking
    c. Make the period longer, 10 mins is too fast, especially for big storage

2.for meta backends, implement meta rebuild when the meta backends replaced

Today i tried to implement 1.a, but facing many difficulties.
At first i thought because i'm still not advanced in Rust language, but after some thinking, implement it as stream is indeed tricky, because the related backends could gone any time.
So, i think i'll change the 1.a to use Redis scan, and expose the cursor to the caller.

@iwanbk iwanbk linked a pull request Nov 29, 2024 that will close this issue
@iwanbk
Copy link
Member

iwanbk commented Nov 29, 2024

for meta backends, implement meta rebuild when the meta backends replaced

I've fixed this on #134

@iwanbk
Copy link
Member

iwanbk commented Nov 29, 2024

for data backends, improve periodic checking:
a. replace object_metas (func to get all metadata) which returns Vector to something which returns Stream
b. Avoid duplicated backend checking
c. Make the period longer, 10 mins is too fast, especially for big storage

For this, looks like better to create new issue.
Created at #136

@iwanbk
Copy link
Member

iwanbk commented Nov 29, 2024

As a side note, we also have to make sure that only one rebuild running at a time, tracked at #135

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants