Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: bug in distributed wal(obkv, kafka) #422

Merged
merged 6 commits into from
Dec 1, 2022

Conversation

Rachelint
Copy link
Contributor

@Rachelint Rachelint commented Nov 26, 2022

Which issue does this PR close?

Closes #441

Rationale for this change

If a shard (which is mapped to region in wal) is move between nodes like this:

A --> B --> ... --> A

Then wal module in mode A can't distinguish if the shard has been moved.
This may cause a serious bug:

  • the shard's wal meta information may be modify in other nodes.
  • when it is moved to A(original node), A doesn't know it and think it is same as before it was moved.
  • A persists the old meta information to storage and overwritten the new one which are persisted in other nodes.

What changes are included in this PR?

1. For rocksdb and obkv impls

before:

+----------+         +----------+                     
| table id |-------->|table unit|                     
+----------+         +----------+                     

now:

+----------+----------+----------+        +----------+
| region   | region   |  table   | ------>|  table   |
| version  |   id     |   id     |        |   unit   |
+----------+----------+----------+        +----------+

Before,Table unit in wal manager is only indexed by table id, so wal manager know nothing about shard id and shard version (shard are mapped to region). So where shard id and shard version of table changed, wal manager can't perceive.
Now, Table unit is indexed by region version + region id + table id, it can perceive any changes.

2. For kafka impl

before:

+----------+         +----------+                     
|region id |-------> |  region  |                     
+----------+         +----------+                     

now:

+----------+----------+          +----------+
| region   | region   |--------->|  region  |
| version  |   id     |          |          |
+----------+----------+          +----------+

Similar as above, region in kafka impl can perceive shard version now.

Are there any user-facing changes?

None.

How does this change test

Test by ut.

@Rachelint Rachelint changed the title fix: distributed bug in wal on obkv fix: bug in distributed wal(obkv, kafka) Nov 26, 2022
@chunshao90
Copy link
Contributor

Please describe the solution in What changes are included in this PR?.

@chunshao90
Copy link
Contributor

I don't think the current tests are sufficient to cover this bug.

@chunshao90
Copy link
Contributor

The amount of code modification is large. Please tell me where is the core modified code.

@Rachelint Rachelint force-pushed the fix-distributed-bug-in-wal-on-obkv branch from 272ee34 to 1a0c6aa Compare November 28, 2022 11:04
@Rachelint
Copy link
Contributor Author

Please describe the solution in What changes are included in this PR?.

Done.

@Rachelint
Copy link
Contributor Author

I don't think the current tests are sufficient to cover this bug.

Done.

@Rachelint
Copy link
Contributor Author

Rachelint commented Nov 28, 2022

The amount of code modification is large. Please tell me where is the core modified code.

The codes in wal/src are main.

analytic_engine/src/table/data.rs Outdated Show resolved Hide resolved
wal/src/manager.rs Show resolved Hide resolved
wal/src/tests/read_write.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@jiacai2050 jiacai2050 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Rachelint Rachelint merged commit 4d39868 into apache:main Dec 1, 2022
chunshao90 pushed a commit to chunshao90/ceresdb that referenced this pull request May 15, 2023
* change `shard_id` in `TableData` to `shard_info`

* modify wal to adapt adding shard version.

* add test for makeing wal perceive shard version.

* add more logs in key path to help debug.

* change the mapping from `shard version -> region version` to `cluster version -> region version`.

* address CR.
@Rachelint Rachelint deleted the fix-distributed-bug-in-wal-on-obkv branch May 27, 2023 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Serious meta information overwritten bug of distributed wal in cluster mode
3 participants