Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/multi group snapshot #42

Merged
merged 25 commits into from
Apr 6, 2019
Merged

Feat/multi group snapshot #42

merged 25 commits into from
Apr 6, 2019

Conversation

fengjiachun
Copy link
Contributor

@fengjiachun fengjiachun commented Mar 15, 2019

背景:

#37

rheaKV 是 multi-raft-group 设计,但是 kv 存储是共享的,原来做 snapshot 是当前节点(Store)全量数据 snapshot,这有几个问题:

  1. 多分重复数据,本地存储浪费,install snapshot 时快照网络传输也是很大开销
  2. 每个 region 都会定时触发 snapshot,都是全量数据,snapshot 期间还是禁写禁读的,一个 region 做 snapshot 可能导致整个节点一段时间内无法提供服务

改进方案是:

  1. 对于 RocksRawKVStore,采用 sst file 的方式, 对一个 region 的 key range 生成 sst file, snapshot load 是直接 ingest sst file 即可
  2. 对于 MemoryRawKVStore,因为是跳表结构存储,很容易根据一个 key range subMap 出当前 region 包含数据即可

改进以后的好处是:

  1. 最大的好处就是 一个 region 在做 snapshot 期间,整个 kv 依旧可以对其他 region 提供并发读写
  2. 每个 region 只 snapshot 自己的数据,磁盘,网络都得到节省
  3. 所有 region 的 snapshot 加起来的大小只有原来的 1/4, 压缩时间自然也就缩短了,只测了全部 region 一起压缩,没有测试单个 region snapshot 的压缩时间,实际程序运行中是单个 region 压缩的,会更快
  4. snapshot file 文件变小( 算所有region 总量仍然比 rocksdb backup 小很多)
  5. load snapshot 飞快

缺点:
生成 sst file 不是很快,需要控制一个 region 的大小,简单测试了下对于 100 个字节的 value,最好一个 region 的 key 数量不要超过千万量级

以下为一千万个 key 的 snapshot 压测数据,其中 sst 前缀的为新的 snapshot 实现,测试代码见 SnapshotBenchmark:

    -----------------------------------------------
    db size = 10000000
    slow save snapshot time cost: 2552
    slow compressed file size: 41915298
    slow compress time cost: 9173
    slow load snapshot time cost: 5119
    -----------------------------------------------
    db size = 10000000
    sst save snapshot time cost: 4296
    sst compressed file size: 10741032
    sst compress time cost: 2005
    sst load snapshot time cost: 593
    -----------------------------------------------
    db size = 10000000
    slow save snapshot time cost: 2248
    slow compressed file size: 41918551
    slow compress time cost: 8705
    slow load snapshot time cost: 4485
    -----------------------------------------------
    db size = 10000000
    sst save snapshot time cost: 4206
    sst compressed file size: 10741032
    sst compress time cost: 1950
    sst load snapshot time cost: 599
    -----------------------------------------------
    db size = 10000000
    slow save snapshot time cost: 2327
    slow compressed file size: 41916640
    slow compress time cost: 8643
    slow load snapshot time cost: 4590

    -----------------------------------------------
    db size = 10000000
    sst save snapshot time cost: 4253
    sst compressed file size: 10741032
    sst compress time cost: 1947
    sst load snapshot time cost: 590
    -----------------------------------------------

@fengjiachun
Copy link
Contributor Author

fengjiachun commented Mar 17, 2019

先不要 merge,有新想法,明天还会提交代码
DistributedLock 的 fencing token 自增序列修改为 region 隔离,每 region 一个,range split 时新的 region 会继承 parent region 的值,然后再独立自增,以避免在 range split 后生成 token 比之前小

@killme2008 killme2008 modified the milestones: 1.2.4, 1.2.5 Mar 19, 2019
@fengjiachun fengjiachun modified the milestones: 1.2.5, 1.2.6 Mar 30, 2019
@fengjiachun fengjiachun changed the base branch from develop to master April 1, 2019 13:00
@@ -39,7 +39,9 @@ public static boolean isWindows() {
}

private static boolean isWindows0() {
boolean windows = SystemPropertyUtil.get("os.name", "").toLowerCase(Locale.US).contains("win");
final boolean windows = SystemPropertyUtil.get("os.name", "") //
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉的这里可以搞成全局的静态变量,然后在 static block 里初始化这个变量,这里就变成 getter 了,是不是性能更好点? 如果这个方法调用不频繁,倒也没关系。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恩,其实现在的代码就是和你描述的思路一样的,因为 isWindows0() 是只供内部调用的,并且有一个静态变量,
private static final boolean IS_WINDOWS = isWindows0();

对使用者开放的方法是 isWindows() , 所以说 isWindows0() 只会被调用一次

this.rawKVStore.initFencingToken(parentKey, splitKey);
this.storeEngine.doSplit(regionIds.getKey(), regionIds.getValue(), splitKey, closure);
} catch (final Exception e) {
if (closure != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

打印下日志,加上堆栈,这里这样处理,会丢掉堆栈,后续排查问题会很痛苦。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

} finally {
writeLock.unlock();
}
this.opts = opts;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是因为将 map 修改为 concurrent map 后,所以读写锁都去掉了? 但是怎么保证几个 map 之间的数据是同步的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对几个 map 同时有操作的方法不多,比如 tryLockWith 与 onSnapshotLoad()

  1. 对于同 region 的并发操作时没有的,因为操作都要走 raft log,是串行的
  2. 线性一致读场景之后操作 defaultDB, 跟其他几个 map 无关联
  3. 对于不同 region 的并发操作,因为不同 region 之间是严格的 range 隔离,所以也没有影响

} finally {
writeLock.unlock();
}
this.defaultDB.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

类似这个方法,可能你在 clear 几个 map 之间,还有其他操作在执行,这样会有什么后果,也就是 map 之间数据不完全同步。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是 store 关闭方法中的, 在 store close 之前,是要 close 所有当前节点 raft group 服务的,不会进来任何操作

try {
closeMemoryDB();
openMemoryDB(this.opts, defaultDB, sequenceDB.data(), fencingKeyDB.data(), lockerDB.data());
final File file = new File(snapshotPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个方法建议封装成一个类似 MemoryKeyStoreSnapshotFile 的类,将这些逻辑放进去。类似 Counter 的 snapshot 处理。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

主要问题是 snapshot 严重依赖 store 内部的各种数据以及状态,我尝试将这以外的比如 io 操作等剥离出来

@Override
public void onSnapshotLoad(final String snapshotPath, final LocalFileMeta meta, final Region region)
throws Exception {
final Timer.Context timeCtx = getTimeContext("SNAPSHOT_LOAD");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

openRocksDB(opts);
LOG.info("[RocksRawKVStore] start successfully, options: {}.", opts);
return true;
} catch (final RocksDBException e) {
} catch (final Exception e) {
LOG.error("Fail to open rocksDB at path {}, {}.", opts.getDbPath(), StackTraceUtil.stackTrace(e));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

异常,没有 return false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return false 了,在下边

}

public long getDatabaseVersion() {
return this.databaseVersion.get();
}

public void addStatisticsCollectorCallback(final StatisticsCollectorCallback callback) {
if (this.statisticsCollector == null || this.statistics == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以改成 requires 调用

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1069,6 +1107,7 @@ public void createSstFiles(final EnumMap<SstColumnFamily, File> sstFileTable, fi
it.seek(startKey);
}
sstFileWriter.open(sstFile.getAbsolutePath());
long count = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int 似乎够了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉还是 long 稳妥一点呀,int 正整数范围才到 21 亿

@@ -1222,6 +1270,66 @@ private void readSnapshot(final String snapshotPath) {
}
}

private LocalFileMeta writeSstSnapshot(final String snapshotPath, final Region region) {
final Timer.Context timeCtx = getTimeContext("WRITE_SST_SNAPSHOT");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块 snapshot 处理,也同样建议重构成 SnapshotFile 类似的接口,一是更内聚,不用在 state machine 暴露这个逻辑,另一个是可以搞成接口,后续可以扩展。

}
};
this.kvStore.addStatisticsCollectorCallback(callback);
// StatisticsCollectorCallback callback = new StatisticsCollectorCallback() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要的代码删掉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@killme2008
Copy link
Contributor

@fengjiachun 过了一遍,你看着改下。现在测试似乎过不了?

@fengjiachun
Copy link
Contributor Author

@killme2008 先等等,还有 snapshot 部分的建议没修完

@fengjiachun
Copy link
Contributor Author

@killme2008 重构了 snapshot 相关,抽空再帮看看

@killme2008 killme2008 merged commit bc4523d into master Apr 6, 2019
@killme2008 killme2008 deleted the fieat/multi-group-snapshot branch April 6, 2019 02:50
fengjiachun added a commit that referenced this pull request Apr 17, 2019
* (fix) multi-group snapshot with memoryStore

* (fix) multi-group snapshot with rocksDBStore

* (fix) add multi sst snapshot benchmark

* (fix) add meta check on load snapshot

* (fix) add meta check on load snapshot

* (fix) add timer metric with memory snapshot

* (fix) typo

* (fix) log

* (bugfix) fencing token isolation

* (bugfix) rename local variable

* (bugfix) fencing token isolation by region

* (fix) typo

* (fix) region info visibility

* (feat) add comment

* (fix) delete existing data on rocksdb restart, because raft will play them back

* (fix) delete existing data on rocksdb restart, because raft will play them back

* (fix) add restart test #86

* (fix) add restart test #86

* (fix) update restart test

* (fix) add error log on splitting fail

* (fix) requires

* (fix) remove useless code

* (fix) code refactoring with snapshot

* (fix) fix bad name

* (fix) typo
@fengjiachun fengjiachun mentioned this pull request Aug 15, 2019
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants