[Question] 十亿点边如何进行快速导入 #2607

zhaorui9303 · 2024-07-31T07:47:25Z

Problem Type (问题类型)

None

Before submit

我已经确认现有的 Issues 与 FAQ 中没有相同 / 重复问题 (I have confirmed and searched that there are no similar problems in the historical issue and documents)

Environment (环境信息)

Server Version: 1.0.0 (Apache Release Version)
Backend: RocksDB x nodes, HDD or SSD
OS: xx CPUs, xx G RAM, Ubuntu 2x.x / CentOS 7.x
Data Size: xx vertices, xx edges

Your Question (问题描述)

每天的增量数据大概有10亿条左右，底层的存储用的是hbase。通过loader工具导入的时候，运行一段时间会出现rest server is too busy to write的情况。尝试着限制了导入的速率（rate_limit.write=10000），和调大了批量写入的线程占比（batch.max_write_ratio=90），这样导入不会报错，但是导入性能会变的很差。针对大批量数据的导入有什么调优建议吗。在保证导入不报错的前提下，最大化的提高导入的性能。

Vertex/Edge example (问题点 / 边数据举例)

No response

Schema [VertexLabel, EdgeLabel, IndexLabel] (元数据结构)

No response

dosubot · 2024-07-31T07:57:43Z

要高效地导入每天约10亿条增量数据到HugeGraph并避免出现“rest server is too busy to write”的错误，可以参考以下调优建议：

使用批量加载：使用HBase的批量加载功能，而不是直接写入HBase。这涉及生成HFiles并将其加载到HBase中。以下是生成HFiles并加载到HBase的代码示例：

JavaPairRDD<ImmutableBytesWritable, KeyValue> repartitionedRdd =
    buildAndSerRdd.repartitionAndSortWithinPartitions(partitioner);
Configuration conf = sinkToHBase.getHBaseConfiguration().get();
Job job = Job.getInstance(conf);
HFileOutputFormat2.configureIncrementalLoadMap(job, tableDescriptor);
conf.set("hbase.mapreduce.hfileoutputformat.table.name",
         tableDescriptor.getTableName().getNameAsString());
String path = getHFilePath(job.getConfiguration());
repartitionedRdd.saveAsNewAPIHadoopFile(path, ImmutableBytesWritable.class,
                                        KeyValue.class, HFileOutputFormat2.class,
                                        conf);
LOG.info("Saved HFiles to: '{}'", path);
flushPermission(conf, path);

重新分区和排序数据：在生成HFiles之前，重新分区和排序数据以确保其分布均匀且有序。这有助于减少HBase区域服务器的负载。
```
JavaPairRDD<ImmutableBytesWritable, KeyValue> repartitionedRdd =
    buildAndSerRdd.repartitionAndSortWithinPartitions(partitioner);
```
调整HBase配置：调整HBase配置以处理大数据量。关键参数包括：
- hbase.regionserver.handler.count：增加处理程序的数量以允许更多并发操作。
- hbase.regionserver.global.memstore.upperLimit：增加memstore大小以缓冲更多写入操作。
- hbase.hstore.blockingStoreFiles：增加在触发压缩之前允许的存储文件数量。
监控和扩展HBase集群：确保HBase集群适当扩展以处理负载。监控集群性能并在必要时添加更多区域服务器。

优化HFile生成路径：确保HFile生成路径已优化并在使用前清理。

public String getHFilePath(Configuration conf) throws IOException {
    FileSystem fs = FileSystem.get(conf);
    String fileID = fileID();
    String pathStr = fs.getWorkingDirectory().toString() + "/hfile-gen" + "/" + fileID + "/";
    Path hfileGenPath = new Path(pathStr);
    if (fs.exists(hfileGenPath)) {
        LOG.info("\n Delete the path where the hfile is generated,path {} ", pathStr);
        fs.delete(hfileGenPath, true);
    }
    return pathStr;
}

批量加载HFiles：最后，使用批量加载功能将生成的HFiles加载到HBase中。

public void loadFiles(String path) {
    try {
        // BulkLoad HFile to HBase
        sinkToHBase.loadHfiles(path, getTableName());
    } catch (Exception e) {
        LOG.error(" Failed to load hfiles", e);
    }
}

通过遵循这些实践，可以高效地将大数据量导入HBase，并最大限度地减少出现“rest server is too busy to write”错误的风险^[1]。

To continue talking to Dosu, mention @dosu.

zhaorui9303 · 2024-07-31T07:58:14Z

在没有调整rate_limit.write和batch.max_write_ratio参数的时候，导入也会抛下面的这样的错误。调整参数后，报错不复现，这种错误是不是写hbase速度太快导致的了

dosubot · 2024-07-31T08:08:47Z

是的，InsertException 错误通常是由于写入 HBase 的速度过快导致的。为了优化导入过程并避免这种错误，同时保持高性能，可以参考以下建议：

重试设置：增加失败插入的重试次数和重试间隔。
错误处理：设置允许的最大解析和插入错误数。
超时设置：调整插入操作的超时时间。

以下是一个示例配置：

{
  "import-settings": {
    "requestTimesWhenInterpolationFailed": 5,  // 重试次数
    "requestTicksWhenInterpolationFailed": 2,  // 重试间隔（秒）
    "maximumAnalyzedErrorRow": 100,            // 允许的最大解析错误数
    "maxiumInterpolateErrorRow": 50,           // 允许的最大插入错误数
    "InterpolationTimeout": 30                 // 插入超时时间（秒）
  }
}

这些设置可以通过控制重试行为和错误容忍度来管理 HBase 的负载，从而减少 InsertException 的发生，同时保持高性能^[1]。

To continue talking to Dosu, mention @dosu.

JackyYangPassion · 2024-08-01T02:45:03Z

参考下如下两个资料

loader doc
bulkload to HBase pr

10亿级别点边以上任意方式小时内能够导入完成
@zhaorui9303

cc @haohao0103

JackyYangPassion · 2024-08-01T03:32:46Z

每天的增量数据大概有10亿条左右，底层的存储用的是hbase。通过loader工具导入的时候，运行一段时间会出现rest server is too busy to write的情况。尝试着限制了导入的速率（rate_limit.write=10000），和调大了批量写入的线程占比（batch.max_write_ratio=90），这样导入不会报错，但是导入性能会变的很差。针对大批量数据的导入有什么调优建议吗。在保证导入不报错的前提下，最大化的提高导入的性能。

此处需要确认写入瓶颈点是HBase/ HugeServer/Loader

如果是Server 可以用nginx /haprox 做负载均衡
如果是 loader 可以采用Spark

zhaorui9303 · 2024-08-01T09:59:48Z

上面贴出来的那个图片中的报错，这个报错是表示写入瓶颈点在HBase上吗？ @JackyYangPassion

dosubot bot added hbase HBase backend question Further information is requested labels Jul 31, 2024

imbajin assigned JackyYangPassion Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] 十亿点边如何进行快速导入 #2607

[Question] 十亿点边如何进行快速导入 #2607

zhaorui9303 commented Jul 31, 2024

dosubot bot commented Jul 31, 2024

zhaorui9303 commented Jul 31, 2024

dosubot bot commented Jul 31, 2024

JackyYangPassion commented Aug 1, 2024 •

edited

Loading

JackyYangPassion commented Aug 1, 2024

zhaorui9303 commented Aug 1, 2024

[Question] 十亿点边如何进行快速导入 #2607

[Question] 十亿点边如何进行快速导入 #2607

Comments

zhaorui9303 commented Jul 31, 2024

Problem Type (问题类型)

Before submit

Environment (环境信息)

Your Question (问题描述)

Vertex/Edge example (问题点 / 边数据举例)

Schema [VertexLabel, EdgeLabel, IndexLabel] (元数据结构)

dosubot bot commented Jul 31, 2024

zhaorui9303 commented Jul 31, 2024

dosubot bot commented Jul 31, 2024

JackyYangPassion commented Aug 1, 2024 • edited Loading

JackyYangPassion commented Aug 1, 2024

zhaorui9303 commented Aug 1, 2024

JackyYangPassion commented Aug 1, 2024 •

edited

Loading