Data Summary 性能优化 #49

ZhengshuaiPENG · 2022-08-16T03:41:57Z

目标: SQLDataSummary SQLPatternDistribution ET 性能优化

现状：

当前执行SQLDataSummary ET过程中会发现整个过程中触发了很多Job，同时这些job中部分是串行触发的，无法充分利用spark集群的并发度，可以作为一个优化的点，

已经发现的包括：

tech.mlsql.plugins.mllib.ets.fe.SQLDataSummary#train

   a. tech.mlsql.plugins.mllib.ets.fe.SQLDataSummary#getQuantileNum
        for each schema 调用了 data.isEmpty() && computePercentile

   b.  tech.mlsql.plugins.mllib.ets.fe.SQLDataSummary#computePercentile
       内部逻辑里面执行了count和index.lookup
       同时这个被顺序调用了3次，改成并发
       k < c时，index.lookup 触发了3次

   c. tech.mlsql.plugins.mllib.ets.fe.SQLDataSummary#getModeNum
       ! dfWithoutNa.isEmpty 时，for each schema 调用了count

tech.mlsql.plugins.mllib.ets.fe.SQLPatternDistribution#train
for each schema 执行了 pattern_group_df.count()

注意点：

以上并不一定是全部，研发实际开发调优过程中发现新的点也一起更新到这个issue，或关联进来。

补充一点：

@hellozepp 通过df 的 cache发现性能提升明显，所以ET侧的cache也是一个点

The text was updated successfully, but these errors were encountered:

fishcus · 2022-08-17T07:55:02Z

from @hellozepp
3. 算法剪枝

a. 下属算子可以通过一次select求出所有指标值，而不是单个计算后再union起来

b. 剪掉主键候选者；唯一值比例=100% 实际上就是 主键候选者=true ; 小于100%,主键候选者=false 

c. 剪掉空白、非空计数;非空计数= 总数-（空值+空白） 所以只要有总数和空值，空值、空白、非空计数就都有了

尽可能缓存数据，减少重复计算

a. colWithFilterBlank 只有少数指标没有使用，是否可以对源表统一做一次colWithFilterBlank处理，然后cache该结果集，供其他指标使用

b. computePercentile 被调用多次，函数中sortBy 和 count 存在重复计算，可以做复用

ckeys · 2022-08-17T08:09:33Z

优化的点：
1）剪掉主键候选者；唯一值比例=100% 实际上就是主键候选者=true ; 小于100%,主键候选者=false
2）剪掉空白、非空计数;非空计数= 总数-（空值+空白）
3）computePercentile 的计算里，RDD做了 cache 操作
4）避免 df.isEmpty 取反操作, 取消 !df.isEmpty

ZhengshuaiPENG · 2022-09-02T10:21:08Z

PR：https://github.com/byzer-org/byzer-extension/pull/54/files

ZhengshuaiPENG added the byzer-lang 2.3.3 label Aug 16, 2022

ZhengshuaiPENG assigned ckeys Aug 16, 2022

ZhengshuaiPENG added byzer-lang 2.3.4 byzer-lang 2.3.3 and removed byzer-lang 2.3.3 byzer-lang 2.3.4 labels Aug 16, 2022

ZhengshuaiPENG mentioned this issue Sep 2, 2022

Data Summary 重构 #55

Closed

ZhengshuaiPENG closed this as completed Sep 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Summary 性能优化 #49

Data Summary 性能优化 #49

ZhengshuaiPENG commented Aug 16, 2022 •

edited

Loading

fishcus commented Aug 17, 2022 •

edited

Loading

ckeys commented Aug 17, 2022

ZhengshuaiPENG commented Sep 2, 2022

Data Summary 性能优化 #49

Data Summary 性能优化 #49

Comments

ZhengshuaiPENG commented Aug 16, 2022 • edited Loading

fishcus commented Aug 17, 2022 • edited Loading

ckeys commented Aug 17, 2022

ZhengshuaiPENG commented Sep 2, 2022

ZhengshuaiPENG commented Aug 16, 2022 •

edited

Loading

fishcus commented Aug 17, 2022 •

edited

Loading