[OSPP2023] Refactor OpenDigger by sql query builder #1265

xgdyp · 2023-04-24T23:39:32Z

Description

Description：Complex and long SQL is difficult to maintain, this is not conducive to the development of the project. One of the ways to solve this problem is to create SQL through SQL builder, which can help us reduce the difficulty of reading SQL, especially subquery. So we want to explore whether there is a mature framework that can help OpenDigger.
In Python, we can use pypika which is a python pkg supporting some of clickhouse sql syntax. So what we need to do is to investigate in detail whether it can cover our sql, and whether it will really improve OpenDigger.

If yes, we can refactor Python kernel first as OSPP task. I think I can follow up on this project.

Expected outcomes:
metrics generated by python sql builder

Skills:
python typescript javascript SQL

references:
https://github.com/didi/gendry
https://github.com/sqlkata/querybuilder
https://github.com/ibis-project/ibis
https://github.com/doug-martin/goqu

tc2000731 · 2023-04-28T14:43:30Z

Hello @xgdyp,

I am TianchenZhou from East China Normal University, and I am very interested in this project. I learned that opendigger is an open source analysis report project, which aims to combine the wisdom of global developers to jointly analyze and insight into open source related data, to help everyone better understand and participate in open source. This project makes me curious and eager to contribute to it.

Therefore, I would like to ask you, what pre-tasks can I do to help me start my contribution journey? Or, what preparations can I do to increase my chances of being selected?

Thank you very much for your time and guidance. I look forward to your reply!

您好 @xgdyp，

我是来自华东师范大学的 TianchenZhou，我对此项目非常感兴趣。我了解到opendigger是一个开源分析报告项目，旨在结合全球开发者的智慧，共同分析和洞察开源相关数据，帮助大家更好地理解和参与开源。这个项目让我很好奇，也很想为它做出贡献。

因此，我想请教您，我现在可以做哪些 pre-task 来帮助我启动我的贡献之旅呢？或者说，我可以做哪些准备来增加我入选的机会呢？

非常感谢您的时间和指导。我期待您的回复！

tc2000731 · 2023-04-28T14:54:10Z

I would like to use PyPika as a reference to implement a python sql builder. PyPika is a simple and flexible SQL query builder that supports multiple database platforms and analytical features. I think PyPika is a good model because it reflects the richness and logic of the SQL language, rather than simply concatenating strings. Although PyPika cannot fully support clickhouse queries, it already covers most of the required features.

I strongly agree with using sql builder to construct queries, because it can improve the readability and maintainability of the code, and also avoid security risks such as SQL injection. Therefore, I would like to learn more about PyPika’s design and implementation, and communicate with you about my work plan and progress. Could you please give me some guidance or suggestions, so that I know what standards and levels my work should achieve?

xgdyp · 2023-04-29T04:13:09Z

Hi, welcome, can you use pypika first to see how much it supports clickhouse?

xgdyp · 2023-04-30T14:05:06Z

some information in this link

tc2000731 · 2023-04-30T17:35:54Z

Hi, welcome, can you use Pypika first to see how much it supports Clickhouse?

OK, I'll report what I learn later.

tc2000731 · 2023-05-01T14:49:20Z

@xgdyp There are multiple queries that call this function:

open-digger/python/metrics/basic.py

Lines 241 to 254 in b35b78d

    
           def getGroupArrayInsertAtClauseForClickhouse(config, option) -> str: 
        
               """_summary_ 
        
               Args: 
        
                   config (dict): QueryConfig 
        
                   option (_type_): { key: string; defaultValue?: string; value?: string; } 
        
               """ 
        
               def process(): 
        
                   if config.get('groupTimeRange') == None: return '0' 
        
                   startTime = 'toDate(\'{}-{}-1\')'.format(config.get('startYear'), config.get('startMonth')) 
        
                   if config.get('groupTimeRange') == 'quarter': startTime = 'toStartOfQuarter({})'.format(startTime) 
        
                   elif config.get('groupTimeRange') == 'year': startTime = 'toStartOfYear({})'.format(startTime) 
        
                   return 'toUInt32(dateDiff(\'{}\', {}, time))'.format(config.get('groupTimeRange'), startTime) 
        
               return 'groupArrayInsertAt{}({}, {}) AS {}'.format('({})'\ 
        
                   .format(option.get('defaultValue')) if option.get('defaultValue') != None else '', option.get('value') if option.get('value')!=None else option.get('key'), process(), option.get('key'))

This function uses clickhouse's built-in function groupArrayInsertAt . This is a very special kind of aggregation function. Apparently, this aggregation function does not exist in pypika.

How many other unique functions like this exist?
- I am not completely sure yet, but according to the official manual, the aggregation function alone is not small. ¹
- TODO: And how many are used in the opendigger project, I need further work to retrieve and confirm.
How should this type of unique function be addressed?
- TODO: I currently think the pypika extension is a good clue ², but not necessarily the perfect solution.
- Worst case: re-implement the clickhouse custom sql generator separately.

tc2000731 · 2023-05-01T14:55:32Z

@xgdyp I need further support on below, I hope you can tutor me or give me a little help.

I need a few simple SQL to use as my example. Most SQLs in this codebase are contained in string templates, I need complete SQL for tests, so please inform me where I can find some, or offer me some if possible.

xgdyp · 2023-05-02T04:28:08Z

@xgdyp I need further support on below, I hope you can tutor me or give me a little help.

I need a few simple SQL to use as my example. Most SQLs in this codebase are contained in string templates, I need complete SQL for tests, so please inform me where I can find some, or offer me some if possible.

Nice work.

You can use sample data as your test playground. For SQLs, you can also find in js kernel. Just print the spliced query.

In OpenDigger, we don't use a lot of clickhouse feature syntax(maybe arrayJoin,multiIf,groupInsertAt and some time functions). I think functions like groupArrayInsertAt are easy to be extended in pypika but actually this project has not been updated since last year so it may be a little tricky.

I agree with you that the first thing we should do is to find out whether it is feasible to use pypika. And you can try to run OpenDigger first, feel free to ask me if you have any questions .

tc2000731 · 2023-05-02T05:17:31Z

@xgdyp I need further support on below, I hope you can tutor me or give me a little help.
I need a few simple SQL to use as my example. Most SQLs in this codebase are contained in string templates, I need complete SQL for tests, so please inform me where I can find some, or offer me some if possible.

Nice work.

You can use sample data as your test playground. For SQLs, you can also find in js kernel. Just print the spliced query.

In OpenDigger, we don't use a lot of clickhouse feature syntax(maybe arrayJoin,multiIf,groupInsertAt and some time functions). I think functions like groupArrayInsertAt are easy to be extended in pypika but actually this project has not been updated since last year so it may be a little tricky.

I agree with you that the first thing we should do is to find out whether it is feasible to use pypika. And you can try to run OpenDigger first, feel free to ask me if you have any questions .

@xgdyp It's so hard for me to reform the original SQL from these string templates, could you please point me to some simple ones?

xgdyp · 2023-05-08T05:51:44Z

Here are some steps for your onboarding:

learn docker and other tools to use sample data. Just run clickhouse first.
use tool like datagrip or navicat to connect and interact with clickhouse container and try to write some queries and to get familiar with database schema.
run js kernel and use js kernel metrics to explore some github repos.
then you can get the raw SQL by console.log() metric query.
explore query builder.

tc2000731 · 2023-05-13T11:08:32Z

Here are some steps for your onboarding:

learn docker and other tools to use sample data. Just run clickhouse first.

use tool like datagrip or navicat to connect and interact with clickhouse container and try to write some queries and to get familiar with database schema.

run js kernel and use js kernel metrics to explore some github repos.

then you can get the raw SQL by console.log() metric query.

explore query builder.

Thanks, I'll start from here.

l1tok · 2023-07-23T12:03:51Z

The CustomFunction function or custom class of the pypika library can achieve Clickhouse-specific function. However, I think sql builder refactoring is actually not significantly different from current sql string implementation. I think we can focus on the open-digger python kernel improvement, which is the same as the ultimate goal of refactoring with sql builder, as there are some problems with the python kernel implementation and there are inconsistencies with the js kernel. Solving these problems is also necessary for sql builder refactoring.

My works at present are as follows:

The first problem I solved was that the python kernel sql used print() output was extremely unreadable. I use f'' instead of the original format function to implement sql. Here is the difference in the results of adding print(sql) to chaoss.py:

    SELECT     id,     argMax(name, time) AS name,     SUM(count) AS total_count,     groupArrayInsertAt(
        0, 
        toUInt32(dateDiff('year', toDate('2016-1-1'), toDate('2022-12-1'))) + 1)(ROUND(count, 2), 
        toUInt32(dateDiff('year', toStartOfYear(toDate('2016-1-1')), time))) AS issues_new_count     FROM     (SELECT         toStartOfYear(created_at) AS time, repo_id AS id, argMax(repo_name, time) AS name,         COUNT() AS count     FROM opensource.gh_events WHERE type = 'IssuesEvent' AND action = 'opened' AND (repo_id IN [288431943]) AND  created_at >= toDate('2016-1-1') AND created_at < toDate('2023-1-1')      GROUP BY id, time     ORDER BY count DESC LIMIT 10 BY time     )     GROUP BY id     ORDER BY issues_new_count[-1] DESC     FORMAT JSONCompact

    SELECT 
    id, 
    argMax(name, time) AS name, 
    SUM(count) AS total_count, 
    groupArrayInsertAt(
        0, 
        toUInt32(dateDiff('year', toDate('2016-1-1'), toDate('2022-12-1'))) + 1)(ROUND(count, 2), 
        toUInt32(dateDiff('year', toStartOfYear(toDate('2016-1-1')), time))) AS issues_new_count 
    FROM 
    (
    
    SELECT 
        toStartOfYear(created_at) AS time, repo_id AS id, argMax(repo_name, time) AS name, 
        COUNT() AS count 
    FROM opensource.gh_events           
    WHERE type = 'IssuesEvent' AND action = 'opened' AND (repo_id IN [288431943]) AND  created_at >= toDate('2016-1-1') AND created_at < toDate('2023-1-1')  
    GROUP BY id, time 
    ) 
    GROUP BY id 
    ORDER BY issues_new_count[-1] DESC LIMIT 10
    
    FORMAT JSONCompact

Re-implement getGroupArrayInsertAtClauseForClickhouse function of python kernel. The current problem with this function is that if the metrics for the last year is 0, it will be missing from the final output list. There are no problems in the js kernel, and the changes make the python kernel consistent with js.
Change the implementation of the order by and limit clauses in sql statement. For the following query options：

options = {
    'orgIds': [1342004],
    'limit': 3,
    'startYear': 2015,
    'endYear': 2016,
    'startMonth': 1,
    'endMonth': 12,
    'order': 'DESC',
    'groupTimeRange': 'year'
}

The following figure shows the results of the query. The python kernel has exceptions compared with the js kernel. The problem is caused by 'order by' and 'limit' clauses.

python kernel	js kernel

l1tok · 2023-08-08T03:21:23Z

The main task of python refactoring is to organize open-digger code python kernel in a more object-oriented way. Recently, I have been learning about python design patterns, but I don't think there is a suitable design pattern for open-digger. In the refactoring, I only used the singleton pattern in the instantiation of the database.

In addition, I did the following：

All metrics, dbs and open-digger are defined in the form of class.
Do not instantiate in the call to the metric class.
Refactored the code for the chaoss metrics, reducing code redundancy. The functions now return both the SQL and the metrics. Additionally, they can provide the original SQL and metrics before aggregation if needed.

Changes after refactoring：

Open-digger needs to be instantiated using openDigger = openDigger() before it can be used.
Don't need to type '()' after each class to call the function.

before	after

Returns both metrics and sql. You can also choose to return the original sql and metrics before the aggregation.
The code of chaoss metrics was reorganized to encapsulate all metrics queried by counting into a function to reduce code redundancy, and it may be possible to encapsulate similar metric queries later.

Thanks very much for @xgdyp's guidance and help.

l1tok · 2023-09-12T07:40:23Z

By now, the unimplemented chaoss metrics in the python kernel have been implemented and refactored. In #1375, I have put all the changes in the OSPP in a new folder called python_v2 so as not to affect the use of the previous version. At the end, further think is needed about the form of returned sql and metrics that functions return.

open-digger-bot bot added the kind/enhancement Category issues or prs related to enhancement. label Apr 24, 2023

github-actions bot added the waiting for repliers need other's feedback label Apr 24, 2023

github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels Apr 28, 2023

github-actions bot added waiting for repliers need other's feedback and removed waiting for author need issue author's feedback labels Apr 29, 2023

github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels Apr 30, 2023

github-actions bot added waiting for repliers need other's feedback and removed waiting for author need issue author's feedback labels May 2, 2023

github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels May 2, 2023

Zzzzzhuzhiwei mentioned this issue May 3, 2023

OpenDigger Biweekly Meeting: 2023-05-03 #1277

Closed

github-actions bot added waiting for repliers need other's feedback and removed waiting for author need issue author's feedback labels May 8, 2023

github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels May 13, 2023

l1tok mentioned this issue Jun 29, 2023

[Call]🌳XSOSI 一年与 🍄 2023 暑期计划 X-lab2017/open-wonderland#290

Open

4 tasks

l1tok mentioned this issue Jul 25, 2023

refactor: improve python kernel #1346

Merged

Zzzzzhuzhiwei mentioned this issue Aug 8, 2023

OpenDigger Biweekly Meeting: 2023-08-09 #1361

Closed

l1tok mentioned this issue Aug 25, 2023

refactor: improve python kernel #1375

Merged

frank-zsy closed this as completed Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OSPP2023] Refactor OpenDigger by sql query builder #1265

[OSPP2023] Refactor OpenDigger by sql query builder #1265

xgdyp commented Apr 24, 2023

tc2000731 commented Apr 28, 2023

tc2000731 commented Apr 28, 2023

xgdyp commented Apr 29, 2023

xgdyp commented Apr 30, 2023

tc2000731 commented Apr 30, 2023

tc2000731 commented May 1, 2023

tc2000731 commented May 1, 2023

xgdyp commented May 2, 2023

tc2000731 commented May 2, 2023

xgdyp commented May 8, 2023

tc2000731 commented May 13, 2023

l1tok commented Jul 23, 2023

l1tok commented Aug 8, 2023 •

edited

Loading

l1tok commented Sep 12, 2023 •

edited

Loading

[OSPP2023] Refactor OpenDigger by sql query builder #1265

[OSPP2023] Refactor OpenDigger by sql query builder #1265

Comments

xgdyp commented Apr 24, 2023

Description

tc2000731 commented Apr 28, 2023

tc2000731 commented Apr 28, 2023

xgdyp commented Apr 29, 2023

xgdyp commented Apr 30, 2023

tc2000731 commented Apr 30, 2023

tc2000731 commented May 1, 2023

Footnotes

tc2000731 commented May 1, 2023

xgdyp commented May 2, 2023

tc2000731 commented May 2, 2023

xgdyp commented May 8, 2023

tc2000731 commented May 13, 2023

l1tok commented Jul 23, 2023

l1tok commented Aug 8, 2023 • edited Loading

l1tok commented Sep 12, 2023 • edited Loading

l1tok commented Aug 8, 2023 •

edited

Loading

l1tok commented Sep 12, 2023 •

edited

Loading