Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OSPP2023] Refactor OpenDigger by sql query builder #1265

Closed
xgdyp opened this issue Apr 24, 2023 · 14 comments
Closed

[OSPP2023] Refactor OpenDigger by sql query builder #1265

xgdyp opened this issue Apr 24, 2023 · 14 comments
Labels
kind/enhancement Category issues or prs related to enhancement. waiting for author need issue author's feedback

Comments

@xgdyp
Copy link
Contributor

xgdyp commented Apr 24, 2023

Description

Description:Complex and long SQL is difficult to maintain, this is not conducive to the development of the project. One of the ways to solve this problem is to create SQL through SQL builder, which can help us reduce the difficulty of reading SQL, especially subquery. So we want to explore whether there is a mature framework that can help OpenDigger.
In Python, we can use pypika which is a python pkg supporting some of clickhouse sql syntax. So what we need to do is to investigate in detail whether it can cover our sql, and whether it will really improve OpenDigger.

If yes, we can refactor Python kernel first as OSPP task. I think I can follow up on this project.

Expected outcomes:
metrics generated by python sql builder

Skills:
python typescript javascript SQL

references:
https://github.com/didi/gendry
https://github.com/sqlkata/querybuilder
https://github.com/ibis-project/ibis
https://github.com/doug-martin/goqu

@open-digger-bot open-digger-bot bot added the kind/enhancement Category issues or prs related to enhancement. label Apr 24, 2023
@github-actions github-actions bot added the waiting for repliers need other's feedback label Apr 24, 2023
@tc2000731
Copy link
Contributor

Hello @xgdyp,

I am TianchenZhou from East China Normal University, and I am very interested in this project. I learned that opendigger is an open source analysis report project, which aims to combine the wisdom of global developers to jointly analyze and insight into open source related data, to help everyone better understand and participate in open source. This project makes me curious and eager to contribute to it.

Therefore, I would like to ask you, what pre-tasks can I do to help me start my contribution journey? Or, what preparations can I do to increase my chances of being selected?

Thank you very much for your time and guidance. I look forward to your reply!

您好 @xgdyp,

我是来自华东师范大学的 TianchenZhou,我对此项目非常感兴趣。我了解到opendigger是一个开源分析报告项目,旨在结合全球开发者的智慧,共同分析和洞察开源相关数据,帮助大家更好地理解和参与开源。这个项目让我很好奇,也很想为它做出贡献。

因此,我想请教您,我现在可以做哪些 pre-task 来帮助我启动我的贡献之旅呢?或者说,我可以做哪些准备来增加我入选的机会呢?

非常感谢您的时间和指导。我期待您的回复!

@github-actions github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels Apr 28, 2023
@tc2000731
Copy link
Contributor

I would like to use PyPika as a reference to implement a python sql builder. PyPika is a simple and flexible SQL query builder that supports multiple database platforms and analytical features. I think PyPika is a good model because it reflects the richness and logic of the SQL language, rather than simply concatenating strings. Although PyPika cannot fully support clickhouse queries, it already covers most of the required features.

I strongly agree with using sql builder to construct queries, because it can improve the readability and maintainability of the code, and also avoid security risks such as SQL injection. Therefore, I would like to learn more about PyPika’s design and implementation, and communicate with you about my work plan and progress. Could you please give me some guidance or suggestions, so that I know what standards and levels my work should achieve?

@xgdyp
Copy link
Contributor Author

xgdyp commented Apr 29, 2023

Hi, welcome, can you use pypika first to see how much it supports clickhouse?

@github-actions github-actions bot added waiting for repliers need other's feedback and removed waiting for author need issue author's feedback labels Apr 29, 2023
@xgdyp
Copy link
Contributor Author

xgdyp commented Apr 30, 2023

some information in this link

@tc2000731
Copy link
Contributor

Hi, welcome, can you use Pypika first to see how much it supports Clickhouse?

OK, I'll report what I learn later.

@github-actions github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels Apr 30, 2023
@tc2000731
Copy link
Contributor

@xgdyp There are multiple queries that call this function:

def getGroupArrayInsertAtClauseForClickhouse(config, option) -> str:
"""_summary_
Args:
config (dict): QueryConfig
option (_type_): { key: string; defaultValue?: string; value?: string; }
"""
def process():
if config.get('groupTimeRange') == None: return '0'
startTime = 'toDate(\'{}-{}-1\')'.format(config.get('startYear'), config.get('startMonth'))
if config.get('groupTimeRange') == 'quarter': startTime = 'toStartOfQuarter({})'.format(startTime)
elif config.get('groupTimeRange') == 'year': startTime = 'toStartOfYear({})'.format(startTime)
return 'toUInt32(dateDiff(\'{}\', {}, time))'.format(config.get('groupTimeRange'), startTime)
return 'groupArrayInsertAt{}({}, {}) AS {}'.format('({})'\
.format(option.get('defaultValue')) if option.get('defaultValue') != None else '', option.get('value') if option.get('value')!=None else option.get('key'), process(), option.get('key'))

This function uses clickhouse's built-in function groupArrayInsertAt . This is a very special kind of aggregation function. Apparently, this aggregation function does not exist in pypika.

  • How many other unique functions like this exist?
    • I am not completely sure yet, but according to the official manual, the aggregation function alone is not small. 1
    • TODO: And how many are used in the opendigger project, I need further work to retrieve and confirm.
  • How should this type of unique function be addressed?
    • TODO: I currently think the pypika extension is a good clue 2, but not necessarily the perfect solution.
    • Worst case: re-implement the clickhouse custom sql generator separately.

Footnotes

  1. List of Aggregate Functions | ClickHouse Docs

  2. Extending PyPika --- PyPika 0.35.16 documentation

@tc2000731
Copy link
Contributor

@xgdyp I need further support on below, I hope you can tutor me or give me a little help.

I need a few simple SQL to use as my example. Most SQLs in this codebase are contained in string templates, I need complete SQL for tests, so please inform me where I can find some, or offer me some if possible.

@xgdyp
Copy link
Contributor Author

xgdyp commented May 2, 2023

@xgdyp I need further support on below, I hope you can tutor me or give me a little help.

I need a few simple SQL to use as my example. Most SQLs in this codebase are contained in string templates, I need complete SQL for tests, so please inform me where I can find some, or offer me some if possible.

Nice work.

You can use sample data as your test playground. For SQLs, you can also find in js kernel. Just print the spliced ​​query.

In OpenDigger, we don't use a lot of clickhouse feature syntax(maybe arrayJoin,multiIf,groupInsertAt and some time functions). I think functions like groupArrayInsertAt are easy to be extended in pypika but actually this project has not been updated since last year so it may be a little tricky.

I agree with you that the first thing we should do is to find out whether it is feasible to use pypika. And you can try to run OpenDigger first, feel free to ask me if you have any questions .

@github-actions github-actions bot added waiting for repliers need other's feedback and removed waiting for author need issue author's feedback labels May 2, 2023
@tc2000731
Copy link
Contributor

@xgdyp I need further support on below, I hope you can tutor me or give me a little help.
I need a few simple SQL to use as my example. Most SQLs in this codebase are contained in string templates, I need complete SQL for tests, so please inform me where I can find some, or offer me some if possible.

Nice work.

You can use sample data as your test playground. For SQLs, you can also find in js kernel. Just print the spliced ​​query.

In OpenDigger, we don't use a lot of clickhouse feature syntax(maybe arrayJoin,multiIf,groupInsertAt and some time functions). I think functions like groupArrayInsertAt are easy to be extended in pypika but actually this project has not been updated since last year so it may be a little tricky.

I agree with you that the first thing we should do is to find out whether it is feasible to use pypika. And you can try to run OpenDigger first, feel free to ask me if you have any questions .

@xgdyp It's so hard for me to reform the original SQL from these string templates, could you please point me to some simple ones?

@github-actions github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels May 2, 2023
@xgdyp
Copy link
Contributor Author

xgdyp commented May 8, 2023

Here are some steps for your onboarding:

  1. learn docker and other tools to use sample data. Just run clickhouse first.
  2. use tool like datagrip or navicat to connect and interact with clickhouse container and try to write some queries and to get familiar with database schema.
  3. run js kernel and use js kernel metrics to explore some github repos.
  4. then you can get the raw SQL by console.log() metric query.
  5. explore query builder.

@github-actions github-actions bot added waiting for repliers need other's feedback and removed waiting for author need issue author's feedback labels May 8, 2023
@tc2000731
Copy link
Contributor

Here are some steps for your onboarding:

  1. learn docker and other tools to use sample data. Just run clickhouse first.
  2. use tool like datagrip or navicat to connect and interact with clickhouse container and try to write some queries and to get familiar with database schema.
  3. run js kernel and use js kernel metrics to explore some github repos.
  4. then you can get the raw SQL by console.log() metric query.
  5. explore query builder.

Thanks, I'll start from here.

@github-actions github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels May 13, 2023
@l1tok
Copy link
Contributor

l1tok commented Jul 23, 2023

The CustomFunction function or custom class of the pypika library can achieve Clickhouse-specific function. However, I think sql builder refactoring is actually not significantly different from current sql string implementation. I think we can focus on the open-digger python kernel improvement, which is the same as the ultimate goal of refactoring with sql builder, as there are some problems with the python kernel implementation and there are inconsistencies with the js kernel. Solving these problems is also necessary for sql builder refactoring.

My works at present are as follows:

  • The first problem I solved was that the python kernel sql used print() output was extremely unreadable. I use f'' instead of the original format function to implement sql. Here is the difference in the results of adding print(sql) to chaoss.py:
    SELECT     id,     argMax(name, time) AS name,     SUM(count) AS total_count,     groupArrayInsertAt(
        0, 
        toUInt32(dateDiff('year', toDate('2016-1-1'), toDate('2022-12-1'))) + 1)(ROUND(count, 2), 
        toUInt32(dateDiff('year', toStartOfYear(toDate('2016-1-1')), time))) AS issues_new_count     FROM     (SELECT         toStartOfYear(created_at) AS time, repo_id AS id, argMax(repo_name, time) AS name,         COUNT() AS count     FROM opensource.gh_events WHERE type = 'IssuesEvent' AND action = 'opened' AND (repo_id IN [288431943]) AND  created_at >= toDate('2016-1-1') AND created_at < toDate('2023-1-1')      GROUP BY id, time     ORDER BY count DESC LIMIT 10 BY time     )     GROUP BY id     ORDER BY issues_new_count[-1] DESC     FORMAT JSONCompact
    SELECT 
    id, 
    argMax(name, time) AS name, 
    SUM(count) AS total_count, 
    groupArrayInsertAt(
        0, 
        toUInt32(dateDiff('year', toDate('2016-1-1'), toDate('2022-12-1'))) + 1)(ROUND(count, 2), 
        toUInt32(dateDiff('year', toStartOfYear(toDate('2016-1-1')), time))) AS issues_new_count 
    FROM 
    (
    
    SELECT 
        toStartOfYear(created_at) AS time, repo_id AS id, argMax(repo_name, time) AS name, 
        COUNT() AS count 
    FROM opensource.gh_events           
    WHERE type = 'IssuesEvent' AND action = 'opened' AND (repo_id IN [288431943]) AND  created_at >= toDate('2016-1-1') AND created_at < toDate('2023-1-1')  
    GROUP BY id, time 
    ) 
    GROUP BY id 
    ORDER BY issues_new_count[-1] DESC LIMIT 10
    
    FORMAT JSONCompact
  • Re-implement getGroupArrayInsertAtClauseForClickhouse function of python kernel. The current problem with this function is that if the metrics for the last year is 0, it will be missing from the final output list. There are no problems in the js kernel, and the changes make the python kernel consistent with js.

  • Change the implementation of the order by and limit clauses in sql statement. For the following query options:

options = {
    'orgIds': [1342004],
    'limit': 3,
    'startYear': 2015,
    'endYear': 2016,
    'startMonth': 1,
    'endMonth': 12,
    'order': 'DESC',
    'groupTimeRange': 'year'
}

The following figure shows the results of the query. The python kernel has exceptions compared with the js kernel. The problem is caused by 'order by' and 'limit' clauses.

python kernel js kernel
image image

@l1tok
Copy link
Contributor

l1tok commented Aug 8, 2023

The main task of python refactoring is to organize open-digger code python kernel in a more object-oriented way. Recently, I have been learning about python design patterns, but I don't think there is a suitable design pattern for open-digger. In the refactoring, I only used the singleton pattern in the instantiation of the database.

In addition, I did the following:

  • All metrics, dbs and open-digger are defined in the form of class.
  • Do not instantiate in the call to the metric class.
  • Refactored the code for the chaoss metrics, reducing code redundancy. The functions now return both the SQL and the metrics. Additionally, they can provide the original SQL and metrics before aggregation if needed.

Changes after refactoring:

  • Open-digger needs to be instantiated using openDigger = openDigger() before it can be used.
  • Don't need to type '()' after each class to call the function.
before after
image image
  • Returns both metrics and sql. You can also choose to return the original sql and metrics before the aggregation.
    image
    image

  • The code of chaoss metrics was reorganized to encapsulate all metrics queried by counting into a function to reduce code redundancy, and it may be possible to encapsulate similar metric queries later.
    image

Thanks very much for @xgdyp's guidance and help.

@l1tok
Copy link
Contributor

l1tok commented Sep 12, 2023

By now, the unimplemented chaoss metrics in the python kernel have been implemented and refactored. In #1375, I have put all the changes in the OSPP in a new folder called python_v2 so as not to affect the use of the previous version. At the end, further think is needed about the form of returned sql and metrics that functions return.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Category issues or prs related to enhancement. waiting for author need issue author's feedback
Projects
None yet
Development

No branches or pull requests

4 participants