Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloads table is borked #154

Closed
astrojuanlu opened this issue Oct 16, 2024 · 4 comments
Closed

Downloads table is borked #154

astrojuanlu opened this issue Oct 16, 2024 · 4 comments
Assignees

Comments

@astrojuanlu
Copy link
Member

astrojuanlu commented Oct 16, 2024

SELECT *
FROM KEDRO_BI_DB.PYPI.AGGREGATED_DOWNLOADS
LIMIT 5

image

Unsure when this started happening

@DimedS
Copy link
Member

DimedS commented Oct 22, 2024

As I understand, some optimisations have been made on the data source side (GitHub: ClickHouse/clickpy):

  1. String Storage Change:
    Strings are now stored in binary format. This is an easy fix on our side by applying conversion.

  2. Partition Pruning:
    This is a more complex change. Previously, we loaded the entire dataset daily with the following query:

    SELECT * 
    FROM pypi_downloads_per_day_by_version_by_installer_by_type 
    WHERE project LIKE 'kedro%'

    This query used to return around 600,000 rows. However, it now returns only 1,500–2,000 rows.
    I tried using FINAL and filtering by date (likely, the dataset is partitioned by date), which gave slightly better results. For example:

    SELECT * 
    FROM pypi_downloads_per_day_by_version_by_installer_by_type FINAL 
    WHERE project LIKE 'kedro%' AND date >= '2024-01-01'

    This query returns around 16,000 rows - about 15% of the expected result - but the number is unstable and varies with each execution.

It seems that we can no longer retrieve the entire table as we did previously. Here are two potential solutions to address the issue:

Option 1: Switch to an Incremental Approach

We could use previously archived data stored in our database (KEDRO_BI_DB.PYPI.TEST_ALL_KEDRO_AGGREGATES) and append new data daily from ClickHouse. Selecting by specific date from ClickHouse appears to work well.

Cons:

  • Incremental loading can be less reliable over time, given the instability of our data source.

Option 2: Switch to the Raw PyPI Dataset in BigQuery

We could replace the ClickHouse data source with the raw PyPI dataset available in BigQuery. Although this approach would take longer to process (around 3 minutes instead of 30 seconds), it would provide a more robust solution by removing the intermediate data layer.

Additionally, this would allow us to continue with our current approach of replacing the entire table daily and offer the opportunity to include more comprehensive data in our aggregates.

Cons:

  • We need to set up a Google Cloud account, but we can initially use a personal account. Google Cloud provides free query allowances for public datasets each month, which would help manage costs.

Personally, I prefer the second approach because it offers greater stability and flexibility. What do you think, @astrojuanlu ?

@astrojuanlu
Copy link
Member Author

Thanks for the analysis @DimedS, very detailed.

I agree option 2 is the simplest one to implement and gives more opportunities for future data aggregations. However, it requires a credit card. That's already a big barrier of entry, and on top of that we'd need to analyse whether our data needs would fit within the free plan of BigQuery.

Also if we need more data than what we can query with the free plan, then we'll need an incremental approach anyway.

It's not unsurmountable but could we try option 1 first? And then if we see the source is too unstable or unreliable, we can take what we learned to try to minimise BigQuery costs. Also this helps us explore possible pain points for incremental workloads work in Kedro, see kedro-org/kedro#3578

@astrojuanlu astrojuanlu moved this to In Progress in Kedro Framework Oct 22, 2024
@DimedS
Copy link
Member

DimedS commented Oct 28, 2024

I modified our PyPI downloads Kedro ETL project kedro-pypi-to-snowflake (internal link) by adding a new pipeline, load_dates, which:

  1. Requires dynamic parameters, start_date and end_date, to run.
  2. Retrieves data from ClickHouse for each date from start_date to end_date (inclusive), day by day.
  3. Concatenates all input data into a single Pandas DataFrame.
  4. Removes all rows within the start_date to end_date range from the table KEDRO_BI_DB.PYPI.AGGREGATED_DOWNLOADS in Snowflake storage.
  5. Appends the new data to KEDRO_BI_DB.PYPI.AGGREGATED_DOWNLOADS in Snowflake.

I set up a GitHub Action CI/CD workflow to run this pipeline daily at 6 AM, reloading data from the current day and up to five days back.

I scheduled it last Friday, and it seems to be working well. However, as mentioned earlier, the ClickHouse source we are using is unpredictable and appears to be unreliable, so let’s keep a close eye on it.

I would appreciate it if anyone could review the project. Unlike the previous "etl" pipeline, which was more elegant (with all I/O operations encapsulated within a custom dataset), I couldn't modify the ClickHouse dataset to support dynamic parameters for day-by-day loading.

As a result, I had to perform the loading operations directly within the node code instead of through the catalog. Additionally, to remove data from the Snowflake dataset, I had to implement a special method, which I admit is not the most elegant solution.

@astrojuanlu
Copy link
Member Author

Thanks a lot for the update @DimedS! Glad that you found a workaround for the incremental load. I looked at the pipeline runs and they look good. I'm closing this issue...

I would appreciate it if anyone could review the project.

...and opening a new one about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: Done
Development

No branches or pull requests

2 participants