Downloads table is borked #154

astrojuanlu · 2024-10-16T13:41:19Z

SELECT *
FROM KEDRO_BI_DB.PYPI.AGGREGATED_DOWNLOADS
LIMIT 5

Unsure when this started happening

DimedS · 2024-10-22T11:12:24Z

As I understand, some optimisations have been made on the data source side (GitHub: ClickHouse/clickpy):

String Storage Change:
Strings are now stored in binary format. This is an easy fix on our side by applying conversion.
Partition Pruning:
This is a more complex change. Previously, we loaded the entire dataset daily with the following query:
```
SELECT * 
FROM pypi_downloads_per_day_by_version_by_installer_by_type 
WHERE project LIKE 'kedro%'
```
This query used to return around 600,000 rows. However, it now returns only 1,500–2,000 rows.
I tried using FINAL and filtering by date (likely, the dataset is partitioned by date), which gave slightly better results. For example:
```
SELECT * 
FROM pypi_downloads_per_day_by_version_by_installer_by_type FINAL 
WHERE project LIKE 'kedro%' AND date >= '2024-01-01'
```
This query returns around 16,000 rows - about 15% of the expected result - but the number is unstable and varies with each execution.

It seems that we can no longer retrieve the entire table as we did previously. Here are two potential solutions to address the issue:

Option 1: Switch to an Incremental Approach

We could use previously archived data stored in our database (KEDRO_BI_DB.PYPI.TEST_ALL_KEDRO_AGGREGATES) and append new data daily from ClickHouse. Selecting by specific date from ClickHouse appears to work well.

Cons:

Incremental loading can be less reliable over time, given the instability of our data source.

Option 2: Switch to the Raw PyPI Dataset in BigQuery

We could replace the ClickHouse data source with the raw PyPI dataset available in BigQuery. Although this approach would take longer to process (around 3 minutes instead of 30 seconds), it would provide a more robust solution by removing the intermediate data layer.

Additionally, this would allow us to continue with our current approach of replacing the entire table daily and offer the opportunity to include more comprehensive data in our aggregates.

Cons:

We need to set up a Google Cloud account, but we can initially use a personal account. Google Cloud provides free query allowances for public datasets each month, which would help manage costs.

Personally, I prefer the second approach because it offers greater stability and flexibility. What do you think, @astrojuanlu ?

astrojuanlu · 2024-10-22T11:35:45Z

Thanks for the analysis @DimedS, very detailed.

I agree option 2 is the simplest one to implement and gives more opportunities for future data aggregations. However, it requires a credit card. That's already a big barrier of entry, and on top of that we'd need to analyse whether our data needs would fit within the free plan of BigQuery.

Also if we need more data than what we can query with the free plan, then we'll need an incremental approach anyway.

It's not unsurmountable but could we try option 1 first? And then if we see the source is too unstable or unreliable, we can take what we learned to try to minimise BigQuery costs. Also this helps us explore possible pain points for incremental workloads work in Kedro, see kedro-org/kedro#3578

DimedS · 2024-10-28T09:59:46Z

I modified our PyPI downloads Kedro ETL project kedro-pypi-to-snowflake (internal link) by adding a new pipeline, load_dates, which:

Requires dynamic parameters, start_date and end_date, to run.
Retrieves data from ClickHouse for each date from start_date to end_date (inclusive), day by day.
Concatenates all input data into a single Pandas DataFrame.
Removes all rows within the start_date to end_date range from the table KEDRO_BI_DB.PYPI.AGGREGATED_DOWNLOADS in Snowflake storage.
Appends the new data to KEDRO_BI_DB.PYPI.AGGREGATED_DOWNLOADS in Snowflake.

I set up a GitHub Action CI/CD workflow to run this pipeline daily at 6 AM, reloading data from the current day and up to five days back.

I scheduled it last Friday, and it seems to be working well. However, as mentioned earlier, the ClickHouse source we are using is unpredictable and appears to be unreliable, so let’s keep a close eye on it.

I would appreciate it if anyone could review the project. Unlike the previous "etl" pipeline, which was more elegant (with all I/O operations encapsulated within a custom dataset), I couldn't modify the ClickHouse dataset to support dynamic parameters for day-by-day loading.

As a result, I had to perform the loading operations directly within the node code instead of through the catalog. Additionally, to remove data from the Snowflake dataset, I had to implement a special method, which I admit is not the most elegant solution.

astrojuanlu · 2024-10-28T11:15:37Z

Thanks a lot for the update @DimedS! Glad that you found a workaround for the incremental load. I looked at the pipeline runs and they look good. I'm closing this issue...

I would appreciate it if anyone could review the project.

...and opening a new one about this.

astrojuanlu added this to Kedro DevRel and Kedro Framework Oct 16, 2024

astrojuanlu moved this to In Progress in Kedro Framework Oct 22, 2024

astrojuanlu assigned DimedS Oct 22, 2024

astrojuanlu closed this as completed Oct 28, 2024

github-project-automation bot moved this to Done in Kedro DevRel Oct 28, 2024

github-project-automation bot moved this from In Review to Done in Kedro Framework Oct 28, 2024

DimedS mentioned this issue Dec 2, 2024

ETL pipeline for kedro PyPI downloads: unreliable data source and alternatives #155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloads table is borked #154

Downloads table is borked #154

astrojuanlu commented Oct 16, 2024 •

edited

Loading

DimedS commented Oct 22, 2024

astrojuanlu commented Oct 22, 2024

DimedS commented Oct 28, 2024 •

edited by astrojuanlu

Loading

astrojuanlu commented Oct 28, 2024

Downloads table is borked #154

Downloads table is borked #154

Comments

astrojuanlu commented Oct 16, 2024 • edited Loading

DimedS commented Oct 22, 2024

Option 1: Switch to an Incremental Approach

Cons:

Option 2: Switch to the Raw PyPI Dataset in BigQuery

Cons:

astrojuanlu commented Oct 22, 2024

DimedS commented Oct 28, 2024 • edited by astrojuanlu Loading

astrojuanlu commented Oct 28, 2024

astrojuanlu commented Oct 16, 2024 •

edited

Loading

DimedS commented Oct 28, 2024 •

edited by astrojuanlu

Loading