-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downloads table is borked #154
Comments
As I understand, some optimisations have been made on the data source side (GitHub: ClickHouse/clickpy):
It seems that we can no longer retrieve the entire table as we did previously. Here are two potential solutions to address the issue: Option 1: Switch to an Incremental ApproachWe could use previously archived data stored in our database ( Cons:
Option 2: Switch to the Raw PyPI Dataset in BigQueryWe could replace the ClickHouse data source with the raw PyPI dataset available in BigQuery. Although this approach would take longer to process (around 3 minutes instead of 30 seconds), it would provide a more robust solution by removing the intermediate data layer. Additionally, this would allow us to continue with our current approach of replacing the entire table daily and offer the opportunity to include more comprehensive data in our aggregates. Cons:
Personally, I prefer the second approach because it offers greater stability and flexibility. What do you think, @astrojuanlu ? |
Thanks for the analysis @DimedS, very detailed. I agree option 2 is the simplest one to implement and gives more opportunities for future data aggregations. However, it requires a credit card. That's already a big barrier of entry, and on top of that we'd need to analyse whether our data needs would fit within the free plan of BigQuery. Also if we need more data than what we can query with the free plan, then we'll need an incremental approach anyway. It's not unsurmountable but could we try option 1 first? And then if we see the source is too unstable or unreliable, we can take what we learned to try to minimise BigQuery costs. Also this helps us explore possible pain points for incremental workloads work in Kedro, see kedro-org/kedro#3578 |
I modified our PyPI downloads Kedro ETL project kedro-pypi-to-snowflake (internal link) by adding a new pipeline,
I set up a GitHub Action CI/CD workflow to run this pipeline daily at 6 AM, reloading data from the current day and up to five days back. I scheduled it last Friday, and it seems to be working well. However, as mentioned earlier, the ClickHouse source we are using is unpredictable and appears to be unreliable, so let’s keep a close eye on it. I would appreciate it if anyone could review the project. Unlike the previous "etl" pipeline, which was more elegant (with all I/O operations encapsulated within a custom dataset), I couldn't modify the ClickHouse dataset to support dynamic parameters for day-by-day loading. As a result, I had to perform the loading operations directly within the node code instead of through the catalog. Additionally, to remove data from the Snowflake dataset, I had to implement a special method, which I admit is not the most elegant solution. |
Thanks a lot for the update @DimedS! Glad that you found a workaround for the incremental load. I looked at the pipeline runs and they look good. I'm closing this issue...
...and opening a new one about this. |
Unsure when this started happening
The text was updated successfully, but these errors were encountered: