Skip to content

Conversation

@Sprengmeister-dev
Copy link

Summary

This PR improves test data generation by downloading the GEO series matrix directly with progress output and more robust error handling. It also upgrades the download helper to support resumable downloads, progress reporting, and forced re-downloads.

Changes

  • Use cached_download with progress output in the test data generator.
  • Parse GEO metadata directly and limit the test dataset to 10 samples without DataSource.
  • Handle corrupted cached downloads by retrying with a forced re-download.
  • Enhance cached_download with .part files, HTTP range resume, and optional progress output.
  • Remove unused plotting imports.

@marcbal77 marcbal77 self-requested a review January 6, 2026 05:56
@marcbal77
Copy link
Member

I can see you're trying to solve real pain points around download feedback and robustness. Did you run into this issue yourself or did someone you know experience this?

some concerns:

  1. Scope: This significantly changes cached_download, which is used in 12+ places across the codebase. Changes to core utilities need careful consideration and test coverage - maybe we should add tests.

  2. Test generator changes: The rewrite of generate.py bypasses DataSource and reimplements parsing logic directly. The existing test data generator should use the library's own code - otherwise we lose test coverage of DataSource.

  3. No tests: The new download logic (resume, progress, retry) has no test coverage.

  4. Formatting: make format needs to be run per development.md

I'd suggest instead:

Split this into smaller PRs:
- PR 1: Remove unused imports (I'll accept that immediately and also loop @sarudak into it)
- PR 2: Add optional progress output to cached_download (simpler scope, maybe use logging instead of print())

  • Keep generate.py using DataSource
  • The HTTP Range resume and .part file handling is nice-to-have but adds a decent amount of complextiy. Unless users are frequently hitting interrupted downloads, maybe we should holdoff.

@sarudak
Copy link
Member

sarudak commented Jan 13, 2026

Could you elaborate more on what problem you're trying to solve with these changes to cached download? I am concerned about the added complexity and we don't have any testing around it to cover any possible failure modes.

@marcbal77 marcbal77 removed their request for review January 14, 2026 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants