Refactor pulling dataset rows #617

ilongin · 2024-11-23T00:45:33Z

As we are chainging the way we export files to s3 in https://github.com/iterative/studio/pull/10966, we needed to adjust CLI code as well to have better performance.
Now we are exporting chunks to s3 "in order", which means chunks with lower index will be there first so we can use that to make sure we are fetching those chunks with lower indexes first to avoid idling until chunks are ready.

codecov · 2024-11-23T00:51:32Z

Codecov Report

Attention: Patch coverage is 85.71429% with 5 lines in your changes missing coverage. Please review.

Project coverage is 87.46%. Comparing base (10e90c5) to head (8e56e86).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/catalog/catalog.py	86.66%	2 Missing and 2 partials ⚠️
src/datachain/data_storage/sqlite.py	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #617      +/-   ##
==========================================
+ Coverage   87.44%   87.46%   +0.01%     
==========================================
  Files         114      114              
  Lines       10898    10910      +12     
  Branches     1499     1501       +2     
==========================================
+ Hits         9530     9542      +12     
  Misses        990      990              
  Partials      378      378

Flag	Coverage Δ
datachain	`87.39% <85.71%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cloudflare-workers-and-pages · 2024-11-25T14:43:04Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`8e56e86`
Status:	✅ Deploy successful!
Preview URL:	https://27534ae8.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-616-refactor-datacha.datachain-documentation.pages.dev

View logs

…ative/datachain into ilongin/616-refactor-datachain-pull

amritghimire · 2024-12-06T14:20:34Z

src/datachain/catalog/catalog.py

+            if self.should_check_for_status():
+                self.check_for_status()
+            r = requests.get(url, timeout=PULL_DATASET_CHUNK_TIMEOUT)
+            if r.status_code == 404:


Sorry if it is a silly question. How likely is it that we will be stuck on forever loop here if the url is indeed incorrect url and returning 404 response?

Nice catch! Do we need retry counter here?

Good question, but we don't need retry counter here. 404 is expected as this particular chunk may not be exported yet into s3 (in parallel with this Studio is exporting chunks). If something actually fails and we are not able to export chunk to s3 which leads to 404 forever, export itself will fail in Studio and in this loop we are checking for export (whole export job) status on Studio as well every 20 seconds. When we realize that exporting dataset failed on Studio, we will print an error and end the loop.

ilongin added 2 commits November 23, 2024 01:40

refactor pulling dataset rows

421c018

Merge branch 'main' into ilongin/616-refactor-datachain-pull

172245a

ilongin linked an issue Nov 23, 2024 that may be closed by this pull request

Refactor pulling dataset rows from Studio #616

Closed

ilongin marked this pull request as draft November 23, 2024 00:45

Merge branch 'main' into ilongin/616-refactor-datachain-pull

962b525

ilongin marked this pull request as ready for review November 25, 2024 23:19

ilongin requested a review from a team November 25, 2024 23:20

ilongin added 2 commits November 26, 2024 13:57

Merge branch 'main' into ilongin/616-refactor-datachain-pull

6f848ec

added coverage of checking for export status

2134ce8

ilongin requested review from dreadatour, amritghimire, skshetry and mattseddon November 28, 2024 14:50

mattseddon approved these changes Nov 29, 2024

View reviewed changes

ilongin added 5 commits November 29, 2024 11:11

Merge branch 'main' into ilongin/616-refactor-datachain-pull

6754072

Merge branch 'ilongin/616-refactor-datachain-pull' of github.com:iter…

2a8fd56

…ative/datachain into ilongin/616-refactor-datachain-pull

merging with main

b45e434

remove comment

b286c39

Merge branch 'main' into ilongin/616-refactor-datachain-pull

8f76e2f

amritghimire reviewed Dec 6, 2024

View reviewed changes

Merge branch 'main' into ilongin/616-refactor-datachain-pull

8e56e86

ilongin temporarily deployed to internal December 17, 2024 10:54 — with GitHub Actions Inactive

ilongin requested a review from amritghimire December 18, 2024 00:54

ilongin merged commit 983cbd8 into main Dec 19, 2024
33 of 34 checks passed

ilongin deleted the ilongin/616-refactor-datachain-pull branch December 19, 2024 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor pulling dataset rows #617

Refactor pulling dataset rows #617

ilongin commented Nov 23, 2024

codecov bot commented Nov 23, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Nov 25, 2024 •

edited

Loading

amritghimire Dec 6, 2024

dreadatour Dec 7, 2024

ilongin Dec 18, 2024

Refactor pulling dataset rows #617

Refactor pulling dataset rows #617

Conversation

ilongin commented Nov 23, 2024

codecov bot commented Nov 23, 2024 • edited Loading

Codecov Report

cloudflare-workers-and-pages bot commented Nov 25, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

amritghimire Dec 6, 2024

Choose a reason for hiding this comment

dreadatour Dec 7, 2024

Choose a reason for hiding this comment

ilongin Dec 18, 2024

Choose a reason for hiding this comment

codecov bot commented Nov 23, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Nov 25, 2024 •

edited

Loading