-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor pulling dataset rows #617
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #617 +/- ##
==========================================
+ Coverage 87.44% 87.46% +0.01%
==========================================
Files 114 114
Lines 10898 10910 +12
Branches 1499 1501 +2
==========================================
+ Hits 9530 9542 +12
Misses 990 990
Partials 378 378
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Deploying datachain-documentation with Cloudflare Pages
|
…ative/datachain into ilongin/616-refactor-datachain-pull
if self.should_check_for_status(): | ||
self.check_for_status() | ||
r = requests.get(url, timeout=PULL_DATASET_CHUNK_TIMEOUT) | ||
if r.status_code == 404: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if it is a silly question. How likely is it that we will be stuck on forever loop here if the url is indeed incorrect url and returning 404 response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Do we need retry counter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, but we don't need retry counter here. 404 is expected as this particular chunk may not be exported yet into s3 (in parallel with this Studio is exporting chunks). If something actually fails and we are not able to export chunk to s3 which leads to 404 forever, export itself will fail in Studio and in this loop we are checking for export (whole export job) status on Studio as well every 20 seconds. When we realize that exporting dataset failed on Studio, we will print an error and end the loop.
As we are chainging the way we export files to s3 in https://github.com/iterative/studio/pull/10966, we needed to adjust CLI code as well to have better performance.
Now we are exporting chunks to s3 "in order", which means chunks with lower index will be there first so we can use that to make sure we are fetching those chunks with lower indexes first to avoid idling until chunks are ready.