-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pre_fetch option in additional to cache for lib.File #40
Comments
Note that with the current architecture, pre_fetch won't do much, since only one |
@rlamy we should change it in a way that pre-caching helps. |
Depends on the file API refactoring. Moving indexing to the app level. For now moving back to backlog. |
Since we are done with indexing more or less, moving it back to the ready stage cc @rlamy . Might still depend on some work that Ronan is doing now with decoupling datasetquery and datachain. One of the use cases I have atm is:
One thing that is a bit annoying is that some tools (OpenCV) seems to require a local path. Yes, cache helps in that case and pre-fetch can help - but both require downloading the whole file, while for some operations I just need some header. If someone has ideas how that can be improved - let me know. Is there a way to create file-like-looking object but that is a stream from the cloud underneath? |
Some notes:
This means that |
where do we receive raw DB rows there? (I wonder if this related or should be taken into account - https://github.com/iterative/studio/issues/10531#issuecomment-2379390308 ) |
After probably too much refactoring, I can confirm that this can be implemented inside
Ignoring a lot of details, the basic idea is to change the implementation of for db_row in udf_inputs:
obj_row = self._prepare(db_row)
obj_result = self.process(obj_row)
yield self._convert_result(obj_result) to this: obj_rows = (self._prepare(db_row) for db_row in udf_inputs)
obj_rows = AsyncMapper(_prefetch_row, obj_rows, workers=pre_fetch)
for obj_row in obj_rows:
obj_result = self.process(obj_row)
yield self._convert_result(obj_result) where async def prefetch_row(row):
for obj in row:
if isinstance(obj, File):
await obj._prefetch()
return row Note that the latter can easily be generalised to arbitrary models, if we define some kind of "prefetching protocol". |
It looks like the right way of solving this. Thank you! |
The proposed implementation has a problem: it hangs when run in distributed mode, i.e. when using something like
Possible solutions
|
Using threading in
|
minor observation -
I think prefetch still makes sense (can start fetching the next batch?). I think definitely can be a followup / separate ticket to discuss and prioritize. |
@rlamy, was that fixed? I see that now I tried to fix a hanging issue when interrupted/error in #597 which was causing test failures. If you have a moment, I would appreciate your feedback on the PR. Thank you. |
@skshetry can it be closed? |
We need to download items in async mode before processing them:
pre_fetch
this should enable async file download (per thread) for a given limit of files (like, pre_fetch=10). Like pre_fetch in pytorch datasets. Default should be pre_fetch=2OUTDATED:
The text was updated successfully, but these errors were encountered: