-
Notifications
You must be signed in to change notification settings - Fork 782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download and save a large file as an artifact #135
Comments
Discussion on gitter from @tuulos:
|
I'm happy to write a PR to handle this. My immediate thought on how to fix this for the common use-case of "storing a file" is to add a new check in A more flexible approach, which allows any data type to be persisted, even if it's not pickleable, would be to expose the target file-object somehow (which would correspond to a local file for the local data store, and an S3 stream for an AWS file etc), and let us define types that know how to load and save themselves using that file: class RawFile(ArtifactType):
# Artifact types all have a value field that indicates the data they're storing
value: IOBase
# Tell metaflow how to save this to a file
def serialize(fp):
shutil.copyfileobj(self.value, self.fp)
# Tell metaflow how to hydrate this from a file
def deserialize(fp):
self.value = fp
# Use the RawFile type in a step
class Workflow(FlowSpec):
@step
def download_file(self):
req = requests.get(self.input['url'], allow_redirects=True)
self.big_file = RawFile(req)
self.next(self.use_file)
@step
def use_file(self):
process_file(self.big_file)
|
Handling (intermediate) files is a useful feature for data scientists working the life sciences (aka bioinformaticians), as the data files are often too big to keep in memory, and many efficient algorithms are implemented in standalone applications. Given that most of these bioinformatics tools are available through the bioconda conda channel, using metaflow seems straightforward for anything except handling (intermediate) data files. |
The new datastore implementation now allows for custom serde. |
Great! I guess that isn't yet stable though? Are there usage examples that involve file storage anywhere? |
One of the steps of my workflow is simply downloading a large data file:
Now, this fails with a
MemoryError
becausereq.content
tries to read the whole file into memory. However, even thoughrequests
has a streaming API, viaiter_content()
, I don't think it's possible to use this becausemetaflow
doesn't expose a file object to write into. If I try to store a generator object as an artifact it doesn't work either:Finally, I can't use
req.raw
:If you somehow exposed the file object we were writing to, I could stream each chunk of the file separately and pickle them:
Or ideally not use pickle at all:
Is exposing the file object, or allowing non-pickle files currently possible? If not, is it on the radar?
The text was updated successfully, but these errors were encountered: