-
Notifications
You must be signed in to change notification settings - Fork 790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add properties for accessing artifact, stdout and stderr file sizes #752
feat: add properties for accessing artifact, stdout and stderr file sizes #752
Conversation
…dout/stderr log size property
… use for artifact file size checks.
update datastorage file_size docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments mainly revolving around using the info we already record in the datastore about the size of the object (it's the size pre-gzip so actually a bit more relevant than the size on disk). I think it will be more effective overall particularly if we cache the task DS in the filecache (although maybe we need ta way to clean them out; maybe a simple LRU algorithm there.
"""Gets the size of the artifact content (in bytes) for the name""" | ||
ds = self._get_flow_datastore(ds_type, ds_root, flow_name) | ||
|
||
task_ds = ds.get_task_datastore( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here we would probably need to pass the attempt and it may be interesting to cache the task_ds here to avoid fetching the same thing from S3 multiple times. I didn't cache the task_ds here but it should be do-able to to the same as for the flow datastores (you need to index based on flow/run/step/task/attempt and re-fetch if attempt isn't a fixed thing (ie: "last attempt")). You would not pass the data_metadata though here and let it load it itself.
metaflow/datastore/task_datastore.py
Outdated
@@ -314,6 +314,39 @@ def load_artifacts(self, names): | |||
for sha, blob in self._ca_store.load_blobs(to_load): | |||
yield sha_to_names[sha], pickle.loads(blob) | |||
|
|||
@require_mode(None) | |||
def get_artifact_size(self, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for this, I would actually not go this route since we already store the size of the artifacts in S3 in the _info portion of the file. We would open the task datastore in 'r' mode at the proper attempt and then read directly from _info.
Nit: to keep with the other functions here, I would make names a list and return an iterator just like load_artifacts.
We might also be able to require_mode('r') here to be on the safer side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enforced the read-only mode for now, as things work with it, and if I understand correctly, this ensures that the related artifact file metadata has been written prior to attempting a read?
implemented the other suggested changes as well :)
… continuously requesting filesizes. Fix possible issue with task_datastore not retaining passed in task attempt for further use.
if self._attempt is None: | ||
for i in range(metaflow_config.MAX_ATTEMPTS): | ||
check_meta = self._metadata_name_for_attempt( | ||
self.METADATA_ATTEMPT_SUFFIX, i) | ||
if self.has_metadata(check_meta, add_attempt=False): | ||
self._attempt = i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romain-intel had to change this up a bit, not sure if its an oversight from previous changes or what is up with this?
Without this the task_datastore is not retaining the passed in attempt
for further use at all. Could you see if the changes make sense as a whole
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was an oversight as that path wasn't used previously. I will update it appropriately to make sure we still validate that hte attempt is valid.
I'm going to merge this and touch it up in the per-attempt-task branch. Just leaving this comment here for context. Looks very close though so thanks a bunch @saikonen. |
* Missed attempt_id on attempt_done * Allow access to per-attempt Task and DataArtifact You can now specify a specific attempt for a Task or a DataArtifact in the client like so: - Task('flow/run/step/id/attempt') - DataArtifact('flow/run/step/id/name/attempt') This gives you a specific view of that particular attempt. Note that attempts are only valid for Task and Artifacts. * Added service component for task/artifact attempt This requires the attempt-fix branch of the metadata service. TODO: - still need to add version check to make sure we are hitting a modern enough service * Py2 compatibility * Moved the attempt specification from the pathspec to a separate argument Also added the version check (make sure the service returns 2.0.6 to test it out). Also addressed comments. * Typos * Add check to make sure attempts are only on Task and DataArtifact objects * feat: add properties for accessing artifact, stdout and stderr file sizes (#752) * wip: rough implementation of artifact size gets and suggestion for stdout/stderr log size property * add file_size to the datastore interface, implement for s3storage and use for artifact file size checks. * wip: implement log sizes for legacy and MFLOG type logs. * implement file_size for LocalStorage as well. update datastorage file_size docs * cleanup core docstrings for log_size properties * update docs and rename get_size to be specific about artifact size * refactor: move current attempt to a property * cleanup artifact size return * cleanup comment and rename file_size to be in line with other methods * change to require_mode('r') for size getters * fix indent * use cached filesize found in 'info' metadata for artifacts instead of continuously requesting filesizes. Fix possible issue with task_datastore not retaining passed in task attempt for further use. * change artifact size function to return an iterator to adhere to existing styles. * Remove visible tags/system_tags from metadata * Address issue when None is the value returned for all_tags * Add TaskDatastore caching to filecache; a few other fixes * Fix bug * Updated comment strings to more accurately reflect reality * Addressed comments Co-authored-by: Sakari Ikonen <64256562+saikonen@users.noreply.github.com>
Feature mainly used for ui backend, where some of the cache revoking will be done based on file size changes.
Introduces a
file_size
method to the DataStoreStorage interface. Implements this for both the S3Storage and LocalStorage stores.Also introduces a
current_attempt
property to Task, which will either try to infer the latest attempt from task metadata, or return the attempt used for initializing. Simple refactor due to this being required in multiple occasions.