Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add properties for accessing artifact, stdout and stderr file sizes #752

Merged
merged 13 commits into from
Oct 14, 2021
Merged

feat: add properties for accessing artifact, stdout and stderr file sizes #752

merged 13 commits into from
Oct 14, 2021

Conversation

saikonen
Copy link
Collaborator

Feature mainly used for ui backend, where some of the cache revoking will be done based on file size changes.

  • Artifact sizes could possibly be cached if necessary, as I believe these are only write-once and will not change afterwards.
  • Logs can keep growing during runtime so these size-checks bypass the filecache , passing the check directly to the datastore.

Introduces a file_size method to the DataStoreStorage interface. Implements this for both the S3Storage and LocalStorage stores.

Also introduces a current_attempt property to Task, which will either try to infer the latest attempt from task metadata, or return the attempt used for initializing. Simple refactor due to this being required in multiple occasions.

@saikonen saikonen marked this pull request as draft October 12, 2021 11:01
Copy link
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments mainly revolving around using the info we already record in the datastore about the size of the object (it's the size pre-gzip so actually a bit more relevant than the size on disk). I think it will be more effective overall particularly if we cache the task DS in the filecache (although maybe we need ta way to clean them out; maybe a simple LRU algorithm there.

metaflow/client/filecache.py Outdated Show resolved Hide resolved
metaflow/datastore/datastore_storage.py Outdated Show resolved Hide resolved
metaflow/client/core.py Outdated Show resolved Hide resolved
"""Gets the size of the artifact content (in bytes) for the name"""
ds = self._get_flow_datastore(ds_type, ds_root, flow_name)

task_ds = ds.get_task_datastore(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here we would probably need to pass the attempt and it may be interesting to cache the task_ds here to avoid fetching the same thing from S3 multiple times. I didn't cache the task_ds here but it should be do-able to to the same as for the flow datastores (you need to index based on flow/run/step/task/attempt and re-fetch if attempt isn't a fixed thing (ie: "last attempt")). You would not pass the data_metadata though here and let it load it itself.

@@ -314,6 +314,39 @@ def load_artifacts(self, names):
for sha, blob in self._ca_store.load_blobs(to_load):
yield sha_to_names[sha], pickle.loads(blob)

@require_mode(None)
def get_artifact_size(self, name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for this, I would actually not go this route since we already store the size of the artifacts in S3 in the _info portion of the file. We would open the task datastore in 'r' mode at the proper attempt and then read directly from _info.

Nit: to keep with the other functions here, I would make names a list and return an iterator just like load_artifacts.

We might also be able to require_mode('r') here to be on the safer side.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enforced the read-only mode for now, as things work with it, and if I understand correctly, this ensures that the related artifact file metadata has been written prior to attempting a read?

implemented the other suggested changes as well :)

metaflow/datastore/task_datastore.py Outdated Show resolved Hide resolved
metaflow/datastore/task_datastore.py Outdated Show resolved Hide resolved
Comment on lines +129 to +134
if self._attempt is None:
for i in range(metaflow_config.MAX_ATTEMPTS):
check_meta = self._metadata_name_for_attempt(
self.METADATA_ATTEMPT_SUFFIX, i)
if self.has_metadata(check_meta, add_attempt=False):
self._attempt = i
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romain-intel had to change this up a bit, not sure if its an oversight from previous changes or what is up with this?
Without this the task_datastore is not retaining the passed in attempt for further use at all. Could you see if the changes make sense as a whole

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was an oversight as that path wasn't used previously. I will update it appropriately to make sure we still validate that hte attempt is valid.

@saikonen saikonen marked this pull request as ready for review October 13, 2021 14:35
@romain-intel
Copy link
Contributor

I'm going to merge this and touch it up in the per-attempt-task branch. Just leaving this comment here for context. Looks very close though so thanks a bunch @saikonen.

@romain-intel romain-intel merged commit 788c7fd into Netflix:per-attempt-task Oct 14, 2021
romain-intel added a commit that referenced this pull request Oct 18, 2021
* Missed attempt_id on attempt_done

* Allow access to per-attempt Task and DataArtifact

You can now specify a specific attempt for a Task or a DataArtifact
in the client like so:
  - Task('flow/run/step/id/attempt')
  - DataArtifact('flow/run/step/id/name/attempt')

This gives you a specific view of that particular attempt. Note that
attempts are only valid for Task and Artifacts.

* Added service component for task/artifact attempt

This requires the attempt-fix branch of the metadata service.

TODO:
  - still need to add version check to make sure we are hitting a modern enough service

* Py2 compatibility

* Moved the attempt specification from the pathspec to a separate argument

Also added the version check (make sure the service returns 2.0.6 to test it out).

Also addressed comments.

* Typos

* Add check to make sure attempts are only on Task and DataArtifact objects

* feat: add properties for accessing artifact, stdout and stderr file sizes (#752)

* wip: rough implementation of artifact size gets and suggestion for stdout/stderr log size property

* add file_size to the datastore interface, implement for s3storage and use for artifact file size checks.

* wip: implement log sizes for legacy and MFLOG type logs.

* implement file_size for LocalStorage as well.

update datastorage file_size docs

* cleanup core docstrings for log_size properties

* update docs and rename get_size to be specific about artifact size

* refactor: move current attempt to a property

* cleanup artifact size return

* cleanup comment and rename file_size to be in line with other methods

* change to require_mode('r') for size getters

* fix indent

* use cached filesize found in 'info' metadata for artifacts instead of continuously requesting filesizes.

Fix possible issue with task_datastore not retaining passed in task attempt for further use.

* change artifact size function to return an iterator to adhere to existing styles.

* Remove visible tags/system_tags from metadata

* Address issue when None is the value returned for all_tags

* Add TaskDatastore caching to filecache; a few other fixes

* Fix bug

* Updated comment strings to more accurately reflect reality

* Addressed comments

Co-authored-by: Sakari Ikonen <64256562+saikonen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants