feat: add properties for accessing artifact, stdout and stderr file sizes #752

saikonen · 2021-10-12T11:01:09Z

Feature mainly used for ui backend, where some of the cache revoking will be done based on file size changes.

Artifact sizes could possibly be cached if necessary, as I believe these are only write-once and will not change afterwards.
Logs can keep growing during runtime so these size-checks bypass the filecache , passing the check directly to the datastore.

Introduces a file_size method to the DataStoreStorage interface. Implements this for both the S3Storage and LocalStorage stores.

Also introduces a current_attempt property to Task, which will either try to infer the latest attempt from task metadata, or return the attempt used for initializing. Simple refactor due to this being required in multiple occasions.

…dout/stderr log size property

… use for artifact file size checks.

update datastorage file_size docs

romain-intel

A few comments mainly revolving around using the info we already record in the datastore about the size of the object (it's the size pre-gzip so actually a bit more relevant than the size on disk). I think it will be more effective overall particularly if we cache the task DS in the filecache (although maybe we need ta way to clean them out; maybe a simple LRU algorithm there.

metaflow/client/filecache.py

metaflow/datastore/datastore_storage.py

metaflow/client/core.py

romain-intel · 2021-10-13T07:51:43Z

metaflow/client/filecache.py

+        """Gets the size of the artifact content (in bytes) for the name"""
+        ds = self._get_flow_datastore(ds_type, ds_root, flow_name)
+
+        task_ds = ds.get_task_datastore(


So here we would probably need to pass the attempt and it may be interesting to cache the task_ds here to avoid fetching the same thing from S3 multiple times. I didn't cache the task_ds here but it should be do-able to to the same as for the flow datastores (you need to index based on flow/run/step/task/attempt and re-fetch if attempt isn't a fixed thing (ie: "last attempt")). You would not pass the data_metadata though here and let it load it itself.

romain-intel · 2021-10-13T07:52:46Z

metaflow/datastore/task_datastore.py

@@ -314,6 +314,39 @@ def load_artifacts(self, names):
        for sha, blob in self._ca_store.load_blobs(to_load):
            yield sha_to_names[sha], pickle.loads(blob)

+    @require_mode(None)
+    def get_artifact_size(self, name):


So for this, I would actually not go this route since we already store the size of the artifacts in S3 in the _info portion of the file. We would open the task datastore in 'r' mode at the proper attempt and then read directly from _info.

Nit: to keep with the other functions here, I would make names a list and return an iterator just like load_artifacts.

We might also be able to require_mode('r') here to be on the safer side.

enforced the read-only mode for now, as things work with it, and if I understand correctly, this ensures that the related artifact file metadata has been written prior to attempting a read?

implemented the other suggested changes as well :)

metaflow/datastore/task_datastore.py

… continuously requesting filesizes. Fix possible issue with task_datastore not retaining passed in task attempt for further use.

saikonen · 2021-10-13T14:01:02Z

metaflow/datastore/task_datastore.py

+                if self._attempt is None:
+                    for i in range(metaflow_config.MAX_ATTEMPTS):
+                        check_meta = self._metadata_name_for_attempt(
+                            self.METADATA_ATTEMPT_SUFFIX, i)
+                        if self.has_metadata(check_meta, add_attempt=False):
+                            self._attempt = i


@romain-intel had to change this up a bit, not sure if its an oversight from previous changes or what is up with this?
Without this the task_datastore is not retaining the passed in attempt for further use at all. Could you see if the changes make sense as a whole

That was an oversight as that path wasn't used previously. I will update it appropriately to make sure we still validate that hte attempt is valid.

…ting styles.

romain-intel · 2021-10-14T06:31:49Z

I'm going to merge this and touch it up in the per-attempt-task branch. Just leaving this comment here for context. Looks very close though so thanks a bunch @saikonen.

* Missed attempt_id on attempt_done * Allow access to per-attempt Task and DataArtifact You can now specify a specific attempt for a Task or a DataArtifact in the client like so: - Task('flow/run/step/id/attempt') - DataArtifact('flow/run/step/id/name/attempt') This gives you a specific view of that particular attempt. Note that attempts are only valid for Task and Artifacts. * Added service component for task/artifact attempt This requires the attempt-fix branch of the metadata service. TODO: - still need to add version check to make sure we are hitting a modern enough service * Py2 compatibility * Moved the attempt specification from the pathspec to a separate argument Also added the version check (make sure the service returns 2.0.6 to test it out). Also addressed comments. * Typos * Add check to make sure attempts are only on Task and DataArtifact objects * feat: add properties for accessing artifact, stdout and stderr file sizes (#752) * wip: rough implementation of artifact size gets and suggestion for stdout/stderr log size property * add file_size to the datastore interface, implement for s3storage and use for artifact file size checks. * wip: implement log sizes for legacy and MFLOG type logs. * implement file_size for LocalStorage as well. update datastorage file_size docs * cleanup core docstrings for log_size properties * update docs and rename get_size to be specific about artifact size * refactor: move current attempt to a property * cleanup artifact size return * cleanup comment and rename file_size to be in line with other methods * change to require_mode('r') for size getters * fix indent * use cached filesize found in 'info' metadata for artifacts instead of continuously requesting filesizes. Fix possible issue with task_datastore not retaining passed in task attempt for further use. * change artifact size function to return an iterator to adhere to existing styles. * Remove visible tags/system_tags from metadata * Address issue when None is the value returned for all_tags * Add TaskDatastore caching to filecache; a few other fixes * Fix bug * Updated comment strings to more accurately reflect reality * Addressed comments Co-authored-by: Sakari Ikonen <64256562+saikonen@users.noreply.github.com>

saikonen added 8 commits October 12, 2021 12:22

wip: rough implementation of artifact size gets and suggestion for st…

7babcae

…dout/stderr log size property

add file_size to the datastore interface, implement for s3storage and…

61d84bb

… use for artifact file size checks.

wip: implement log sizes for legacy and MFLOG type logs.

e9c0015

implement file_size for LocalStorage as well.

3761122

update datastorage file_size docs

cleanup core docstrings for log_size properties

204be43

update docs and rename get_size to be specific about artifact size

a23c059

refactor: move current attempt to a property

5614f71

cleanup artifact size return

6f4465f

saikonen marked this pull request as draft October 12, 2021 11:01

romain-intel reviewed Oct 13, 2021

View reviewed changes

saikonen added 4 commits October 13, 2021 11:28

cleanup comment and rename file_size to be in line with other methods

c8914b6

change to require_mode('r') for size getters

43c7efd

fix indent

560c662

use cached filesize found in 'info' metadata for artifacts instead of…

5a74c16

… continuously requesting filesizes. Fix possible issue with task_datastore not retaining passed in task attempt for further use.

saikonen commented Oct 13, 2021

View reviewed changes

change artifact size function to return an iterator to adhere to exis…

e8fa155

…ting styles.

saikonen marked this pull request as ready for review October 13, 2021 14:35

romain-intel merged commit 788c7fd into Netflix:per-attempt-task Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add properties for accessing artifact, stdout and stderr file sizes #752

feat: add properties for accessing artifact, stdout and stderr file sizes #752

saikonen commented Oct 12, 2021

romain-intel left a comment

romain-intel Oct 13, 2021

romain-intel Oct 13, 2021

saikonen Oct 13, 2021

saikonen Oct 13, 2021

romain-intel Oct 13, 2021

romain-intel commented Oct 14, 2021

feat: add properties for accessing artifact, stdout and stderr file sizes #752

feat: add properties for accessing artifact, stdout and stderr file sizes #752

Conversation

saikonen commented Oct 12, 2021

romain-intel left a comment

Choose a reason for hiding this comment

romain-intel Oct 13, 2021

Choose a reason for hiding this comment

romain-intel Oct 13, 2021

Choose a reason for hiding this comment

saikonen Oct 13, 2021

Choose a reason for hiding this comment

saikonen Oct 13, 2021

Choose a reason for hiding this comment

romain-intel Oct 13, 2021

Choose a reason for hiding this comment

romain-intel commented Oct 14, 2021