2.8.4
-
Features
-
Improvements
Features
Introduce support for tmpfs for executions on Kubernetes
It is typical for the user code in a Metaflow step to download assets from an object store, e.g. S3. Examples include serialized models and raw input data, such unstructured media or structured Parquet files. The amount of data loaded in a task is typically 10-100GB, allowing even terabytes to be handled in a foreach.
To reduce IO bottlenecks in such tasks, we provide an optimized client for S3, metaflow.S3 that makes it possible to download data using all available network bandwidth. Notably, in a modern instance the available network bandwidth can be higher than the local disk bandwidth. Consider: SATA 3.0 provides 6Gbit/s whereas a large instance can have 20Gbit/s network throughput. Even Gen3 NVMe provides just 16Git/s. To benefit from the full network bandwidth, local disk IO must be bypassed. The metaflow.S3 client accomplishes this by relying on the page cache: Nominally files are downloaded in a temporary directory on disk but practically all data stays in the page cache. This is assuming that the downloaded data can fit in memory, which can be ensured by having a high enough @resources(memory=) setting.
The above setup, which can provide excellent IO performance in general, has a small gotcha: The instance needs to have enough local disk space to back all the data, although no data actually hits the disk. Increasingly, instances may have more memory than local disk space available, so this superfluous requirement becomes a problem. This puts users in a strange situation: The instance has enough RAM to hold all the data in memory, and there are ways to download it quickly from S3, but the lack of local disk space (that is not even needed), makes it impossible to access the data.
Kubernetes supports mounting a tmpfs filesystem on the fly. Using this feature, the user can create a memory-backed file system which can be used as a temporary space for downloaded data. This removes the need to have to deal with any local disks. One can simply use a minimal root filesystem, which greatly simplifies the infrastructure setup.
With this release, we introduce a new config option - METAFLOW_TEMPDIR, which, if defined, is used as the default metaflow.S3(tmproot). If METAFLOW_TEMPDIR is not defined, tmproot=’.’ as before. In addition, a few new attributes are introduced for @kubernetes decorator -
Attribute (default) | Default behavior | Override semantics |
---|---|---|
use_tmpfs=False | tmpfs disabled | use_tmpfs=True enables tmpfs |
tmpfs_tempdir=True | sets METAFLOW_TEMPDIR=tmpfs_path | tmpfs_tempdir=False doesn't set METAFLOW_TEMPDIR |
tmpfs_size=None | sets tmpfs size to 50% of @resources(memory) | tmpfs size in megabytes |
tmpfs_path=None | use /metaflow_temp as tmpfs_path | custom mount point |
Examples
Handle large amounts of data in-memory with Kubernetes:
@kubernetes(memory=100000, use_tmpfs=True)
In this case, at most 50GB is available for tmpfs and it is used by S3 by default. Note that tmpfs only consumes the amount of memory corresponding to the data stored, so there is no downside in setting a large size by default.
Increase tmpfs size:
@kubernetes(memory=100000, tmpfs_size=100000)
Let tmpfs use all available memory. Note that use_tmpfs=True doesn’t have to be specified redundantly.
Custom tmpfs use case:
@kubernetes(memory=100000, tmpfs_size=10000, tmpfs_path=’/data’, tmpfs_tempdir=False)
Full control over settings - metaflow.S3 doesn’t use the tmpfs volume in this case.
Besides metaflow.S3, the user may want to use the tmpfs volume for their own use cases. In particular, many modern ML libraries require a local cache. To support these use cases, tmpfs_path is exposed through the current object, as current.tempdir.
This allows the user to leverage the volume straightforwardly:
AutoModelForSeq2SeqLM.from_pretrained(
model_path,
cache_dir=current.tempdir,
device_map='auto',
load_in_8bit=True,
)
Introduce current.run and current.task_ in current singleton
With this release, you can access current.run
and current.task
within a running flow, allowing for use cases like
from metaflow import current
# add tags from inside a run
current.run.add_tag('foobar')
Improvements
Make metaflow client objects backward compatible
The previous release broke backward compatibility in cases where the metaflow client object is deserialized from an older version of Metaflow. This release preserves the functionality and provides explicit compatibility guarantees going forward.
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
What's Changed
- Fix: Check all steps for MetaflowCode and return if any by @bsridatta in #1338
- chore: comment on run.code by @saikonen in #1357
- Add kubernetes labels by @dhpollack in #1236
- Revert "Add kubernetes labels" by @savingoyal in #1359
- fix: batch tmpfs enabling logic by @saikonen in #1365
- feature: tmpfs for kubernetes and argo by @saikonen in #1361
- Fix: Validate pathspec argument for MetaflowObject by @bsridatta in #1350
- Fix:
METAFLOW_S3_ENDPOINT_URL
as a part of airflow by @valayDave in #1368 - Introduce support for event-triggered workflows by @savingoyal in #1271
- feature: remove pylint dependency by @saikonen in #1378
- Fixing a MetaflowObject backward compatibility issue by @pjoshi30 in #1363
- added missing return statement by @felipeGarciaDiaz in #1383
- fix: batch decorator missing metadata handling by @saikonen in #1385
- mute argo event emmission by @savingoyal in #1386
- Update current object adding
run
andtask
object. by @romain-intel in #1384 - release 2.8.4 by @savingoyal in #1388
New Contributors
- @dhpollack made their first contribution in #1236
- @felipeGarciaDiaz made their first contribution in #1383
Full Changelog: 2.8.3...2.8.4