GCSFuse performance on Vertex AI custom training job #1830

miguelalba96 · 2024-04-07T14:21:06Z

Description

I am encountering data loading throughput issues while training a large model on Google Cloud Platform (GCP). Here's some context:

I am utilizing Vertex AI pipelines for my training process. According to GCP documentation, Vertex AI custom training jobs automatically mount GCS (Google Cloud Storage) buckets using GCSFuse. Upon debugging my training setup, I've identified that the bottleneck in data loading seems to be related to GCSFuse, leading to data starvation and subsequent drops in GPU utilization.

I've come across performance tips that discuss caching as a potential solution. However, since Vertex AI configures GCSFuse automatically, it's unclear how to enable caching.

Should I configure caching at runtime when running the training job?
When building the docker image that contains my code to run as custom job should I mount manually the bucket and specify cache-dir, won't that be reconfigured by vertex AI when submitting the job?

Additional context

I am running distributed training on a 4-node setup within Vertex AI pipelines. Each worker node is a n1-highmemory-16 machine equipped with 2 GPUs.

I am using google_cloud_pipeline_components.v1.custom_job.create_custom_training_job_from_component to create the custom training job.

In my code, I'm simply replacing gs:// with /gcs/ as per the GCP documentation for Vertex AI.

Type of issue

Information - request

gargnitingoogle · 2024-04-08T07:24:19Z

hi @miguelalba96 thanks for asking the question and for providing the full context of the problem.

it's unclear how to enable caching

File-caching in GCSFuse is a new feature that was added in v2.0.0, about 3 weeks ago. Unfortunately, the vertex AI pipelines still use an older version of GCSFuse, which doesn't support the file-caching feature, so as of now, there is no way to enable/configure file-caching in gcsfuse using vertex AI pipeline.

Should I configure caching at runtime when running the training job?

As I said above, this is not possible through Vertex AI job creation interface right now.

should I mount manually the bucket and specify cache-dir, won't that be reconfigured by vertex AI when submitting the job?

If you can take control of which gcsfuse version is installed in your container and how it is used to mount buckets, then this might be possible. In this case, you need to install GCSFuse v2.0.0 (instructions) in your container and mount buckets using the config-file needed for desirable file-cache parameters (doc).

won't that be reconfigured by vertex AI when submitting the job?

This is outside my expertise area, but I can try to guess. If you install gcsfuse in your training container and mount your bucket at /gcs/<bucket-name>/, Vertex AI might override that mount. But I am not sure about it.

marcoa6 · 2024-04-08T14:16:25Z

Hi @miguelalba96, Vertex AI has GCSfuse v1.x currently which has stat and type caching enabled by default. The new file cache feature is only available in GCSfuse V2, which has not been rolled out by Vertex yet. We can let you know once its available, but I also suggested opening up a ticket/feature request for the Vertex AI team directly so they can track this.

gargnitingoogle · 2024-04-10T11:48:00Z

Lowering priority down to P2 as the upgrade of vertex ai to gcsfuse V2 is already in plan and is also outside the scope of the GCSFuse team.

gargnitingoogle · 2024-04-23T05:05:28Z

Wanted to update here that VertexAI training upgrade to GCSFuse v2.0.0 is now complete. The VertexAI training team is now working on enabling GCSFuse file-cache feature in training jobs. Once that completes, the original problem that @miguelalba96 faced might get fixed. Though, AFAIK, VertexAI won't be providing users controls to configure file-cache feature parameters as Miguel asked.

tiagovrtr · 2024-05-09T11:30:48Z

Wanted to update here that VertexAI training upgrade to GCSFuse v2.0.0 is now complete. The VertexAI training team is now working on enabling GCSFuse file-cache feature in training jobs. Once that completes, the original problem that @miguelalba96 faced might get fixed. Though, AFAIK, VertexAI won't be providing users controls to configure file-cache feature parameters as Miguel asked.

I'm keen to be able to use the file cache on the /gcs/ mount point, can we get updates on this somewhere? Thanks

marcoa6 · 2024-05-09T20:59:35Z

@tiagovrtr the Vertex team is working on the integration, but dont have a timeline yet. Will report back

jasonbrancazio · 2024-05-21T19:25:43Z

keep us posted! This feature will also be helpful to me. I've been using gcsfuse, file caching, and webdataset in a Vertex AI Workbench instance where I have more control over the environment, but I prefer to use Vertex AI Training once I'm out of the prototyping phase.

JasperW01 · 2024-06-27T23:20:43Z

keen to know where we are in terms of enabling GCSFuse file-cache feature in Vertex AI custom training jobs.

miguelalba96 · 2024-06-28T12:02:32Z

After months of experimentation on distributed training in GCP. I think the most cost efficient solutions (specially when dealing with TB of data) for custom training job workflows or workbench instances are:

Download the data to the machine running the training job, having the data on a local SSD makes training faster, so less GPU hours in case of deep learning workflows
If persistent storage is needed (ex. you are debugging a vertex AI pipeline component and you don't want the infrastructure to be gone once the process crashes), then mount an NFS (FileStore) containing the data to the custom training job. Mounting also gives you higher data throughput during training reducing GPU costs

and use GCSFuse only for model artifacts, metadata and small dataframes.

gargnitingoogle · 2024-07-01T10:10:17Z

@miguelalba96 I'm glad that you have found a workaround that works for you.

Mounting also gives you higher data throughput during training reducing GPU costs

I am curious.. higher compared to which alternative: the local SSD, or the reads through GCSFuse ? Also, could you share what kind of read speeds are you getting with the filestore NFS mount?

I am asking this to know if these will be beaten by GCSFuse with file-cache enabled once vertex AI has rolled out the support for it.

gargnitingoogle · 2024-07-01T10:31:47Z

keen to know where we are in terms of enabling GCSFuse file-cache feature in Vertex AI custom training jobs.

@JasperW01 last I spoked to the vertex AI training team, they told me that they will be rolling out the change within Q3. I'll update here as soon as it is fully rolled out and you can expect it to make the difference.

JasperW01 · 2024-09-09T01:58:36Z

Thx a lot, @gargnitingoogle . is there any update on the rolling out? my customers are keen to leverage it in their production workloads. Thx a lot in advance.

gargnitingoogle · 2024-09-09T04:02:20Z

@JasperW01 AFAIK, the plan is to roll it out for a limited machine types at first, and that too for autopilot gke clusters only at first. I'll share more concrete information as soon I have it.

miguelalba96 added p1 P1 question Customer Issue: question about how to use tool labels Apr 7, 2024

gargnitingoogle self-assigned this Apr 8, 2024

gargnitingoogle mentioned this issue Apr 8, 2024

gcsfuse not working with custom container in Vertex AI on NVIDIA_L4 #1677

Closed

gargnitingoogle added p2 P2 and removed p1 P1 labels Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCSFuse performance on Vertex AI custom training job #1830

GCSFuse performance on Vertex AI custom training job #1830

miguelalba96 commented Apr 7, 2024 •

edited

Loading

gargnitingoogle commented Apr 8, 2024

marcoa6 commented Apr 8, 2024

gargnitingoogle commented Apr 10, 2024

gargnitingoogle commented Apr 23, 2024

tiagovrtr commented May 9, 2024

marcoa6 commented May 9, 2024

jasonbrancazio commented May 21, 2024

JasperW01 commented Jun 27, 2024

miguelalba96 commented Jun 28, 2024 •

edited

Loading

gargnitingoogle commented Jul 1, 2024

gargnitingoogle commented Jul 1, 2024

JasperW01 commented Sep 9, 2024

gargnitingoogle commented Sep 9, 2024

GCSFuse performance on Vertex AI custom training job #1830

GCSFuse performance on Vertex AI custom training job #1830

Comments

miguelalba96 commented Apr 7, 2024 • edited Loading

Description

Additional context

Type of issue

gargnitingoogle commented Apr 8, 2024

marcoa6 commented Apr 8, 2024

gargnitingoogle commented Apr 10, 2024

gargnitingoogle commented Apr 23, 2024

tiagovrtr commented May 9, 2024

marcoa6 commented May 9, 2024

jasonbrancazio commented May 21, 2024

JasperW01 commented Jun 27, 2024

miguelalba96 commented Jun 28, 2024 • edited Loading

gargnitingoogle commented Jul 1, 2024

gargnitingoogle commented Jul 1, 2024

JasperW01 commented Sep 9, 2024

gargnitingoogle commented Sep 9, 2024

miguelalba96 commented Apr 7, 2024 •

edited

Loading

miguelalba96 commented Jun 28, 2024 •

edited

Loading