From bad0c3f7bb1610edab57d26a61ccc2470dc1e47c Mon Sep 17 00:00:00 2001 From: aireenmei Date: Wed, 27 Nov 2024 23:33:43 +0000 Subject: [PATCH] update setup_gcsfuse for better perf --- getting_started/Data_Input_Pipeline.md | 4 +++- setup_gcsfuse.sh | 18 +++++++++++------- 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/getting_started/Data_Input_Pipeline.md b/getting_started/Data_Input_Pipeline.md index fb53e63ee..84351325b 100644 --- a/getting_started/Data_Input_Pipeline.md +++ b/getting_started/Data_Input_Pipeline.md @@ -96,7 +96,9 @@ In HF or TFDS data pipeline, global shuffle is performed by a shuffle buffer wit 1. Dataset needs to be in a format that supports random access. The default format is [ArrayRecord](https://github.com/google/array_record). For converting a dataset into ArrayRecord, see [instructions](https://github.com/google/array_record/tree/main/beam). Additionally, other random accessible data sources can be supported via a custom data source class ([docs](https://github.com/google/grain/blob/main/docs/data_sources.md)). 2. ArrayRecord dataset, when hosted on GCS bucket, can only be read through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/setup.sh). User then needs to mount the GCS bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/setup_gcsfuse.sh). The script configs some parameters for the mount. ``` -bash setup_gcsfuse.sh DATASET_GCS_BUCKET=$BUCKET_NAME MOUNT_PATH=$MOUNT_PATH +bash setup_gcsfuse.sh DATASET_GCS_BUCKET=$BUCKET_NAME MOUNT_PATH=$MOUNT_PATH [FILE_PATH=$MOUNT_PATH/my_dataset] +# FILE_PATH is optional, when provided, the script runs "ls -R" for pre-filling the metadata cache +# https://cloud.google.com/storage/docs/cloud-storage-fuse/performance#improve-first-time-reads ``` 3. Set `dataset_type=grain` and set `grain_train_files` to match the ArrayRecord files via a local path since the bucket has been mounted. 4. Tune `grain_worker_count` for performance. This parameter controls the number of child process used by Grain (more details in [behind_the_scene](https://github.com/google/grain/blob/main/docs/behind_the_scenes.md), [code](https://github.com/google/grain/blob/main/grain/_src/python/grain_pool.py)). If you use a large number of workers, please check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/setup_gcsfuse.sh) to avoid gcsfuse throttling. diff --git a/setup_gcsfuse.sh b/setup_gcsfuse.sh index 53e3baa7f..40c2066b7 100644 --- a/setup_gcsfuse.sh +++ b/setup_gcsfuse.sh @@ -15,7 +15,7 @@ # limitations under the License. # Description: -# bash setup_gcsfuse.sh DATASET_GCS_BUCKET=maxtext-dataset MOUNT_PATH=dataset +# bash setup_gcsfuse.sh DATASET_GCS_BUCKET=maxtext-dataset MOUNT_PATH=/tmp/gcsfuse FILE_PATH=/tmp/gcsfuse/my_dataset set -e @@ -44,9 +44,13 @@ fi mkdir -p $MOUNT_PATH # see https://cloud.google.com/storage/docs/gcsfuse-cli for all configurable options of gcsfuse CLI -# Grain uses _PROCESS_MANAGEMENT_MAX_THREADS = 64 (https://github.com/google/grain/blob/main/grain/_src/python/grain_pool.py) -# Please make sure max-conns-per-host > grain_worker_count * _PROCESS_MANAGEMENT_MAX_THREADS - -gcsfuse -o ro --implicit-dirs --http-client-timeout=5s --max-conns-per-host=2000 \ - --debug_fuse_errors --debug_fuse --debug_gcs --debug_invariants --debug_mutex \ - --log-file=$HOME/gcsfuse.json "$DATASET_GCS_BUCKET" "$MOUNT_PATH" +TIMESTAMP=$(date +%Y%m%d-%H%M) +gcsfuse -o ro --implicit-dirs --log-severity=debug \ + --type-cache-max-size-mb=-1 --stat-cache-max-size-mb=-1 --kernel-list-cache-ttl-secs=-1 --metadata-cache-ttl-secs=-1 \ + --log-file=$HOME/gcsfuse_$TIMESTAMP.json "$DATASET_GCS_BUCKET" "$MOUNT_PATH" + +# Use ls to prefill the metadata cache: https://cloud.google.com/storage/docs/cloud-storage-fuse/performance#improve-first-time-reads +if [[ ! -z ${FILE_PATH} ]] ; then + FILE_COUNT=$(ls -R $FILE_PATH | wc -l) + echo $FILE_COUNT files found in $FILE_PATH +fi