Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

Commit

Permalink
No slurm (#39)
Browse files Browse the repository at this point in the history
* changed metaseq_internal to metaseq

* fixed naming of script in docs, added resharding script without slurm
  • Loading branch information
tsor13 authored May 4, 2022
1 parent d679a4e commit 6ade896
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 1 deletion.
26 changes: 26 additions & 0 deletions metaseq/scripts/reshard_mp_launch_no_slurm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

prefix=$1
save_dir=$2
mparts=$3
tgt_size=$4
shift 4
mkdir -p $save_dir
let "last_part=$mparts -1"
echo "$@"
for i in $(seq 0 $last_part)
do

echo "python -m metaseq.scripts.reshard_mp $prefix $save_dir --part $i --target-ddp-size $tgt_size"
jname=reshard_mp"$i"_ddp"$tgt_size"
echo $jname
python3 -m metaseq.scripts.reshard_mp $prefix $save_dir --part $i --target-ddp-size $tgt_size &
done
echo "Waiting on jobs..."
wait $(jobs -p)
echo "Done"

6 changes: 5 additions & 1 deletion projects/OPT/download_opt175b.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,11 @@ bash metaseq/scripts/download_opt175b.sh "<presigned_url_given_in_email>"
## Reshard the shards
To consolidate the 992 shards into 8 files model-parallel evaluation, run (assuming you have SLURM set up already):
```
bash metaseq/scripts/reshard_sbatch.sh <directory_where_all_the_shards_are>/checkpoint_last <output_dir>/ 8 1
bash metaseq/scripts/reshard_mp_launch.sh <directory_where_all_the_shards_are>/checkpoint_last <output_dir>/ 8 1
```
If you don't have slurm on your machine, run:
```
bash metaseq/scripts/reshard_mp_launch_no_slurm.sh <directory_where_all_the_shards_are>/checkpoint_last <output_dir>/ 8 1
```

Note that most of our models expect to run with Model (Tensor) Parallelism. For smaller models, some
Expand Down

0 comments on commit 6ade896

Please sign in to comment.