Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support tf2 resnet50 #1

Closed
wants to merge 72 commits into from
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
64e97e1
support resnet50-tf2
Jan 20, 2023
49eecd5
support resnet50-tf2
Jan 29, 2023
1a73ac4
rm
Jan 29, 2023
7d14804
rm unnecessary file
Jan 31, 2023
5e71532
Modify according to review comments
Feb 6, 2023
5f74151
Modify according to review comments
Feb 6, 2023
80798ab
check pep8
Feb 9, 2023
c17e0c2
check pep8
Feb 9, 2023
caf2032
support adapter
Feb 9, 2023
30b7aaa
update pip source to baidu for more stable accessibility (#20)
yuzhou03 Mar 10, 2023
2eb5bce
add standard case modification PR requirements
Mar 10, 2023
62f587f
update run_pretraining.example.py. config should be declared as globa…
Mar 13, 2023
e7d7139
rename file to skip flake8 check
Mar 13, 2023
a2abc50
init_dist_training_env support single gpu (#23)
upvenly Mar 13, 2023
d106cfc
Merge pull request #25 from yuzhou03/docs-dev
upvenly Mar 13, 2023
b59cbf7
Update tox.ini
yuzhou03 Mar 14, 2023
a0a01e9
Create .github/workflows/yapf-check.yaml
yuzhou03 Mar 15, 2023
b7d374b
GLM: fix distributed handling for single-card training (#28)
yuzhou03 Mar 16, 2023
8caf9a7
fix single-card training for CPM (#27)
yuzhou03 Mar 17, 2023
a9f5625
GLM: add kunlunxin R300x1x8 config
dynamicheart Mar 16, 2023
d410c08
fix yapf style check warnings
Mar 17, 2023
f059298
update markdown:1.solve display issues of angle brackets. 2.replace l…
yuzhou03 Mar 18, 2023
2b697c9
source environment_variables.sh before start task (#33)
dynamicheart Mar 20, 2023
05f517b
Merge pull request #32 from yuzhou03/yapf-style
upvenly Mar 20, 2023
7353e9f
Merge branch 'main' into doc-case-standard
yuzhou03 Mar 20, 2023
e977f03
Merge pull request #24 from yuzhou03/doc-case-standard
upvenly Mar 20, 2023
12ec9ef
doc: update supported case table for repo readme
Mar 21, 2023
114e67a
update case-adaption spec: add configs
Mar 21, 2023
b8226b3
remove get_finished_info from helper.py. update related content in st…
Mar 21, 2023
99f52db
GLM: add nv configs, add running statistics to GLM-pytorch case readme
Mar 21, 2023
d210b62
process logEvent for evaluate, step_begin and init_evaluate (#35)
yuzhou03 Mar 22, 2023
a5f0aff
Merge pull request #30 from yuzhou03/repo-readme
upvenly Mar 22, 2023
83b605f
Merge pull request #36 from yuzhou03/doc-adaption
upvenly Mar 22, 2023
e5be778
Merge pull request #29 from yuzhou03/glm-running-stat
upvenly Mar 22, 2023
4ef6057
GLM: add comments for kunlunxin config
dynamicheart Mar 23, 2023
48ad35b
Merge pull request #31 from dynamicheart/main
upvenly Mar 24, 2023
d229491
add pytorch glm by iluvatar
yan-rui Mar 24, 2023
b0397dd
fix same bugs
yan-rui Mar 24, 2023
14efa78
print cmd result after runing for easy debug
yan-rui Mar 24, 2023
f1c5d7d
update iluvatar GLM-pytorch info
yan-rui Mar 24, 2023
f0e8f32
update iluvatar GLM-pytorch info
yan-rui Mar 24, 2023
2389c3f
first commit for iluvatar cpm-pytorch
yan-rui Mar 24, 2023
843095c
fix same bugs
yan-rui Mar 25, 2023
98ef713
add configs : 1x1 1x2 1x4 1x8 2x8
yan-rui Mar 25, 2023
8845a3c
Rename config_BI100x1x8.py to config_BI-V100x1x8.py
yan-rui Mar 25, 2023
75809c1
rename configs
Mar 25, 2023
f712404
update performence of iluvatar cpm-pytorch
yan-rui Mar 25, 2023
07949a1
iluvatar glm-pytorch reformart by yapf
Mar 25, 2023
c222721
iluvatar cpm-pytorch reformart by yapf
Mar 25, 2023
37cb352
Merge pull request #37 from yuzhou03/finished_info
upvenly Mar 27, 2023
25f5795
modify tf2-res50
Mar 27, 2023
95c11c9
Iluvatar glm-pytorch: remove unused code
yan-rui Mar 27, 2023
6d01d59
add email for softwares; fix some bugs;
yan-rui Mar 27, 2023
e86056b
1. Adjusted the config file.
forestlee95 Mar 27, 2023
ec6d202
GLM: add GLM case document for kunlunxin
dynamicheart Mar 27, 2023
107c1a4
modify tf2-res50
Mar 28, 2023
b8f09f9
modify tf2-res50
Mar 28, 2023
2ac585e
modify tf2-res50
Mar 28, 2023
da56212
Iluvatar GLM-pytorch: remove unused files
yan-rui Mar 28, 2023
a3378ef
reset doc
Mar 28, 2023
0254099
Iluvatar CPM-pytorch: merge glm
yan-rui Mar 28, 2023
8f380af
rm unnecessary file
Mar 28, 2023
f6ce7fa
Merge pull request #40 from yan-rui/iluvatar_cpm
upvenly Mar 28, 2023
bbb4eeb
Merge pull request #41 from dynamicheart/kunlunxin_doc
upvenly Mar 29, 2023
7607122
rm unnecessary file
Mar 29, 2023
cb598c9
support resume & pep8
Mar 29, 2023
8dee852
update readme
Mar 29, 2023
d9e6055
update readme
Mar 29, 2023
800b9c8
update readme
Mar 29, 2023
a18737f
update readme
Mar 29, 2023
ef80bcb
update readme
Mar 29, 2023
bb54b1e
rm unnecessary code & update readme
Mar 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,9 @@
.ijwb/
.vscode/
__pycache__/
.pytest_cache
.pytest_cache
training/result/*
training/result_ckpt/*
training/result_ckpt/events.*
training/result_save/*
training/result_save/model.*
13 changes: 12 additions & 1 deletion training/benchmarks/driver/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@ def __init__(self, config, mutable_params):
self.logger = None

def setup_config(self, parser):
parser.add_argument(
"--data_dir",
type=str,
required=False,
help="The full path to the root of external modules")
parser.add_argument(
"--extern_module_dir",
type=str,
Expand Down Expand Up @@ -49,8 +54,12 @@ def setup_config(self, parser):
self.extern_modules)
self.logger = perf_logger.PerfLogger.get_default_logger(
rank=self.config.local_rank)
try:
log_freq = self.config.log_freq
except AttributeError:
log_freq = self.config.train.time_history.log_steps
event_manager = log_event.LogEventManager(
self.config.local_rank, self.logger, log_freq=self.config.log_freq)
self.config.local_rank, self.logger, log_freq=log_freq)
event_manager.register_event_handlers(self)
for _, mod in self.extern_modules.items():
for cls in mod_util.find_derived_classes(EventManager, mod):
Expand All @@ -65,6 +74,8 @@ def setup_modules(self, *args):
elif isinstance(arg, dict):
print(str(arg) + " remap by " + str(self.extern_modules))
mod_util.remap_modules(arg, self.extern_modules)
elif isinstance(arg, object): # TODO
pass
else:
raise TypeError('Can either be a module or a dict')

Expand Down
7 changes: 5 additions & 2 deletions training/benchmarks/driver/config_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,8 +144,11 @@ def activate(base_config,

parsed_params = parse_from_args_and_config(params, cmd_args, ext_config,
enable_extern_config)

_merge_dict_to_config(parsed_params.__dict__, base_config.__dict__)
# tf2
if isinstance(base_config, object):
base_config.override(parsed_params.__dict__, False)
else:
_merge_dict_to_config(parsed_params.__dict__, base_config.__dict__)

if ext_config:
config_path = ext_config
Expand Down
248 changes: 248 additions & 0 deletions training/benchmarks/driver/dist_tensorflow2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
# Copyright © 2022 BAAI. All rights reserved.

import logging
import random
import os
import json
from contextlib import contextmanager
import tensorflow as tf

def generate_seeds(rng, size):
"""
Generate list of random seeds

:param rng: random number generator
:param size: length of the returned list
"""
seeds = [rng.randint(0, 2**32 - 1) for _ in range(size)]
return seeds


def global_batch_size(config):
return config.train_dataset.batch_size * config.runtime.num_gpus # TODO get_world_size()


def format_step(step):
if isinstance(step, str):
return step
s = ""
if len(step) > 0:
s += "Training Epoch: {} ".format(step[0])
if len(step) > 1:
s += "Training Iteration: {} ".format(step[1])
if len(step) > 2:
s += "Validation Iteration: {} ".format(step[2])
return s


def _mirrored_cross_device_ops(all_reduce_alg, num_packs):
"""Return a CrossDeviceOps based on all_reduce_alg and num_packs.

Args:
all_reduce_alg: a string specifying which cross device op to pick, or None.
num_packs: an integer specifying number of packs for the cross device op.

Returns:
tf.distribute.CrossDeviceOps object or None.

Raises:
ValueError: if `all_reduce_alg` not in [None, "nccl", "hierarchical_copy"].
"""
if all_reduce_alg is None:
return None
mirrored_all_reduce_options = {
"nccl": tf.distribute.NcclAllReduce,
"hierarchical_copy": tf.distribute.HierarchicalCopyAllReduce
}
if all_reduce_alg not in mirrored_all_reduce_options:
raise ValueError(
"When used with `mirrored`, valid values for all_reduce_alg are "
"[`nccl`, `hierarchical_copy`]. Supplied value: {}".format(
all_reduce_alg))
cross_device_ops_class = mirrored_all_reduce_options[all_reduce_alg]
return cross_device_ops_class(num_packs=num_packs)


def tpu_initialize(tpu_address):
"""Initializes TPU for TF 2.x training.

Args:
tpu_address: string, bns address of master TPU worker.

Returns:
A TPUClusterResolver.
"""
cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
tpu=tpu_address)
if tpu_address not in ("", "local"):
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
return cluster_resolver


def configure_cluster(worker_hosts=None, task_index=-1):
"""Set multi-worker cluster spec in TF_CONFIG environment variable.

Args:
worker_hosts: comma-separated list of worker ip:port pairs.
task_index: index of the worker.

Returns:
Number of workers in the cluster.
"""
tf_config = json.loads(os.environ.get("TF_CONFIG", "{}"))
if tf_config:
num_workers = (
len(tf_config["cluster"].get("chief", [])) +
len(tf_config["cluster"].get("worker", [])))
elif worker_hosts:
workers = worker_hosts.split(",")
num_workers = len(workers)
if num_workers > 1 and task_index < 0:
raise ValueError("Must specify task_index when number of workers > 1")
task_index = 0 if num_workers == 1 else task_index
os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"worker": workers
},
"task": {
"type": "worker",
"index": task_index
}
})
else:
num_workers = 1
return num_workers


def _collective_communication(all_reduce_alg):
"""Return a CollectiveCommunication based on all_reduce_alg.

Args:
all_reduce_alg: a string specifying which collective communication to pick,
or None.

Returns:
tf.distribute.experimental.CollectiveCommunication object

Raises:
ValueError: if `all_reduce_alg` not in [None, "ring", "nccl"]
"""
collective_communication_options = {
None: tf.distribute.experimental.CollectiveCommunication.AUTO,
"ring": tf.distribute.experimental.CollectiveCommunication.RING,
"nccl": tf.distribute.experimental.CollectiveCommunication.NCCL
}
if all_reduce_alg not in collective_communication_options:
raise ValueError(
"When used with `multi_worker_mirrored`, valid values for "
"all_reduce_alg are [`ring`, `nccl`]. Supplied value: {}".format(
all_reduce_alg))
return collective_communication_options[all_reduce_alg]

def get_distribution_strategy(distribution_strategy="mirrored",
num_gpus=0,
all_reduce_alg=None,
num_packs=1,
tpu_address=None,
**kwargs):
"""Return a Strategy for running the model.

Args:
distribution_strategy: a string specifying which distribution strategy to
use. Accepted values are "off", "one_device", "mirrored",
"parameter_server", "multi_worker_mirrored", and "tpu" -- case
insensitive. "tpu" means to use TPUStrategy using `tpu_address`.
"off" means to use the default strategy which is obtained from
tf.distribute.get_strategy (for details on the default strategy, see
https://www.tensorflow.org/guide/distributed_training#default_strategy).
num_gpus: Number of GPUs to run this model.
all_reduce_alg: Optional. Specifies which algorithm to use when performing
all-reduce. For `MirroredStrategy`, valid values are "nccl" and
"hierarchical_copy". For `MultiWorkerMirroredStrategy`, valid values are
"ring" and "nccl". If None, DistributionStrategy will choose based on
device topology.
num_packs: Optional. Sets the `num_packs` in `tf.distribute.NcclAllReduce`
or `tf.distribute.HierarchicalCopyAllReduce` for `MirroredStrategy`.
tpu_address: Optional. String that represents TPU to connect to. Must not be
None if `distribution_strategy` is set to `tpu`.
**kwargs: Additional kwargs for internal usages.

Returns:
tf.distribute.Strategy object.
Raises:
ValueError: if `distribution_strategy` is "off" or "one_device" and
`num_gpus` is larger than 1; or `num_gpus` is negative or if
`distribution_strategy` is `tpu` but `tpu_address` is not specified.
"""
del kwargs
if num_gpus < 0:
raise ValueError("`num_gpus` can not be negative.")

if not isinstance(distribution_strategy, str):
msg = ("distribution_strategy must be a string but got: %s." %
(distribution_strategy,))
if distribution_strategy == False: # pylint: disable=singleton-comparison,g-explicit-bool-comparison
msg += (" If you meant to pass the string 'off', make sure you add "
"quotes around 'off' so that yaml interprets it as a string "
"instead of a bool.")
raise ValueError(msg)

distribution_strategy = distribution_strategy.lower()
if distribution_strategy == "off":
if num_gpus > 1:
raise ValueError(f"When {num_gpus} GPUs are specified, "
"distribution_strategy flag cannot be set to `off`.")
# Return the default distribution strategy.
return tf.distribute.get_strategy()

if distribution_strategy == "tpu":
# When tpu_address is an empty string, we communicate with local TPUs.
cluster_resolver = tpu_initialize(tpu_address)
return tf.distribute.TPUStrategy(cluster_resolver)

if distribution_strategy == "multi_worker_mirrored":
return tf.distribute.experimental.MultiWorkerMirroredStrategy(
communication=_collective_communication(all_reduce_alg))

if distribution_strategy == "one_device":
if num_gpus == 0:
return tf.distribute.OneDeviceStrategy("device:CPU:0")
if num_gpus > 1:
raise ValueError("`OneDeviceStrategy` can not be used for more than "
"one device.")
return tf.distribute.OneDeviceStrategy("device:GPU:0")

if distribution_strategy == "mirrored":
if num_gpus == 0:
devices = ["device:CPU:0"]
else:
devices = ["device:GPU:%d" % i for i in range(num_gpus)]
return tf.distribute.MirroredStrategy(
devices=devices,
cross_device_ops=_mirrored_cross_device_ops(all_reduce_alg, num_packs))

if distribution_strategy == "parameter_server":
cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
return tf.distribute.experimental.ParameterServerStrategy(cluster_resolver)

raise ValueError("Unrecognized Distribution Strategy: %r" %
distribution_strategy)


def get_strategy_scope(strategy):
if strategy:
strategy_scope = strategy.scope()
else:
strategy_scope = DummyContextManager()

return strategy_scope


class DummyContextManager(object):

def __enter__(self):
pass

def __exit__(self, *args):
pass
Loading