-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
启动内存优化后,多个loss训练,获取loss值有异常 #11320
Comments
@dyning Can you give a screenshot of your code? The part that uses memory_optimizer and ParallelExecutor.run? @dzhwinter @reyoung @chengduoZH I suspect the memory_optimizer somehow renamed the variables, therefore ParallelExecutor fetched |
screenshot of code as following: optimizer = fluid.optimizer.Momentum(
learning_rate=fluid.layers.piecewise_decay(
boundaries=self.propsalparam['bd'], values=self.propsalparam['lr']), momentum=0.9,
regularization=fluid.regularizer.L2Decay(1e-4))
opts = optimizer.minimize(avg_loss)
#fluid.memory_optimize(fluid.default_main_program())
#place = fluid.CPUPlace()
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
pretrain_models_path = self.propsalparam['pretrain_models_path']
save_interval = self.propsalparam['save_interval']
if pretrain_models_path is not None:
def if_exist(var):
"""if_exist"""
return os.path.exists(os.path.join(pretrain_models_path,
var.name))
fluid.io.load_vars(exe, pretrain_models_path, predicate=if_exist)
train_reader = paddle.batch(logodet_reader.reader_creator_logodet(
self.configfile, self.sectionname),
batch_size=self.propsalparam['batch_size'])
train_exe = fluid.ParallelExecutor(use_cuda=True,
loss_name=avg_loss.name)
fetch_list_var = [avg_rpn_loss_cls.name, avg_rpn_loss_loc.name, \
avg_loss_cls.name, avg_loss_loc.name, avg_loss.name, \
rpn_acc_top1.name, acc_top1.name, "learning_rate"]
fetch_list_name = ["rpn_loss_cls", "rpn_loss_loc", "loss_cls",\
"loss_loc", "loss", "rpn_acc_top1", "acc_top1", "lr"]
with open("./output/train_log.txt", "wb") as fout_log:
for pass_id in range(self.propsalparam['epoch_num']):
begtime = time.time()
for batch_id, blobs in enumerate(train_reader()):
feed_dict = self.convert_blobs_to_feed_dict(blobs, place)
results = train_exe.run(fetch_list_var,
feed=feed_dict) |
@dyning 你可以试试 #11372 这个patch,帮忙确认一下是否有效 先把自己想要fetch的值给列出来,然后传给memory_optimize这个接口。 fetch_list_var = [avg_rpn_loss_cls.name, avg_rpn_loss_loc.name, \
avg_loss_cls.name, avg_loss_loc.name, avg_loss.name, \
rpn_acc_top1.name, acc_top1.name, "learning_rate"]
fluid.memory_optimize(fluid.default_main_program(), fetch_list_var) |
@dyning 这个问题解决了吗? |
这个issue @QiJune 和我的PR都是可以fix的。打开和关闭memory_optimize,可以完全对齐。验证结果如下, train_reader = paddle.batch(reader.fake_reader(), batch_size=train_batch_size)
fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name, avg_cost0.name, avg_cost1.name, avg_cost2.name]
if with_memory_optimization:
fluid.memory_optimize(fluid.default_main_program(), skip_opt_set=set(fetch_list))
for pass_id in range(params["num_epochs"]):
train_info = [[], [], []]
test_info = [[], [], []]
for batch_id, data in enumerate(train_reader()):
t1 = time.time()
loss, acc1, acc5, cost0, cost1, cost2 = train_exe.run(fetch_list, feed=feeder.feed(data))
t2 = time.time()
period = t2 - t1
loss = np.mean(np.array(loss))
acc1 = np.mean(np.array(acc1))
acc5 = np.mean(np.array(acc5))
train_info[0].append(loss)
train_info[1].append(acc1)
train_info[2].append(acc5)
if batch_id % 1 == 0:
print("Pass {0}, trainbatch {1}, loss {2}, \
acc1 {3}, acc5 {4}, {5}, {6}, {7}, time {8}"
.format(pass_id, \
batch_id, loss, acc1, acc5, np.array(cost0), np.array(cost1), np.array(cost2), \
"%2.2f sec" % period))
sys.stdout.flush() Reader部分下 shape = [3, 224, 224]
label = range(102)
np.random.seed(100)
def fake_reader():
def reader():
while True:
yield np.random.uniform(size=shape), np.random.choice(label)
return reader 关闭memory_optimize Pass 0, trainbatch 0, loss 7.40484714508, acc1 0.0, acc5 0.0625, [4.618476], [4.6716123], [4.616293], time 0.11 sec
Pass 0, trainbatch 1, loss 7.40602397919, acc1 0.03125, acc5 0.0625, [4.629113], [4.6268578], [4.629512], time 0.09 sec
Pass 0, trainbatch 2, loss 7.42118453979, acc1 0.0, acc5 0.0, [4.647031], [4.606136], [4.641044], time 0.11 sec
Pass 0, trainbatch 3, loss 7.39293956757, acc1 0.03125, acc5 0.0625, [4.620225], [4.600328], [4.6420546], time 0.10 sec
Pass 0, trainbatch 4, loss 7.45125865936, acc1 0.0, acc5 0.0, [4.6322255], [4.7699347], [4.6268435], time 0.10 sec
Pass 0, trainbatch 5, loss 7.39793109894, acc1 0.0, acc5 0.0625, [4.6226783], [4.621888], [4.628953], time 0.09 sec
Pass 0, trainbatch 6, loss 7.39033412933, acc1 0.0, acc5 0.03125, [4.6119056], [4.6370993], [4.6243286], time 0.10 sec
Pass 0, trainbatch 7, loss 7.37090873718, acc1 0.03125, acc5 0.0625, [4.6033278], [4.618459], [4.606809], time 0.09 sec
Pass 0, trainbatch 8, loss 7.45732116699, acc1 0.0, acc5 0.03125, [4.650306], [4.720114], [4.636603], time 0.10 sec
Pass 0, trainbatch 9, loss 7.50588703156, acc1 0.03125, acc5 0.09375, [4.711724], [4.682214], [4.6316633], time 0.09 sec
Pass 0, trainbatch 10, loss 7.42192602158, acc1 0.0, acc5 0.0, [4.638735], [4.6542664], [4.623038], time 0.10 sec 打开memory_optimize Pass 0, trainbatch 0, loss 7.40484714508, acc1 0.0, acc5 0.0625, [4.618476], [4.6716123], [4.616293], time 0.11 sec
Pass 0, trainbatch 1, loss 7.40602397919, acc1 0.03125, acc5 0.0625, [4.629113], [4.6268578], [4.629512], time 0.09 sec
Pass 0, trainbatch 2, loss 7.42118453979, acc1 0.0, acc5 0.0, [4.647031], [4.6061325], [4.641046], time 0.09 sec
Pass 0, trainbatch 3, loss 7.39287471771, acc1 0.03125, acc5 0.0625, [4.6201744], [4.600312], [4.642022], time 0.08 sec
Pass 0, trainbatch 4, loss 7.45143222809, acc1 0.0, acc5 0.0, [4.6320643], [4.7702603], [4.627632], time 0.10 sec
Pass 0, trainbatch 5, loss 7.39721155167, acc1 0.0, acc5 0.0625, [4.622399], [4.6212826], [4.6280923], time 0.09 sec
Pass 0, trainbatch 6, loss 7.39040565491, acc1 0.0, acc5 0.03125, [4.611952], [4.636217], [4.625294], time 0.09 sec
Pass 0, trainbatch 7, loss 7.37325382233, acc1 0.03125, acc5 0.0625, [4.604336], [4.6227126], [4.6070127], time 0.09 sec
Pass 0, trainbatch 8, loss 7.4660615921, acc1 0.0, acc5 0.03125, [4.657354], [4.7219033], [4.6404552], time 0.10 sec
Pass 0, trainbatch 9, loss 7.50238180161, acc1 0.03125, acc5 0.09375, [4.711801], [4.669479], [4.6324573], time 0.10 sec
Pass 0, trainbatch 10, loss 7.42162179947, acc1 0.0, acc5 0.0, [4.640833], [4.6434155], [4.625881], time 0.10 sec |
这里需要特别注意顺序,必须在memory_optimize之前设置fetch的变量为persistable,或者加入到skip_set里。 |
赞,修复后可以正常显示loss,thanks。 |
实际运行时,发现训练过程中,内存使用一直在涨,不知道什么情况? |
这个高优跟进一下吧。看看能不能先一起复现一下 |
是内存在涨还是显存在涨? 什么版本的paddle? |
是内存在涨,最新版的paddle |
update: |
您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持! |
我的多个loss的程序调用fluid.memory_optimize(fluid.default_main_program())内存优化后,获取loss显示异常。并且相同环境,关闭fluid.memory_optimize优化,两个loss值有diff。
我调用的一些关键程序:
train_exe = fluid.ParallelExecutor(use_cuda=True, loss_name=avg_loss.name)
fetch_list_var = []
results = train_exe.run(fetch_list_var, feed=feed_dict)
建议可以拿googlenet的训练试试,看看是否可以复现。
@panyx0718
The text was updated successfully, but these errors were encountered: