启动内存优化后，多个loss训练，获取loss值有异常 #11320

dyning · 2018-06-08T11:16:03Z

我的多个loss的程序调用fluid.memory_optimize(fluid.default_main_program())内存优化后，获取loss显示异常。并且相同环境，关闭fluid.memory_optimize优化，两个loss值有diff。

我调用的一些关键程序：
train_exe = fluid.ParallelExecutor(use_cuda=True, loss_name=avg_loss.name)
fetch_list_var = []
results = train_exe.run(fetch_list_var, feed=feed_dict)

建议可以拿googlenet的训练试试，看看是否可以复现。

@panyx0718

panyx0718 · 2018-06-08T11:22:32Z

@dyning Can you give a screenshot of your code? The part that uses memory_optimizer and ParallelExecutor.run?

@dzhwinter @reyoung @chengduoZH
I quickly looked at @dyning 's code. He fetched several intermediate losses. When mem_opt is enabled, the fetched value is wrong, when mem_opt is disabled, the fetched value is correct.

I suspect the memory_optimizer somehow renamed the variables, therefore ParallelExecutor fetched
the wrong thing.

dyning · 2018-06-08T12:27:23Z

screenshot of code as following:

       optimizer = fluid.optimizer.Momentum(
            learning_rate=fluid.layers.piecewise_decay(
                boundaries=self.propsalparam['bd'], values=self.propsalparam['lr']), momentum=0.9, 
            regularization=fluid.regularizer.L2Decay(1e-4))
        opts = optimizer.minimize(avg_loss)
        #fluid.memory_optimize(fluid.default_main_program())

        #place = fluid.CPUPlace()
        place = fluid.CUDAPlace(0)
        exe = fluid.Executor(place)
        exe.run(fluid.default_startup_program())
        pretrain_models_path = self.propsalparam['pretrain_models_path']
        save_interval = self.propsalparam['save_interval']
        if pretrain_models_path is not None:
            def if_exist(var):
                """if_exist"""
                return os.path.exists(os.path.join(pretrain_models_path, 
                    var.name))
            fluid.io.load_vars(exe, pretrain_models_path, predicate=if_exist)
        train_reader = paddle.batch(logodet_reader.reader_creator_logodet(
            self.configfile, self.sectionname), 
            batch_size=self.propsalparam['batch_size'])

        train_exe = fluid.ParallelExecutor(use_cuda=True, 
            loss_name=avg_loss.name)
        fetch_list_var = [avg_rpn_loss_cls.name, avg_rpn_loss_loc.name, \
            avg_loss_cls.name, avg_loss_loc.name, avg_loss.name, \
            rpn_acc_top1.name, acc_top1.name, "learning_rate"]
        fetch_list_name = ["rpn_loss_cls", "rpn_loss_loc", "loss_cls",\
            "loss_loc", "loss", "rpn_acc_top1", "acc_top1", "lr"]
        with open("./output/train_log.txt", "wb") as fout_log:
            for pass_id in range(self.propsalparam['epoch_num']):
                begtime = time.time()
                for batch_id, blobs in enumerate(train_reader()):
                    feed_dict = self.convert_blobs_to_feed_dict(blobs, place)
                    results = train_exe.run(fetch_list_var, 
                        feed=feed_dict)

QiJune · 2018-06-11T12:45:24Z

@dyning 你可以试试 #11372 这个patch，帮忙确认一下是否有效

先把自己想要fetch的值给列出来，然后传给memory_optimize这个接口。
这些值在做内存优化的时候就会被跳过。

fetch_list_var = [avg_rpn_loss_cls.name, avg_rpn_loss_loc.name, \
        avg_loss_cls.name, avg_loss_loc.name, avg_loss.name, \
        rpn_acc_top1.name, acc_top1.name, "learning_rate"]
fluid.memory_optimize(fluid.default_main_program(), fetch_list_var)

panyx0718 · 2018-06-13T03:42:57Z

@dyning 这个问题解决了吗？

dyning · 2018-06-14T05:55:05Z

@QiJune ,#11372 验证了，问题依然存在。

dzhwinter · 2018-06-14T06:09:10Z

我来跟一下吧。
用这个里的googlenet可以复现是吗？
https://github.com/PaddlePaddle/models/blob/71f3172cff2a181873d68422c577c8c8a96d53dd/fluid/image_classification/caffe2fluid/examples/imagenet/tools/test.sh

dzhwinter · 2018-06-14T12:05:24Z

这个issue @QiJune 和我的PR都是可以fix的。打开和关闭memory_optimize，可以完全对齐。验证结果如下，
使用了models中的googlenet，fetch部分的snippet如下

    train_reader = paddle.batch(reader.fake_reader(), batch_size=train_batch_size)
    fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name, avg_cost0.name, avg_cost1.name, avg_cost2.name]

    if with_memory_optimization:
        fluid.memory_optimize(fluid.default_main_program(), skip_opt_set=set(fetch_list))

    for pass_id in range(params["num_epochs"]):
        train_info = [[], [], []]
        test_info = [[], [], []]
        for batch_id, data in enumerate(train_reader()):
            t1 = time.time()
            loss, acc1, acc5, cost0, cost1, cost2 = train_exe.run(fetch_list, feed=feeder.feed(data))
            t2 = time.time()
            period = t2 - t1
            loss = np.mean(np.array(loss))
            acc1 = np.mean(np.array(acc1))
            acc5 = np.mean(np.array(acc5))
            train_info[0].append(loss)
            train_info[1].append(acc1)
            train_info[2].append(acc5)
            if batch_id % 1 == 0:
                print("Pass {0}, trainbatch {1}, loss {2}, \
                       acc1 {3}, acc5 {4}, {5}, {6}, {7}, time {8}"
                                                   .format(pass_id, \
                                                           batch_id, loss, acc1, acc5, np.array(cost0), np.array(cost1), np.array(cost2), \
                       "%2.2f sec" % period))
                sys.stdout.flush()

Reader部分下

shape = [3, 224, 224]
label = range(102)
np.random.seed(100)
def fake_reader():
    def reader():
        while True:
            yield np.random.uniform(size=shape), np.random.choice(label)
    return reader

关闭memory_optimize

Pass 0, trainbatch 0, loss 7.40484714508,                        acc1 0.0, acc5 0.0625, [4.618476], [4.6716123], [4.616293], time 0.11 sec
Pass 0, trainbatch 1, loss 7.40602397919,                        acc1 0.03125, acc5 0.0625, [4.629113], [4.6268578], [4.629512], time 0.09 sec
Pass 0, trainbatch 2, loss 7.42118453979,                        acc1 0.0, acc5 0.0, [4.647031], [4.606136], [4.641044], time 0.11 sec
Pass 0, trainbatch 3, loss 7.39293956757,                        acc1 0.03125, acc5 0.0625, [4.620225], [4.600328], [4.6420546], time 0.10 sec
Pass 0, trainbatch 4, loss 7.45125865936,                        acc1 0.0, acc5 0.0, [4.6322255], [4.7699347], [4.6268435], time 0.10 sec
Pass 0, trainbatch 5, loss 7.39793109894,                        acc1 0.0, acc5 0.0625, [4.6226783], [4.621888], [4.628953], time 0.09 sec
Pass 0, trainbatch 6, loss 7.39033412933,                        acc1 0.0, acc5 0.03125, [4.6119056], [4.6370993], [4.6243286], time 0.10 sec
Pass 0, trainbatch 7, loss 7.37090873718,                        acc1 0.03125, acc5 0.0625, [4.6033278], [4.618459], [4.606809], time 0.09 sec
Pass 0, trainbatch 8, loss 7.45732116699,                        acc1 0.0, acc5 0.03125, [4.650306], [4.720114], [4.636603], time 0.10 sec
Pass 0, trainbatch 9, loss 7.50588703156,                        acc1 0.03125, acc5 0.09375, [4.711724], [4.682214], [4.6316633], time 0.09 sec
Pass 0, trainbatch 10, loss 7.42192602158,                        acc1 0.0, acc5 0.0, [4.638735], [4.6542664], [4.623038], time 0.10 sec

打开memory_optimize

Pass 0, trainbatch 0, loss 7.40484714508,                        acc1 0.0, acc5 0.0625, [4.618476], [4.6716123], [4.616293], time 0.11 sec
Pass 0, trainbatch 1, loss 7.40602397919,                        acc1 0.03125, acc5 0.0625, [4.629113], [4.6268578], [4.629512], time 0.09 sec
Pass 0, trainbatch 2, loss 7.42118453979,                        acc1 0.0, acc5 0.0, [4.647031], [4.6061325], [4.641046], time 0.09 sec
Pass 0, trainbatch 3, loss 7.39287471771,                        acc1 0.03125, acc5 0.0625, [4.6201744], [4.600312], [4.642022], time 0.08 sec
Pass 0, trainbatch 4, loss 7.45143222809,                        acc1 0.0, acc5 0.0, [4.6320643], [4.7702603], [4.627632], time 0.10 sec
Pass 0, trainbatch 5, loss 7.39721155167,                        acc1 0.0, acc5 0.0625, [4.622399], [4.6212826], [4.6280923], time 0.09 sec
Pass 0, trainbatch 6, loss 7.39040565491,                        acc1 0.0, acc5 0.03125, [4.611952], [4.636217], [4.625294], time 0.09 sec
Pass 0, trainbatch 7, loss 7.37325382233,                        acc1 0.03125, acc5 0.0625, [4.604336], [4.6227126], [4.6070127], time 0.09 sec
Pass 0, trainbatch 8, loss 7.4660615921,                        acc1 0.0, acc5 0.03125, [4.657354], [4.7219033], [4.6404552], time 0.10 sec
Pass 0, trainbatch 9, loss 7.50238180161,                        acc1 0.03125, acc5 0.09375, [4.711801], [4.669479], [4.6324573], time 0.10 sec
Pass 0, trainbatch 10, loss 7.42162179947,                        acc1 0.0, acc5 0.0, [4.640833], [4.6434155], [4.625881], time 0.10 sec

dzhwinter · 2018-06-14T12:09:55Z

这里需要特别注意顺序，必须在memory_optimize之前设置fetch的变量为persistable，或者加入到skip_set里。
原因是memory_optimize会默认把非persistable的变量(非参数变量，临时变量) 内存复用掉，节约内存，但是用户会fetch的变量也可能在其中，所以必须在做内存优化之前就把这个信息传过去。

dyning · 2018-06-14T12:25:48Z

赞，修复后可以正常显示loss，thanks。

dyning · 2018-06-18T10:59:03Z

实际运行时，发现训练过程中，内存使用一直在涨，不知道什么情况？

panyx0718 · 2018-06-18T11:29:03Z

这个高优跟进一下吧。看看能不能先一起复现一下

dzhwinter · 2018-06-18T13:24:13Z

是内存在涨还是显存在涨？什么版本的paddle？

dyning · 2018-06-19T01:53:39Z

是内存在涨，最新版的paddle

dzhwinter · 2018-06-20T02:34:56Z

update：
训练过程中峰值在上涨，登陆进去看实际占用内存并没有波动。本地启动任务复现不了这个问题，正在和cloud跟进。

shanyi15 · 2018-08-15T11:35:14Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

panyx0718 assigned dzhwinter, chengduoZH, panyx0718 and reyoung Jun 8, 2018

QiJune mentioned this issue Jun 11, 2018

enhance memory optimization transpiler to support user defined skip_opt_set #11372

Merged

dzhwinter mentioned this issue Jun 14, 2018

Add test to memory optimize/polish the doc in memory transpiler #11462

Merged

dyning closed this as completed Jun 14, 2018

dyning reopened this Jun 18, 2018

panyx0718 added the 烫 label Jun 18, 2018

oraoto mentioned this issue Jul 26, 2018

memory_optimize会导致结果错误 #12387

Closed

shanyi15 closed this as completed Aug 15, 2018

cserken mentioned this issue Nov 8, 2018

使用memory_optimize后，loss计算异常 #14312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

启动内存优化后，多个loss训练，获取loss值有异常 #11320

启动内存优化后，多个loss训练，获取loss值有异常 #11320

dyning commented Jun 8, 2018

panyx0718 commented Jun 8, 2018

dyning commented Jun 8, 2018 •

edited by dzhwinter

Loading

QiJune commented Jun 11, 2018 •

edited

Loading

panyx0718 commented Jun 13, 2018

dyning commented Jun 14, 2018

dzhwinter commented Jun 14, 2018

dzhwinter commented Jun 14, 2018 •

edited

Loading

dzhwinter commented Jun 14, 2018

dyning commented Jun 14, 2018

dyning commented Jun 18, 2018

panyx0718 commented Jun 18, 2018 •

edited

Loading

dzhwinter commented Jun 18, 2018

dyning commented Jun 19, 2018

dzhwinter commented Jun 20, 2018

shanyi15 commented Aug 15, 2018

启动内存优化后，多个loss训练，获取loss值有异常 #11320

启动内存优化后，多个loss训练，获取loss值有异常 #11320

Comments

dyning commented Jun 8, 2018

panyx0718 commented Jun 8, 2018

dyning commented Jun 8, 2018 • edited by dzhwinter Loading

QiJune commented Jun 11, 2018 • edited Loading

panyx0718 commented Jun 13, 2018

dyning commented Jun 14, 2018

dzhwinter commented Jun 14, 2018

dzhwinter commented Jun 14, 2018 • edited Loading

dzhwinter commented Jun 14, 2018

dyning commented Jun 14, 2018

dyning commented Jun 18, 2018

panyx0718 commented Jun 18, 2018 • edited Loading

dzhwinter commented Jun 18, 2018

dyning commented Jun 19, 2018

dzhwinter commented Jun 20, 2018

shanyi15 commented Aug 15, 2018

dyning commented Jun 8, 2018 •

edited by dzhwinter

Loading

QiJune commented Jun 11, 2018 •

edited

Loading

dzhwinter commented Jun 14, 2018 •

edited

Loading

panyx0718 commented Jun 18, 2018 •

edited

Loading