program crash when run sparse model predict #8500

liumilan · 2017-11-01T08:34:25Z

I just run the linear_classification.py and save model.Then i need to predict data base on this model.
The predict code such as below:
if name == 'main':
args = parse_args()

prefix = './dnn'

devs = mx.cpu() if args.gpus is None else [
         mx.gpu(int(i)) for i in args.gpus.split(',')]

model_load = mx.model.FeedForward.load(prefix, args.iteration,ctx=devs)

test = mx.io.LibSVMIter(data_libsvm=args.test_dir, data_shape=(3864334,),
                              batch_size=args.batch_size, num_parts=1,
                              part_index=0,
                              num_epoch = 1)

[prob, data, label] = model_load.predict(test, return_data=True)
data_len = len(prob)

for i in range(0, data_len):
    print '%s\t%d\t%.6f' % ('1', label[i], prob[i])

shell command is as below:
python predict.py --test-dir="./test41.xls" --iteration=1 --batch-size=2000

and then it will report
include/mxnet/././tensor_blob.h:275: Check failed: this->shape_.Size() == shape.Size() (77286
68000 vs. 3433700704) TBlob.get_with_shape: new and old shape do not match total elements
how can i fixed it?
@eric-haibin-lin

The text was updated successfully, but these errors were encountered:

eric-haibin-lin · 2017-11-02T04:38:35Z

How large is your weight matrix? This is one of the mshadow-related legacy issues mentioned in #7319 (comment)
The problem is that 32 bit int is not sufficient to calculate the number of elements in a large matrix..

https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/tensor_blob.h#L270

It's on our roadmap but we don't have the time to fix that recently...

liumilan · 2017-11-02T07:06:42Z

i have no idea how to caculate the weight matrix? batchsize*num_features?
batchsize is 2000,num_features is 3864334,
and in error log how 3433700704 is caculated?
In fact,during training ,the batchsize and num_features are setted the same as predict,and it can run during trainning.
@eric-haibin-lin

eric-haibin-lin · 2017-11-02T16:50:07Z

If you reduce batch_size to 1000 or 100, is the error still there?

I suspect there's some integer overflow problem

liumilan · 2017-11-03T04:02:58Z

yes,there is no error report when batchsize reduce to 200.Maybe there is bug in code.
But why it is ok during trainning? Is the result right if the batchsize is setted to 2000 during trainning?
@eric-haibin-lin

eric-haibin-lin · 2017-11-03T07:07:03Z

The code only breaks when it calls this functions: https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/tensor_blob.h#L270
It's just doing a sanity check to sure the shape matches. But unfortunately 32bit is not enough to perform the check, hence the failure.
It won't affect the quality of the model. You can either use small batch size for now as a walk-around, or remove that CHECK and build from source to predict with large batch size.

liumilan · 2017-11-03T07:45:06Z

OK,
And one more question,
Now the train datasize is 98994453 ,bathsize is 200
it takes 2 hours to finish 1 epoch,seemed to slow.
Can it possible to speed up sparse trainning?
In other sparse lr training ,it should be much more faster
@eric-haibin-lin

eric-haibin-lin · 2017-11-03T17:04:19Z

batch size = 200 seems to be very small. You can increase the batch size to reduce the time for training per epoch (you have to increase learning rate according).

Module.load just loads the model symbols and model parameters. You can train with large data batch and predict with small batch.

liumilan · 2017-11-04T10:08:15Z

is learning rate related to batchsize ?In code, grad has divided to batchsize.
In my data,if batchsize are increased to 20000 or 200000,the model result will be bad.
@eric-haibin-lin

liumilan · 2017-11-04T15:03:28Z

In fact,even the batchsize is increased to 200000,it take much to finish one epoch also
Time cost=2174.153
@eric-haibin-lin

eric-haibin-lin · 2017-11-09T06:15:36Z

Usually you want to increase the learning rate if batch size is increased.

What's the density of your model and how many classes are there?

If you have a GPU, you can change the ctx to mx.gpu(0) and run the same code without kvstore. The sparse.dot is implemented on GPU.
If you have multiple machines, you can do distributed training.

liumilan · 2017-11-11T10:21:41Z

It cost 1868.217 in every epoch in one gpu,seemed so slowly.
There are only 2 classes ,0 and 1.Batchsizes are assigned to 80000,features num are 3864334.Train data sizes are 100 millions.
The same train data is run much fasted in my leagcy code in one machine
@eric-haibin-lin

eric-haibin-lin · 2017-11-11T17:48:35Z

What was your legacy code based on? Is it MXNet dense NDArray implementation?
For cpu training, did you train setting OMP_NUM_THREADS to different values and see if that affects training time? What version of MXNet are you using? Is it built from source?
If you run your programme in terminal, do you see any warning messages?
What optimizer are you using?
I wonder if you can post your training code and see if there's any obvious place for improvement.

liumilan · 2017-11-11T18:09:46Z

The legacy code is implented by mpi for lr,used in my product,not mxnet.
The trainning code is modified a little by your sparse examples,change from weighted loss to LogsticRegressOutput
The mxnet version is 0.12.0,and it is build from source.
The log is as below:
auc is added by myself,and it won't affect the speed,wii it?

/search/ted/mxnet_new/example/../../python/mxnet/metric.py:1191: UserWarning: ^[[91mNew metric mxnet.metric.AUC registered with name auc isoverriding ex
isting metric mxnet.metric.AUC^[[0m
class AUC(EvalMetric):
2017-11-08 16:47:05,419 Namespace(batch_size=80000, gpus=0, kvstore=None, num_epoch=100, optimizer='sgd')
2017-11-08 16:47:09,010 Already bound, ignoring bind()
/search/ted/mxnet_new/example/../../python/mxnet/module/base_module.py:464: UserWarning: Parameters already initialized and force_init=False. init_param
s call ignored.
allow_missing=allow_missing, force_init=force_init)
2017-11-08 16:47:09,012 optimizer already initialized, ignoring...
/search/ted/mxnet_new/example../../python/mxnet/metric.py:1191: UserWarning: ^[[91mNew metric mxnet.metric.AUC registered with name auc isoverriding ex
isting metric mxnet.metric.AUC^[[0m
class AUC(EvalMetric):
2017-11-08 17:12:12,118 Namespace(batch_size=80000, gpus=0, kvstore=None, num_epoch=100, optimizer='sgd')
2017-11-08 17:12:15,516 Already bound, ignoring bind()
/search/ted/mxnet_new/example../../python/mxnet/module/base_module.py:464: UserWarning: Parameters already initialized and force_init=False. init_param
s call ignored.
allow_missing=allow_missing, force_init=force_init)

and line_model.py as below:

def linear_model(num_features, positive_cls_weight):
# data with csr storage type to enable feeding data with CSRNDArray
x = mx.symbol.Variable("data", stype='csr')
norm_init = mx.initializer.Normal(sigma=0.01)
# weight with row_sparse storage type to enable sparse gradient updates
#weight = mx.symbol.Variable("weight", shape=(num_features, 1),
# init=norm_init, stype='row_sparse')
weight = mx.symbol.Variable("weight", shape=(num_features, 1),stype='row_sparse',
)
bias = mx.symbol.Variable("bias", shape=(1,))
dot = mx.symbol.sparse.dot(x, weight)
pred = mx.symbol.broadcast_add(dot, bias)
y = mx.symbol.Variable("softmax_label")
#model = mx.sym.Custom(pred, y, op_type='weighted_softmax_ce_loss',
# positive_cls_weight=positive_cls_weight, name='out')
#return mx.sym.MakeLoss(model)
net = mx.symbol.LogisticRegressionOutput(data=dot, name="softmax")
return net

@eric-haibin-lin

liumilan · 2017-11-11T18:12:14Z

I just set 2 environment values in mxnet.
export MXNET_GPU_COPY_NTHREADS=4
export MXNET_CPU_WORKER_NTHREADS=20
@eric-haibin-lin

liumilan · 2017-11-11T18:14:06Z

Linear_classfication.py is also changed ,call model.fit directly

model

# The positive class weight, says how much more we should upweight the importance of
# positive instances in the objective function.
# This is used to combat the extreme class imbalance.
positive_class_weight = 2
model = linear_model(num_features, positive_class_weight)

# module
mod = mx.mod.Module(symbol=model,data_names=['data'], label_names=['softmax_label'],context=devs)
mod.bind(data_shapes=train_data.provide_data, label_shapes=train_data.provide_label)
mod.init_params()
optim = mx.optimizer.create(optimizer, learning_rate=0.7,momentum = 0.8,rescale_grad=1.0/batch_size/num_worker)
mod.init_optimizer(optimizer=optim, kvstore=kv)
# use accuracy as the metric
metric = mx.metric.create(['ce','auc'])

# get the sparse weight parameter
weight_index = mod._exec_group.param_names.index('weight')
weight_param = mod._exec_group.param_arrays[weight_index]
speedometer = mx.callback.Speedometer(batch_size, 100)

mod.fit(train_data = train_data,
        eval_data = eval_data,
        eval_metric = metric,
        optimizer = optim,
        kvstore = kvstore,
        num_epoch = args.num_epoch,
        #batch_end_callback = mx.callback.Speedometer(batch_size, 10000),
        batch_end_callback = None,
        epoch_end_callback = save_model())

eric-haibin-lin · 2017-11-11T18:42:31Z

io: Are you using LibSVMIter for io?
metrics: it might affect speed depending on your implementation. Are you using numpy to implement it? (as a side-note, contribution back to MXNet is welcome!)
what commit id did you build MXNet from?
do you mind build it with USE_PROFILER=1 to enable profiler, and set env_var MXNET_PROFILER_AUTOSTART to get the profiling result for your training of first 20 batches? That will be very helpful diagnose any performance bottleneck.
BTW, we observed some space for performance improvement for the backward of sparse.dot, and there's a PR for it: optimization for dot(csr.T, dense) = rsp #8611 which will be potentially helpful in your case.

eric-haibin-lin · 2017-11-19T22:58:06Z

The cpu _backward_dot operator was improved by at least 3x in #8611
Do you want to sync with master and run it again?

ZiyueHuang · 2017-11-20T02:34:21Z

Current LR example is broken due to the custom op. This is fixed in #8721.

liumilan · 2017-11-20T10:00:18Z

1.After updating new codes,the trainning speed is still slowly.
Batchsize is 80000 also,train data size is 100 millions,the same as last testing.
It cost 1977.781s in gpus per epoch,and cost 2038.416 cpus .
2.I don't t think metric affect the speed,because i have skip the metric during trainning.
3.how can i collect profile log?i'have build it with profile=1,and set export MXNET_PROFILER_AUTOSTART=0,but no profile log are printed on nohup logs.
Is there any other issues that i missed?
@eric-haibin-lin

eric-haibin-lin · 2017-11-20T19:45:36Z

You can either inject these code into the training loop:

    mx.profiler.profiler_set_config(mode='all', filename='profile_output.json')
    mx.profiler.profiler_set_state('run')

    # Code to be profiled goes here...

    mx.profiler.profiler_set_state('stop')

or set MXNET_PROFILER_AUTOSTART=1 to enable auto start for profiling.

You can post the xxx_profile.json file here for us to take a look which part is the bottleneck.

eric-haibin-lin · 2017-11-21T01:31:45Z

I think the performance of _backward_dot operator needs to be improved further. I'll keep you posted.

liumilan · 2017-11-21T03:22:00Z

Only touch profile_output.json first,and when run python linear_classification.py , the logs will be writed to
profile_output.json,But no more logs when the program started to training.

liumilan · 2017-11-21T03:24:57Z

profile_output.txt

eric-haibin-lin · 2017-11-21T04:30:15Z

I see a lot of gaps between batches. It could be thread scheduling or IO overhead.
I don't think export MXNET_CPU_WORKER_NTHREADS=20 will help. This is useful when you have parallel path in the graph (for example, two operators can be executed at the same time). Maybe setting this to 1 or 2 is good enough. Setting this to a too high value will lead to some overhead.
MXNET_CPU_WORKER_NTHREADS controls how many operators executed in parallel, while OMP_NUM_THREADS controls how many threads run in parallel for any particular operator.
Could you also run with MXNET_CPU_WORKER_NTHREADS=1 and see if there's any IO bottleneck?

liumilan · 2017-11-21T08:41:03Z

profile_output.txt
this is the updated profile ,when the 1 epoch is finished.I will try the MXNET_CPU_WORKER_NTHREADS=1 right now.

liumilan · 2017-11-21T09:29:21Z

profile_output.txt
Here is the MXNET_CPU_WORKER_NTHREADS=1 ,it cost the same time as cpu_worker_nthreads=20

eric-haibin-lin · 2017-11-24T01:49:42Z

There's a 2-second gap between batches. What metric are you using? Did you verify how much time computing the metric takes? What data iterator are you using?

liumilan · 2017-11-24T08:29:35Z

I used libsvm data iterator.And I set batch_end_callback=None,and re-run the new codes.It cost 1303.680s per epoch,the older codes cost 2038.416s.It did much more faster.
But it is still slower than my mpi lr code.For my code,the same datasize,it only cost 1.6 hours when run 100 epoches.
Do u think mxnet can be improved much for faster?
profile.txt
@eric-haibin-lin

eric-haibin-lin · 2017-12-01T20:38:04Z

For this model you don't need to change the default value for MXNET_CPU_WORKER_NTHREADS . This env var is to enable running multiple operators in parallel when you have a computation graph that has divergent path. I recommend you setting it back to 1, which will make the profiler result much easier to read.

The time spent on forward and backward is only a small fraction of the total time. I don't see the profiler block for updater. What optimizer are you using? Is it any custom optimizer?

eric-haibin-lin · 2017-12-01T20:49:03Z

I think probably IO (libsvmIter) is the bottleneck.
Edit: I'm able to read 3GB of criteo libsvm test data within 4 seconds using LibSVMIter. If you simply skip forward/backward/update/update_metric, how much time is spent on IO? You can also use the DummyIter to see if IO is the bottleneck and the speed

liumilan · 2017-12-06T09:56:04Z

profile (2).txt
This the latest profile,and i have set env value to 1.
SGD optimizer is used ,not any custom optimizer.
It did slowly ,though it is faster than last version

liumilan · 2017-12-06T11:37:44Z

If I skip the forward/backward/update/update_metric,the cost time is reduced from 1303.680s to 276.323s.It did cost much time when get batch data.
It may be a issue for every io iter,not only libsvmIter?
@eric-haibin-lin

eric-haibin-lin · 2017-12-18T21:19:05Z

@liumilan I don't know why your profiler doesn't show SGD optimizer at all. That's very strange. How did you compile MXNet? Did you set DEV=0, DEBUG=0, USE_PROFILER=1? Did you do a clean build when build config was changed?

lanking520 · 2018-10-18T19:20:32Z

@liumilan Just to follow up with this thread, are you still facing the issue?

ankkhedia · 2018-11-01T17:20:24Z

@sandeep-krishnamurthy @nswamy @anirudh2290 Could you please close this issue due to lack of activity.

@liumilan Please feel free to re-open in case closed in error

zachgk · 2018-11-15T22:38:47Z

@mxnet-label-bot add [suggest-closed]

eric-haibin-lin added the Sparse label Nov 2, 2017

eric-haibin-lin self-assigned this Nov 11, 2017

eric-haibin-lin added the Pending Requester Info label May 9, 2018

marcoabreu added the suggest-closed label Nov 15, 2018

sandeep-krishnamurthy closed this as completed Nov 16, 2018

program crash when run sparse model predict #8500

program crash when run sparse model predict #8500

Comments

liumilan commented Nov 1, 2017 • edited Loading

eric-haibin-lin commented Nov 2, 2017 • edited Loading

liumilan commented Nov 2, 2017

eric-haibin-lin commented Nov 2, 2017

liumilan commented Nov 3, 2017

eric-haibin-lin commented Nov 3, 2017

liumilan commented Nov 3, 2017

eric-haibin-lin commented Nov 3, 2017

liumilan commented Nov 4, 2017

liumilan commented Nov 4, 2017

eric-haibin-lin commented Nov 9, 2017

liumilan commented Nov 11, 2017

eric-haibin-lin commented Nov 11, 2017

liumilan commented Nov 11, 2017

liumilan commented Nov 11, 2017

liumilan commented Nov 11, 2017 • edited Loading

model

eric-haibin-lin commented Nov 11, 2017

eric-haibin-lin commented Nov 19, 2017

ZiyueHuang commented Nov 20, 2017

liumilan commented Nov 20, 2017 • edited Loading

eric-haibin-lin commented Nov 20, 2017

eric-haibin-lin commented Nov 21, 2017

liumilan commented Nov 21, 2017

liumilan commented Nov 21, 2017

eric-haibin-lin commented Nov 21, 2017

liumilan commented Nov 21, 2017 • edited Loading

liumilan commented Nov 21, 2017

eric-haibin-lin commented Nov 24, 2017

liumilan commented Nov 24, 2017 • edited Loading

eric-haibin-lin commented Dec 1, 2017

eric-haibin-lin commented Dec 1, 2017 • edited Loading

liumilan commented Dec 6, 2017

liumilan commented Dec 6, 2017

eric-haibin-lin commented Dec 18, 2017

lanking520 commented Oct 18, 2018

ankkhedia commented Nov 1, 2018

zachgk commented Nov 15, 2018

liumilan commented Nov 1, 2017 •

edited

Loading

eric-haibin-lin commented Nov 2, 2017 •

edited

Loading

liumilan commented Nov 11, 2017 •

edited

Loading

liumilan commented Nov 20, 2017 •

edited

Loading

liumilan commented Nov 21, 2017 •

edited

Loading

liumilan commented Nov 24, 2017 •

edited

Loading

eric-haibin-lin commented Dec 1, 2017 •

edited

Loading