Bug: SomeTimes Coredumped using tfjob #1456

whybeyoung · 2021-11-01T11:09:20Z

hello, iam using tfjob to train keras model.

most of times, they work fine. But some times, it will crash after train and savemodel.

our partial train code is here:

    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="/workspace/model/fit_logs/", histogram_freq=1)

    model.fit(
        dataset,
        epochs=epochs,
        verbose=2#,
#         validation_data=test_dataset#,
#         callbacks = [tensorboard_callback]
#         callbacks=[EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)]
    )

    # checkpoint_path = '/'.join((args.checkpoint_path, save_day, save_hour))
    # if TASK_INDEX == 0:
    #    checkpoint_path = checkpoint_path
    # else:
        # Save to a path that is unique across workers.
    #    checkpoint_path = checkpoint_path + '/worker_tmp_' + str(TASK_INDEX)

    inputs = tf.keras.layers.Input(shape=(input_length,), dtype=tf.int64, name='input')
    outs = model(inputs)
    mymodel = tf.keras.Model(inputs, outs)
    mymodel.save(checkpoint_path)

    if TASK_INDEX == 0:
        # tf2onnx
        try:
            onnx_args = OnnxArgs()
            onnx_args.saved_model = checkpoint_path
            onnx_args.output = checkpoint_path + '/deepfm.onnx'
            onnx_args.tag = 'serve'


            parse2onnx(onnx_args)
            logging.info("Success")
            logging.info(onnx_args.saved_model)
        except Exception as e:
            logging.error("Failed convert")
            logging.error(str(e))

And the log is:

I get the coredumpe file;

it show this:

the operator log is :

i don't know the reason and don't know whether it is a tensorflow bug...

Any help will be appreciated... Thanks alot

gaocegege · 2021-11-01T11:22:11Z

Hi, can you please show us the version of your keras/tensorflow?

gaocegege · 2021-11-01T11:25:27Z

https://stackoverflow.com/questions/7381757/c-terminate-called-without-an-active-exception

Maybe there is a thread that did not join or detach.

whybeyoung · 2021-11-01T11:31:04Z

Hi, can you please show us the version of your keras/tensorflow?

the first picture show the version

2.6 and the problem also occurs in 2.7dev

And my model exactly successfully trained. The exception is not throwed by my train code

whybeyoung · 2021-11-01T11:53:44Z

aslo submit the same issue. tensorflow-52894

stale · 2022-03-02T09:12:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gaocegege added the kind/question label Nov 1, 2021

gaocegege added the area/upstream label Nov 15, 2021

stale bot added the lifecycle/stale label Mar 2, 2022

stale bot closed this as completed Apr 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: SomeTimes Coredumped using tfjob #1456

Bug: SomeTimes Coredumped using tfjob #1456

whybeyoung commented Nov 1, 2021 •

edited

Loading

gaocegege commented Nov 1, 2021

gaocegege commented Nov 1, 2021

whybeyoung commented Nov 1, 2021 •

edited

Loading

whybeyoung commented Nov 1, 2021 •

edited

Loading

stale bot commented Mar 2, 2022

Bug: SomeTimes Coredumped using tfjob #1456

Bug: SomeTimes Coredumped using tfjob #1456

Comments

whybeyoung commented Nov 1, 2021 • edited Loading

gaocegege commented Nov 1, 2021

gaocegege commented Nov 1, 2021

whybeyoung commented Nov 1, 2021 • edited Loading

whybeyoung commented Nov 1, 2021 • edited Loading

stale bot commented Mar 2, 2022

whybeyoung commented Nov 1, 2021 •

edited

Loading

whybeyoung commented Nov 1, 2021 •

edited

Loading

whybeyoung commented Nov 1, 2021 •

edited

Loading