early stopping invalid #535

zhang-ge-hao · 2023-03-31T11:12:11Z

item	detail
model	megatron-345m
image	nvcr.io/nvidia/pytorch:21.11-py3
command	`python /workspace/FasterTransformer/examples/pytorch/gpt/multi_gpu_gpt_example.py --output_len 512 --max_batch_size 1 --end_id 13 --time`

I set 13 as end_id, this id corresponds to full-stop punctuation in English. I expect to trigger the early stop mechanism.

Result. It takes 1543.90 ms. Seemed like there is no early stopping:

Loading layer_num from config.ini,    previous: 24,    current: 24
Loading max_seq_len from config.ini,    previous: 1024,    current: 1024
Loading weights_data_type from config.ini,    previous: fp32,    current: fp32
Loading head_num from config.ini,    previous: 16,    current: 16
Loading size_per_head from config.ini,    previous: 64,    current: 64
Loading tensor_para_size from config.ini,    previous: 1,    current: 1

=================== Arguments ===================
layer_num.....................: 24
input_len.....................: 1
output_len....................: 512
head_num......................: 16
size_per_head.................: 64
vocab_size....................: 50304
beam_width....................: 1
top_k.........................: 1
top_p.........................: 0.0
temperature...................: 1.0
len_penalty...................: 0.0
beam_search_diversity_rate....: 0.0
tensor_para_size..............: 1
pipeline_para_size............: 1
ckpt_path.....................: ../models/megatron-models/c-model/345m/1-gpu
lib_path......................: ./lib/libth_transformer.so
vocab_file....................: ../models/gpt2-vocab.json
merges_file...................: ../models/gpt2-merges.txt
start_id......................: 50256
end_id........................: 13
max_batch_size................: 1
repetition_penalty............: 1.0
presence_penalty..............: 0.0
min_length....................: 0
max_seq_len...................: 1024
inference_data_type...........: fp32
time..........................: True
sample_input_file.............: None
sample_output_file............: None
enable_random_seed............: False
skip_end_tokens...............: False
detokenize....................: True
use_jieba_tokenizer...........: False
int8_mode.....................: 0
weights_data_type.............: fp32
return_cum_log_probs..........: 0
shared_contexts_ratio.........: 1.0
banned_words..................: 
use_gpt_decoder_ops...........: False
=================================================

[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
[FT][INFO] Device NVIDIA TITAN RTX
[INFO] batch 0, beam 0:
[Context]
<|endoftext|>

[Output]


The first of the two-day conference, which will be held at the University of California, Berkeley, will be held on Thursday, March 15, from 9 a.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

[INFO] GPT time costs: 1543.90 ms

The text was updated successfully, but these errors were encountered:

zhang-ge-hao · 2023-03-31T11:17:45Z

I found that some early stopping code was removed after v5.3.

The code in ParallelGpt.cc which was removed:

if (*generation_should_stop_) {
    break;
}

byshiue · 2023-03-31T11:39:11Z

This is a know issue, please refer this issue #487.

zhang-ge-hao · 2023-03-31T11:45:53Z

@byshiue

Well, I fixed this issue in my scenario. You can take a look at my PR #536 and check if this fix is completely correct for FT.

And I updated my description at the top, making it make more sense.

The case takes 1543.90 ms originally, but 114.79 ms after early stopping activated.

zhang-ge-hao · 2023-03-31T12:53:26Z

@byshiue

Oh, I see that you have fixed this problem in the same way, but the fix causes another problem.

zhang-ge-hao added the bug Something isn't working label Mar 31, 2023

zhang-ge-hao mentioned this issue Mar 31, 2023

fix early stopping invalid #536

Closed

zhang-ge-hao closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

early stopping invalid #535

early stopping invalid #535

zhang-ge-hao commented Mar 31, 2023 •

edited

Loading

zhang-ge-hao commented Mar 31, 2023 •

edited

Loading

byshiue commented Mar 31, 2023

zhang-ge-hao commented Mar 31, 2023 •

edited

Loading

zhang-ge-hao commented Mar 31, 2023

early stopping invalid #535

early stopping invalid #535

Comments

zhang-ge-hao commented Mar 31, 2023 • edited Loading

zhang-ge-hao commented Mar 31, 2023 • edited Loading

byshiue commented Mar 31, 2023

zhang-ge-hao commented Mar 31, 2023 • edited Loading

zhang-ge-hao commented Mar 31, 2023

zhang-ge-hao commented Mar 31, 2023 •

edited

Loading

zhang-ge-hao commented Mar 31, 2023 •

edited

Loading

zhang-ge-hao commented Mar 31, 2023 •

edited

Loading