Llama3 Branch Still Suffers Segmentation Fault When Generating Datastore Using Qwen2.5 #26

csAugust · 2024-11-30T06:20:13Z

I'm trying building a datastore for Qwen2.5 series models using the DraftRetriever but encountered a Segmentation fault (core dumped) error when calling writer.finalize() in script get_datastore_chat.py Line 54. The dataset I used is ShareGPT_Vicuna_unfiltered, the same as the default option.

I'm using "llama3" branch (as it fixed the vocabulary size limit) with python3.9 and the prebuilt wheel. I'm not familiar with Rust, so I will be sincerely appreciated if there would be someone help me out.

For reproduce:
Just modify the get_datastore_chat.py Line 13, Line 45 and run it with no arguments.

parser.add_argument(
    "--model-path",
    type=str,
    default="Qwen/Qwen2.5-0.5B-Instruct",
    # default="lmsys/vicuna-7b-v1.5",
    help="The path to the weights. This can be a local folder or a Hugging Face repo ID.",
)

Line 45

    dataset_path = 'path/to/datasets/ShareGPT_Vicuna_unfiltered/ShareGPT_2023.05.04v0_Wasteland_Edition.json'

Output:

number of samples:  52052
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 52052/52052 [05:09<00:00, 167.97it/s]
Segmentation fault (core dumped)

Thanks a lot.

The text was updated successfully, but these errors were encountered:

csAugust · 2024-11-30T07:22:29Z

After I changed to the main branch and tried the solution in #19, it is successfully fixed. So there may be some problems in the llama3 branch (even through I do the same modification to it, code in llama3 branch still fails).

csAugust · 2024-11-30T08:08:39Z

And I found that even though there is no core dumped or any other logged errors, the Reader still encounters OSError: failed to fill whole buffer when I loaded the datastore generated by modifying the main branch as following:

To fix the issue, you may change this line in writer from self.index_file.write_u16::(item as u16)?; to self.index_file.write_u32::(item as u32)?;
Besides, change this line this line in Reader from let int = LittleEndian::read_u16(&data_u8[i..i+2]) as i32; to let int = LittleEndian::read_u32(&data_u8[i..i+4]) as i32;).

I solved it by modifying Line 114 from
self.index_file.write_u32::<LittleEndian>((self.buffer.len() * 2) as u32)?;
to
self.index_file.write_u32::<LittleEndian>((self.buffer.len() * 4) as u32)?;

I'm not sure if other modifications in the #23 are necessary. (As I don't understand Rust :( sorry)

scandukuri · 2024-12-02T16:17:20Z

Hi @csAugust - thanks so much for your work. Would you be willing to try the llama3 branch again with the newest changes from @zhenyuhe00 ? The setup as is- in the llama3 branch was working fine for me, at least for Llama models. I'll also work to investigate when I get time, but do not have bandwidth at the moment.

Is it possible that the Qwen tokenizer has additional added tokens? There seems to have been a second vocab size issue that could impact datastore writes, which the most recent commit may have fixed.

csAugust · 2024-12-03T02:07:07Z

Hi @csAugust - thanks so much for your work. Would you be willing to try the llama3 branch again with the newest changes from @zhenyuhe00 ? The setup as is- in the llama3 branch was working fine for me, at least for Llama models. I'll also work to investigate when I get time, but do not have bandwidth at the moment.

Is it possible that the Qwen tokenizer has additional added tokens? There seems to have been a second vocab size issue that could impact datastore writes, which the most recent commit may have fixed.

Thank you for your help! I retry the newest llama3 branch and it works fine. It seems that additional tokens added by Qwen tokenizer cause these problems.

csAugust changed the title ~~Segmentation Fault When Generating Datastore Using Qwen2.5~~ Llama3 Branch Still Suffers Segmentation Fault When Generating Datastore Using Qwen2.5 Nov 30, 2024

csAugust closed this as completed Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3 Branch Still Suffers Segmentation Fault When Generating Datastore Using Qwen2.5 #26

Llama3 Branch Still Suffers Segmentation Fault When Generating Datastore Using Qwen2.5 #26

csAugust commented Nov 30, 2024

csAugust commented Nov 30, 2024

csAugust commented Nov 30, 2024

scandukuri commented Dec 2, 2024 •

edited

Loading

csAugust commented Dec 3, 2024

Llama3 Branch Still Suffers Segmentation Fault When Generating Datastore Using Qwen2.5 #26

Llama3 Branch Still Suffers Segmentation Fault When Generating Datastore Using Qwen2.5 #26

Comments

csAugust commented Nov 30, 2024

csAugust commented Nov 30, 2024

csAugust commented Nov 30, 2024

scandukuri commented Dec 2, 2024 • edited Loading

csAugust commented Dec 3, 2024

scandukuri commented Dec 2, 2024 •

edited

Loading