Skip to content

Llama3 Branch Still Suffers Segmentation Fault When Generating Datastore Using Qwen2.5 #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
csAugust opened this issue Nov 30, 2024 · 4 comments

Comments

@csAugust
Copy link

I'm trying building a datastore for Qwen2.5 series models using the DraftRetriever but encountered a Segmentation fault (core dumped) error when calling writer.finalize() in script get_datastore_chat.py Line 54. The dataset I used is ShareGPT_Vicuna_unfiltered, the same as the default option.

I'm using "llama3" branch (as it fixed the vocabulary size limit) with python3.9 and the prebuilt wheel. I'm not familiar with Rust, so I will be sincerely appreciated if there would be someone help me out.

For reproduce:
Just modify the get_datastore_chat.py Line 13, Line 45 and run it with no arguments.

parser.add_argument(
    "--model-path",
    type=str,
    default="Qwen/Qwen2.5-0.5B-Instruct",
    # default="lmsys/vicuna-7b-v1.5",
    help="The path to the weights. This can be a local folder or a Hugging Face repo ID.",
)

Line 45

    dataset_path = 'path/to/datasets/ShareGPT_Vicuna_unfiltered/ShareGPT_2023.05.04v0_Wasteland_Edition.json'

Output:

number of samples:  52052
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 52052/52052 [05:09<00:00, 167.97it/s]
Segmentation fault (core dumped)

Thanks a lot.

@csAugust csAugust changed the title Segmentation Fault When Generating Datastore Using Qwen2.5 Llama3 Branch Still Suffers Segmentation Fault When Generating Datastore Using Qwen2.5 Nov 30, 2024
@csAugust
Copy link
Author

After I changed to the main branch and tried the solution in #19, it is successfully fixed. So there may be some problems in the llama3 branch (even through I do the same modification to it, code in llama3 branch still fails).

@csAugust
Copy link
Author

And I found that even though there is no core dumped or any other logged errors, the Reader still encounters OSError: failed to fill whole buffer when I loaded the datastore generated by modifying the main branch as following:

To fix the issue, you may change this line in writer from self.index_file.write_u16::(item as u16)?; to self.index_file.write_u32::(item as u32)?;
Besides, change this line this line in Reader from let int = LittleEndian::read_u16(&data_u8[i..i+2]) as i32; to let int = LittleEndian::read_u32(&data_u8[i..i+4]) as i32;).

I solved it by modifying Line 114 from
self.index_file.write_u32::<LittleEndian>((self.buffer.len() * 2) as u32)?;
to
self.index_file.write_u32::<LittleEndian>((self.buffer.len() * 4) as u32)?;

I'm not sure if other modifications in the #23 are necessary. (As I don't understand Rust :( sorry)

@scandukuri
Copy link

scandukuri commented Dec 2, 2024

Hi @csAugust - thanks so much for your work. Would you be willing to try the llama3 branch again with the newest changes from @zhenyuhe00 ? The setup as is- in the llama3 branch was working fine for me, at least for Llama models. I'll also work to investigate when I get time, but do not have bandwidth at the moment.

Is it possible that the Qwen tokenizer has additional added tokens? There seems to have been a second vocab size issue that could impact datastore writes, which the most recent commit may have fixed.

@csAugust
Copy link
Author

csAugust commented Dec 3, 2024

Hi @csAugust - thanks so much for your work. Would you be willing to try the llama3 branch again with the newest changes from @zhenyuhe00 ? The setup as is- in the llama3 branch was working fine for me, at least for Llama models. I'll also work to investigate when I get time, but do not have bandwidth at the moment.

Is it possible that the Qwen tokenizer has additional added tokens? There seems to have been a second vocab size issue that could impact datastore writes, which the most recent commit may have fixed.

Thank you for your help! I retry the newest llama3 branch and it works fine. It seems that additional tokens added by Qwen tokenizer cause these problems.

@csAugust csAugust closed this as completed Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants